# Practical Exam: Customer Purchase Prediction

RetailTech Solutions is a fast-growing international e-commerce platform operating in over 20 countries across Europe, North America, and Asia. They specialize in fashion, electronics, and home goods, with a unique business model that combines traditional retail with a marketplace for independent sellers.

The company has seen rapid growth. A key part of their success has been their data-driven approach to personalization. However, as they plan their expansion into new markets, they need to improve their ability to predict customer behavior.

Their marketing team wants to predict which customers are most likely to make a purchase based on their browsing behavior.

As an AI Engineer, you will help build this prediction system. Your work will directly impact RetailTech's growth strategy and their goal of increasing revenue.


## Data Description

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

# Task 1

The marketing team has collected customer session data in `raw_customer_data.csv`, but it contains missing values and inconsistencies that need to be addressed.
Create a cleaned version of the dataframe:

- Start with the data in the file `raw_customer_data.csv`
- Your output should be a DataFrame named `clean_data`
- All column names and values should match the table below.
</br>

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

In [13]:
# Write your answer to Task 1 here 
import pandas as pd

# Read the data into a pandas DataFrame
raw_data = pd.read_csv('raw_customer_data.csv')

# Create a copy to hold the cleaned data
clean_data = raw_data.copy()

# Fill missing 'time_spent' values with the median
time_spent_median = clean_data['time_spent'].median()
clean_data['time_spent'].fillna(time_spent_median, inplace=True)

# Fill missing 'pages_viewed' values with the mean and cast to integer
pages_viewed_mean = clean_data['pages_viewed'].mean()
clean_data['pages_viewed'].fillna(pages_viewed_mean, inplace=True)
clean_data['pages_viewed'] = clean_data['pages_viewed'].astype(int)

# Fill missing 'basket_value' values with 0
clean_data['basket_value'].fillna(0, inplace=True)

# Fill missing 'device_type' values with 'Unknown'
clean_data['device_type'].fillna("Unknown", inplace=True)

# Fill missing 'customer_type' values with 'New'
clean_data['customer_type'].fillna("New", inplace=True)

print(clean_data.info())
print(clean_data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   customer_id    500 non-null    int64  
 1   time_spent     500 non-null    float64
 2   pages_viewed   500 non-null    int64  
 3   basket_value   500 non-null    float64
 4   device_type    500 non-null    object 
 5   customer_type  500 non-null    object 
 6   purchase       500 non-null    int64  
dtypes: float64(2), int64(3), object(2)
memory usage: 27.5+ KB
None
   customer_id  time_spent  pages_viewed  ...  device_type customer_type purchase
0            1   23.097867             7  ...       Mobile     Returning        0
1            2   57.092144             3  ...       Mobile     Returning        1
2            3   44.187643            14  ...       Mobile     Returning        0
3            4   36.320851            10  ...       Mobile           New        1
4            5   10.20

# Task 2
The pre-cleaned dataset `model_data.csv` needs to be prepared for our neural network.
Create the model features:

- Start with the data in the file `model_data.csv`
- Scale numerical features (`time_spent`, `pages_viewed`, `basket_value`) to 0-1 range
- Apply one-hot encoding to the categorical features (`device_type`, `customer_type`)
    - The column names should have the following format: variable_name_category_name (e.g., `device_type_Desktop`)
- Your output should be a DataFrame named `model_feature_set`, with all column names from `model_data.csv` except for the columns where one-hot encoding was applied.


In [14]:
# Write your answer to Task 2 here
from sklearn.preprocessing import MinMaxScaler
model_data = pd.read_csv('model_data.csv')

# Isolate numerical and categorical features for transformation
numerical_features = ['time_spent', 'pages_viewed', 'basket_value']
categorical_features = ['device_type', 'customer_type']

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Create a copy of the original data to modify
model_feature_set = model_data.copy()

# Scale numerical features and update the DataFrame
model_feature_set[numerical_features] = scaler.fit_transform(model_data[numerical_features])

# Apply one-hot encoding and get the new encoded columns
encoded_df = pd.get_dummies(model_data[categorical_features], prefix=categorical_features, prefix_sep='_')

# Combine the original DataFrame (with scaled numerical columns) with the new encoded columns
model_feature_set = pd.concat([model_feature_set, encoded_df], axis=1)

# Drop the original categorical columns as they are now encoded
model_feature_set.drop(columns=categorical_features, inplace=True)

# The 'model_feature_set' DataFrame is now ready for the neural network.
print(model_feature_set.head())
print(model_feature_set.info())


   customer_id  time_spent  ...  customer_type_New  customer_type_Returning
0          501    0.664167  ...                  1                        0
1          502    0.483681  ...                  0                        1
2          503    0.231359  ...                  0                        1
3          504    0.792944  ...                  1                        0
4          505    0.649210  ...                  1                        0

[5 rows x 11 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   customer_id              500 non-null    int64  
 1   time_spent               500 non-null    float64
 2   pages_viewed             500 non-null    float64
 3   basket_value             500 non-null    float64
 4   purchase                 500 non-null    int64  
 5   device_type_Desktop      500 non-

# Task 3

Now that all preparatory work has been done, create and train a neural network that would allow the company to predict purchases.

- Using PyTorch, create a network with:
   - At least one hidden layer with 8 units
   - ReLU activation for hidden layer
   - Sigmoid activation for the output layer
- Using the prepared features in `input_model_features.csv`, train the model to predict purchases. 
- Use the validation dataset `validation_features.csv` to predict new values based on the trained model. 
- Your model should be named `purchase_model` and your output should be a DataFrame named `validation_predictions` with columns `customer_id` and `purchase`. The `purchase` column must be your predicted values.


In [15]:
# Write your answer to Task 3 here
import torch
import torch.nn as nn
import torch.optim as optim

input_model_features = pd.read_csv('input_model_features.csv')
validation_features = pd.read_csv('validation_features.csv')

# Identify feature columns (all columns except identifiers and the target)
feature_columns = [col for col in full_feature_set.columns if col not in ['customer_id', 'purchase']]

# Prepare training data tensors
X_train = input_model_features[feature_columns].values
y_train = input_model_features['purchase'].values.reshape(-1, 1)
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)

# Prepare validation data tensors
X_val = validation_features[feature_columns].values
validation_customer_ids = validation_features['customer_id']
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)

# --- 2. Neural Network Definition ---

class PurchasePredictorNet(nn.Module):
    def __init__(self, input_size):
        super(PurchasePredictorNet, self).__init__()
        # First hidden layer
        self.layer1 = nn.Linear(input_size, 8)
        self.relu = nn.ReLU()
        # Output layer
        self.output_layer = nn.Linear(8, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.output_layer(x)
        x = self.sigmoid(x)
        return x

# --- 3. Model Training ---

# Set a random seed for reproducible results
torch.manual_seed(42)

# Instantiate the model
input_size = len(feature_columns)
purchase_model = PurchasePredictorNet(input_size)

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(purchase_model.parameters(), lr=0.01)

# Training loop
epochs = 500
for epoch in range(epochs):
    purchase_model.train()
    
    # Forward pass: compute predicted y by passing x to the model
    outputs = purchase_model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    
    # Zero gradients, perform a backward pass, and update the weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# --- 4. Prediction on Validation Set ---

# Set model to evaluation mode (disables dropout, etc.)
purchase_model.eval()
with torch.no_grad(): # Disable gradient computation
    # Get raw predictions (probabilities)
    raw_predictions = purchase_model(X_val_tensor)
    # Convert probabilities to binary classes (0 or 1) based on a 0.5 threshold
    predicted_classes = (raw_predictions > 0.5).int()

# --- 5. Create Output DataFrame ---

validation_predictions = pd.DataFrame({
    'customer_id': validation_customer_ids.values,
    'purchase': predicted_classes.numpy().flatten()
})

print("Validation Predictions:")
print(validation_predictions)

Validation Predictions:
     customer_id  purchase
0           1801         1
1           1802         1
2           1803         1
3           1804         0
4           1805         1
..           ...       ...
195         1996         1
196         1997         1
197         1998         1
198         1999         1
199         2000         1

[200 rows x 2 columns]
