# Practical Exam: Customer Purchase Prediction

RetailTech Solutions is a fast-growing international e-commerce platform operating in over 20 countries across Europe, North America, and Asia. They specialize in fashion, electronics, and home goods, with a unique business model that combines traditional retail with a marketplace for independent sellers.

The company has seen rapid growth. A key part of their success has been their data-driven approach to personalization. However, as they plan their expansion into new markets, they need to improve their ability to predict customer behavior.

Their marketing team wants to predict which customers are most likely to make a purchase based on their browsing behavior.

As an AI Engineer, you will help build this prediction system. Your work will directly impact RetailTech's growth strategy and their goal of increasing revenue.


## Data Description

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

# Task 1

The marketing team has collected customer session data in `raw_customer_data.csv`, but it contains missing values and inconsistencies that need to be addressed.
Create a cleaned version of the dataframe:

- Start with the data in the file `raw_customer_data.csv`
- Your output should be a DataFrame named `clean_data`
- All column names and values should match the table below.
</br>

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

In [18]:
# Task 1

import pandas as pd

# Load data
raw_data = pd.read_csv('raw_customer_data.csv')

# Clean the data
clean_data = raw_data.copy()

# Ensure the customer_id is int and it has no missing values
clean_data['customer_id'] = clean_data['customer_id'].astype(int)

# Replace missing time_spent with median time spent
median_time_spent = clean_data['time_spent'].median()
clean_data['time_spent'] = clean_data['time_spent'].fillna(median_time_spent).astype(float)

# Replace missing pages_viewed with mean
mean_pages_viewed = clean_data['pages_viewed'].mean()
clean_data['pages_viewed'] = clean_data['pages_viewed'].fillna(mean_pages_viewed).astype(int)

# Replace missing basket_value with 0
clean_data['basket_value'] = clean_data['basket_value'].fillna(0).astype(float)

# Replace missing device_type with "Unknown"
clean_data['device_type'] = clean_data['device_type'].fillna("Unknown")

# Replace missing customer_type with "New"
clean_data['customer_type'] = clean_data['customer_type'].fillna("New")

# Ensure purchase is binary (1 or 0)
clean_data['purchase'] = clean_data['purchase'].astype(int)

# Display the cleaned data
clean_data.head()


Unnamed: 0,customer_id,time_spent,pages_viewed,basket_value,device_type,customer_type,purchase
0,1,23.097867,7,50.574647,Mobile,Returning,0
1,2,57.092144,3,56.891022,Mobile,Returning,1
2,3,44.187643,14,8.348296,Mobile,Returning,0
3,4,36.320851,10,43.481489,Mobile,New,1
4,5,10.2051,16,0.0,Mobile,Returning,1


# Task 2
The pre-cleaned dataset `model_data.csv` needs to be prepared for our neural network.
Create the model features:

- Start with the data in the file `model_data.csv`
- Scale numerical features (`time_spent`, `pages_viewed`, `basket_value`) to 0-1 range
- Apply one-hot encoding to the categorical features (`device_type`, `customer_type`)
    - The column names should have the following format: variable_name_category_name (e.g., `device_type_Desktop`)
- Your output should be a DataFrame named `model_feature_set`, with all column names from `model_data.csv` except for the columns where one-hot encoding was applied.


In [19]:
# Task 2

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Load Pre-cleaned dataset
model_data = pd.read_csv('model_data.csv')

# Scale numerical features to 0-1 range
scaler = MinMaxScaler()
numerical_features = ['time_spent', 'pages_viewed', 'basket_value']
model_data[numerical_features] = scaler.fit_transform(model_data[numerical_features])

# Apply one-hot encoding to categorical features
categorical_features = ['device_type', 'customer_type']
encoder = OneHotEncoder(sparse=False, drop='first')  # I drop first to avoid multicollinearity
encoded_features = encoder.fit_transform(model_data[categorical_features])

# Create DataFrame for encoded features
encoded_columns = encoder.get_feature_names_out(categorical_features)
encoded_df = pd.DataFrame(encoded_features, columns=encoded_columns)

# Combine scaled numerical features and encoded categorical features
model_feature_set = pd.concat([model_data.drop(columns=categorical_features), encoded_df], axis=1)

# Display the prepared feature set
model_feature_set.head()

Unnamed: 0,customer_id,time_spent,pages_viewed,basket_value,purchase,device_type_Mobile,device_type_Tablet,device_type_Unknown,customer_type_Returning
0,501,0.664167,0.5,0.0,1,0.0,0.0,0.0,0.0
1,502,0.483681,0.222222,0.524981,1,1.0,0.0,0.0,1.0
2,503,0.231359,0.111111,0.457291,0,1.0,0.0,0.0,1.0
3,504,0.792944,0.277778,0.0,1,0.0,0.0,1.0,0.0
4,505,0.64921,0.166667,0.484283,1,0.0,1.0,0.0,0.0


# Task 3

Now that all preparatory work has been done, create and train a neural network that would allow the company to predict purchases.

- Using PyTorch, create a network with:
   - At least one hidden layer with 8 units
   - ReLU activation for hidden layer
   - Sigmoid activation for the output layer
- Using the prepared features in `input_model_features.csv`, train the model to predict purchases. 
- Use the validation dataset `validation_features.csv` to predict new values based on the trained model. 
- Your model should be named `purchase_model` and your output should be a DataFrame named `validation_predictions` with columns `customer_id` and `purchase`. The `purchase` column must be your predicted values.


In [30]:
# Task 3

import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Load the prepared features and validation dataset
input_features = pd.read_csv('input_model_features.csv')
validation_features = pd.read_csv('validation_features.csv')

# Get feature columns (excluding 'purchase' and 'customer_id')
feature_cols = [col for col in input_features.columns 
               if col not in ['purchase', 'customer_id']]

# Prepare features for both training and validation sets
X_train = input_features[feature_cols].values
y_train = input_features['purchase'].values
X_val = validation_features[feature_cols].values

# Convert to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).reshape(-1, 1)
X_val = torch.tensor(X_val, dtype=torch.float32)

# Define the neural network
class PurchaseModel(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.hidden = nn.Linear(input_size, 8)
        self.output = nn.Linear(8, 1)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x = self.relu(self.hidden(x))
        x = self.sigmoid(self.output(x))
        return x

# Initialize the model
input_size = X_train.shape[1]
purchase_model = PurchaseModel(input_size)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(purchase_model.parameters(), lr=0.01)

# Train the model
epochs = 100
batch_size = 32
n_samples = len(X_train)

for epoch in range(epochs):
    for i in range(0, n_samples, batch_size):
        # Get batch
        batch_X = X_train[i:i+batch_size]
        batch_y = y_train[i:i+batch_size].reshape(-1, 1)
        
        optimizer.zero_grad()
        outputs = purchase_model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        
    # Print progress every 10 epochs
    if (epoch + 1) % 10 == 0:
        with torch.no_grad():
            train_predictions = purchase_model(X_train)
            train_predictions = (train_predictions >= 0.5).float()
            accuracy = (train_predictions.flatten() == torch.tensor(y_train)).float().mean()
            print(f'Epoch [{epoch+1}/{epochs}], Training Accuracy: {accuracy:.4f}')

# Make predictions on validation set
with torch.no_grad():
    predictions = purchase_model(X_val)
    predictions = (predictions >= 0.5).int()

# Create output DataFrame
validation_predictions = pd.DataFrame({
    'customer_id': validation_features['customer_id'],
    'purchase': predictions.numpy().flatten()
})

# Display first few predictions
validation_predictions.head()

Epoch [10/100], Training Accuracy: 0.7900
Epoch [20/100], Training Accuracy: 0.7864
Epoch [30/100], Training Accuracy: 0.7806
Epoch [40/100], Training Accuracy: 0.7791
Epoch [50/100], Training Accuracy: 0.7784
Epoch [60/100], Training Accuracy: 0.7791
Epoch [70/100], Training Accuracy: 0.7791
Epoch [80/100], Training Accuracy: 0.7784
Epoch [90/100], Training Accuracy: 0.7784
Epoch [100/100], Training Accuracy: 0.7762


Unnamed: 0,customer_id,purchase
0,1801,1
1,1802,1
2,1803,1
3,1804,0
4,1805,1
