# Practical Exam: Customer Purchase Prediction

RetailTech Solutions is a fast-growing international e-commerce platform operating in over 20 countries across Europe, North America, and Asia. They specialize in fashion, electronics, and home goods, with a unique business model that combines traditional retail with a marketplace for independent sellers.

The company has seen rapid growth. A key part of their success has been their data-driven approach to personalization. However, as they plan their expansion into new markets, they need to improve their ability to predict customer behavior.

Their marketing team wants to predict which customers are most likely to make a purchase based on their browsing behavior.

As an AI Engineer, you will help build this prediction system. Your work will directly impact RetailTech's growth strategy and their goal of increasing revenue.


## Data Description

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

# Task 1

The marketing team has collected customer session data in `raw_customer_data.csv`, but it contains missing values and inconsistencies that need to be addressed.
Create a cleaned version of the dataframe:

- Start with the data in the file `raw_customer_data.csv`
- Your output should be a DataFrame named `clean_data`
- All column names and values should match the table below.
</br>

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

In [17]:
import pandas as pd

# Load raw customer session data
df = pd.read_csv('raw_customer_data.csv')

# Clean and standardize the data
clean_data = df.copy()

# Fill missing values
clean_data['time_spent'].fillna(clean_data['time_spent'].median(), inplace=True)
clean_data['pages_viewed'].fillna(clean_data['pages_viewed'].mean(), inplace=True)
clean_data['basket_value'].fillna(0.0, inplace=True)
clean_data['device_type'].fillna('Unknown', inplace=True)
clean_data['customer_type'].fillna('New', inplace=True)

# Ensure correct data types
clean_data['customer_id'] = clean_data['customer_id'].astype(int)
clean_data['time_spent'] = clean_data['time_spent'].astype(float)
clean_data['pages_viewed'] = clean_data['pages_viewed'].round().astype(int)
clean_data['basket_value'] = clean_data['basket_value'].astype(float)
clean_data['purchase'] = clean_data['purchase'].astype(int)

# Preview to verify
print(" Cleaned data sample:")
print(clean_data.head())
print("\n Data types:")
print(clean_data.dtypes)


 Cleaned data sample:
   customer_id  time_spent  pages_viewed  ...  device_type customer_type purchase
0            1   23.097867             7  ...       Mobile     Returning        0
1            2   57.092144             3  ...       Mobile     Returning        1
2            3   44.187643            14  ...       Mobile     Returning        0
3            4   36.320851            10  ...       Mobile           New        1
4            5   10.205100            16  ...       Mobile     Returning        1

[5 rows x 7 columns]

 Data types:
customer_id        int64
time_spent       float64
pages_viewed       int64
basket_value     float64
device_type       object
customer_type     object
purchase           int64
dtype: object


# Task 2
The pre-cleaned dataset `model_data.csv` needs to be prepared for our neural network.
Create the model features:

- Start with the data in the file `model_data.csv`
- Scale numerical features (`time_spent`, `pages_viewed`, `basket_value`) to 0-1 range
- Apply one-hot encoding to the categorical features (`device_type`, `customer_type`)
    - The column names should have the following format: variable_name_category_name (e.g., `device_type_Desktop`)
- Your output should be a DataFrame named `model_feature_set`, with all column names from `model_data.csv` except for the columns where one-hot encoding was applied.


In [18]:
# Write your answer to Task 2 here
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load the dataset
df_model = pd.read_csv('model_data.csv')

# Define numerical and categorical features
numerical_features = ['time_spent', 'pages_viewed', 'basket_value']
categorical_features = ['device_type', 'customer_type']

# Create a preprocessor for scaling numerical features and one-hot encoding categorical features
# The remainder='passthrough' will keep other columns (like customer_id and purchase)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ],
    remainder='passthrough'
)

# Apply the preprocessing
X_processed = preprocessor.fit_transform(df_model)

# Get feature names after one-hot encoding
# The column names for numerical features will be the same
# The column names for one-hot encoded features need to be retrieved from the OneHotEncoder
numerical_feature_names = numerical_features
categorical_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)

# Get the names of the columns that passed through (customer_id, purchase)
passthrough_columns = [col for col in df_model.columns if col not in numerical_features + categorical_features]

# Combine all feature names in the correct order
all_feature_names = list(numerical_feature_names) + list(categorical_feature_names) + list(passthrough_columns)

# Create the final DataFrame
model_feature_set = pd.DataFrame(X_processed, columns=all_feature_names)

# Ensure the columns are in a logical order if necessary, for this task, the order from all_feature_names is fine.
# Display the first few rows of the model_feature_set and its info to verify
print("Model Feature Set (First 5 Rows):")
print(model_feature_set.head())
print("\nModel Feature Set Info:")
print(model_feature_set.info())

Model Feature Set (First 5 Rows):
   time_spent  pages_viewed  ...  customer_id  purchase
0    0.664167      0.500000  ...        501.0       1.0
1    0.483681      0.222222  ...        502.0       1.0
2    0.231359      0.111111  ...        503.0       0.0
3    0.792944      0.277778  ...        504.0       1.0
4    0.649210      0.166667  ...        505.0       1.0

[5 rows x 11 columns]

Model Feature Set Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   time_spent               500 non-null    float64
 1   pages_viewed             500 non-null    float64
 2   basket_value             500 non-null    float64
 3   device_type_Desktop      500 non-null    float64
 4   device_type_Mobile       500 non-null    float64
 5   device_type_Tablet       500 non-null    float64
 6   device_type_Unknown      500 non-null   

# Task 3

Now that all preparatory work has been done, create and train a neural network that would allow the company to predict purchases.

- Using PyTorch, create a network with:
   - At least one hidden layer with 8 units
   - ReLU activation for hidden layer
   - Sigmoid activation for the output layer
- Using the prepared features in `input_model_features.csv`, train the model to predict purchases. 
- Use the validation dataset `validation_features.csv` to predict new values based on the trained model. 
- Your model should be named `purchase_model` and your output should be a DataFrame named `validation_predictions` with columns `customer_id` and `purchase`. The `purchase` column must be your predicted values.


In [19]:
# Write your answer to Task 3 here
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Load the datasets
input_model_features = pd.read_csv('input_model_features.csv')
validation_features = pd.read_csv('validation_features.csv')

# Separate features and target for training data
# 'purchase' is the target variable
X_train_df = input_model_features.drop(columns=['customer_id', 'purchase'])
y_train_df = input_model_features['purchase']

# Convert to PyTorch tensors
X_train = torch.tensor(X_train_df.values, dtype=torch.float32)
y_train = torch.tensor(y_train_df.values, dtype=torch.float32).view(-1, 1) # Reshape for BCELoss

# For validation data, store customer_id separately and drop 'purchase' if it exists (it shouldn't based on task)
customer_ids_validation = validation_features['customer_id']
X_validation_df = validation_features.drop(columns=['customer_id', 'purchase'], errors='ignore') # 'purchase' might not be in validation_features for prediction
X_validation = torch.tensor(X_validation_df.values, dtype=torch.float32)

# Define the neural network model
class PurchasePredictor(nn.Module):
    def __init__(self, input_size):
        super(PurchasePredictor, self).__init__()
        # At least one hidden layer with 8 units and ReLU activation
        self.fc1 = nn.Linear(input_size, 8)
        self.relu = nn.ReLU()
        # Output layer with Sigmoid activation for binary classification
        self.fc2 = nn.Linear(8, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

# Instantiate the model
input_size = X_train.shape[1]
purchase_model = PurchasePredictor(input_size)

# Define loss function and optimizer
criterion = nn.BCELoss() # Binary Cross-Entropy Loss
optimizer = optim.Adam(purchase_model.parameters(), lr=0.001)

# Create DataLoader for training
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Train the model
num_epochs = 100 # A reasonable number of epochs for a simple model
for epoch in range(num_epochs):
    purchase_model.train() # Set model to training mode
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = purchase_model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

# Predict on the validation dataset
purchase_model.eval() # Set model to evaluation mode
with torch.no_grad(): # Disable gradient calculation for prediction
    y_pred_proba_validation = purchase_model(X_validation)
    # Convert probabilities to binary predictions (0 or 1)
    y_pred_validation = (y_pred_proba_validation >= 0.5).int()

# Create the validation_predictions DataFrame
validation_predictions = pd.DataFrame({
    'customer_id': customer_ids_validation,
    'purchase': y_pred_validation.numpy().flatten()
})

print("Validation Predictions (First 5 Rows):")
print(validation_predictions.head())
print("\nValidation Predictions Info:")
print(validation_predictions.info())

Validation Predictions (First 5 Rows):
   customer_id  purchase
0         1801         1
1         1802         1
2         1803         1
3         1804         1
4         1805         1

Validation Predictions Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   customer_id  200 non-null    int64
 1   purchase     200 non-null    int32
dtypes: int32(1), int64(1)
memory usage: 2.5 KB
None
