Question 1: Data Understanding

In [2]:
# Import the datasets
import pandas as pd
clickHistory = pd.read_csv("click_history.csv")
productFeatures = pd.read_csv("product_features.csv")
userFeatures = pd.read_csv("user_features.csv")

In [3]:
# Look at the first 5 rows of the data
clickHistory.head()

Unnamed: 0,user_id,product_id,clicked
0,104863,1350,False
1,108656,1321,True
2,100120,1110,False
3,104838,1443,True
4,107304,1397,True


In [4]:
# Give summary statistics
clickHistory.describe()

Unnamed: 0,user_id,product_id
count,35990.0,35990.0
mean,106017.080161,1500.232898
std,3483.48009,288.101984
min,100001.0,1000.0
25%,102976.5,1250.0
50%,106060.0,1503.0
75%,109049.0,1749.0
max,111999.0,1999.0


In [5]:
# Get column info
clickHistory.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35990 entries, 0 to 35989
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   user_id     35990 non-null  int64
 1   product_id  35990 non-null  int64
 2   clicked     35990 non-null  bool 
dtypes: bool(1), int64(2)
memory usage: 597.6 KB


In [6]:
# Look at the first 5 rows of the data
productFeatures.head()

Unnamed: 0,product_id,category,on_sale,number_of_reviews,avg_review_score
0,1134,tools,False,101,3.349452
1,1846,skincare,False,111,5.0
2,1762,fragrance,False,220,4.882706
3,1254,hair,True,446,5.0
4,1493,body,True,513,-1.0


In [7]:
# Give summary statistics
productFeatures.describe()

Unnamed: 0,product_id,number_of_reviews,avg_review_score
count,1000.0,1000.0,1000.0
mean,1499.5,115772.5,2.660656
std,288.819436,502899.7,1.741875
min,1000.0,66.0,-1.0
25%,1249.75,257.0,1.428969
50%,1499.5,471.0,2.769397
75%,1749.25,704.25,4.18086
max,1999.0,2307390.0,5.0


In [8]:
# Get column information
productFeatures.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   product_id         1000 non-null   int64  
 1   category           1000 non-null   object 
 2   on_sale            1000 non-null   bool   
 3   number_of_reviews  1000 non-null   int64  
 4   avg_review_score   1000 non-null   float64
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 32.4+ KB


In [9]:
# Look at the first 5 rows of the data
userFeatures.head()

Unnamed: 0,user_id,number_of_clicks_before,ordered_before,personal_interests
0,104939,2,True,"['body', 'makeup', 'nail', 'hand', 'foot', 'me..."
1,101562,2,True,"['men_skincare', 'men_fragrance', 'tools', 'sk..."
2,102343,2,True,"['tools', 'makeup', 'foot', 'nail']"
3,106728,5,True,"['hand', 'men_skincare']"
4,107179,0,True,"['makeup', 'body', 'skincare', 'foot', 'men_sk..."


In [10]:
# Give summary statistics
userFeatures.describe()

Unnamed: 0,user_id
count,12000.0
mean,105999.5
std,3464.24595
min,100000.0
25%,102999.75
50%,105999.5
75%,108999.25
max,111999.0


In [11]:
# Get column information
userFeatures.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   user_id                  12000 non-null  int64 
 1   number_of_clicks_before  11500 non-null  object
 2   ordered_before           12000 non-null  bool  
 3   personal_interests       12000 non-null  object
dtypes: bool(1), int64(1), object(2)
memory usage: 293.1+ KB


Based on the above analysis, I noted a few important things. First, there are several columns that are True/False. We should switch these to be 1 and 0. Next, in the productFeatures data, there are several rows that have -1 as the average review score. This doesn't make sense and we will likely need to remove these rows. Finally, in the userFeatures data, there are 500 rows that have a null value for number_of_clicks_before. Also, there is the value of 6+ which does not fit into the integer bucket, so python has automatically classified the column as an object datatype. We will fix all of the above in the next section.

Question 2: Data Cleaning and Preprocessing

In [14]:
# Merge clickHistory and productFeatures
combinedTable = pd.merge(clickHistory, productFeatures, on='product_id')

In [15]:
# Merge combinedTable and userFeatures
data = pd.merge(combinedTable, userFeatures, on='user_id')

In [16]:
# Change true/false values to be 1 or 0
data['clicked'] = data['clicked'].astype(int)
data['on_sale'] = data['on_sale'].astype(int)
data['ordered_before'] = data['ordered_before'].astype(int)

In [17]:
# Replace the avg_review_score values that were -1 with the median value excluding the -1s
median_score = data[data['avg_review_score'] != -1]['avg_review_score'].median()
data['avg_review_score'].replace(-1, median_score, inplace=True)

In [18]:
# Replace the 6+ values with 8
data['number_of_clicks_before'].replace('6+', 8, inplace=True)
data['number_of_clicks_before'] = pd.to_numeric(data['number_of_clicks_before'])

In [19]:
# Replace the null values with the median
median_clicks = data['number_of_clicks_before'].median()
data['number_of_clicks_before'].fillna(median_clicks, inplace=True)

In [20]:
# Check in on our dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35990 entries, 0 to 35989
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   user_id                  35990 non-null  int64  
 1   product_id               35990 non-null  int64  
 2   clicked                  35990 non-null  int32  
 3   category                 35990 non-null  object 
 4   on_sale                  35990 non-null  int32  
 5   number_of_reviews        35990 non-null  int64  
 6   avg_review_score         35990 non-null  float64
 7   number_of_clicks_before  35990 non-null  float64
 8   ordered_before           35990 non-null  int32  
 9   personal_interests       35990 non-null  object 
dtypes: float64(2), int32(3), int64(3), object(2)
memory usage: 2.3+ MB


In [21]:
# Change the category column to be a string
data['category'] = data['category'].astype('string')

In [22]:
import ast
# Convert strings in column 'personal_interests' into lists
data['personal_interests'] = data['personal_interests'].apply(ast.literal_eval)
# Initialize set of unique interests
all_interests = set()
# Get the set of unique interests
data['personal_interests'].apply(lambda x: all_interests.update(x))
# Add columns for all unique interests
for interest in all_interests:
    data[f'interest_{interest}'] = data['personal_interests'].apply(lambda x: 1 if interest in x else 0)
# Remove the 'personal_interests' column
data.drop('personal_interests', axis=1, inplace=True)

In [23]:
# Check in on the data. Note our new interest dummy columns.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35990 entries, 0 to 35989
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   user_id                  35990 non-null  int64  
 1   product_id               35990 non-null  int64  
 2   clicked                  35990 non-null  int32  
 3   category                 35990 non-null  string 
 4   on_sale                  35990 non-null  int32  
 5   number_of_reviews        35990 non-null  int64  
 6   avg_review_score         35990 non-null  float64
 7   number_of_clicks_before  35990 non-null  float64
 8   ordered_before           35990 non-null  int32  
 9   interest_makeup          35990 non-null  int64  
 10  interest_men_skincare    35990 non-null  int64  
 11  interest_foot            35990 non-null  int64  
 12  interest_nail            35990 non-null  int64  
 13  interest_tools           35990 non-null  int64  
 14  interest_skincare     

In [24]:
# Make dummy variables for category
category_dummies = pd.get_dummies(data['category'], prefix='category', dtype=int)
# Drop old category column
data.drop('category', axis=1, inplace=True)
# Add the dummy columns to the data
data = pd.concat([data, category_dummies], axis=1)

In [25]:
# Get a final look at our columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35990 entries, 0 to 35989
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   user_id                  35990 non-null  int64  
 1   product_id               35990 non-null  int64  
 2   clicked                  35990 non-null  int32  
 3   on_sale                  35990 non-null  int32  
 4   number_of_reviews        35990 non-null  int64  
 5   avg_review_score         35990 non-null  float64
 6   number_of_clicks_before  35990 non-null  float64
 7   ordered_before           35990 non-null  int32  
 8   interest_makeup          35990 non-null  int64  
 9   interest_men_skincare    35990 non-null  int64  
 10  interest_foot            35990 non-null  int64  
 11  interest_nail            35990 non-null  int64  
 12  interest_tools           35990 non-null  int64  
 13  interest_skincare        35990 non-null  int64  
 14  interest_fragrance    

In [26]:
# Get a final look at the first 5 rows of our data
data.head()

Unnamed: 0,user_id,product_id,clicked,on_sale,number_of_reviews,avg_review_score,number_of_clicks_before,ordered_before,interest_makeup,interest_men_skincare,...,category_foot,category_fragrance,category_hair,category_hand,category_makeup,category_men_fragrance,category_men_skincare,category_nail,category_skincare,category_tools
0,104863,1350,0,0,136,2.653361,2.0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
1,104863,1617,1,1,279,4.924063,2.0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2,104863,1959,0,1,540,3.049224,2.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,104863,1920,0,1,776,1.562768,2.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,104863,1138,0,1,104,1.996069,2.0,1,0,0,...,0,0,0,0,1,0,0,0,0,0


That took a while, but I think we finally have the data in a good spot to start our model creation and tuning.

Question 3: Model Generation and Evaluation

Logistic Regression

In [30]:
from sklearn.model_selection import train_test_split
# Drop the key columns and the target variable
X = data.drop(columns=['clicked', 'user_id', 'product_id'])
# Identify the 'clicked' variable as our y
y = data['clicked']
# Get a test and train split with a test size of 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

In [31]:
from sklearn.preprocessing import MinMaxScaler
# Scale the data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [32]:
# Fit the model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

In [33]:
# Our accuracy with logistic regression is 65%. Refer to the classification report for further statistics.
from sklearn.metrics import classification_report
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.95      0.78      7018
           1       0.49      0.08      0.14      3779

    accuracy                           0.65     10797
   macro avg       0.58      0.52      0.46     10797
weighted avg       0.60      0.65      0.56     10797



Naive Bayes

In [35]:
# Make the naive bayes model. Our accuracy is worse at 57%.
from sklearn.naive_bayes import GaussianNB
gnb_model = GaussianNB()
gnb_model.fit(X_train_scaled, y_train)
y_pred_gnb = gnb_model.predict(X_test_scaled)
print(classification_report(y_test, y_pred_gnb))

              precision    recall  f1-score   support

           0       0.73      0.53      0.61      7018
           1       0.42      0.63      0.50      3779

    accuracy                           0.57     10797
   macro avg       0.57      0.58      0.56     10797
weighted avg       0.62      0.57      0.57     10797



Decision Tree

In [37]:
# Make a basic decision tree model. We get accuracy of 67%. Now let's try to optimize it.
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(random_state=123)
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)
print(classification_report(y_test, y_pred_dt))

              precision    recall  f1-score   support

           0       0.75      0.74      0.75      7018
           1       0.53      0.54      0.54      3779

    accuracy                           0.67     10797
   macro avg       0.64      0.64      0.64     10797
weighted avg       0.67      0.67      0.67     10797



In [38]:
# Do a grid search to get the best parameters
from sklearn.model_selection import GridSearchCV
param_grid = {
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10]
}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=123), param_grid, cv=5, scoring='accuracy', verbose=1)
grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 45 candidates, totalling 225 fits


In [39]:
# Show the best parameters
grid_search.best_params_

{'max_depth': 10, 'min_samples_leaf': 5, 'min_samples_split': 20}

In [40]:
# Train the model based on the best parameters. We get 75% accuracy now which is by far our best model yet.
dt_model = DecisionTreeClassifier(random_state=123, max_depth=10, min_samples_leaf=10, min_samples_split=2)
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)
print(classification_report(y_test, y_pred_dt))

              precision    recall  f1-score   support

           0       0.80      0.82      0.81      7018
           1       0.65      0.62      0.63      3779

    accuracy                           0.75     10797
   macro avg       0.73      0.72      0.72     10797
weighted avg       0.75      0.75      0.75     10797



Neural Network

In [42]:
# Basic neural network gets 61% accuracy. Let's optimize it now.
from sklearn.neural_network import MLPClassifier
mlp_model = MLPClassifier(random_state=123, max_iter=1000)
mlp_model.fit(X_train_scaled, y_train)
y_pred_mlp = mlp_model.predict(X_test_scaled)
print(classification_report(y_test, y_pred_mlp))

              precision    recall  f1-score   support

           0       0.69      0.76      0.72      7018
           1       0.45      0.36      0.40      3779

    accuracy                           0.62     10797
   macro avg       0.57      0.56      0.56     10797
weighted avg       0.61      0.62      0.61     10797



In [43]:
mlp_param_grid = {
    'hidden_layer_sizes': [(100,), (50, 50), (30, 30, 30)],
    'activation': ['relu', 'tanh'],
    'learning_rate_init': [0.001, 0.01]
}
mlp_grid_search = GridSearchCV(MLPClassifier(random_state=123, max_iter=1000), mlp_param_grid, cv=3, scoring='accuracy', verbose=1)
mlp_grid_search.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits




In [81]:
# Get the optimized parameters
mlp_grid_search.best_params_

{'activation': 'tanh',
 'hidden_layer_sizes': (100,),
 'learning_rate_init': 0.01}

In [85]:
# Fit the model with optimized parameters. It didn't improve that much (64% accuracy).
from sklearn.neural_network import MLPClassifier
mlp_model = MLPClassifier(random_state=123, max_iter=1000, activation='tanh', hidden_layer_sizes=(100,), learning_rate_init=0.01)
mlp_model.fit(X_train_scaled, y_train)
y_pred_mlp = mlp_model.predict(X_test_scaled)
print(classification_report(y_test, y_pred_mlp))

              precision    recall  f1-score   support

           0       0.69      0.83      0.75      7018
           1       0.48      0.30      0.37      3779

    accuracy                           0.64     10797
   macro avg       0.58      0.56      0.56     10797
weighted avg       0.61      0.64      0.61     10797



Question 4: Assessment and Evaluation

I think the optimized decision tree is the best, as it had 75% accuracy.

Logistic regression comes in 2nd with 65% accuracy.

The optimized neural network is in 3rd with 64% accuracy. It could likely be improved through further parameter tuning, but even as is, it took like 30 minutes to run the grid search. Not very efficient.

Naive bayes is the worst with only a 57% accuracy rate.