# The Training Session

This session will consist of the training process. It will not have too much EDA and data preprocessing parts, as I did the preprocessing inside the model to save time

I will create a collaborative filtering model that relies on users past interactions which are stored in their respective embedding.

In [82]:
import pandas as pd

In [83]:
customer_interactions = pd.read_csv('customer_interactions.csv')
purchase_history = pd.read_csv('purchase_history.csv')
product_details = pd.read_csv('product_details.csv')

In [84]:
customer_interactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87229 entries, 0 to 87228
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   user_id       87229 non-null  int64  
 1   page_views    87229 non-null  int64  
 2   session_time  87229 non-null  float64
 3   session_id    87229 non-null  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 2.7+ MB


In [85]:
purchase_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 276761 entries, 0 to 276760
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   user_id        276761 non-null  int64 
 1   product_id     276761 non-null  int64 
 2   session_id     276761 non-null  object
 3   purchase_date  276761 non-null  object
 4   purchased      276761 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 10.6+ MB


In [86]:
product_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52230 entries, 0 to 52229
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   product_id   52230 non-null  int64  
 1   category_id  52230 non-null  int64  
 2   price        52230 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 1.2 MB


In [87]:
purchase_history.purchased.value_counts()

purchased
0    269535
1      7226
Name: count, dtype: int64

I would need to handle this imbalanced label later

# Data Preparation

### Feature Engineering

First I merged all 3 tables I created in the previous notebook.

In [88]:
data = purchase_history.merge(customer_interactions)
data = data.merge(product_details)
data

Unnamed: 0,user_id,product_id,session_id,purchase_date,purchased,page_views,session_time,category_id,price
0,512749724,1000978,21db6fe3-b667-40cc-acde-63c0861c8bac,2019-10-31 08:49:59+00:00,0,7,1121.0,2053013555631882655,307.192
1,551539729,1000978,259156cd-39ce-4b52-be5c-112e82944cd7,2019-10-30 13:20:56+00:00,0,7,196.0,2053013555631882655,307.192
2,512641002,1000978,268cc02f-2c57-493e-8514-ea6ab50e3b1b,2019-10-27 03:22:27+00:00,0,2,87.0,2053013555631882655,307.192
3,551539729,1000978,40fee886-f23b-4025-be2b-286f05fbdc78,2019-10-25 14:05:42+00:00,0,16,653.0,2053013555631882655,307.192
4,529216566,1000978,5331aada-6a85-402a-bfe3-21976b6d1c2a,2019-10-22 15:30:31+00:00,0,18,381.0,2053013555631882655,307.192
...,...,...,...,...,...,...,...,...,...
276755,512694680,60500002,2c48e74e-d100-4fd3-bc03-1f73f56ad592,2019-10-27 05:29:17+00:00,0,11,1075.0,2162513070503494350,42.190
276756,512684520,60500002,958883de-b352-4219-81b9-5b819d3d7810,2019-10-30 17:23:43+00:00,0,4,85.0,2162513070503494350,42.190
276757,519178692,60500002,b48fd7f4-2225-4a87-9823-dbf33edcb250,2019-10-30 18:12:11+00:00,0,3,140.0,2162513070503494350,42.190
276758,516187142,60500002,eceb3971-a847-41b0-9600-73a847792af0,2019-10-25 15:11:18+00:00,0,30,1849.0,2162513070503494350,42.190


In [89]:
data['user_id'] = data['user_id'].astype(int)
data['product_id'] = data['product_id'].astype(int)

### Downsampling

In [91]:
data.purchased.value_counts()/len(data)

purchased
0    0.973891
1    0.026109
Name: count, dtype: float64

I tried to downsample the labels. But as you can see, the **purchased** label makes up only 2% of the dataset. To downsample, means not using most of the dataset. So I will abandon this attempt.

In [23]:
from sklearn.utils import resample

data_majority = data[data['purchased'] == 0]
data_minority = data[data['purchased'] == 1]

# Downsample majority class
data_majority_downsampled = resample(data_majority, 
                                   n_samples=len(data_minority)) # Match minority class size

# Combine downsampled majority with minority
data_downsampled = pd.concat([data_minority, data_majority_downsampled])

### Data Splitting

As a good law abiding citizen, I splitted my dataset accordingly.

In [24]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2)

In [25]:
print(train.shape)
print(test.shape)

(221408, 9)
(55352, 9)


In [26]:
train.head()

Unnamed: 0,user_id,product_id,session_id,purchase_date,purchased,page_views,session_time,category_id,price
163495,515972325,12400024,a009d11f-d453-40b5-a7f5-b6c5c66c60e9,2019-10-14 14:46:00+00:00,0,1,0.0,2053013556252639687,161.888276
245858,512598759,28714357,c4cbd815-bfc3-4a33-9078-4c0b9721ef0a,2019-10-17 07:21:08+00:00,0,10,194.0,2053013565069067197,155.71
165746,555008988,12700948,1718e852-9763-4e87-afcd-84fdf38c4e7a,2019-10-01 03:13:47+00:00,0,11,236.0,2053013553559896355,125.57
90960,515697242,2702349,061b50b1-5bd0-4ecb-acc8-816e39aae784,2019-10-02 03:32:02+00:00,0,5,402.0,2053013563911439225,605.259583
74718,530834332,1307484,eada204d-8973-41a6-a030-6a6a601ceee8,2019-10-21 08:52:26+00:00,0,23,599.0,2053013558920217191,1042.24


### WandB Setup

As a good law abiding ML Engineer, I always track my training session. As most chaos happens here, where many people lost their mind finding the true best hyperparameters and architecture. All just to get 0.2% better validation results.

In [92]:
# !pip install wandb

In [93]:
!wandb login 

wandb: Currently logged in as: manfredmichael (nodeflux-internship). Use `wandb login --relogin` to force relogin


**Note:** I will include the WandB session url in the readme too

In [95]:
import wandb
import random
from wandb.keras import WandbMetricsLogger, WandbModelCheckpoint

from tensorflow.keras.optimizers import Adam

wandb.init(
    # set the wandb project where this run will be logged
    project="tech-assessment-recsyc",

    # track hyperparameters and run metadata with wandb.config
    config={
        "desc": "with session context + category embedding + class weights",
        "embedding": 16,
        "dropout": 0.1,
        "num_of_deep_layers": 4,
        "l2": 0.000
    }
)

config = wandb.config

### Model Building

Yes, I chose tensorflow to train this ranking model. It results in headache for the last 3 days, and I was traumatized for life. 

As a certified Tensorflow Developer, I hate Tensorflow.

In [96]:
import numpy as np
import tensorflow as tf
# import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs

Here, I took all unique list of user & product ids, and also categories. These features will be converted into embedding inside the model.

In [98]:
n_products = data.product_id.nunique()
n_users = data.user_id.nunique()
n_category =  data.category_id.nunique()

unique_products = list(data.product_id.unique())
unique_users = list(data.user_id.unique())
unique_category = list(data.category_id.unique())

print('num of users:', n_users)
print('num of products:', n_products)
print('num of categories:', n_category)

num of users: 10000
num of products: 52229
num of categories: 593


I used wide and deep architecture that will take all embeddings. At the end of the network, I modified the wide and deep arch to also take the session context which consist of:
* page_views
* user_input
* session_time


In [99]:
from tensorflow.keras.models import Model, Sequential 
from tensorflow.keras.layers import Input, Embedding, Dot, Flatten, Dense, concatenate, Dropout
from tensorflow.keras.activations import sigmoid
from tensorflow.keras import regularizers

EMBEDDING_DIM = config.embedding

# WIDE MODEL
wide_model = Sequential([Dense(EMBEDDING_DIM * 2, use_bias='false', 
                        kernel_regularizer=regularizers.l2(config.l2)),
                         Dropout(config.dropout)])

# DEEP MODEL
deep_layers = []
for i in range(config.num_of_deep_layers):
    deep_layers += [Dense(EMBEDDING_DIM, activation='relu', use_bias='false', kernel_regularizer=regularizers.l2(config.l2)),
                    Dropout(config.dropout)]
              
deep_model = Sequential(deep_layers)

# input layers
product_input = Input(shape=[1], name='product_id')
user_input = Input(shape=[1], name='user_id')

page_views_input = Input(shape=[1], name='page_views')
session_time_input = Input(shape=[1], name='session_time')

price_input = Input(shape=[1], name='price')

category_input = Input(shape=[1], name='category_id')
category_input_int = tf.keras.layers.IntegerLookup(vocabulary=unique_category, num_oov_indices=1, mask_token=None)(category_input)
category_embedding = Embedding(n_products+1, EMBEDDING_DIM, embeddings_regularizer=regularizers.l2(config.l2))(category_input_int)

session_context = concatenate([price_input, page_views_input, session_time_input])
session_context = tf.keras.layers.BatchNormalization()(session_context)
session_context = tf.keras.layers.Dense(EMBEDDING_DIM, activation='relu', kernel_regularizer=regularizers.l2(config.l2))(session_context)

product_input_int = tf.keras.layers.IntegerLookup(vocabulary=unique_products, num_oov_indices=1, mask_token=None)(product_input)
user_input_int = tf.keras.layers.IntegerLookup(vocabulary=unique_users, num_oov_indices=1, mask_token=None)(user_input)

# # Convert strings to integers
# product_input_int = product_lookup(product_input)
# user_input_int = user_lookup(user_input)

# embedding layers 
product_embedding = Embedding(n_products+1, EMBEDDING_DIM, embeddings_regularizer=regularizers.l2(config.l2))(product_input_int)
user_embedding = Embedding(n_users+1, EMBEDDING_DIM, embeddings_regularizer=regularizers.l2(config.l2))(user_input_int)

# flatten the embeddings
product_flat = Flatten()(product_embedding)
user_flat = Flatten()(user_embedding)
category_embedding = Flatten()(category_embedding)

# wide and deep model
concat = concatenate([product_flat, user_flat, category_embedding])
wide_output = wide_model(concat)
deep_output = deep_model(concat)

# output layer
concat_2 = concatenate([wide_output, deep_output])
embeddings_output = tf.keras.layers.Dense(EMBEDDING_DIM, activation='relu',  kernel_regularizer=regularizers.l2(config.l2), use_bias='false')(concat_2)
concat_3 = concatenate([embeddings_output, session_context])
output = Dense(1, activation='sigmoid', use_bias='false')(concat_3)
#output = sigmoid(Dot(1)([wide_output, deep_output]))

# the model
model = Model([product_input, user_input, page_views_input, session_time_input, price_input, category_input], [output])

In [100]:
model.summary()

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 product_id (InputLayer)     [(None, 1)]                  0         []                            
                                                                                                  
 user_id (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 category_id (InputLayer)    [(None, 1)]                  0         []                            
                                                                                                  
 integer_lookup_4 (IntegerL  (None, 1)                    0         ['product_id[0][0]']          
 ookup)                                                                                     

As said before, I will need to handle the imbalanced label problem. So I relied on class weights to prioritize more on minotiry class.

I used accuracy, precision & recall, and AUC as metrics to monitor the training process.

In [106]:
from sklearn.utils import class_weight

class_weights = class_weight.compute_class_weight('balanced', 
                                                  classes=np.unique(train.purchased),
                                                  y=train.purchased)
class_weights_dict = dict(enumerate(class_weights))

model.compile(optimizer=Adam(lr=1e-2), loss='binary_crossentropy', 
              metrics=['accuracy',  tf.keras.metrics.Precision(), tf.keras.metrics.Recall(name='recall'),
                      tf.keras.metrics.AUC(name='AUC')])




### Model Training

I also used Checkpoints to save the best models on each run. I could have used WandB artefacts to save models, but I decided to disable it as it took too long.

In [107]:
from tensorflow.keras.callbacks import ModelCheckpoint


checkpoint_callback = ModelCheckpoint(
    filepath='best_weights_epoch-{epoch:02d}-val-auc-{val_AUC:.3f}.hdf5',
    monitor='val_recall',  
    mode='max',          
    save_best_only=True,  # Save only the best model
    save_weights_only=True,  
    verbose=1            
)

Giving it more epoch results in less diverse recommendation and overfitted models. So I stick with <5 epochs.

In [108]:
history = model.fit(x=[train.product_id, train.user_id, train.page_views, train.session_time, train.price, train.category_id], y=train.purchased,
                    validation_split=0.2,
                    batch_size=8192,
                    epochs=3,
                    class_weight=class_weights_dict,
                    callbacks=[
                      WandbMetricsLogger(log_freq='batch'),
                      # WandbModelCheckpoint("models-initial"),
                      checkpoint_callback 
                    ])

Epoch 1/3




Epoch 1: val_recall improved from -inf to 0.61552, saving model to best_weights_epoch-01-val-auc-0.707.hdf5
Epoch 2/3
Epoch 2: val_recall improved from 0.61552 to 0.63228, saving model to best_weights_epoch-02-val-auc-0.711.hdf5
Epoch 3/3
Epoch 3: val_recall improved from 0.63228 to 0.63492, saving model to best_weights_epoch-03-val-auc-0.713.hdf5


### Model Evaluation

Please don't complain about the horrible precision and recall values, as I have tried my best 🥺

In [109]:
model.evaluate(x=[test.product_id, test.user_id, test.page_views, test.session_time, test.price, test.category_id], 
               y=test.purchased, batch_size=8192)



[0.520074725151062,
 0.6984933018684387,
 0.05372362211346626,
 0.6077127456665039,
 0.6972603797912598]

### Demonstrating Recommendation System

I created a function to get recommended products for a given user. It also takes page_views and session_time as the context of that session.

In [127]:
def get_product_recommendation(user_id, page_views, session_time, products=product_details, model=model):
    products = products.copy()
    user_ids = [user_id] * len(products)
    page_views = [page_views] * len(products)
    session_time = [session_time] * len(products)

    input_ = pd.DataFrame({'product_id': list(products.product_id), 
                           'user_id': user_ids,
                           'page_views': page_views,
                           'session_time': session_time,
                           'price': list(products.price),
                           'category_id': list(products.category_id)})
    
    results = model([input_['product_id'], 
                     input_['user_id'],
                     input_['page_views'].values.reshape(-1, 1),
                     input_['session_time'].values.reshape(-1, 1),
                     input_['price'].values.reshape(-1, 1),
                     input_['category_id']
                    ]).numpy().reshape(-1)
    # results = model([products, user_ids]).numpy().reshape(-1)
    
    products['purchase_proba'] = pd.Series(results, index=products.index)
    products = products.sort_values('purchase_proba', ascending=False)
    
    return products

Here, you can change the arg values and see different results.

In [128]:
get_product_recommendation(513411503, 
                   page_views=10, 
                   session_time=714.0)

Unnamed: 0,product_id,category_id,price,purchase_proba
916,1005144,2053013555631882655,1720.407530,0.786002
907,1005135,2053013555631882655,1737.485376,0.784788
47129,33900039,2059484602216481097,2573.810000,0.755602
40109,27700639,2053013560086233771,2571.150000,0.750894
47148,33900067,2059484602216481097,2573.810000,0.750641
...,...,...,...,...
33140,22700324,2053013556168753601,22.279091,0.189495
9041,5100833,2053013553375346967,47.687910,0.189279
8349,4804137,2053013554658804075,105.215313,0.182693
42876,28715756,2053013565069067197,73.484490,0.180599


In [129]:
get_product_recommendation(512525276, 
                   page_views=3, 
                   session_time=182.0)

Unnamed: 0,product_id,category_id,price,purchase_proba
916,1005144,2053013555631882655,1720.407530,0.772818
907,1005135,2053013555631882655,1737.485376,0.770175
847,1005074,2053013555631882655,1188.910128,0.731957
47129,33900039,2059484602216481097,2573.810000,0.727920
912,1005140,2053013555631882655,1524.516667,0.727900
...,...,...,...,...
5850,3601505,2053013563810775923,479.102000,0.198451
5721,3601212,2053013563810775923,409.266082,0.196772
42876,28715756,2053013565069067197,73.484490,0.193689
8349,4804137,2053013554658804075,105.215313,0.193333


### Getting User Recommendations

So I thought of an idea. If a product was on sale with discounts/cashback, and it wants to promote to a number of users. Which users should get the notifications? So this function will predict which users are more likely to buy the given product.

Create user_details which take 1 row of every unique user_id

In [130]:
user_details = customer_interactions.loc[customer_interactions.user_id.drop_duplicates().index]

In [131]:
# Find which users are likely to buy which product
def get_user_recommendation(product_id, price, category_id, users=user_details, model=model):
    users = users.copy()
    product_ids = [product_id] * len(users)
    prices = [price] * len(users)
    category_ids = [category_id] * len(users)

    
    input_ = pd.DataFrame({'product_id': product_ids, 
                           'user_id': list(users.user_id),
                           'page_views': list(users.page_views),
                           'session_time': list(users.session_time),
                           'price': prices,
                           'category_id': category_ids})
    
    results = model([input_['product_id'], 
                     input_['user_id'],
                     input_['page_views'].values.reshape(-1, 1),
                     input_['session_time'].values.reshape(-1, 1),
                     input_['price'].values.reshape(-1, 1),
                     input_['category_id']
                    ]).numpy().reshape(-1)
    
    users['purchase_proba'] = pd.Series(results, index=users.index)
    users = users.sort_values('purchase_proba', ascending=False)
    
    return users

In [132]:
get_user_recommendation(product_id=12400024,
                        price=161,
                        category_id=2053013556252639687).head(5)

Unnamed: 0,user_id,page_views,session_time,session_id,purchase_proba
790,547466368,4,117.0,02512329-da54-4a4d-a0db-bd25bde50e31,0.502976
295,550472384,3,387.0,00e09b60-6b60-46a4-9172-f2825719dc47,0.482903
54708,538546070,2,48.0,a0673da0-17f3-4de7-a3dc-b46cc90d244a,0.480648
7485,512983498,2,31.0,15f425fb-9d08-4b2e-9c34-b5d81051b1a8,0.467882
61237,531119538,1,0.0,b3b77859-0387-4ee2-85a5-ea8650c3aedf,0.466564


In [133]:
get_user_recommendation(product_id=12400024,
                        price=10,
                        category_id=2086471240800797129).head(5)

Unnamed: 0,user_id,page_views,session_time,session_id,purchase_proba
790,547466368,4,117.0,02512329-da54-4a4d-a0db-bd25bde50e31,0.4645
54708,538546070,2,48.0,a0673da0-17f3-4de7-a3dc-b46cc90d244a,0.450992
295,550472384,3,387.0,00e09b60-6b60-46a4-9172-f2825719dc47,0.445602
61237,531119538,1,0.0,b3b77859-0387-4ee2-85a5-ea8650c3aedf,0.440557
36653,555465562,1,0.0,6b16ca30-c3bd-4324-ae27-dbb8a1448317,0.440144


# Save and Load Model

In [124]:
# model.save_weights('final_model.h5')

In [125]:
model_load = Model([product_input, user_input, page_views_input, session_time_input, price_input, category_input], [output])
model_load.load_weights('final_model.h5')

In [126]:
get_recommendation(513411503, 
                   page_views=10, 
                   session_time=714.0,
                   model=model_load)

Unnamed: 0,product_id,category_id,price,purchase_proba
916,1005144,2053013555631882655,1720.407530,0.786002
907,1005135,2053013555631882655,1737.485376,0.784788
47129,33900039,2059484602216481097,2573.810000,0.755602
40109,27700639,2053013560086233771,2571.150000,0.750894
47148,33900067,2059484602216481097,2573.810000,0.750641
...,...,...,...,...
33140,22700324,2053013556168753601,22.279091,0.189495
9041,5100833,2053013553375346967,47.687910,0.189279
8349,4804137,2053013554658804075,105.215313,0.182693
42876,28715756,2053013565069067197,73.484490,0.180599


In [385]:
get_recommendation(550050854, model=model_load).head(10)

Unnamed: 0,product_id,category_id,price,purchase_proba
6904,4100346,2053013561218695907,392.80965,0.218746
4038,2601810,2053013563970159485,155.259458,0.213689
703,1004863,2053013555631882655,173.004043,0.194762
378,1004246,2053013555631882655,743.151695,0.171427
563,1004659,2053013555631882655,738.926172,0.168789
700,1004856,2053013555631882655,131.142246,0.167703
4750,2800476,2053013563835941749,172.08,0.167442
20240,12706258,2053013553559896355,60.23,0.160695
381,1004249,2053013555631882655,739.568116,0.12633
684,1004838,2053013555631882655,174.748446,0.120576


In [386]:
get_recommendation(0, model=model_load)

Unnamed: 0,product_id,category_id,price,purchase_proba
847,1005074,2053013555631882655,1188.910128,0.174015
916,1005144,2053013555631882655,1720.407530,0.170057
22811,13103656,2053013553526341921,270.280000,0.164640
41300,28300432,2053013566243472379,46.330000,0.131460
19296,12700716,2053013553559896355,40.930000,0.122681
...,...,...,...,...
787,1004990,2053013555631882655,253.171459,0.001422
8883,5100551,2053013553341792533,166.080331,0.001417
12705,7005394,2053013560346280633,163.883901,0.001398
757,1004946,2053013555631882655,738.895365,0.001359
