# Week3  Session based recommendations using RNN

REFERENCE: https://github.com/mquad/sars_tutorial/blob/master/06_PersonalizedRNN.ipynb

The two datasets below can be used for code example 1 and/or 2. Code example 1 is implementation of the paper discussed in the LML meeting. Here, u_id can be ignored and next job_listing_id can be predicted merely based on job_listing_ids per session. Code example 2 is also based on Hidasi et al papers form 2016 and 2018, but here user-specific session data can be used. It's up to the coders to pick either of those implementations. Enjoy!

Training Dataset: searchml.user_jl_sessions_May2019_train_data
Train data contains user sessions with at least 5 applystarts in the month of May. There are 67181 unique users with at least 5 applystarts per session, total counts is 646155. 

Test Dataset: searchml.user_jl_sessions_June1_5_2019_test_data
Test set is generated for the June 1st - 5th 2019 and filtered to the same u_ids as in train data. There are 8144 distinct users, 47986 rows.


In [1]:
import theano
from theano import tensor as T
from theano import function
from theano.sandbox.rng_mrg import MRG_RandomStreams
import numpy as np
import pandas as pd
from collections import OrderedDict
mrng = MRG_RandomStreams()
from gpu_ops import gpu_diag_wide

import sys

sys.path.append('..')

import gru4rec
import evaluation

import baselines
import gpu_ops


In [6]:
# DO ONCE ONLY
PATH_TO_DATA = '/Users/malathi.sankar/recommendations/week3/'
data_tr = pd.read_csv(PATH_TO_DATA + 'sessions_train.csv', sep=',', usecols=[0,3,4], 
                     dtype={'job_listing_id': np.int64, 'ts': np.float64})
data_tr.columns = ["SessionId", "ItemId", "Time"]
data_tr.to_csv(PATH_TO_DATA + 'sessions_train_trimmed.csv', sep=',', index=False)

data_test = pd.read_csv(PATH_TO_DATA + 'sessions_test.csv', sep=',', usecols=[0,3,4],
 dtype={'job_listing_id': np.int64, 'ts': np.float64})
data_test.columns = ["SessionId", "ItemId", "Time"]
data_test.to_csv(PATH_TO_DATA + 'sessions_test_trimmed.csv', sep=',', index=False)


In [2]:

PATH_TO_TRAIN = '/Users/malathi.sankar/recommendations/week3/sessions_train_trimmed.csv'
PATH_TO_TEST = '/Users/malathi.sankar/recommendations/week3/sessions_test_trimmed.csv'
data = pd.read_csv(PATH_TO_TRAIN, sep=',', dtype={'ItemId': np.int64})
valid = pd.read_csv(PATH_TO_TEST, sep=',', dtype={'ItemId': np.int64})
  

In [11]:
data.head

<bound method NDFrame.head of                                SessionId      ItemId          Time
0       00000F4D8B096C00DAFAA17B649D1DDC  3245706208  1.558992e+09
1       00000F4D8B096C00DAFAA17B649D1DDC  2900724925  1.558992e+09
2       00000F4D8B096C00DAFAA17B649D1DDC  3194578784  1.558994e+09
3       00000F4D8B096C00DAFAA17B649D1DDC  3231890821  1.558994e+09
4       00000F4D8B096C00DAFAA17B649D1DDC  3151435214  1.558994e+09
5       00000F4D8B096C00DAFAA17B649D1DDC  2805386421  1.558994e+09
6       00000F4D8B096C00DAFAA17B649D1DDC  3207415566  1.558994e+09
7       00000F4D8B096C00DAFAA17B649D1DDC  3194578784  1.558994e+09
8       0000CE301774342EEBB3F120844BA114  3153598402  1.558610e+09
9       0000CE301774342EEBB3F120844BA114  3236302370  1.558610e+09
10      0000CE301774342EEBB3F120844BA114  3098496172  1.558610e+09
11      0000CE301774342EEBB3F120844BA114  3198740851  1.558611e+09
12      0000CE301774342EEBB3F120844BA114  3237934285  1.558612e+09
13      0000CE301774342EEBB3F120

In [8]:
data.shape

(646155, 3)

In [9]:
valid.shape

(47986, 3)

In [10]:
# State-of-the-art results on RSC15 from "Recurrent Neural Networks with Top-k Gains for 
#vSession-based Recommendations" on RSC15 (http://arxiv.org/abs/1706.03847)
# BPR-max, no embedding (R@20 = 0.7197, M@20 = 0.3157)
gru = gru4rec.GRU4Rec(loss='bpr-max', final_act='elu-0.5', hidden_act='tanh', layers=[100], adapt='adagrad',
                      n_epochs=10, batch_size=32, dropout_p_embed=0, dropout_p_hidden=0, learning_rate=0.2,
                      momentum=0.3, n_sample=2048, sample_alpha=0, bpreg=1, constrained_embedding=False)
gru.fit(data)
res = evaluation.evaluate_gpu(gru, valid)
print('Recall@20: {}'.format(res[0]))
print('MRR@20: {}'.format(res[1]))


Epoch0	loss: 0.684163
Epoch1	loss: 0.478654
Epoch2	loss: 0.288830
Epoch3	loss: 0.199518
Epoch4	loss: 0.153353
Epoch5	loss: 0.128884
Epoch6	loss: 0.115895
Epoch7	loss: 0.108556
Epoch8	loss: 0.103890
Epoch9	loss: 0.100833
Measuring Recall@20 and MRR@20
Recall@20: 0.08554857419043016
MRR@20: 0.031932985925742785


In [12]:
# BPR-max, constrained embedding (R@20 = 0.7261, M@20 = 0.3124)
gru = gru4rec.GRU4Rec(loss='bpr-max', final_act='elu-0.5', hidden_act='tanh', layers=[100], adapt='adagrad',
                      n_epochs=10, batch_size=32, dropout_p_embed=0, dropout_p_hidden=0, learning_rate=0.2,
                      momentum=0.1, n_sample=2048, sample_alpha=0, bpreg=0.5, constrained_embedding=True)
gru.fit(data)
res = evaluation.evaluate_gpu(gru, valid)
print('Recall@20: {}'.format(res[0]))
print('MRR@20: {}'.format(res[1]))


Epoch0	loss: 0.691879
Epoch1	loss: 0.435929
Epoch2	loss: 0.176247
Epoch3	loss: 0.107964
Epoch4	loss: 0.093723
Epoch5	loss: 0.087552
Epoch6	loss: 0.083689
Epoch7	loss: 0.080798
Epoch8	loss: 0.078662
Epoch9	loss: 0.077070
Measuring Recall@20 and MRR@20
Recall@20: 0.10662155630739488
MRR@20: 0.04241912934541998


In [3]:

# Cross-entropy (R@20 = 0.7180, M@20 = 0.3087)
gru = gru4rec.GRU4Rec(loss='cross-entropy', final_act='softmax', hidden_act='tanh', layers=[100], adapt='adagrad',
                      n_epochs=10, batch_size=32, dropout_p_embed=0, dropout_p_hidden=0.3, learning_rate=0.1,
                      momentum=0.7, n_sample=2048, sample_alpha=0, bpreg=0, constrained_embedding=False)
gru.fit(data)
res = evaluation.evaluate_gpu(gru, valid)
print('Recall@20: {}'.format(res[0]))
print('MRR@20: {}'.format(res[1]))


Epoch0	loss: 9.139716
Epoch1	loss: 7.656699
Epoch2	loss: 4.575075
Epoch3	loss: 2.122007
Epoch4	loss: 0.624115
Epoch5	loss: 0.213050
Epoch6	loss: 0.134199
Epoch7	loss: 0.107137
Epoch8	loss: 0.094181
Epoch9	loss: 0.090308
Measuring Recall@20 and MRR@20
Recall@20: 0.051425809569840504
MRR@20: 0.024179165527695435


Notrs for MS:
MRR = mean reciprocal rank