# RNNs for next product prediction

The goal of this notebook is to look at a simple implementation of RNN for next product prediction.
It might seem as an obvious use case, but it wasn't very popular until the 2015 paper [Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939).

We will be using a sample of the [RecSys 2015 challenge yoochoose dataset](https://2015.recsyschallenge.com/challenge.html), where we will user click sessions of various lengths, and the task we will be trying to solve, is predicting at each timestep the next product the user will click.

First the imports.

In [None]:
import pandas as pd
import itertools
import collections
import pandas as pd
from typing import List
import sys

from tensorflow import keras
from keras.preprocessing.sequence import pad_sequences
from keras.layers import *
from keras import Sequential
import tensorflow as tf

We need now to load the data from a url. We will several sessions, where a session is a succession of clicks on products, basically a shopping session.

In [None]:
def load_df_from_url(path):
    sessions_df = pd.read_csv(path, sep=",", header=None)
    sessions_df.columns = ["session_id", "timestamp", "item_id", "category"]
    sessions_df["timestamp"] = pd.to_datetime(sessions_df["timestamp"])
    return sessions_df

In [None]:
data_url = "https://www.dropbox.com/s/urf0v28umc7afg2/yoochoose-clicks-sample.dat?dl=1"

In [None]:
sessions_df = load_df_from_url(data_url)

In [None]:
sessions_df.head()

Now we will build the actual sessions which just a list of products. We will group rows by session_id, replace product ids with their indices (between 1 and product_vocab_size, we will see why we are 1-indexed later), and filter out products that are not very common and sessions that are too small.

In [None]:
def build_sessions(sessions_df: pd.DataFrame, 
                   max_products: int = 1000, 
                   min_session_size: int = 3) -> List[List[int]]:
    print("Session Dataframe length ", len(sessions_df))
    
    all_items = sessions_df["item_id"].values
    items_counter = collections.Counter(all_items)
    most_common_items = dict(items_counter.most_common(max_products))
    ids_to_indices = dict((item_id, i+1) for i, item_id in enumerate(most_common_items.keys()))
    
    session_dicts = sessions_df.to_dict(orient='records')
    grouped_sessions = itertools.groupby(session_dicts, lambda d: d["session_id"])
    sessions = []
    for _, session in grouped_sessions:
        item_list = [d["item_id"] for d in sorted(list(session), key=lambda x: x["timestamp"])]
        item_list = [ids_to_indices[item] for item in item_list if item in ids_to_indices]
        if len(item_list) >= min_session_size:
            sessions.append(item_list)
    
    print("Sessions count ", len(sessions))
    
    return sessions, most_common_items

In [None]:
sessions, most_common_items = build_sessions(sessions_df)

We can't have sessions of different lengths in the same tensor (or numpy array). So we need to fix a session length and truncate sessions to this length, or pad them with a dummy value, we will use 0 as our dummy value, and that is why we adopted 1-based indexing previously.

In [None]:
max_session_length = 50
padded_sessions = pad_sequences(sessions, 
                                maxlen=max_session_length, 
                                padding='post', 
                                truncating='pre', 
                                value=0)
padded_sessions = np.array(padded_sessions)

Now let's create a very simple LSTM model, where we will first embed the products in a lower dimensional space and then use the output of the LSTM at each timestep to predict the next product in the sequence.

In [None]:
vocab_size = len(most_common_items) + 1
embedding_size = 20
input_length = max_session_length - 1

model = Sequential()
model.add(Masking(mask_value=0, input_shape=(input_length, )))
model.add(Embedding(vocab_size, embedding_size, input_length=input_length, mask_zero=True))
model.add(LSTM(100, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size, activation='softmax')))
print(model.summary())

We will implement a custom metric here as we want the accuracy for the whole sequence while masking the dummy value 0

In [None]:
def categorical_accuracy_sequential(y_true, y_pred):
    y_true = tf.squeeze(y_true)
    padding_mask = tf.greater(y_true, 0)
    
    y_pred = tf.argmax(y_pred, axis=-1)
    y_pred = tf.cast(y_pred, tf.float32)

    match = tf.cast(tf.equal(y_true, y_pred), tf.float32)

    match_masked = match * tf.cast(padding_mask, tf.float32)
    return tf.reduce_sum(match_masked) / tf.reduce_sum(tf.cast(padding_mask, tf.float32))

In [None]:
X = padded_sessions[:, :-1]
y = np.expand_dims(padded_sessions[:, 1:], -1)

In [None]:
model.compile("adam", loss="sparse_categorical_crossentropy", metrics=[categorical_accuracy_sequential])
model.fit(x=X, y=y, validation_split=0.1, batch_size=32)

### Questions

* Q1 : Try increasing the max sequence length (from 50 to 100) what's the impact on the performance ? How about the maximum number of unique products ?
* Q2 : Let's make the model bigger ! Try adding a second (or more) LSTM layers
* Q3 : Compare the performance of the model to a naive model that always predict the current product as the next product (in other words y_pred = X)