# Creating a recommender system with RNNs

We will observe the history of items that customers have bought from our online store. Our objective is to predict the next item that a customer will buy, given their purchase history.

**Question 1**: Assuming you can predict well what customers are going to buy when visiting our store. What can you do with this information in order to improve the profits of our online store?

With this information, an online store can improve profits in many ways. A few examples are: offer personalized product recommendations to customers, implement dynamic pricing strategies and manage inventory a lot more efficiently. Additionally, in relation with this assignment of predicting what customers will buy next given their history, the product recommendations and dynamic pricing will specifically be focused on what a costumer is predicted to by next. For example, if a cotue=mer bought a pregnancy test, the next logical recoomendation would be baby supplies and then deals on baby supply bundles could be offered.

In [None]:
import pandas as pd
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Embedding, Conv1D, BatchNormalization, GRU, Dense
from sklearn.model_selection import train_test_split

## 1. The data

Each entry in the dataset corresponds to a combination of customer and item bought. Per customer, items are arranged based on the visits to our online store (e.g., customer 17850 first bought item 0, then item 399, then item 505, etc.)

In [None]:
df = pd.read_csv('https://www.dropbox.com/s/4kicl5okwlmst5i/online_retail.csv?dl=1')
df.head(10)

We will later use zero-padding to get sequences of equal length. Hence, we should avoid items with name "0" and instead shift all items by 1:

In [None]:
df['StockCode'] = df['StockCode'] + 1

In [None]:
number_items = len(df['StockCode'].unique())
number_items

In [None]:
len(df)

We convert the dataframe into a list of sequences, where each customer corresponds to one sequence of items bought.

In [None]:
sequences = []
for customer in df['CustomerID'].unique():
    temp = df[df['CustomerID'] == customer]
    sequences.append(temp['StockCode'].tolist())

Some sequences are much longer than others, so we will only consider sequences of a certain length. In particular, we pick here approximately the 90% quantile to cut off sequences of purchases.

In [None]:
np.quantile([len(seq) for seq in sequences],0.9)

In [None]:
max_length = 160
sequences = [seq[:min(max_length,len(seq))] for seq in sequences]

We also add "padding" to make sequences of equal length (to train our model, each sequence within a mini-batch has to have the same length. Since we don't want to have a lot of work splitting the data into mini-batches, we will just equalize everything). Note that we will need to tell our algorithm later to ignore padded values when it comes to loss-computation.

In [None]:
sequences[10]

In [None]:
sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding="pre")

In [None]:
sequences[10]

In [None]:
len(sequences[10])

Next, we split the data into training, validation, and testing (randomly):

In [None]:
X_train, X_other = train_test_split(sequences,train_size=0.8)
X_valid, X_test = train_test_split(X_other,train_size=0.5)
print(X_train.shape)

Finally, we build our target (y) based on a sequence-to-sequence approach. That is, for each sequence of inputs, we predict a sequence of outputs.

**Question 2**: Below, we construct the variables `y_train`, `y_valid`, and `y_test`. Describe why we build the `y` variables in this way and why we also need to modify `X_train`, `X_valid`, and `X_test`.

To create y_train, y_valid, and y_test, we are simply taking only the last element of the sequence for the corresponding X_train, X_valid, and X_test. This last element is the element will be trying to predict based on the previous sequence. For X_train, X_valid, and X_test, we need to remove the last element as it is the one we are trying to predict and should not be part of the input sequence.

In [None]:
y_train = X_train[:,1:]
y_valid = X_valid[:,1:]
y_test = X_test[:,1:]

X_train = X_train[:,:-1]
X_valid = X_valid[:,:-1]
X_test = X_test[:,:-1]

Check your sequence lengths:

In [None]:
print(X_train.shape)
print(y_train.shape)

## 2. Building a model

We now build a model that takes as input a sequence of orders by one customer and outputs the predictions for the next time step. Instead of directly using our sequences as inputs to a recurrent layer, we will use an `Embedding` layer.

**Question 3**: In your own words, describe what (word) embeddings do, and why we use them in deep learning. A good resource is the accompanying book "Deep Learning with Python" (2nd edition) by Francois Chollet, available online through the City University Library. You might want to check the part "Understanding Word Embeddings" within Chapter 11.3.3.

From my understanding, embeddings encode the meaning of words or in our case stock codes into vector representations with relatively low dimensionalty. They are essentially doing the same job as one-hot encoding categorical variables, except at a fixed lower dimensionality which for this model we choose to be 6. During training, the model adjusts these embeddings to increase its performance. When adjusting these embeddings, the model makes products (stock codes) which are similiar to have vectors that are closer together in the embedding space and allows the model capture relationships between these codes to help predict the sequence.

In [None]:
model = tf.keras.Sequential([
    Embedding(input_dim=number_items+1, output_dim=6, input_shape=[None], mask_zero=True),
    Conv1D(32, kernel_size=2, padding="causal", activation="relu"),
    BatchNormalization(),
    GRU(64, return_sequences=True, dropout = 0.2),
    BatchNormalization(),
    Dense(number_items+1, activation="softmax")
])
model.summary()

We want to add our own metric, to capture how well we're doing on the last prediction (that's the only one that matters after all). In particular, we will see whether the product the customer bought is within the 5 products we gave the highest probability in our prediction.

**Question 4**: Define a function `last_time_step_top_5` that takes the inputs `y_true` and `y_pred` and computes the `tf.keras.metrics.sparse_top_k_categorical_accuracy` between `y_true` and `y_pred` *for the last entry of each sequence*. Note that `sparse_top_k_categorical_accuracy` (see the [documentation here](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/sparse_top_k_categorical_accuracy)) takes as input the (modified) `y_true` and `y_pred`, as well as a value `k`.

In [None]:
def last_time_step_top_5(y_true, y_pred):
    
    last_true = tf.gather(y_true, tf.shape(y_true)[0] - 1, axis=0)
    last_pred = tf.gather(y_pred, tf.shape(y_pred)[0] - 1, axis=0)
    
    return tf.keras.metrics.sparse_top_k_categorical_accuracy(last_true, last_pred, k=5)

We are now ready to train the model:

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(learning_rate = 0.001),
                metrics = [last_time_step_top_5])
log = model.fit(X_train, y_train, epochs=20,
                validation_data = (X_valid,y_valid))

In [None]:
# Get predictions on training data
y_pred_train = model.predict(X_train)

# Get predictions on validation data
y_pred_valid = model.predict(X_valid)

#Get predictions on test data
y_pred_test = model.predict(X_test)

In [None]:
accuracy_top_5 = last_time_step_top_5(y_train,y_pred_train)

In [None]:
sum(accuracy_top_5)/len(accuracy_top_5)

In [None]:
val_accuracy_top_5 = last_time_step_top_5(y_valid,y_pred_valid)

In [None]:
sum(val_accuracy_top_5)/len(val_accuracy_top_5)

In [None]:
test_accuracy_top_5 = last_time_step_top_5(y_test,y_pred_test)

In [None]:
#test Accuracy
sum(test_accuracy_top_5)/len(test_accuracy_top_5)

In [None]:
plt.plot(log.history['last_time_step_top_5'],label = "actual in top 5 - training",color='green')
plt.plot(log.history['val_last_time_step_top_5'], label = "actual in top 5 - validation",color='grey')
plt.legend()
ax = plt.gca()
plt.show()

On the one hand, this doesn't sound too impressive. On the other hand, keep in mind that we have looked at raw items, and 1000 of them (while only having the buying history of 3500 customers).

**Question 5**: Can you do better? Go through the frameworks we have discussed in class in order to generate an improved model. A few hints:
- Before thinking about our framework for improving bias and variance, note that the model does not yet really overfit
- While we generally don't stack recurrent layers too deeply for computational reasons, we are currently only using a single one
- Consider the specific type of dropout regularization relevant for RNNs
- Aside from the typical suspects for parameters to modify, the number of dimensions of the embedding usually has a big influence

At the end of your improvement process, evaluate your model on the test set.

In my new model, i added one more recurrent layer, increased both RNN layers to double, increased the dimensions of embedding to 8, doubled the number of convolution layers and added dropout regularization relevant for RNNs by using reccurent_dropout.

In [None]:
model = tf.keras.Sequential([
    Embedding(input_dim=number_items+1, output_dim=8, input_shape=[None], mask_zero=True),
    Conv1D(64, kernel_size=2, padding="causal", activation="relu"),
    BatchNormalization(),
    GRU(128, return_sequences=True, dropout = 0.2, recurrent_dropout=0.2),
    GRU(128, return_sequences=True, dropout = 0.2, recurrent_dropout=0.2),
    BatchNormalization(),
    Dense(number_items+1, activation="softmax")
])
model.summary()

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(learning_rate = 0.001),
                metrics = [last_time_step_top_5])
log = model.fit(X_train, y_train, epochs=10,
                validation_data = (X_valid,y_valid))

In [None]:
# Get predictions on training data
y_pred_train = model.predict(X_train)

# Get predictions on validation data
y_pred_valid = model.predict(X_valid)

#Get predictions on test data
y_pred_test = model.predict(X_test)

In [None]:
accuracy_top_5 = last_time_step_top_5(y_train,y_pred_train)

In [None]:
sum(accuracy_top_5)/len(accuracy_top_5)

In [None]:
val_accuracy_top_5 = last_time_step_top_5(y_valid,y_pred_valid)

In [None]:
sum(val_accuracy_top_5)/len(val_accuracy_top_5)

In [None]:
test_accuracy_top_5 = last_time_step_top_5(y_test,y_pred_test)

In [None]:
#test accuracy
sum(test_accuracy_top_5)/len(test_accuracy_top_5)

In [None]:
plt.plot(log.history['last_time_step_top_5'],label = "actual in top 5 - training",color='green')
plt.plot(log.history['val_last_time_step_top_5'], label = "actual in top 5 - validation",color='grey')
plt.legend()
ax = plt.gca()
plt.show()