# Table of contents

1. [Load the dataset](#load_the_dataset)
2. [Split the dataset](#split_the_dataset)
3. [Fitting the recommender](#fitting)
4. [Sequential evaluation](#seq_evaluation)  
    4.1 [Evaluation with sequentially revaeled user profiles](#eval_seq_rev)  
    4.2 [Evaluation with "static" user profiles](#eval_static)  
5. [Analysis of next-item recommendation](#next-item)  
    5.1 [Evaluation with different recommendation list lengths](#next-item_list_length)  
    5.2 [Evaluation with different user profile lengths](#next-item_profile_length)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from util.data_utils import create_seq_db_filter_top_k, sequences_to_spfm_format
from util.split import random_holdout, temporal_holdout
from util.metrics import precision, recall, mrr
from util import evaluation
from recommenders.RNNRecommender import RNNRecommender

In [3]:
import datetime

In [4]:
def get_test_sequences_and_users(test_data, given_k, train_users):
    # we can run evaluation only over sequences longer than abs(LAST_K)
    mask = test_data['sequence'].map(len) > abs(given_k)
    mask &= test_data['user_id'].isin(train_users)
    test_sequences = test_data.loc[mask, 'sequence'].values
    test_users = test_data.loc[mask, 'user_id'].values
    return test_sequences, test_users

<a id='load_the_dataset'></a>

# 1. Load the dataset

For this hands-on session we will use a dataset of user-listening sessions crawled from [last.fm](https://www.last.fm/). In detail, we will use a subset of the following dataset:

* 30Music listening and playlists dataset, Turrin et al., ACM RecSys 2015 ([paper](https://home.deib.polimi.it/pagano/portfolio/papers/30Musiclisteningandplaylistsdataset.pdf))

In [6]:
dataset_path = 'datasets/sessions.csv'

# for the sake of speed, let's keep only the top-5k most popular items 
dataset = create_seq_db_filter_top_k(path=dataset_path,
                                     topk=1000, 
                                     last_months=1) 

is deprecated and will be removed in a future version
  aggregated = groups['item_id'].agg({'sequence': lambda x: list(map(str, x))})


Let's see at how the dataset looks like

In [7]:
dataset.head()

Unnamed: 0_level_0,sequence,ts,user_id
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
122,"[1762, 3700, 638]",1420059172,2432
223,"[3772, 3953]",1419418147,15861
226,"[245, 1271, 379]",1419433841,15861
243,"[245, 1197, 4307, 3868]",1421674741,15861
245,"[409, 234, 2334, 2431, 231, 4738, 219, 2403]",1421679507,15861


Let's show some statistics about the dataset

In [8]:
from collections import Counter
cnt = Counter()
dataset.sequence.map(cnt.update);

In [9]:
sequence_length = dataset.sequence.map(len).values
n_sessions_per_user = dataset.groupby('user_id').size()

print('Number of items: {}'.format(len(cnt)))
print('Number of users: {}'.format(dataset.user_id.nunique()))
print('Number of sessions: {}'.format(len(dataset)) )

print('\nSession length:\n\tAverage: {:.2f}\n\tMedian: {}\n\tMin: {}\n\tMax: {}'.format(
    sequence_length.mean(), 
    np.quantile(sequence_length, 0.5), 
    sequence_length.min(), 
    sequence_length.max()))

print('Sessions per user:\n\tAverage: {:.2f}\n\tMedian: {}\n\tMin: {}\n\tMax: {}'.format(
    n_sessions_per_user.mean(), 
    np.quantile(n_sessions_per_user, 0.5), 
    n_sessions_per_user.min(), 
    n_sessions_per_user.max()))

Number of items: 1000
Number of users: 17816
Number of sessions: 65917

Session length:
	Average: 4.19
	Median: 3.0
	Min: 1
	Max: 198
Sessions per user:
	Average: 3.70
	Median: 3.0
	Min: 1
	Max: 38


In [10]:
print('Most popular items: {}'.format(cnt.most_common(5)))

Most popular items: [('443', 1970), ('1065', 1526), ('67', 1462), ('1622', 1212), ('2308', 1211)]


<a id='split_the_dataset'></a>

# 2. Split the dataset

For simplicity, let's split the dataset randomly. NOTE: Sessions will be assigned either to the training or the test set.

In [11]:
train_data, test_data = random_holdout(dataset, perc=0.8, seed=1234)

In [12]:
print("Train size: {} - Test size: {}".format(len(train_data), len(test_data)))

Train size: 52733 - Test size: 13184


<a id='fitting'></a>

# 3. Fitting the recommender

Here we fit the recommedation algorithm over the sessions in the training set.  

This is a **simplified** interface to Recurrent Neural Network models for Session-based recommendation.
Based on the following two papers:

* Recurrent Neural Networks with Top-k Gains for Session-based Recommendations, Hidasi and Karatzoglou, CIKM 2018
* Personalizing Session-based Recommendation with Hierarchical Recurrent Neural Networks, Quadrana et al, Recsys 2017

In this notebook, we will consider the session-aware (**personalized**) version of the algorithm.  
The hyper-parameters are:

* `session_layers`: number of units per layer used at session level.
    It has to be a list of integers for multi-layer networks, or a integer value for single-layer networks.
* `user_layers`: number of units per layer used at user level. Required only by personalized models.
    It has to be a list of integers for multi-layer networks, or a integer value for single-layer networks.
* `batch_size`: the mini-batch size used in training
* `learning_rate`: the learning rate used in training (Adagrad optimized)
* `momentum`: the momentum coefficient used in training
* `dropout`: it's a 3-tuple with the values for the dropout of (user hidden, session hidden, user-to-session hidden) layers.
* `epochs`: number of training epochs
* `personalized`: whether to train a personalized model using the HRNN model (`True` in this case).

**NOTE: HGRU4Rec originally has many more hyper-parameters, and checking them all is out from the scope of this tutorial. Check-out the original source code [here](https://github.com/mquad/hgru4rec).**


In [14]:
recommender = RNNRecommender(session_layers=[20], 
                             user_layers=[20],
                             batch_size=16,
                             learning_rate=0.5,
                             momentum=0.1,
                             dropout=(0.1,0.1,0.1),
                             epochs=5,
                             personalized=True)
recommender.fit(train_data)

2018-09-25 17:32:43,795 - INFO - Converting training data to GRU4Rec format
2018-09-25 17:32:44,219 - INFO - Training started
  result[diagonal_slice] = x
2018-09-25 17:33:23,066 - INFO - Epoch 0 - train cost: 0.9533
2018-09-25 17:33:38,338 - INFO - Epoch 1 - train cost: 0.8611
2018-09-25 17:33:52,569 - INFO - Epoch 2 - train cost: 0.8450
2018-09-25 17:34:05,512 - INFO - Epoch 3 - train cost: 0.8398
2018-09-25 17:34:18,010 - INFO - Epoch 4 - train cost: 0.8370
2018-09-25 17:34:18,021 - INFO - Training completed


<a id='seq_evaluation'></a>


# 4. Sequential evaluation

In the evaluation of sequence-aware recommenders, each sequence in the test set is split into:
- the _user profile_, used to compute recommendations, is composed by the first *k* events in the sequence;
- the _ground truth_, used for performance evaluation, is composed by the remainder of the sequence.

In the cells below, you can control the dimension of the _user profile_ by assigning a **positive** value to `GIVEN_K`, which correspond to the number of events from the beginning of the sequence that will be assigned to the initial user profile. This ensures that each user profile in the test set will have exactly the same initial size, but the size of the ground truth will change for every sequence.

Alternatively, by assigning a **negative** value to `GIVEN_K`, you will set the initial size of the _ground truth_. In this way the _ground truth_ will have the same size for all sequences, but the dimension of the user profile will differ.

In [15]:
METRICS = {'precision':precision, 
           'recall':recall,
           'mrr': mrr}
TOPN = 10 # length of the recommendation list

<a id='eval_seq_rev'></a>

## 4.1 Evaluation with sequentially revealed user-profiles

Here we evaluate the quality of the recommendations in a setting in which user profiles are revealed _sequentially_.

The _user profile_ starts from the first `GIVEN_K` events (or, alternatively, from the last `-GIVEN_K` events if `GIVEN_K<0`).  
The recommendations are evaluated against the next `LOOK_AHEAD` events (the _ground truth_).  
The _user profile_ is next expanded to the next `STEP` events, the ground truth is scrolled forward accordingly, and the evaluation continues until the sequence ends.

In typical **next-item recommendation**, we start with `GIVEN_K=1`, generate a set of **alternatives** that will evaluated against the next event in the sequence (`LOOK_AHEAD=1`), move forward of one step (`STEP=1`) and repeat until the sequence ends.

You can set the `LOOK_AHEAD='all'` to see what happens if you had to recommend a **whole sequence** instead of a set of a set of alternatives to a user.

NOTE: Metrics are averaged over each sequence first, then averaged over all test sequences.

** (TODO) Try out with different evaluation settings to see how the recommandation quality changes. **


![](gifs/sequential_eval.gif)

In [16]:
# GIVEN_K=1, LOOK_AHEAD=1, STEP=1 corresponds to the classical next-item evaluation
GIVEN_K = 1
LOOK_AHEAD = 1
STEP = 1

In [17]:
test_sequences, test_users = get_test_sequences_and_users(test_data, GIVEN_K, train_data['user_id'].values) # we need user ids now!
print('{} sequences available for evaluation ({} users)'.format(len(test_sequences), len(np.unique(test_users))))

results = evaluation.sequential_evaluation(recommender,
                                           test_sequences=test_sequences,
                                           users=test_users,
                                           given_k=GIVEN_K,
                                           look_ahead=LOOK_AHEAD,
                                           evaluation_functions=METRICS.values(),
                                           top_n=TOPN,
                                           scroll=False,
                                           step=STEP)

  0%|          | 0/8558 [00:00<?, ?it/s]

8558 sequences available for evaluation (5903 users)


100%|██████████| 8558/8558 [00:51<00:00, 166.78it/s]


In [18]:
print('Sequential evaluation (GIVEN_K={}, LOOK_AHEAD={}, STEP={})'.format(GIVEN_K, LOOK_AHEAD, STEP))
for mname, mvalue in zip(METRICS.keys(), results):
    print('\t{}@{}: {:.4f}'.format(mname, TOPN, mvalue))

Sequential evaluation (GIVEN_K=1, LOOK_AHEAD=1, STEP=1)
	precision@10: 0.0312
	recall@10: 0.3123
	mrr@10: 0.0899


<a id='eval_static'></a>

## 4.2 Evaluation with "static" user-profiles

Here we evaluate the quality of the recommendations in a setting in which user profiles are instead _static_.

The _user profile_ starts from the first `GIVEN_K` events (or, alternatively, from the last `-GIVEN_K` events if `GIVEN_K<0`).  
The recommendations are evaluated against the next `LOOK_AHEAD` events (the _ground truth_).  

The user profile is *not extended* and the ground truth *doesn't move forward*.
This allows to obtain "snapshots" of the recommendation performance for different user profile and ground truth lenghts.

Also here you can set the `LOOK_AHEAD='all'` to see what happens if you had to recommend a **whole sequence** instead of a set of a set of alternatives to a user.

**(TODO) Try out with different evaluation settings to see how the recommandation quality changes.**

In [17]:
GIVEN_K = 1
LOOK_AHEAD = 1
STEP=1

In [18]:
test_sequences = get_test_sequences(test_data, GIVEN_K)
print('{} sequences available for evaluation'.format(len(test_sequences)))

results = evaluation.sequential_evaluation(recommender,
                                           test_sequences=test_sequences,
                                           given_k=GIVEN_K,
                                           look_ahead=LOOK_AHEAD,
                                           evaluation_functions=METRICS.values(),
                                           top_n=TOPN,
                                           scroll=False  # notice that scrolling is disabled!
                                          )  

  0%|          | 22/9323 [00:00<00:42, 217.91it/s]

9323 sequences available for evaluation


100%|██████████| 9323/9323 [00:46<00:00, 201.74it/s]


In [19]:
print('Sequential evaluation (GIVEN_K={}, LOOK_AHEAD={}, STEP={})'.format(GIVEN_K, LOOK_AHEAD, STEP))
for mname, mvalue in zip(METRICS.keys(), results):
    print('\t{}@{}: {:.4f}'.format(mname, TOPN, mvalue))

Sequential evaluation (GIVEN_K=1, LOOK_AHEAD=1, STEP=1)
	precision@10: 0.0382
	recall@10: 0.3824
	mrr@10: 0.1034


<a id='next-item'></a>

## 5. Analysis of next-item recommendation

Here we propose to analyse the performance of the recommender system in the scenario of *next-item recommendation* over the following dimensions:

* the *length* of the **recommendation list**, and
* the *length* of the **user profile**.

NOTE: This evaluation is by no means exhaustive, as different the hyper-parameters of the recommendation algorithm should be *carefully tuned* before drawing any conclusions. Unfortunately, given the time constraints for this tutorial, we had to leave hyper-parameter tuning out. A very useful reference about careful evaluation of (session-based) recommenders can be found at:

*  Evaluation of Session-based Recommendation Algorithms, Ludewig and Jannach, 2018 ([paper](https://arxiv.org/abs/1803.09587))

<a id='next-item_list_length'></a>

### 5.1 Evaluation for different recommendation list lengths

In [20]:
GIVEN_K = 1
LOOK_AHEAD = 1
STEP = 1
topn_list = [1, 5, 10, 20, 50, 100]

In [21]:
# ensure that all sequences have the same minimum length 
test_sequences = get_test_sequences(test_data, GIVEN_K)
print('{} sequences available for evaluation'.format(len(test_sequences)))

9323 sequences available for evaluation


In [22]:
res_list = []

for topn in topn_list:
    print('Evaluating recommendation lists with length: {}'.format(topn))
    res_tmp = evaluation.sequential_evaluation(recommender,
                                               test_sequences=test_sequences,
                                               given_k=GIVEN_K,
                                               look_ahead=LOOK_AHEAD,
                                               evaluation_functions=METRICS.values(),
                                               top_n=topn,
                                               scroll=True,  # here we average over all profile lengths
                                               step=STEP)
    mvalues = list(zip(METRICS.keys(), res_tmp))
    res_list.append((topn, mvalues))

  0%|          | 6/9323 [00:00<02:42, 57.21it/s]

Evaluating recommendation lists with length: 1


100%|██████████| 9323/9323 [03:54<00:00, 39.79it/s]
  0%|          | 5/9323 [00:00<03:22, 46.08it/s]

Evaluating recommendation lists with length: 5


100%|██████████| 9323/9323 [03:35<00:00, 43.29it/s]
  0%|          | 7/9323 [00:00<02:25, 64.24it/s]

Evaluating recommendation lists with length: 10


  2%|▏         | 224/9323 [00:03<02:38, 57.24it/s]


KeyboardInterrupt: 

In [None]:
# show separate plots per metric
fig, axes = plt.subplots(nrows=1, ncols=len(METRICS), figsize=(15,5))
res_list_t = list(zip(*res_list))
for midx, metric in enumerate(METRICS):
    mvalues = [res_list_t[1][j][midx][1] for j in range(len(res_list_t[1]))]
    ax = axes[midx]
    ax.plot(topn_list, mvalues)
    ax.set_title(metric)
    ax.set_xticks(topn_list)
    ax.set_xlabel('List length')

<a id='next-item_profile_length'></a>

### 5.2 Evaluation for different user profile lengths

In [None]:
given_k_list = [1, 2, 3, 4]
LOOK_AHEAD = 1
STEP = 1
TOPN = 10

In [None]:
# ensure that all sequences have the same minimum length 
test_sequences = get_test_sequences(test_data, max(given_k_list))
print('{} sequences available for evaluation'.format(len(test_sequences)))

res_list = []

for gk in given_k_list:
    print('Evaluating profiles having length: {}'.format(gk))
    res_tmp = evaluation.sequential_evaluation(recommender,
                                               test_sequences=test_sequences,
                                               given_k=gk,
                                               look_ahead=LOOK_AHEAD,
                                               evaluation_functions=METRICS.values(),
                                               top_n=TOPN,
                                               scroll=False,  # here we stop for each sequence length
                                               step=STEP)
    mvalues = list(zip(METRICS.keys(), res_tmp))
    res_list.append((gk, mvalues))

In [None]:
# show separate plots per metric
fig, axes = plt.subplots(nrows=1, ncols=len(METRICS), figsize=(15,5))
res_list_t = list(zip(*res_list))
for midx, metric in enumerate(METRICS):
    mvalues = [res_list_t[1][j][midx][1] for j in range(len(res_list_t[1]))]
    ax = axes[midx]
    ax.plot(given_k_list, mvalues)
    ax.set_title(metric)
    ax.set_xticks(given_k_list)
    ax.set_xlabel('Profile length')