# Notebook #1: Designing and evaluating a recommendation algorithm

In this notebook, we become familiar with the Python recommendation toolbox, in the simplest possible way. First, we setup the working environment in GDrive. Then, we go through the experimental pipeline, by:
- loading the Movielens 1M dataset; 
- performing a train-test splitting;
- creating a pointwise / pairwise / random / mostpop recommendation object;
- training the model (if applicable);
- computing the user-item relevance matrix;
- calculating some of the recommendation metrics (e.g., NDCG, Item Coverage, Diversity, Novelty).

The trained models, together with the partial computation we will save (e.g., user-item relevance matrix or metrics), will be the starting point of the investigation and the treatment covered by the other Jupyter notebooks.

**IMPORTANT**: Please go the "Runtime" option in the top menu, then click on "Change runtime" and select "GPU". 

## Setup the working environment for this notebook

- Python 3.6
- Package Requirements: matplotlib, numpy, pandas, scikit-learn, scipy, tensorflow-gpu==2.0
- Storage requirements: around 1GB

This step serves to mount GDrive storage within this Jupyter notebook. The command will request us to give access permissions to this notebook, so that we will be able to clone the project repository when we desire. Please follow the prompted instructions.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

We will clone the project repository in our My Drive folder. If you wish to change the target folder, please modify the command below.

In [None]:
%cd /content/gdrive/My Drive/

In [None]:
! git clone https://github.com/mirkomarras/bias-recsys-tutorial.git

We will move to the project folder in order to install the required packages. 

In [None]:
%cd bias-recsys-tutorial

In [None]:
! ls

In [None]:
! pip install -r requirements.txt

We will configure the notebooks directory as our working directory in order to simulate a local notebook execution. 

In [None]:
%cd ./notebooks

## Import packages

In [1]:
import sys 
import os

sys.path.append(os.path.join('..'))

In [2]:
import pandas as pd
import numpy as np

In [3]:
from helpers.train_test_splitter import *
from models.pointwise import PointWise
from models.pairwise import PairWise
from models.mostpop import MostPop
from models.random import Random
from helpers.utils import *

We will define the folders where we will store our pre-computed results. 

In [4]:
data_path = '../data/'

In [None]:
!mkdir '../data/outputs'
!mkdir '../data/outputs/splits'
!mkdir '../data/outputs/instances'
!mkdir '../data/outputs/models'
!mkdir '../data/outputs/predictions'
!mkdir '../data/outputs/metrics'

## Load data

First, we will load the Movielens 1M dataset, which has been pre-arranged in order to comply with the following structure: user_id, item_id, rating, timestamp, type (label for the item category), and type_id (unique id of the item category). For the sake of tutorial easiness, we assume here that each item is randomly assigned to one of its categories in the original dataset. Our toolbox is flexible enough to integrate any other dataset in csv format that has the same structure of the pre-arranged csv shown below. No further changes are needed to experiment with other datasets.   

In [5]:
dataset = 'ml1m'  
user_field = 'user_id'
item_field = 'item_id'
rating_field = 'rating'
time_field = 'timestamp'
type_field = 'type_id'

In [6]:
data = pd.read_csv(os.path.join(data_path, 'datasets/' + dataset + '.csv'), encoding='utf8')

In [7]:
data.head()

Unnamed: 0,user_id,item_id,rating,timestamp,type,type_id
0,1,1193,5.0,2000-12-31 23:12:40,Drama,7
1,2,1193,5.0,2000-12-31 22:33:33,Drama,7
2,12,1193,4.0,2000-12-31 00:49:39,Drama,7
3,15,1193,4.0,2000-12-30 19:01:19,Drama,7
4,17,1193,5.0,2000-12-30 07:41:11,Drama,7


During this tutorial, we will simulate a scenario with implicit feedback.

In [8]:
data[rating_field] = data[rating_field].apply(lambda x: 1.0)

## Split data in train and test sets

- **smode**: 'uftime' for fixed timestamp split, 'utime' for time-based split per user, 'urandom' for random split per user 
- **train_ratio**: percentage of data to be included in the train set
- **min_train**: minimum number of train samples for a user to be included  
- **min_test**: minimum number of test samples for a user to be included
- **min_time**: start timestamp for computing the splitting timestamp (only for uftime)
- **max_time**: end timestamp for computing the splitting timestamp (only for uftime)
- **step_time**: timestamp step for computing the splitting timestamp (only for uftime)

In [9]:
smode = 'utime'
train_ratio = 0.80        
min_train_samples = 8
min_test_samples = 2
min_time = None
max_time = None
step_time = 1000

During this tutorial, we will work with a common time-based split per user. 

In [10]:
if smode == 'uftime':
    traintest = fixed_timestamp(data, min_train_samples, min_test_samples, min_time, max_time, step_time, user_field, item_field, time_field, rating_field)
elif smode == 'utime':
    traintest = user_timestamp(data, train_ratio, min_train_samples+min_test_samples, user_field, item_field, time_field)
elif smode == 'urandom':
    traintest = user_random(data, train_ratio, min_train_samples+min_test_samples, user_field, item_field)

> Parsing user 6000 of 6040


Please note that user_ids and item_ids have been scaled so that user_ids is in [0, no_users] and item_ids will be in [0, no_items]. If you wish to link these new ids to the older ones, please refer to the user_id_original and item_id_original columns. 

In [11]:
traintest.head()

Unnamed: 0,user_id,item_id,rating,timestamp,type,type_id,set,user_id_original,item_id_original
34073,0,2969,1.0,2000-12-31 23:00:19,Drama,7,train,1,3186
31152,0,1574,1.0,2000-12-31 23:00:55,Romance,13,train,1,1721
37339,0,957,1.0,2000-12-31 23:00:55,Children's,3,train,1,1022
23270,0,1178,1.0,2000-12-31 23:00:55,Sci-Fi,14,train,1,1270
28157,0,2147,1.0,2000-12-31 23:01:43,Romance,13,train,1,2340


For the sake of replicability and efficiency of this tutorial, we will save the pre-computed train and test sets in ./data/outputs/splits

In [12]:
traintest.to_csv(os.path.join(data_path, 'outputs/splits/' + dataset + '_' + smode + '.csv'))

## Run the model train and test

We will create two dataframes, one with train feedback and another with test feedback, from the pre-computed split data. 

In [13]:
train = traintest[traintest['set']=='train'].copy()
test = traintest[traintest['set']=='test'].copy()

In [14]:
users = list(np.unique(traintest[user_field].values))
items = list(np.unique(traintest[item_field].values))

In [15]:
len(users), len(items)

(6040, 3706)

In [16]:
category_per_item = traintest.drop_duplicates(subset=['item_id'], keep='first')[type_field].values

In [17]:
len(np.unique(category_per_item))

18

For the sake of easiness, we will focus on four main recommendation strategies: 
- Random
- MostPop
- PointWise
- PairWise

In [18]:
model_types = {'random': Random, 'mostpop': MostPop, 'pointwise': PointWise, 'pairwise': PairWise} 

First, we need to initialize the model. We will see how the process works for a PairWise algorithm. Then, we will consider the other ones. 

In [19]:
model_type = 'pairwise'
model = PairWise(users, items, train, test, category_per_item, item_field, user_field, rating_field)

Initializing user, item, and categories lists
Initializing observed, unobserved, and predicted relevance scores
Initializing item popularity lists
Initializing category per item
Initializing category preference per user
Initializing metrics


We will train the model by feeding the train data we previously prepared, with the following default values. 

- **no_epochs** (default: 100)
- **batches** (default: 1024)
- **lr** (default: 0.001)
- **no_factors** (default: 10)
- **no_negatives** (default: 10)
- **val_split** (default: 0.0001)

In [21]:
model.train(no_epochs=5) # For the sake of tutorial efficiency, we force to stop after 5 epochs

Generating training instances of type pair
Computing instances for interaction 800000 / 803798 of type pair60000 / 803798 of type pair 70000 / 803798 of type pair 220000 / 803798 of type pair 730000 / 803798 of type pair
Performing training - Epochs 5 Batch Size 1024 Learning Rate 0.001 Factors 10 Negatives 10 Mode pair
Train on 7957600 samples
Validation accuracy: 0.8651016862289221 (Sample 80000 of 80380 )
Train on 7957600 samples
Epoch 2/2
Train on 7957600 samples
Epoch 3/3
Train on 7957600 samples
Epoch 4/4
Train on 7957600 samples
Epoch 5/5
Validation accuracy: 0.9182260221747228 (Sample 80000 of 80380 ) 0.9152149886848061 (Sample 19000 of 80380 ) 0.9183020424489388 (Sample 40000 of 80380 ) 0.9179478937876109 (Sample 56000 of 80380 )


The architecture of the trained model looks as follows. 

In [22]:
model.print()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
UserInput (InputLayer)          [(None, 1)]          0                                            
__________________________________________________________________________________________________
PosItemInput (InputLayer)       [(None, 1)]          0                                            
__________________________________________________________________________________________________
NegItemInput (InputLayer)       [(None, 1)]          0                                            
__________________________________________________________________________________________________
UserEmb (Embedding)             (None, 1, 10)        60410       UserInput[0][0]                  
______________________________________________________________________________________________

## Compute user-item relevance scores

Now, we will use the pre-trained model to predict the user-item relevance scores.

In [23]:
model.predict()

Computing predictions for user 6000 / 6040

In [24]:
scores = model.get_predictions()

As we expected, the predicted scores are stored in a matrix of shape np_users x no_items. 

In [25]:
scores.shape

(6040, 3706)

Hence, we can access to the relevance score of the user 120 for the item 320 as follows. 

In [26]:
user_id, item_id = 120, 320
scores[user_id, item_id]

2.7350480556488037

For the sake of convenience, we will save the predicted scores. 

In [27]:
save_obj(scores, os.path.join(data_path, 'outputs/predictions/' + dataset + '_' + smode + '_' + model_type + '_scores.pkl'))

## Calculate metrics

In this step, we leverage the predicted scores in order to compute a set of common recommendation metrics. 

In [28]:
cutoffs = np.array([5, 10, 20])

In [29]:
item_group = load_obj(os.path.join(data_path, 'datasets', 'ml1m-item-group')) 
# we discuss this point in detail in the third notebook

In [30]:
model.test(item_group=item_group, cutoffs=cutoffs)

Computing metrics for user 6000 / 6040

The method has pre-computed a set of metrics and saved the corresponding values in a Python dictionary, as detailed below. 

In [31]:
metrics = model.get_metrics()

In [32]:
metrics.keys()

dict_keys(['precision', 'recall', 'ndcg', 'hit', 'mean_popularity', 'diversity', 'novelty', 'item_coverage', 'visibility', 'exposure'])

The values for each metrics have been computed and store for each cutoff.

In [33]:
for name, values in metrics.items():
    print(values.shape, name)

(6, 6040) precision
(6, 6040) recall
(6, 6040) ndcg
(6, 6040) hit
(6, 6040) mean_popularity
(6, 6040) diversity
(6, 6040) novelty
(6, 3706) item_coverage
(6, 6040) visibility
(6, 6040) exposure


For instance, we can access to the NDCG score for the user 120 at cutoff 10, with the following commands.

In [39]:
user_id, cutoff_index = 1324, int(np.where(cutoffs == 10)[0])
metrics['ndcg'][cutoff_index, user_id]

0.41125017975368

For the sake of convenience, we will save the compted metrics.

In [36]:
save_obj(metrics, os.path.join(data_path, 'outputs/metrics/' + dataset + '_' + smode + '_' + model_type + '_metrics.pkl'))

We can also see the aggregated values. 

In [37]:
model.show_metrics(index_k=int(np.where(cutoffs == 10)[0]))

Precision: 0.1176 
Recall: 0.0485 
NDCG: 0.1272 
Hit Rate: 0.5182 
Avg Popularity: 1949.2307 
Category Diversity: 0.3261 
Novelty: 1.7604 
Item Coverage: 0.22 
User Coverage: 0.5182
Minority Exposure: 0.0425
Minority Visibility: 0.0418


## Repeat the experimental pipeline for Random and MostPop (optionally for PointWise)

We will define a utility function to perform ll the above operations jointly.

In [40]:
def run_model(model_type, no_epochs=None):
    print('Running model', model_type)
    model = model_types[model_type](users, items, train, test, category_per_item, item_field, user_field, rating_field)
    model.train(no_epochs=no_epochs) if no_epochs else model.train() 
    model.predict()
    scores = model.get_predictions()
    save_obj(scores, os.path.join(data_path, 'outputs/predictions/' + dataset + '_' + smode + '_' + model_type + '_scores.pkl'))
    model.test(item_group=item_group, cutoffs=cutoffs)
    metrics = model.get_metrics()
    save_obj(metrics, os.path.join(data_path, 'outputs/metrics/' + dataset + '_' + smode + '_' + model_type + '_metrics.pkl'))
    print()
    model.show_metrics(index_k=int(np.where(cutoffs == 10)[0]))

In [41]:
run_model('random')

Running model random
Initializing user, item, and categories lists
Initializing observed, unobserved, and predicted relevance scores
Initializing item popularity lists
Initializing category per item
Initializing category preference per user
Initializing metrics
Computing metrics for user 6000 / 604060406040/ 60402122 / 6040 / 6040 2687 / 6040 2918 / 60406040 4274 / 6040 6040 / 6040 / 6040 / 6040
Precision: 0.0104 
Recall: 0.0031 
NDCG: 0.0109 
Hit Rate: 0.0887 
Avg Popularity: 197.8719 
Category Diversity: 0.3296 
Novelty: 6.9899 
Item Coverage: 1.0 
User Coverage: 0.0887
Minority Exposure: 0.1662
Minority Visibility: 0.1658


In [43]:
run_model('mostpop')

Running model mostpop
Initializing user, item, and categories lists
Initializing observed, unobserved, and predicted relevance scores
Initializing item popularity lists
Initializing category per item
Initializing category preference per user
Initializing metrics
Computing metrics for user 6000 / 60406040040 / 6040 1294 / 60402402 / 6040 6040/ 6040 4495 / 6040 / 60405031 / 6040/ 6040 6040
Precision: 0.1007 
Recall: 0.0384 
NDCG: 0.1096 
Hit Rate: 0.4422 
Avg Popularity: 2328.0848 
Category Diversity: 0.3293 
Novelty: 1.3922 
Item Coverage: 0.03 
User Coverage: 0.4422
Minority Exposure: 0.0509
Minority Visibility: 0.0616


In [44]:
run_model('pointwise', no_epochs=5)

Running model pointwise
Initializing user, item, and categories lists
Initializing observed, unobserved, and predicted relevance scores
Initializing item popularity lists
Initializing category per item
Initializing category preference per user
Initializing metrics
Generating training instances of type point
Computing instances for interaction 800000 / 803798 of type point80000 / 803798 of type point
Performing training - Epochs 5 Batch Size 1024 Learning Rate 0.001 Factors 10 Negatives 10 Mode point
Train on 7957600 samples, validate on 884178 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 00003: early stopping
Computing metrics for user 6000 / 60406040
Precision: 0.1166 
Recall: 0.0591 
NDCG: 0.1298 
Hit Rate: 0.5541 
Avg Popularity: 1408.3628 
Category Diversity: 0.3212 
Novelty: 2.3554 
Item Coverage: 0.35 
User Coverage: 0.5541
Minority Exposure: 0.0541
Minority Visibility: 0.0574


## How to extend the toolbox

- New splitter: take a look at the helpers/train_test_splitter.py file and how the existing generators have been defined. 
- New train instances creator: similarly, take a look at the helpers/instances_creator.py file and how the existing generators have been defined. 
- New model: a new subclass of the Model class defined in models/model.py should be defined, implementing a 'train' and a 'predict' method. 
- New metrics: both the 'test' and 'show_metrics' methods of models/model.py should be extended with the computation needed by the new metric.  