[![Dataflowr](https://raw.githubusercontent.com/dataflowr/website/master/_assets/dataflowr_logo.png)](https://dataflowr.github.io/website/)

# Collaborative filtering: refactoring the code
-----

In this practical, you will need to refactor the code seen during the lesson in order to deal with the [Movielens 1M Dataset](https://grouplens.org/datasets/movielens/1m/)

## 1. Preparations

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os.path as op
import imp
import numpy as np

from zipfile import ZipFile
try:
    from urllib.request import urlretrieve
except ImportError:  # Python 2 compat
    from urllib import urlretrieve

# this line need to be changed:
data_folder = '/home/andy/00_workspace/dl/data/08/content/'


ML_1M_URL = "http://files.grouplens.org/datasets/movielens/ml-1m.zip"
ML_1M_FILENAME = op.join(data_folder,ML_1M_URL.rsplit('/', 1)[1])
ML_1M_FOLDER = op.join(data_folder,'ml-1m')

  """


In [4]:
import os
if not op.exists(data_folder):
    os.makedirs(data_folder)

if not op.exists(ML_1M_FILENAME):
    print('Downloading %s to %s...' % (ML_1M_URL, ML_1M_FILENAME))
    urlretrieve(ML_1M_URL, ML_1M_FILENAME)

if not op.exists(ML_1M_FOLDER):
    print('Extracting %s to %s...' % (ML_1M_FILENAME, ML_1M_FOLDER))
    ZipFile(ML_1M_FILENAME).extractall(data_folder)

## 2. Data analysis and formating

As in the lesson, we start by loading the data with [Python Data Analysis Library](http://pandas.pydata.org/)

In [5]:
import pandas as pd
all_ratings = pd.read_csv(op.join(ML_1M_FOLDER, 'ratings.dat'), sep='::',
                          names=["user_id", "item_id", "ratings", "timestamp"],engine='python')
all_ratings.head()

Unnamed: 0,user_id,item_id,ratings,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [6]:
list_movies_names = []
list_item_ids = []
with open(op.join(ML_1M_FOLDER, 'movies.dat'), encoding='ISO-8859-1') as fp: # 'movurllib.request import urlretrieve
    for line in fp:
        items = line.split('::')
        list_item_ids.append(items[0])
        list_movies_names.append(items[1])
# except ImportError:  # Python 2 compat
#     from urllib import urlretrieve

# # this line need to be changed:
# # data_folder = '/content/'


# ML_1M_URL = "http://files.grouplens.org/datasets/movielens/ml-1m.zip"
# ML_1M_FILENAME = op.join(data_folder,ML_1M_URL.re.split('::')[1])
        
movies_names = pd.DataFrame(list(zip(list_item_ids, list_movies_names)), 
               columns =['item_id', 'item_name']) 
movies_names.head()

Unnamed: 0,item_id,item_name
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


Here we add the title of the movies to the data.

In [7]:
movies_names['item_id']=movies_names['item_id'].astype(int)
all_ratings['item_id']=all_ratings['item_id'].astype(int)
all_ratings = all_ratings.merge(movies_names,on='item_id')

In [8]:
all_ratings.head()

Unnamed: 0,user_id,item_id,ratings,timestamp,item_name
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975)
1,2,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975)
2,12,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975)
3,15,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975)
4,17,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975)


The dataframe `all_ratings` contains all the raw data for our problem.

In [9]:
#number of entries
len(all_ratings)

1000209

In [10]:
all_ratings['ratings'].describe()

count    1.000209e+06
mean     3.581564e+00
std      1.117102e+00
min      1.000000e+00
25%      3.000000e+00
50%      4.000000e+00
75%      4.000000e+00
max      5.000000e+00
Name: ratings, dtype: float64

In [11]:
all_ratings['ratings'].unique()

array([5, 4, 3, 2, 1])

In [12]:
all_ratings['user_id'].describe()

count    1.000209e+06
mean     3.024512e+03
std      1.728413e+03
min      1.000000e+00
25%      1.506000e+03
50%      3.070000e+03
75%      4.476000e+03
max      6.040000e+03
Name: user_id, dtype: float64

In [13]:
# number of unique users
total_user_id = len(all_ratings['user_id'].unique())
print(total_user_id)

6040


We see that as in the lesson, the users seem to be indexed from 1 to 6040. Let's check that below.

In [14]:
list_user_id = list(all_ratings['user_id'].unique())
list_user_id.sort()

In [15]:
for i,j in enumerate(list_user_id):
    if j != i+1:
        print(i,j) 

We create a new column `user_num` to get an index from 0 to 6039 for users:

In [16]:
all_ratings['user_num'] = all_ratings['user_id'].apply(lambda x :x-1)

In [17]:
all_ratings.head()

Unnamed: 0,user_id,item_id,ratings,timestamp,item_name,user_num
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),0
1,2,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),1
2,12,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),11
3,15,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),14
4,17,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),16


We now look at movies.

In [18]:
all_ratings['item_id'].describe()

count    1.000209e+06
mean     1.865540e+03
std      1.096041e+03
min      1.000000e+00
25%      1.030000e+03
50%      1.835000e+03
75%      2.770000e+03
max      3.952000e+03
Name: item_id, dtype: float64

In [19]:
# number of unique rated items
total_item_id = len(all_ratings['item_id'].unique())
print(total_item_id)

3706


Here there is a clear problem: there are 3706 different movies but the range of `item_id` starts at 1 and ends at 3952. So there are gaps, so the first thing you will need to do is to create a new column `item_num` so that all movies are indexed from 0 to 3705.

In [20]:
#
# your code here
#
# item_id2num_map = pd.DataFrame(all_ratings['item_id'].unique(), columns=['item_id']).sort_values(by=['item_id'])
# item_id2num_map['item_num'] = range(len(item_id2num_map))

item_id2num_map = all_ratings['item_id'].unique()
item_num2id_map = np.arange(max(item_id2num_map)+1)
for i,j in enumerate(item_id2num_map):
    item_num2id_map[j] = i
    
all_ratings['item_num'] = all_ratings['item_id'].apply(lambda x: item_num2id_map[x])

del item_id2num_map, item_num2id_map

This function will verify that your result is correct.

In [21]:
def check_ratings_num(df):
    item_num = set(df['item_num'])
    if item_num == set(range(len(item_num))):
        return True
    else:
        return False

In [22]:
check_ratings_num(all_ratings)

True

In [23]:
all_ratings.head()

Unnamed: 0,user_id,item_id,ratings,timestamp,item_name,user_num,item_num
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),0,0
1,2,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),1,0
2,12,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),11,0
3,15,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),14,0
4,17,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),16,0


Now we will split the data in _train_, _val_ and _test_ be using a pre-defined function from [scikit-learn](http://scikit-learn.org/stable/)

In [24]:
from sklearn.model_selection import train_test_split

ratings_trainval, ratings_test = train_test_split(all_ratings, test_size=0.1, random_state=42)

ratings_train, ratings_val = train_test_split(ratings_trainval, test_size=0.1, random_state=42)

## 3. The model

We will now modify a bit the `FactorizationModel` class seen during the lesson. Internally, we will still use the `Model_dot` but now we use the PyTorch dataloader.

In [25]:
import torch.nn as nn
import torch
import torch.nn.functional as F
import torch.optim as optim

In [26]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [27]:
def df_2_tensor(df, device):
    # return a triplet user_num, item_num, rating from the dataframe
    user_num = np.asarray(df['user_num'])
    item_num = np.asarray(df['item_num'])
    rating = np.asarray(df['ratings'])
    return torch.from_numpy(user_num).to(device), torch.from_numpy(item_num).to(device), torch.from_numpy(rating).to(device)

Below, we construct 3 tensors containing the `user_num`, `item_num` and `rating` for the training set. All tensors have the same shape so that `train_user_num[i]` watched `train_item_num[i]` and gave a rating of `train_rating[i]`.

In [28]:
train_user_num, train_item_num, train_rating = df_2_tensor(ratings_train,device)

We now do the same thing for the validation and test sets.

In [29]:
val_user_num, val_item_num, val_rating = df_2_tensor(ratings_val,device)
test_user_num, test_item_num, test_rating = df_2_tensor(ratings_test,device)

The code below is taken from the lesson

In [30]:
class ScaledEmbedding(nn.Embedding):
    """
    Embedding layer that initialises its values
    to using a normal variable scaled by the inverse
    of the embedding dimension.
    """
    def reset_parameters(self):
        """
        Initialize parameters.
        """

        self.weight.data.normal_(0, 1.0 / self.embedding_dim)
        if self.padding_idx is not None:
            self.weight.data[self.padding_idx].fill_(0)


class ZeroEmbedding(nn.Embedding):
    """
    Used for biases.
    """

    def reset_parameters(self):
        """
        Initialize parameters.
        """

        self.weight.data.zero_()
        if self.padding_idx is not None:
            self.weight.data[self.padding_idx].fill_(0)

In [31]:
class DotModel(nn.Module):
    
    def __init__(self,
                 num_users,
                 num_items,
                 embedding_dim=32):
        
        super(DotModel, self).__init__()
        
        self.embedding_dim = embedding_dim
        
        self.user_embeddings = ScaledEmbedding(num_users, embedding_dim)
        self.item_embeddings = ScaledEmbedding(num_items, embedding_dim)
        self.user_biases = ZeroEmbedding(num_users, 1)
        self.item_biases = ZeroEmbedding(num_items, 1)
                
        
    def forward(self, user_ids, item_ids):
        #
        # your code
        #
        user_emb = self.user_embeddings(user_ids) + self.user_biases(user_ids) # (M, emb=3)
        item_emb = self.item_embeddings(item_ids) + self.item_biases(item_ids) # (N, emb=3)
        return user_emb.mul(item_emb).sum(dim=1)

In [32]:
net = DotModel(total_user_id,total_item_id).to(device)

Now test your network on a small batch.

In [33]:
predicitions = net(train_user_num[:5], train_item_num[:5])
predicitions

tensor([ 0.0026,  0.0067, -0.0038, -0.0017,  0.0102], device='cuda:0',
       grad_fn=<SumBackward1>)

In [34]:
def regression_loss(predicted_ratings, observed_ratings):
    return ((observed_ratings - predicted_ratings) ** 2).mean()

In [35]:
loss_fn = regression_loss
loss = loss_fn(predicitions, train_rating[:5])
loss

tensor(11.9758, device='cuda:0', grad_fn=<MeanBackward0>)

Now you need to construct a dataset and a dataloader. For this, you can define a first function taking as arguments the tensors defined above and returning a list; then a second function taking as argument a dataset, the batchsize and a boolean for the shuffling. We will not use anymore the functions `shuffle` and `minibatch` used in the lesson.

In [1]:
def tensor_2_dataset(user,item,rating):
    # your code here
    # input_list = torch.cat((user.unsqueeze(1),item.unsqueeze(1), rating.unsqueeze(1)),dim=1)
    # output_list = rating
    dataset = torch.utils.data.TensorDataset(user, item, rating)
    return dataset

def make_dataloader(dataset,bs,shuffle):
    # your code here
    # index = np.arange(len(dataset[0]))
    # # random index for shuffling
    # if shuffle:
    #     np.random.shuffle(index)
    # # shuffle the dataset
    # bs_cnt = (len(index)-1) // bs +1
    # for i in range(bs_cnt):
    #     if i != bs_cnt-1:
    #         yield (dataset[0][index[i*bs:(i+1)*bs]], dataset[1][index[i*bs:(i+1)*bs]] )
    #     else:
    #         yield (dataset[0][index[i*bs:]], dataset[1][index[i*bs:]] )
    return torch.utils.data.DataLoader(dataset, batch_size=bs, shuffle=shuffle)


In [36]:
train_dataset = tensor_2_dataset(train_user_num,train_item_num, train_rating)
val_dataset = tensor_2_dataset(val_user_num,val_item_num,val_rating)
test_dataset = tensor_2_dataset(test_user_num, test_item_num, test_rating)

In [37]:
train_dataloader = make_dataloader(train_dataset,1024,True)
val_dataloader = make_dataloader(val_dataset,1024, False)
test_dataloader = make_dataloader(test_dataset,1024,False)

Here you need to modify the code seen during the lesson:
 - remove the batch_size in the init
 - the fit function should now take as argument a dataloader for the training and a dataloader for the validation. AT the end of each epoch, you run the test method on the validation set. Then you print both the loss on the training set and on the validation set to see if you are overfitting.

In [48]:
class FactorizationModel(object):
    
    def __init__(self, embedding_dim=32, n_iter=10, l2=0.0,
                 learning_rate=1e-2, device=device, net=None, num_users=None,
                 num_items=None,random_state=None):
        
        self._embedding_dim = embedding_dim
        self._n_iter = n_iter
        self._learning_rate = learning_rate
        self._l2 = l2
        self._device = device
        self._num_users = num_users
        self._num_items = num_items
        self._net = net
        self._optimizer = None
        self._loss_func = None
        self._random_state = random_state or np.random.RandomState()
             
        
    def _initialize(self):
        if self._net is None:
            self._net = DotModel(self._num_users, self._num_items, self._embedding_dim).to(self._device)
        
        self._optimizer = optim.Adam(
                self._net.parameters(),
                lr=self._learning_rate,
                weight_decay=self._l2
            )
        
        self._loss_func = regression_loss
        
    
    @property
    def _initialized(self):
        return self._optimizer is not None
    
    def __repr__(self):
        return _repr_model(self)
    
    def fit(self, dataloader, val_dataloader, verbose=True):       
        if not self._initialized:
            self._initialize()
            
        for epoch_num in range(self._n_iter):
            epoch_loss = 0.0
            self._net.train(True)
            minibatch_num = 0
            #
            # your code
            #
            for users, items, label in dataloader:
                output = self._net(users.to(self._device), items.to(self._device))
                loss = self._loss_func(output, label)
                
                self._optimizer.zero_grad()
                loss.backward()
                self._optimizer.step()
                
                epoch_loss += loss.cpu().item()
                minibatch_num += 1
                
            
            epoch_loss = epoch_loss / (minibatch_num)
            loss_test = self.test(val_dataloader)

            if verbose:
                print('Epoch {}: loss_train {}, loss_val {}'.format(epoch_num, epoch_loss,loss_test))
        
            if np.isnan(epoch_loss) or epoch_loss == 0.0:
                raise ValueError('Degenerate epoch loss: {}'
                                 .format(epoch_loss))
    
    
    def test(self,dataloader, verbose = False):
        self._net.train(False)
        L1loss = torch.nn.L1Loss()
        test_loss = 0.0
        test_mae = 0.0
        minibatch_num = 0
        #
        # your code here (mae = mean absolute error)
        #
        for users, items, label in dataloader:
            output = self._net(users.to(self._device), items.to(self._device))
            mae = L1loss(output, label)
            loss = self._loss_func(output, label)
            
            test_loss += loss.cpu().item()
            test_mae  += mae.cpu().item()
            minibatch_num += 1      
        
        test_loss = test_loss / (minibatch_num)
        test_mae = test_mae / (minibatch_num)
        if verbose:
            print(f"RMSE: {np.sqrt(test_loss)}, MAE: {test_mae}")
        return loss.item()

In [50]:
model = FactorizationModel(embedding_dim=50,  # latent dimensionality
                                   n_iter=5,  # number of epochs of training
                                   learning_rate=5e-4,
                                   l2=1e-8,  # strength of L2 regularization
                                   num_users=total_user_id,
                                   num_items=total_item_id,
                                   device=device)

In [42]:
for user,item, label in train_dataloader:
    print(user.shape)
    print(label.shape)
    break

torch.Size([1024])
torch.Size([1024])


In [53]:
%%time
model.fit(train_dataloader,val_dataloader)

Epoch 0: loss_train 0.8420454224552771, loss_val 0.8758383989334106
Epoch 1: loss_train 0.841398860092717, loss_val 0.8653251528739929
Epoch 2: loss_train 0.8403925844515213, loss_val 0.8735279440879822
Epoch 3: loss_train 0.8399388290414906, loss_val 0.8609673380851746
Epoch 4: loss_train 0.8391016723983216, loss_val 0.8559385538101196
CPU times: user 33.2 s, sys: 121 ms, total: 33.3 s
Wall time: 32.7 s


In [54]:
_= model.test(test_dataloader,True)

RMSE: 0.9206910512105997, MAE: 0.7284607893350173


Play with the parameter to beat the benchmarks presented here: [Surprise](https://github.com/NicolasHug/Surprise)

## 4. Best and worst movies

Now you need to rank the movies according to their bias. For this, you need to recover the biases of the movies, make a list of the pairs `[name of the movie, its bias]` and then sort this list according to the biases. You can use the method sort of a list.

## 5. PCA of movies' embeddings

Now you can also plpay with the embeddings learned by your algorithm for the movies.

In [None]:
from sklearn.decomposition import PCA
from operator import itemgetter

[![Dataflowr](https://raw.githubusercontent.com/dataflowr/website/master/_assets/dataflowr_logo.png)](https://dataflowr.github.io/website/)