The objective of this notebook (project) is to test multiple recommendation systems for the movielens dataset. Here are the different models that we have tried to implement

1- **Matrix Factorization techniques with SGD learning (https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf)**
- Can't use SVD because the matrix is sparse.
- We replicate this behaviour using deep learning.
- Add bias terms & regularization

2- Wide & Deep for recommender systems: We have tried to implement the following paper https://arxiv.org/pdf/1606.07792.pdf from Google, however, as the paper states, the wide part is built using binary cross-product features. Those features would need to manually be built and would explode the amount of neurons in the wide part.
More information on feature crosses: https://datascience.stackexchange.com/questions/57435/how-is-the-cross-product-transformation-defined-for-binary-features?newreg=2093b549d07e43db92e28eccecb6a73b

3- Neural Collaborative Filtering

Dataset used: http://files.grouplens.org/datasets/movielens/ml-1m-README.txt

In [1]:
### Notes to properly run the notebook
# At the time of developping this notebook, tensorboard was not fully integrated in skorch
# so it has to be installed from the sources
# git clone https://github.com/skorch-dev/skorch.gitts && cd skorch && python setup.py install

# The ipywidgets package needs to be installed to see the progressbar checkpoint
# It also needs to be activated like this: jupyter nbextension enable --py widgetsnbextension

In [2]:
### Important notes for JSGL
# If doing gridsearch, don't activate the function from dataloaders that spawn multiprocesses, memory will be hogged

In [3]:
import datetime
import itertools
import numpy as np
import pandas as pd
import patsy
import time

import sklearn
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, mean_squared_error

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
#from torch.utils.tensorboard import SummaryWriter
from torch import optim
from torch.autograd import Variable

from skorch import NeuralNet
from skorch.helper import predefined_split, SliceDataset
from skorch.callbacks import BatchScoring, Checkpoint, EarlyStopping, EpochScoring, LRScheduler, TensorBoard, ProgressBar

# Install latest Tensorflow build
#!pip install -q tf-nightly-2.0-preview
import tensorflow as tf
from tensorflow import summary
#%load_ext tensorboard

In [4]:
# Torch parameters
identifier = 'cuda:0' if torch.cuda.is_available() else 'cpu'
device = torch.device(identifier)
device = 'cpu'
print('Using device ', device)

print('Using torch version ', torch.__version__)

torch.set_printoptions(precision=7)

Using device  cpu
Using torch version  1.0.1


In [6]:
#!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
#!unzip -o ml-1m.zip

### Dataset

In [5]:
class rsdataset(Dataset):
    def __init__(self, usersfile, moviesfile, ratingsfile, nrows=None):
        
        # Read files
        self.movies = pd.read_csv(moviesfile, sep='::', names=['MovieID', 'Title', 'Genres'], engine='python')
        self.users = pd.read_csv(usersfile, sep='::', names=['UserID', 'Gender', 'Age', 'Occupation', 'Zipcode'], engine='python')
        self.ratings = pd.read_csv(ratingsfile, sep='::', names=['UserID', 'MovieID', 'Rating', 'Timestamp'], engine='python', nrows=nrows)
        
        assert self.users['UserID'].nunique() >= self.ratings['UserID'].nunique(), 'UserID with unknown information'
        assert self.movies['MovieID'].nunique() >= self.ratings['MovieID'].nunique(), 'Movies with unknown information'

        self.users_emb_columns = []
        self.users_ohe_columns = []
        self.movies_emb_columns = []
        self.movies_ohe_columns = []
        self.interact_columns = []

        self.nusers = self.ratings['UserID'].nunique()
        self.nmovies = self.ratings['MovieID'].nunique()

        self.y_range = (self.ratings['Rating'].min(), self.ratings['Rating'].max())

    def __len__(self):
        return len(self.y)
        
    def __getitem__(self, idx):
        """
        What have we learned regarding tensors and GPU memory
        -----------------------------------------------------
        I used to transfer one big chunk of data from RAM to cpu/gpu ... It
        was very long because it took time. The problem was that ohe are encoded in int64.

        It was the same thing for cpu and gpu. I thought that gpu would be longer because it has
        to transfer different memory bus probably? But it was the same thing with the CPU, very long

        I tried pinned_memory=True in the dataloader, and it was the same thing.

        I shipped everything in memory and in __getitem__ I sliced everything after.
        I saved a few seconds (5 seconds)

        One of the biggest increase was when all the dataset was transformed in 5 tensors.
        I tried changing num_workers and the speed increased a lot. 

        """
        return (((self.users_emb[idx])),
                ((self.users_ohe[idx])),
                ((self.movies_emb[idx])),
                ((self.movies_ohe[idx])),
                ((self.interact[idx]))), (self.y[idx])

    def to_tensor(self):
        self.users_emb = torch.from_numpy(self.ratings[self.users_emb_columns].values)
        self.users_ohe = torch.from_numpy(self.ratings[self.users_ohe_columns].values)
        self.movies_emb = torch.from_numpy(self.ratings[self.movies_emb_columns].values)
        self.movies_ohe = torch.from_numpy(self.ratings[self.movies_ohe_columns].values)
        self.interact = torch.from_numpy(self.ratings[self.interact_columns].values)
        self.y = torch.tensor(self.y.values, dtype=torch.float)

In [38]:
train = rsdataset('ml-1m/users.dat', 'ml-1m/movies.dat', 'ml-1m/ratings.dat', nrows=10000)

### Preprocessing of dataset

In [39]:
train.ratings = train.ratings.merge(train.movies, left_on='MovieID', right_on='MovieID')
train.movies = train.ratings[train.movies.columns]

train.ratings = train.ratings.merge(train.users, left_on='UserID', right_on='UserID')
train.users = train.ratings[train.users.columns]

train.y = train.ratings['Rating']

In [40]:
# Label Encode users
#columns = ['UserID', 'Gender', 'Age', 'Occupation', 'Zipcode']
columns = ['UserID', 'Gender', 'Age', 'Occupation']
train.ratings[columns] = train.ratings[columns].apply(preprocessing.LabelEncoder().fit_transform)
#train.ratings[columns] = train.ratings[columns].astype('object')
train.users_emb_columns = train.users_emb_columns + columns

In [41]:
# Label Encode movies
columns = ['MovieID']
train.ratings[columns] = train.ratings[columns].apply(preprocessing.LabelEncoder().fit_transform)
#train.ratings[columns] = train.ratings[columns].astype('object')
train.movies_emb_columns = train.movies_emb_columns + columns

In [42]:
# One Hot Encode users
#columns = ['Gender', 'Age', 'Occupation', 'Zipcode']
columns = ['Gender', 'Age', 'Occupation']
ohe = preprocessing.OneHotEncoder(categories='auto', sparse=False, dtype='int8')
ohe.fit(train.ratings[columns])
train.ratings = pd.concat([train.ratings, pd.DataFrame(data=ohe.transform(train.ratings[columns]), columns=ohe.get_feature_names(columns))], axis=1)
train.users_ohe_columns = train.users_ohe_columns + columns

In [43]:
# One Hot Encode movies (non exclusive)
genres = ['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 
          'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

for genre in genres:
    genre = genre.replace('-', '')
    column = 'Genre_' + str(genre)
    train.ratings[column] = train.ratings['Genres'].apply(lambda x: 1 if genre in x else 0)
    train.movies_ohe_columns.append(column)

In [44]:
int_genres_gender = ""
for genre in train.movies_ohe_columns:
    int_genres_gender = int_genres_gender + '+' +genre + ':Gender'

int_genres_age = ""
for genre in train.movies_ohe_columns:
    int_genres_age = int_genres_gender + '+' + genre + ':Age'
    
interact = patsy.dmatrix("0 + Gender:Age + Gender:Occupation + Age:Occupation"+int_genres_gender+int_genres_age, data=train.ratings.astype('object'), return_type='dataframe').astype('int8')
interact = interact.astype('uint8')
train.ratings = pd.concat([train.ratings, interact], axis=1)
train.interact_columns = interact.columns

In [24]:
# Drop unused columns
train.movies.drop(['Title', 'Genres'], inplace=True, axis=1)
train.ratings.drop(['Title', 'Genres', 'Zipcode'], inplace=True, axis=1)

KeyError: "['Title' 'Genres'] not found in axis"

In [25]:
train.to_tensor()

### DataLoaders

In [26]:
# Split
train_size = int(0.8 * len(train))
test_size = len(train) - train_size
train_dataset, valid_dataset = torch.utils.data.random_split(train, [train_size, test_size])

# Create dataloaders
dataloaders = {}
dataloaders['train'] = torch.utils.data.DataLoader(train_dataset, batch_size=4096, shuffle=True)
dataloaders['valid'] = torch.utils.data.DataLoader(valid_dataset, batch_size=4096, shuffle=True)

### Define structure of model

In [27]:
class deepnwide(nn.Module):
    """
    Set of hyperparams: params = {
        'lr': [0.001, 0.01],
        'module__size_emb': [30, 60, 120],
        'module__dropout': [0.2, 0.5]
}
    Best run: -0.9766627748807272 {'lr': 0.001, 'module__dropout': 0.2, 'module__size_emb': 30}

    """
    def __init__(self, users_emb, movies_emb, users_ohe, movies_ohe, interact, nemb, size_emb, y_range, dropout):
        super().__init__()
        
        self.name = 'deepnwide'
        self.y_range = y_range

        # wide
        # ohe part - We don't need to specify nothing here
        
        # deep
        self.emb_UserID = nn.Embedding(len(torch.unique(users_emb[:, 0])), size_emb)
        self.emb_UserID.weight.data.uniform_(-.01, .01)
        self.emb_Gender = nn.Embedding(len(torch.unique(users_emb[:, 1])), size_emb)
        self.emb_Gender.weight.data.uniform_(-.01, .01)
        self.emb_Age = nn.Embedding(len(torch.unique(users_emb[:, 2])), size_emb)
        self.emb_Age.weight.data.uniform_(-.01, .01)
        self.emb_Occupation = nn.Embedding(len(torch.unique(users_emb[:, 3])), size_emb)
        self.emb_Occupation.weight.data.uniform_(-.01, .01)
        self.emb_MovieID = nn.Embedding(len(torch.unique(movies_emb[:, 0])), size_emb)
        self.emb_MovieID.weight.data.uniform_(-.01, .01)

        # hidden layers
        self.h1 = nn.Linear(nemb * size_emb, 100)
        self.h2 = nn.Linear(100, 100)
        self.h3 = nn.Linear(100, 100)

        # Dropout layers
        self.dropout1 = nn.Dropout(p=dropout)
        self.dropout2 = nn.Dropout(p=dropout)
        self.dropout3 = nn.Dropout(p=dropout)

        # final dense layer 
        self.last_layer = nn.Linear((interact.shape[1]) + (movies_ohe.shape[1]) + (100), 1)


    def forward(self, X):
        """
        Classic Matrix Factorization with bias term.
        MSE: 1.07
        """
        # Assign data
        user_emb = X[0]
        user_ohe = X[1]
        movie_emb = X[2]
        movie_ohe = X[3]
        interact = X[4]
        
        UserID = user_emb[:, 0]
        Gender = user_emb[:, 1]
        Age = user_emb[:, 2]
        Occupation = user_emb[:, 3]
        MovieID = movie_emb[:, 0]

        UserID = self.emb_UserID(UserID)
        Gender = self.emb_Gender(Gender)
        Age = self.emb_Age(Age)
        Occupation = self.emb_Occupation(Occupation)
        MovieID = self.emb_MovieID(MovieID)

        emb = torch.cat([UserID,
                         Age,
                         Gender,
                         Occupation,
                         MovieID],
                         dim=1)
        
        emb = F.relu(self.dropout1(self.h1(emb)))
        emb = F.relu(self.dropout2(self.h2(emb)))
        emb = F.relu(self.dropout3(self.h3(emb)))

        result = self.last_layer(torch.cat([interact.float(), movie_ohe.float(), emb.float()], dim=1))

        return (torch.sigmoid(result) * (self.y_range[1]-self.y_range[0]) + self.y_range[0]).squeeze()


model = deepnwide(train.users_emb, train.movies_emb, train.users_ohe, train.movies_ohe, train.interact, 5, 60, train.y_range, 0.5)
model.to(device)
print(model)

deepnwide(
  (emb_UserID): Embedding(6040, 60)
  (emb_Gender): Embedding(2, 60)
  (emb_Age): Embedding(7, 60)
  (emb_Occupation): Embedding(21, 60)
  (emb_MovieID): Embedding(3706, 60)
  (h1): Linear(in_features=300, out_features=100, bias=True)
  (h2): Linear(in_features=100, out_features=100, bias=True)
  (h3): Linear(in_features=100, out_features=100, bias=True)
  (dropout1): Dropout(p=0.5)
  (dropout2): Dropout(p=0.5)
  (dropout3): Dropout(p=0.5)
  (last_layer): Linear(in_features=542, out_features=1, bias=True)
)


In [28]:
class twoembeds(torch.nn.Module):
    """
    The best results that we were able to achieve on a randomly split test dataset of size 0.2:
    - embsize: 30
    - LR: 0.005 with LRstep of gamma 0.1 step size 7
    - batch size of 4096

    Results:
    epoch:18
    RMSE: 0.8578
    train MSE: 0.5658
    valid MSE: 0.7359
    dur: 25.3224    
    """
    def __init__(self, size_emb):
        super().__init__()

        # set name of model
        self.name = 'twoembeds'

        # User and movie embeddings
        self.emb_UserID = nn.Embedding(train.nusers, size_emb)
        self.emb_MovieID = nn.Embedding(train.nmovies, size_emb)
        self.emb_UserID.weight.data.uniform_(-.01, .01)
        self.emb_MovieID.weight.data.uniform_(-.01, .01)
        
        # User and movie embeddings weights
        self.emb_UserID_b = nn.Embedding(train.nusers, 1)
        self.emb_MovieID_b = nn.Embedding(train.nmovies, 1)
        self.emb_UserID_b.weight.data.uniform_(-.01, .01)
        self.emb_MovieID_b.weight.data.uniform_(-.01, .01)
 

    def forward(self, X):
        """
        Classic Matrix Factorization with bias term.
        Best RMSE on validation dataset: ~1.08
        """
        user_emb = X[0]
        user_ohe = X[1]
        movie_emb = X[2]
        movie_ohe = X[3]
        interact = X[4]

        UserID = user_emb[:, 0]
        MovieID = movie_emb[:, 0]

        user_emb = self.emb_UserID(UserID)
        movie_emb = self.emb_MovieID(MovieID)

        mult = (user_emb * movie_emb).sum(1)

        # add bias
        multb = mult + self.emb_UserID_b(UserID).squeeze() + self.emb_MovieID_b(MovieID).squeeze()

        multb = multb.float()

        return multb


model = twoembeds(15)
model.to(device)
print(model)

twoembeds(
  (emb_UserID): Embedding(6040, 15)
  (emb_MovieID): Embedding(3706, 15)
  (emb_UserID_b): Embedding(6040, 1)
  (emb_MovieID_b): Embedding(3706, 1)
)


### Skorch

In [29]:
# Earlystopping callback
earlystopping = EarlyStopping(monitor='valid_loss', patience=5, threshold=0.005)

In [30]:
# RMSE callback
def rmseloss(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

rmse_scorer = make_scorer(rmseloss)

epoch_rmse = EpochScoring(rmse_scorer, name='rmse_score', lower_is_better=True)

In [31]:
# Checkpoint callback
checkpoint = Checkpoint(monitor='rmse_score_best', f_params='params.pt', f_optimizer='optimizer.pt', f_history='history.json', f_pickle='model')

In [32]:
# Learning rate scheduler callback
lr_scheduler = LRScheduler(policy="StepLR", step_size=7, gamma=0.1)

In [33]:
# Progressbar callback
progressbar = ProgressBar()

In [25]:
# Tensorboard
!rm -rf runs/*
writer = SummaryWriter()
%tensorboard --logdir 'runs/'

ERROR: Failed to launch TensorBoard (exited with 1).
Contents of stderr:
Traceback (most recent call last):
  File "/home/jsleroux/anaconda3/envs/recsys/bin/tensorboard", line 10, in <module>
    sys.exit(run_main())
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/tensorboard/main.py", line 64, in run_main
    app.run(tensorboard.main, flags_parser=tensorboard.configure)
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/tensorboard/program.py", line 220, in main
    server = self._make_server()
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/tensorboard/program.py", line 301, in _make_server
    self.assets_zip_provider)
  File "/home/jsleroux/anaconda3/

In [36]:
deepnwidenet = NeuralNet(
    deepnwide,
    module__users_emb=train.users_emb,
    module__movies_emb=train.movies_emb,
    module__users_ohe=train.users_ohe,
    module__movies_ohe=train.movies_ohe,
    module__interact=train.interact,
    module__nemb=5,
    module__size_emb=30,
    module__y_range=train.y_range,
    module__dropout=0.2,
    max_epochs=30,
    lr=0.001,
    optimizer=torch.optim.Adam,
    criterion=torch.nn.MSELoss,
    device=device,
    iterator_train__batch_size=1024,
    iterator_train__num_workers=0,
    iterator_train__shuffle=True,
    iterator_valid__batch_size=4096,
    train_split=predefined_split(valid_dataset),
    callbacks=[
               earlystopping,
               epoch_rmse,
               #checkpoint,
               lr_scheduler,
               #TensorBoard(writer),
               #progressbar
               ]
)

In [37]:
deepnwidenet.fit(train_dataset)

  epoch    rmse_score    train_loss    valid_loss       dur
-------  ------------  ------------  ------------  --------
      1        0.9029        0.8925        0.8152  109.6266
      2        0.8949        0.8114        0.8009  112.3800
      3        0.8859        0.7915        0.7848  118.3299
      4        0.8789        0.7723        0.7724  119.1442
      5        0.8771        0.7565        0.7694  121.6192


<class 'skorch.net.NeuralNet'>[initialized](
  module_=deepnwide(
    (emb_UserID): Embedding(6040, 30)
    (emb_Gender): Embedding(2, 30)
    (emb_Age): Embedding(7, 30)
    (emb_Occupation): Embedding(21, 30)
    (emb_MovieID): Embedding(3706, 30)
    (h1): Linear(in_features=150, out_features=100, bias=True)
    (h2): Linear(in_features=100, out_features=100, bias=True)
    (h3): Linear(in_features=100, out_features=100, bias=True)
    (dropout1): Dropout(p=0.2)
    (dropout2): Dropout(p=0.2)
    (dropout3): Dropout(p=0.2)
    (last_layer): Linear(in_features=542, out_features=1, bias=True)
  ),
)

In [53]:
params = {
    'lr': [0.001, 0.01],
    'module__size_emb': [30, 60, 120],
    'module__dropout': [0.2, 0.5]
}
gs = GridSearchCV(deepnwidenet,
                  params,
                  verbose=50,
                  refit=False,
                  pre_dispatch=8,
                  n_jobs=8,
                  cv=3,
                  scoring='neg_mean_squared_error')

X_ds = SliceDataset(train, idx=0)
y_ds = SliceDataset(train, idx=1)
gs.fit(X_ds, y_ds)

print(gs.best_score_, gs.best_params_)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   1 tasks      | elapsed: 22.9min
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed: 23.4min
[Parallel(n_jobs=8)]: Done   3 tasks      | elapsed: 29.1min
[Parallel(n_jobs=8)]: Done   4 tasks      | elapsed: 29.5min
[Parallel(n_jobs=8)]: Done   5 tasks      | elapsed: 34.9min
[Parallel(n_jobs=8)]: Done   6 tasks      | elapsed: 40.3min
[Parallel(n_jobs=8)]: Done   7 tasks      | elapsed: 41.0min
[Parallel(n_jobs=8)]: Done   8 tasks      | elapsed: 47.0min
[Parallel(n_jobs=8)]: Done   9 tasks      | elapsed: 47.9min
[Parallel(n_jobs=8)]: Done  10 tasks      | elapsed: 52.2min
[Parallel(n_jobs=8)]: Done  11 tasks      | elapsed: 62.8min
[Parallel(n_jobs=8)]: Done  12 tasks      | elapsed: 68.2min
[Parallel(n_jobs=8)]: Done  13 tasks      | elapsed: 72.5min
[Parallel(n_jobs=8)]: Done  14 tasks      | elapsed: 80.8min
[Parallel(

In [34]:
twoembedsnet = NeuralNet(
    twoembeds,
    module__size_emb=128,
    max_epochs=30,
    lr=0.001,
    optimizer=torch.optim.Adam,
    criterion=torch.nn.MSELoss,
    device=device,
    iterator_train__batch_size=4096,
    iterator_train__num_workers=4,
    iterator_train__shuffle=True,
    iterator_valid__batch_size=4096,
    train_split=predefined_split(valid_dataset),
    callbacks=[earlystopping,
               epoch_rmse,
               #checkpoint,
               lr_scheduler]
)

In [0]:
twoembedsnet.fit(train_dataset)

In [37]:
params = {
    #'lr': [0.001, 0.01],
    'module__size_emb': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}
gs = GridSearchCV(twoembedsnet, params, verbose=50, refit=False, n_jobs=-1, cv=2, scoring='neg_mean_squared_error')

X_ds = SliceDataset(train, idx=0)
y_ds = SliceDataset(train, idx=1)
gs.fit(X_ds, y_ds)

print(gs.best_score_, gs.best_params_)

Fitting 2 folds for each of 10 candidates, totalling 20 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   52.1s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   54.0s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:   55.9s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   57.9s


Exception in thread QueueManagerThread:
Traceback (most recent call last):
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/joblib-0.13.2-py3.7.egg/joblib/externals/loky/process_executor.py", line 662, in _queue_management_worker
    for work_id, work_item in pending_work_items.items():
RuntimeError: dictionary changed size during iteration



TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/joblib-0.13.2-py3.7.egg/joblib/externals/loky/backend/queues.py", line 150, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/joblib-0.13.2-py3.7.egg/joblib/externals/loky/backend/reduction.py", line 243, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/joblib-0.13.2-py3.7.egg/joblib/externals/loky/backend/reduction.py", line 236, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/site-packages/joblib-0.13.2-py3.7.egg/joblib/externals/cloudpickle/cloudpickle.py", line 267, in dump
    return Pickler.dump(self, obj)
  File "/home/jsleroux/anaconda3/envs/recsys/lib/python3.7/pickle.py", line 437, i

In [0]:
### Helper function to time the speed of the dataset in a dataloader
for i in torch.utils.data.DataLoader(train_dataset, batch_size=4096, num_workers=4, shuffle=False):
    #a, b = i
    a = i
    #for j in a:
        #print(j.type())

Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fef7a5fd9e8>>
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fef7a5fd9e8>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 677, in __del__
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 677, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 659, in _shutdown_workers
    self._shutdown_workers()
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fef7a5fd9e8>>
    w.join()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child 

CPU times: user 883 ms, sys: 354 ms, total: 1.24 s
Wall time: 26.3 s


In [0]:
current_time = str(datetime.datetime.now().timestamp())
train_log_dir = '/content/drive2/My Drive/tb/logs/tensorboard/train/' + current_time
valid_log_dir = '/content/drive2/My Drive/tb/logs/tensorboard/valid/' + current_time
train_summary_writer = summary.create_file_writer(train_log_dir)
valid_summary_writer = summary.create_file_writer(valid_log_dir)

In [0]:
def train_model(model, criterion, optimizer, scheduler, num_epochs=50):
    since = time.time()
    
    globaliter = 0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)
        
        globaliter +=1

        for phase in ['train', 'valid']:
            if phase == 'train':
                model.train(True)
            else:
                model.train(False)
            
            running_loss = 0.0

            #for users_emb, users_ohe, movies_emb, movies_ohe, interact, labels in dataloaders[phase]:
            for X, labels in dataloaders[phase]:
   
                #users_emb, users_ohe, movies_emb, movies_ohe, interact, labels = (Variable(users_emb.to(device)),
                #                                                              Variable(users_ohe.to(device)), 
                #                                                                  Variable(movies_emb.to(device)),
                #                                                                  Variable(movies_ohe.to(device)),
                #                                                                  Variable(interact.to(device)),
                #                                                                  Variable(labels.to(device)).float())
                
                optimizer.zero_grad()
                
                outputs = model(X)

                loss = criterion(outputs, labels)

                if phase == 'train':
                    loss.backward()
                    optimizer.step()    
                
                #print(model.emb_UserID.weight.grad)

                running_loss += loss.data
            
            epoch_loss = running_loss / (len(dataloaders[phase].dataset))
            sqrt_loss = torch.sqrt(epoch_loss)
            
            # Tensorboard logging
            if phase == 'train':
                with train_summary_writer.as_default():
                    tf.summary.scalar(model.name + ' RMSE', sqrt_loss.item(), step=globaliter)
            else:
                with valid_summary_writer.as_default():
                    tf.summary.scalar(model.name + ' RMSE', sqrt_loss.item(), step=globaliter)
            
            print('{} loss: MSE: {:.6f} RMSE: {:.6f}'.format(phase, epoch_loss, sqrt_loss))

        time_elapsed = time.time() - since
        print(time_elapsed)

In [0]:
criterion = nn.MSELoss(reduction='sum') # we use sum because in the training loop, we divide by the length of the dataset
#optimizer_ft = optim.SGD(model.parameters(), lr=0.001, weight_decay=1e-5)#, momentum=0.9, weight_decay=1e-5)
optimizer_ft = optim.Adam(model.parameters(), lr=0.001)
#optimizer_ft = RAdam(model.parameters())
train_model(model, criterion, optimizer_ft, None, 50)

Epoch 0/49
----------


RuntimeError: ignored

In [0]:
%tensorboard --logdir '/content/drive2/My Drive/tb/logs/tensorboard'

### Surprise

In [0]:
!pip install surprise

In [0]:
from surprise import NormalPredictor
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split, KFold

In [0]:
user = train[dataloaders['train'].dataset.indices][0][:, 0].data.numpy()
movie = train[dataloaders['train'].dataset.indices][2][:, 0].data.numpy()
y = train[dataloaders['train'].dataset.indices][5].data.numpy()
df = pd.DataFrame({'user': user, 'movie': movie, 'y': y})
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'movie', 'y']], reader)

In [0]:
data = Dataset.load_from_df(train.ratings.loc[dataloaders['train'].dataset.indices, ['UserID', 'MovieID', 'Rating']], reader)

In [0]:
a = train.ratings.loc[dataloaders['train'].dataset.indices, ['UserID', 'MovieID', 'Rating']]
b = pd.DataFrame({'UserID': user, 'MovieID': movie, 'Rating': y})

In [0]:
#data = Dataset.load_builtin('ml-1m')

In [0]:
# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)