<a href="https://colab.research.google.com/github/osamaoun97/MovieLens_Recommender_System/blob/model_training/notebooks/Multi_task_Recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

This notebook is used to train a multi-task matrix factorization model for recommendation.<br>
We'll consider the both implicit interaction and explicit rating in the MovieLens100k dataset.<br>
We'll use tensorflow recommenders to achieve this.

## Import TFRS

First, install and import TFRS and needed packages

In [1]:
!pip install -q tensorflow_recommenders

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/96.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from typing import Dict, Text
import tensorflow as tf
import tensorflow_recommenders as tfrs
from urllib.request import urlretrieve
from zipfile import ZipFile
import pandas as pd
import os
from sklearn.model_selection import train_test_split

In [3]:
SEED = 19011

In [4]:
# python version: 3.10.11
tf.__version__, tfrs.__version__

('2.12.0', 'v0.7.3')

## Download and extract data

In [5]:
DATA_DIR = 'data'

In [6]:
if not os.path.exists(DATA_DIR):
  os.mkdir(DATA_DIR)

In [7]:
compressed_file_URL = 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
compressed_filename = compressed_file_URL.split('/')[-1] # ml-latest-small.zip
extracted_filename = compressed_filename.split('.')[0] # ml-latest-small

In [8]:
compressed_file_path = os.path.join(DATA_DIR, compressed_filename) # data/ml-latest-small.zip

In [9]:
urlretrieve(compressed_file_URL, compressed_file_path) # Download file form url

('data/ml-latest-small.zip', <http.client.HTTPMessage at 0x7fbaec289ab0>)

In [10]:
with ZipFile(compressed_file_path, 'r') as zip_ref:
    zip_ref.extractall(path=DATA_DIR) # extract file to data/
os.remove(compressed_file_path)

## Load, prepare and split data

In [11]:
file_path = os.path.join(DATA_DIR, extracted_filename)

In [12]:
ratings = pd.read_csv(file_path + '/ratings.csv')
movies = pd.read_csv(file_path + '/movies.csv')

In [13]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [14]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [15]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [16]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [17]:
ratings = ratings.join(movies, on='movieId', lsuffix='', rsuffix='_', how='inner')[['userId', 'title', 'rating', 'timestamp']].rename(columns={'title':'movieTitle'})
ratings

Unnamed: 0,userId,movieTitle,rating,timestamp
0,1,Jumanji (1995),4.0,964982703
516,5,Jumanji (1995),4.0,847434962
874,7,Jumanji (1995),4.5,1106635946
1434,15,Jumanji (1995),2.5,1510577970
1667,17,Jumanji (1995),4.5,1305696483
...,...,...,...,...
99945,610,Mrs. Henderson Presents (2005),3.5,1479542444
100012,610,Planet 51 (2009),3.0,1493848602
100033,610,Source Code (2011),2.5,1479544865
100038,610,"Master, The (2012)",4.0,1495959169


In [18]:
movies = movies.rename(columns={'title':'movieTitle'})
movies

Unnamed: 0,movieId,movieTitle,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


We'll convert ids from int to string so that they can be processed by tf StringLookup layer.

In [19]:
ratings['userId'] = ratings['userId'].map(lambda id_int: str(id_int))
movies['movieId'] = movies['movieId'].map(lambda id_int: str(id_int))

In [20]:
train_valid , test = train_test_split(ratings, test_size=0.2, stratify=ratings['userId'], random_state=SEED)
train, valid = train_test_split(train_valid, test_size=0.1, stratify=train_valid['userId'], random_state=SEED)

In [21]:
# Keep only interactions of movies seen in training data
valid = valid[valid['movieTitle'].isin(train['movieTitle'].unique())]
test = test[test['movieTitle'].isin(train_valid['movieTitle'].unique())]

In [22]:
train.shape, valid.shape, test.shape

((56115, 4), (6095, 4), (15248, 4))

## Data Prepatation

We'll create a tensorflow dataset for the train, validation and test sets. This will make it easier for modelling.

Each element in the dataset is a dictionary that has the following key:
- userId: This is the id of the user for which we make recommendation. It is an input feature
- movieTitle: This is the title of the movie.
- rating: This is the rating the user gave to the movie. It is our target label.

In [23]:
train_rating_dataset = tf.data.Dataset.from_tensor_slices({'userId':train['userId'].values, 'movieTitle': train['movieTitle'].values, 'rating': train['rating'].values})
valid_rating_dataset = tf.data.Dataset.from_tensor_slices({'userId':valid['userId'].values, 'movieTitle': valid['movieTitle'].values, 'rating': valid['rating'].values})
test_rating_dataset = tf.data.Dataset.from_tensor_slices({'userId':test['userId'].values, 'movieTitle': test['movieTitle'].values, 'rating': test['rating'].values})
train_rating_dataset

<_TensorSliceDataset element_spec={'userId': TensorSpec(shape=(), dtype=tf.string, name=None), 'movieTitle': TensorSpec(shape=(), dtype=tf.string, name=None), 'rating': TensorSpec(shape=(), dtype=tf.float64, name=None)}>

In [24]:
movie_dataset = tf.data.Dataset.from_tensor_slices(train['movieTitle'].unique())
user_dataset = tf.data.Dataset.from_tensor_slices(train['userId'].unique())

In [25]:
user_ids_vocabulary = tf.keras.layers.StringLookup(mask_token=None, name='users_lookup', num_oov_indices=0)
movie_titles_vocabulary = tf.keras.layers.StringLookup(mask_token=None, name='movies_lookup', num_oov_indices=0)

In [26]:
user_ids_vocabulary.adapt(user_dataset.map(lambda x: x))

In [27]:
movie_titles_vocabulary.adapt(movie_dataset.map(lambda x: x))

In [28]:
n_users = user_ids_vocabulary.vocabulary_size()
n_movies = movie_titles_vocabulary.vocabulary_size()
n_users, n_movies

(610, 4960)

## Define model

The model has three components:
- user_model: This converts the userId to a dense vector representing the user's preferences.
- movie_model: This converts the movieTitle to a dense vector representing the movie's characteristics.
-rating_model: This is a module that takes the output of the user_model and movie_model and generates prediction for this user's rating of this movie.

The model has two tasks (each task has a loss and metrics):
- Ranking task: This task ranks each candidate by estimating the rating the user would give. It has a regression loss that is calculated between the predicted rating and the true rating. The loss is usually mean squared error and the metric is root mean squared error.
- Retrieval task: This task generated candidates for a user from all available movies in our dataset. It has a classification as the output can be one of defined set of movies. The loss is Crossentropy and the metric is topk accuracy.


Note: Top k accuracy (for example k=10) means how often our target variable (which is a movie in our case) appears in the top k of candidates generated by the model.


In [29]:
class MultitaskRecommender(tfrs.Model):
  def __init__(self, embedding_dimension=32, rating_weight: float=1., retrieval_weight: float =1.) -> None:
    super().__init__()

    # Set up user and movie representations.
    self.movie_model = tf.keras.Sequential(
        [
          movie_titles_vocabulary,
          tf.keras.layers.Embedding(n_movies, embedding_dimension, name='movie_embedding')
        ],
        name='movie_model')

    self.user_model = tf.keras.Sequential(
        [
          user_ids_vocabulary,
          tf.keras.layers.Embedding(n_users, embedding_dimension, name='user_embedding')
        ],
        name='user_model')

    # Set up MLP to predict rating from user and movie representation
    self.rating_model = tf.keras.Sequential(
        [
            tf.keras.layers.Dense(256, activation='relu'),
            tf.keras.layers.Dense(128,  activation='relu'),
            tf.keras.layers.Dense(1)
        ],
        name='rating_model')

    # Set up ranking and retrieval tasks
    self.rating_task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
          loss=tf.keras.losses.MeanSquaredError(name='MSE'),
          metrics=[tf.keras.metrics.RootMeanSquaredError(name="RMSE")],
      )

    self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=movie_dataset.batch(128).map(self.movie_model),
            ks = (5,10)
        )
    )

    # Set up weights for rating task and retrieval task
    self.rating_weight = rating_weight
    self.retrieval_weight = retrieval_weight

  def call(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["userId"])
    # And pick out the movie features and pass them into the movie model.
    movie_embeddings = self.movie_model(features["movieTitle"])

    return (
        user_embeddings,
        movie_embeddings,
        # We apply the multi-layered rating model to a concatentation of
        # user and movie embeddings.
        self.rating_model(
            tf.concat([user_embeddings, movie_embeddings], axis=1)
        ),
    )

  def compute_loss(self, features_label: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:

    ratings = features_label.pop("rating")

    user_embeddings, movie_embeddings, rating_predictions = self(features_label)

    # We compute the loss for each task.
    rating_loss = self.rating_task(labels=ratings, predictions=rating_predictions)
    retrieval_loss = self.retrieval_task(user_embeddings, movie_embeddings)

    # And combine them using the loss weights.
    return self.rating_weight*rating_loss + self.retrieval_weight*retrieval_loss

## Compile and fit

Let's cache the training dataset first. We'll use batch size of 8192

In [30]:
cached_train = train_rating_dataset.shuffle(100_000).batch(8192).cache()
cached_valid = train_rating_dataset.shuffle(100_000).batch(4096).cache()

### Trial 1

In [31]:
# This function keeps the initial learning rate for the first ten epochs
# and decreases it exponentially after that.
def scheduler(epoch, lr):
  return lr * tf.math.exp(-0.05)

In [32]:
multi_model = MultitaskRecommender(embedding_dimension=32, rating_weight=1., retrieval_weight=1)
multi_model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01))

In [33]:
multi_model.fit(cached_train, epochs=10, validation_data=cached_valid, callbacks=[tf.keras.callbacks.LearningRateScheduler(scheduler)])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbaec937ee0>

In [34]:
multi_model.fit(cached_train, epochs=20, validation_data=cached_valid, callbacks=[tf.keras.callbacks.LearningRateScheduler(scheduler)], initial_epoch=10)

Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fba535cd840>

In [35]:
multi_model.fit(cached_train, epochs=30, validation_data=cached_valid, callbacks=[tf.keras.callbacks.LearningRateScheduler(scheduler)], initial_epoch=10)

Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fba535fb3a0>

### Trial 2: Higher Embedding Dimension

In [36]:
multi_model_deeper = MultitaskRecommender(embedding_dimension=64, rating_weight=1., retrieval_weight=1)
multi_model_deeper.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01))

In [37]:
multi_model_deeper.fit(cached_train, epochs=30, validation_data=cached_valid, callbacks=[tf.keras.callbacks.LearningRateScheduler(scheduler)])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fba535cf880>

### Trial 3: More weight for retrieval

In [38]:
multi_model_retrieval = MultitaskRecommender(embedding_dimension=32, rating_weight=0.8, retrieval_weight=1)
multi_model_retrieval.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01))

In [39]:
multi_model_retrieval.fit(cached_train, epochs=30, validation_data=cached_valid, callbacks=[tf.keras.callbacks.LearningRateScheduler(scheduler)])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fb9b4202980>

In [40]:
multi_model_retrieval.fit(cached_train, epochs=40, validation_data=cached_valid, callbacks=[tf.keras.callbacks.LearningRateScheduler(scheduler)], initial_epoch=30)

Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7fb9b9dbc940>

## Final Model & Evaluation

We'll choose the first model since it has the best performance with fewer parameters

Before testing it on the test set, we'll retrain it on the train + valdation sets combined.

### Retraining on train + valid

In [41]:
total_rating_dataset = tf.data.Dataset.from_tensor_slices({'userId':ratings['userId'].values, 'movieTitle': ratings['movieTitle'].values, 'rating': ratings['rating'].values})
total_rating_dataset

<_TensorSliceDataset element_spec={'userId': TensorSpec(shape=(), dtype=tf.string, name=None), 'movieTitle': TensorSpec(shape=(), dtype=tf.string, name=None), 'rating': TensorSpec(shape=(), dtype=tf.float64, name=None)}>

In [42]:
total_movie_dataset = tf.data.Dataset.from_tensor_slices(ratings['movieTitle'].unique())
total_user_dataset = tf.data.Dataset.from_tensor_slices(ratings['userId'].unique())

In [43]:
user_ids_vocabulary = tf.keras.layers.StringLookup(mask_token=None, name='users_lookup', num_oov_indices=0)
movie_titles_vocabulary = tf.keras.layers.StringLookup(mask_token=None, name='movies_lookup', num_oov_indices=0)

In [44]:
user_ids_vocabulary.adapt(total_user_dataset.map(lambda x: x))

In [45]:
movie_titles_vocabulary.adapt(total_movie_dataset.map(lambda x: x))

In [46]:
n_users = user_ids_vocabulary.vocabulary_size()
n_movies = movie_titles_vocabulary.vocabulary_size()
n_users, n_movies

(610, 5389)

In [47]:
class MultitaskRecommender(tfrs.Model):
  def __init__(self, embedding_dimension=32, rating_weight: float=1., retrieval_weight: float =1.) -> None:
    super().__init__()

    # Set up user and movie representations.
    self.movie_model = tf.keras.Sequential(
        [
          movie_titles_vocabulary,
          tf.keras.layers.Embedding(n_movies, embedding_dimension, name='movie_embedding')
        ],
        name='movie_model')

    self.user_model = tf.keras.Sequential(
        [
          user_ids_vocabulary,
          tf.keras.layers.Embedding(n_users, embedding_dimension, name='user_embedding')
        ],
        name='user_model')

    # Set up MLP to predict rating from user and movie representation
    self.rating_model = tf.keras.Sequential(
        [
            tf.keras.layers.Dense(256, activation='relu'),
            tf.keras.layers.Dense(128,  activation='relu'),
            tf.keras.layers.Dense(1)
        ],
        name='rating_model')

    # Set up ranking and retrieval tasks
    self.rating_task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
          loss=tf.keras.losses.MeanSquaredError(name='MSE'),
          metrics=[tf.keras.metrics.RootMeanSquaredError(name="RMSE")],
      )

    self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=movie_dataset.batch(128).map(self.movie_model),
            ks = (5,10)
        )
    )

    # Set up weights for rating task and retrieval task
    self.rating_weight = rating_weight
    self.retrieval_weight = retrieval_weight

  def call(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["userId"])
    # And pick out the movie features and pass them into the movie model.
    movie_embeddings = self.movie_model(features["movieTitle"])

    return (
        user_embeddings,
        movie_embeddings,
        # We apply the multi-layered rating model to a concatentation of
        # user and movie embeddings.
        self.rating_model(
            tf.concat([user_embeddings, movie_embeddings], axis=1)
        ),
    )

  def compute_loss(self, features_label: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:

    ratings = features_label.pop("rating")

    user_embeddings, movie_embeddings, rating_predictions = self(features_label)

    # We compute the loss for each task.
    rating_loss = self.rating_task(labels=ratings, predictions=rating_predictions)
    retrieval_loss = self.retrieval_task(user_embeddings, movie_embeddings)

    # And combine them using the loss weights.
    return self.rating_weight*rating_loss + self.retrieval_weight*retrieval_loss

In [48]:
multi_model_final = MultitaskRecommender(embedding_dimension=32, rating_weight=1., retrieval_weight=1)
multi_model_final.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01))

In [49]:
cached_total = total_rating_dataset.shuffle(100_000).batch(8192).cache()

In [50]:
multi_model_final.fit(cached_total, epochs=30, callbacks = [tf.keras.callbacks.LearningRateScheduler(scheduler)])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fb9af6d4ee0>

### Evaluation

In [51]:
cached_test = test_rating_dataset.batch(4096).cache()

In [52]:
multi_model_final.evaluate(cached_test)



[0.9608275294303894,
 0.022232424467802048,
 0.0409889817237854,
 23123.341796875,
 0,
 23123.341796875]

## Indexers

Indexers store the embedding of the possible candidates as keys. When it receives a query, it embeds the query and retrieves the closest keys.

For our recommendation task, it stores the embeddings of movies and the embedding of users. When we want to recommend for a user, it gets the movies whose embedding are the most similar (using dot product) to the user.

In [53]:
# Use brute-force search to set up retrieval using the trained representations.
user_recommender = tfrs.layers.factorized_top_k.BruteForce(multi_model_final.user_model, k=100)

In [54]:
user_recommender.index_from_dataset(
    total_movie_dataset.batch(100).map(lambda title: (title, multi_model_final.movie_model(title))))

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x7fb9af7f7dc0>

In [55]:
# Get some recommendations.
_, titles = user_recommender(tf.constant(["90"]))
print(f"Top 3 recommendations for user 42: {titles[:,:10]}")

Top 3 recommendations for user 42: [[b'Foxfire (1996)' b'Cutthroat Island (1995)'
  b'Postman, The (Postino, Il) (1994)' b'Boxing Helena (1993)'
  b'All About Eve (1950)' b'When Night Is Falling (1995)'
  b'Free Willy 2: The Adventure Home (1995)' b'Tom and Huck (1995)'
  b"Jason's Lyric (1994)" b'Little Rascals, The (1994)']]


#### Item-Item recommendation

For items similarity, we can use the embedding of movies as both query and keys

In [56]:
movie_recommender = tfrs.layers.factorized_top_k.BruteForce(multi_model_final.movie_model, k=100)

In [57]:
movie_recommender.index_from_dataset(
    total_movie_dataset.batch(100).map(lambda title: (title, multi_model_final.movie_model(title))))

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x7fb9a2e5dd20>

In [58]:
# Get some recommendations.
_, titles2 = movie_recommender(tf.constant(["Withnail & I (1987)"]))
print(f"Top 3 recommendations for movie 42: {titles2[:,:10]}")

Top 3 recommendations for movie 42: [[b'Withnail & I (1987)' b'Mother (1996)' b'Escape from New York (1981)'
  b'Indian Summer (a.k.a. Alive & Kicking) (1996)'
  b'Kiss Me, Guido (1997)' b'Event Horizon (1997)'
  b'Wings of Desire (Himmel \xc3\xbcber Berlin, Der) (1987)'
  b'Wishmaster (1997)' b'Kull the Conqueror (1997)' b'Stripes (1981)']]


In [59]:
# Get some recommendations.
_, titles2 = movie_recommender(tf.constant(["Freaky Friday (2003)"]), k=25)
print(f"Top 3 recommendations for movie 42: {titles2[:,:10]}")

Top 3 recommendations for movie 42: [[b'Christmas Story, A (1983)' b'Foxfire (1996)' b'Murder at 1600 (1997)'
  b"When the Cat's Away (Chacun cherche son chat) (1996)" b'Shaft (2000)'
  b'Mexican, The (2001)'
  b"Cat o' Nine Tails, The (Gatto a nove code, Il) (1971)"
  b'Breakfast Club, The (1985)' b'Wildcats (1986)'
  b'Bronx Tale, A (1993)']]


# Exporting the model

We'll save and export this model to use in our movie recommendation platform.

This time we'll retrain it on the entire movielens 100k dataset.

In [60]:
movielens_dataset = tf.data.Dataset.from_tensor_slices({'userId':ratings['userId'].values, 'movieTitle': ratings['movieTitle'].values, 'rating': ratings['rating'].values})

In [61]:
movie_dataset = tf.data.Dataset.from_tensor_slices(ratings['movieTitle'].unique())
user_dataset = tf.data.Dataset.from_tensor_slices(ratings['userId'].unique())

In [62]:
user_ids_vocabulary = tf.keras.layers.StringLookup(mask_token=None, name='users_lookup', num_oov_indices=0)
movie_titles_vocabulary = tf.keras.layers.StringLookup(mask_token=None, name='movies_lookup', num_oov_indices=0)

In [63]:
user_ids_vocabulary.adapt(user_dataset.map(lambda x: x))

In [64]:
movie_titles_vocabulary.adapt(movie_dataset.map(lambda x: x))

In [65]:
n_users = user_ids_vocabulary.vocabulary_size()
n_movies = movie_titles_vocabulary.vocabulary_size()
n_users, n_movies

(610, 5389)

In [66]:
class MultitaskRecommender(tfrs.Model):
  def __init__(self, embedding_dimension=32, rating_weight: float=1., retrieval_weight: float =1.) -> None:
    super().__init__()

    # Set up user and movie representations.
    self.movie_model = tf.keras.Sequential(
        [
          movie_titles_vocabulary,
          tf.keras.layers.Embedding(n_movies, embedding_dimension, name='movie_embedding')
        ],
        name='movie_model')

    self.user_model = tf.keras.Sequential(
        [
          user_ids_vocabulary,
          tf.keras.layers.Embedding(n_users, embedding_dimension, name='user_embedding')
        ],
        name='user_model')

    # Set up MLP to predict rating from user and movie representation
    self.rating_model = tf.keras.Sequential(
        [
            tf.keras.layers.Dense(256, activation='relu'),
            tf.keras.layers.Dense(128,  activation='relu'),
            tf.keras.layers.Dense(1)
        ],
        name='rating_model')

    # Set up ranking and retrieval tasks
    self.rating_task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
          loss=tf.keras.losses.MeanSquaredError(name='MSE'),
          metrics=[tf.keras.metrics.RootMeanSquaredError(name="RMSE")],
      )

    self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=movie_dataset.batch(128).map(self.movie_model),
            ks = (5,10)
        )
    )

    # Set up weights for rating task and retrieval task
    self.rating_weight = rating_weight
    self.retrieval_weight = retrieval_weight

  def call(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["userId"])
    # And pick out the movie features and pass them into the movie model.
    movie_embeddings = self.movie_model(features["movieTitle"])

    return (
        user_embeddings,
        movie_embeddings,
        # We apply the multi-layered rating model to a concatentation of
        # user and movie embeddings.
        self.rating_model(
            tf.concat([user_embeddings, movie_embeddings], axis=1)
        ),
    )

  def compute_loss(self, features_label: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:

    ratings = features_label.pop("rating")

    user_embeddings, movie_embeddings, rating_predictions = self(features_label)

    # We compute the loss for each task.
    rating_loss = self.rating_task(labels=ratings, predictions=rating_predictions)
    retrieval_loss = self.retrieval_task(user_embeddings, movie_embeddings)

    # And combine them using the loss weights.
    return self.rating_weight*rating_loss + self.retrieval_weight*retrieval_loss

In [67]:
final_model = MultitaskRecommender(embedding_dimension=32, rating_weight=1., retrieval_weight=1)
final_model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01))

In [68]:
cached_movielens = movielens_dataset.shuffle(100_000).batch(8192).cache()

In [69]:
final_model.fit(cached_movielens, epochs=30, callbacks = [tf.keras.callbacks.LearningRateScheduler(scheduler)])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fb9a2da50f0>

In [71]:
final_model.retrieval_task =  tfrs.tasks.Retrieval()

In [72]:
final_model.save('multitask_recommender')



In [73]:
loaded = tf.keras.models.load_model('multitask_recommender')

In [75]:
ratings['movieTitle'].unique()

array(['Jumanji (1995)', 'Waiting to Exhale (1995)', 'Sabrina (1995)',
       ..., 'Source Code (2011)', 'Master, The (2012)', 'Breathe (2014)'],
      dtype=object)

In [76]:
model_input = {"userId": tf.tile([str(52)], [5389]), "movieTitle": ratings['movieTitle'].unique()}
user_embeddings, movie_embeddings, predicted_ratings = loaded(model_input)

In [77]:
predicted_ratings

<tf.Tensor: shape=(5389, 1), dtype=float32, numpy=
array([[3.8308444],
       [3.1783068],
       [3.541618 ],
       ...,
       [3.5341933],
       [3.481803 ],
       [3.5044954]], dtype=float32)>

In [79]:
recommended_items = tf.gather(ratings['movieTitle'].unique(), tf.squeeze(tf.argsort(predicted_ratings, axis=0, direction='DESCENDING')))
recommended_items

<tf.Tensor: shape=(5389,), dtype=string, numpy=
array([b'Kiss Me, Guido (1997)', b'Billy Elliot (2000)',
       b'Oklahoma! (1955)', ..., b'Keeping the Faith (2000)',
       b'Primal Fear (1996)', b'Total Eclipse (1995)'], dtype=object)>

In [82]:
movie_recommender = tfrs.layers.factorized_top_k.BruteForce(loaded.movie_model, k=100)

In [83]:
movie_recommender.index_from_dataset(
    movie_dataset.batch(100).map(lambda title: (title, loaded.movie_model(title))))

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x7fb9844b3640>

In [84]:
# Get some recommendations.
_, titles2 = movie_recommender(tf.constant(["Freaky Friday (2003)"]), k=25)
print(f"Top 3 recommendations for movie 42: {titles2[:,:10]}")

Top 3 recommendations for movie 42: [[b'Ishtar (1987)' b'Christmas Story, A (1983)' b'15 Minutes (2001)'
  b"Hard Day's Night, A (1964)" b'Places in the Heart (1984)'
  b'Toys (1992)' b'Deep Impact (1998)' b'Mrs. Dalloway (1997)'
  b'Two Jakes, The (1990)' b'American Ninja (1985)']]
