# Movie Recommendation using TensorFlow Recommenders!
This notebook will demonstrate how to use our own dataset in TensorFlow Recommenders.  

For your information, we can load dataset by using **tfds.load("movielens/100k-ratings", split="train")**.  

But I don't use it for now, because the offical tutorials show how to work with it!

Most of the codes below are based on the [offical tutorial provided by TensorFlow Recommenders](https://www.tensorflow.org/recommenders/examples/basic_retrieval).


---
---

# preparation

## import libraries

In [1]:
%pip install -q tensorflow-recommenders

[K     |████████████████████████████████| 85 kB 3.3 MB/s 
[K     |████████████████████████████████| 462 kB 51.7 MB/s 
[?25h

In [2]:
from typing import Dict, Text

import numpy as np
import pandas as pd
import tensorflow as tf

import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

## download zip file from [MovieLens offical website](https://grouplens.org/datasets/movielens/100k/) and unzip it

In [3]:
!wget https://files.grouplens.org/datasets/movielens/ml-100k.zip

--2022-03-11 15:10:06--  https://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2022-03-11 15:10:06 (23.7 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]



In [4]:
!unzip /content/ml-100k.zip

Archive:  /content/ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test         


## read files

In [5]:
df_ratings = pd.read_csv(
    "/content/ml-100k/u.data", 
    sep="\t",
    names=["user_id", "movie_id", "rating", "timestamp"]
)
df_ratings.head(2)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742


In [6]:
df_ratings.shape

(100000, 4)

In [7]:
df_movies = pd.read_csv(
    "/content/ml-100k/u.item", 
    sep="|",
    usecols=[0,1], 
    names=["movie_id", "movie_title"],
    encoding="latin-1"
)
df_movies.head(2)

Unnamed: 0,movie_id,movie_title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)


In [8]:
df_movies.shape

(1682, 2)

In [9]:
df_merged = df_ratings.merge(df_movies, on="movie_id")
df_merged.head(2)

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)


# create dataset objects

### create ratings dataset from df_merged

**IMPORTANT NOTE**    
Currently, the type of df_merged.user_id is int.   
To avoid error, we have to convert int to string.

In [10]:
df_merged.dtypes

user_id         int64
movie_id        int64
rating          int64
timestamp       int64
movie_title    object
dtype: object

In [11]:
df_merged.user_id = df_merged.user_id.astype(str)
df_merged.head(2)

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)


In [12]:
df_merged.dtypes

user_id        object
movie_id        int64
rating          int64
timestamp       int64
movie_title    object
dtype: object

In [13]:
ratings = tf.data.Dataset.from_tensor_slices({
    "user_id": df_merged.user_id.tolist(),
    "movie_title": df_merged.movie_title.tolist(),
    "rating": df_merged.rating.tolist(),
    "timestamp": df_merged.timestamp.tolist()
})

In [14]:
# print firtst 2 elements
list(ratings.take(2).as_numpy_iterator())

[{'movie_title': b'Kolya (1996)',
  'rating': 3,
  'timestamp': 881250949,
  'user_id': b'196'},
 {'movie_title': b'Kolya (1996)',
  'rating': 3,
  'timestamp': 875747190,
  'user_id': b'63'}]

In [15]:
# I'll use only user_id and movie_title for now.
ratings = ratings.map(lambda x: {
    "user_id": x["user_id"],
    "movie_title": x["movie_title"]
})
list(ratings.take(2).as_numpy_iterator())

[{'movie_title': b'Kolya (1996)', 'user_id': b'196'},
 {'movie_title': b'Kolya (1996)', 'user_id': b'63'}]

### movies dataset

In [16]:
movies = tf.data.Dataset.from_tensor_slices({
    "movie_title": df_movies.movie_title.tolist()
})

In [17]:
list(movies.take(2).as_numpy_iterator())

[{'movie_title': b'Toy Story (1995)'}, {'movie_title': b'GoldenEye (1995)'}]

In [18]:
movies = movies.map(lambda x: x["movie_title"])
list(movies.take(2).as_numpy_iterator())

[b'Toy Story (1995)', b'GoldenEye (1995)']

**ANOTHER IMPORTANT NOTE**  
movies dasaset needs to be unique. 

In [19]:
movies = movies.unique()
list(movies.take(2).as_numpy_iterator())

[b'Toy Story (1995)', b'GoldenEye (1995)']

# define our model

## get unique user_ids and movie_titles

In [20]:
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])
movie_titles = movies.batch(1_000)

unique_user_ids = np.unique(np.concatenate(list(user_ids)))
unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))

In [21]:
len(unique_user_ids), len(unique_movie_titles)

(943, 1664)

In [22]:
class MovieLensModel(tfrs.Model):
    def __init__(self, embedding_dimension=32):
        super().__init__()

        self.user_model = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=unique_user_ids, mask_token=None),
            tf.keras.layers.Embedding(len(unique_user_ids)+1, embedding_dimension)
        ])

        self.movie_model = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=unique_movie_titles, mask_token=None),
            tf.keras.layers.Embedding(len(unique_movie_titles)+1, embedding_dimension)
        ])

        self.task = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=movies.batch(128).map(self.movie_model)
            )
        )

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        user_embeddings = self.user_model(features["user_id"])
        movie_embeddings = self.movie_model(features["movie_title"])

        return self.task(user_embeddings, movie_embeddings)

In [23]:
model = MovieLensModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

# train our model

## before training, shuffle datasets and cache them to make training faster

In [24]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

In [25]:
cached_train = shuffled.shuffle(100_000).batch(8192).cache()
cached_test = shuffled.batch(4096).cache()

In [26]:
%time model.fit(cached_train, validation_data=cached_test, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 4min 41s, sys: 12.6 s, total: 4min 53s
Wall time: 3min 17s


<keras.callbacks.History at 0x7effdc277510>

# It's time for recommendation!

In [27]:
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index_from_dataset(
  tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.movie_model)))
)

_, titles = index(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :3]}")

Recommendations for user 42: [b'Rudy (1993)' b'Client, The (1994)' b'Maverick (1994)']


After 3 epochs, "val_factorized_top_k/top_100_categorical_accuracy" reached 0.3359.    
The prediction seems to be working well 😊

---

## optional: decode byte strings

In [28]:
# predictions are byte strings...
titles[0, :3].numpy().tolist()

[b'Rudy (1993)', b'Client, The (1994)', b'Maverick (1994)']

In [29]:
# Let's decode them!
(b" ".join(titles[0, :3].numpy().tolist())).decode()

'Rudy (1993) Client, The (1994) Maverick (1994)'