### Case study – Building a hybrid model
In this section, let's build a content-based model that incorporates some collaborative filtering techniques into it. Imagine that you have built a website like Netflix. Every time a user watches a movie, you want to display a list of recommendations in the side pane (like YouTube). At first glance, a content-based recommender seems appropriate for this task. This is because, if the person is currently watching something they find interesting, they will be more inclined to watch something similar to it.

Let's say our user is watching The Dark Knight. Since this is a Batman movie, our content based recommender is likely to recommend other Batman (or superhero) movies regardless of quality. This may not always lead to the best recommendations. For instance, most people who like The Dark Knight do not rate Batman and Robin very highly, although they feature the same lead character. Therefore, we will introduce a collaborative filter here that predicts the ratings of the movies recommended by our content-based model and return the top few movies with the highest predictions.

In other words, the workflow of our hybrid model will be as follows:
1. Take in a movie title and user as input
2. Use a content-based model to compute the 25 most similar movies
3. Compute the predicted ratings that the user might give these 25 movies using a collaborative filter
4. Return the top 10 movies with the highest predicted rating

In [1]:
import pandas as pd
import numpy as np

import warnings

warnings.filterwarnings("ignore")

In [50]:
# ratings_small.csv file contains 100,000 ratings for 9,000 movies from 700 users.
ratings_small = pd.read_csv("../../data/ratings_small.csv")
ratings_small.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [3]:
ratings_small.shape

(100004, 4)

In [4]:
# links_small.csv file contains the movie IDs of all the movies rated in the ratings_small.csv file and their corresponding titles.
links_small = pd.read_csv("../../data/links_small.csv")
links_small.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [5]:
links_small.shape

(9125, 3)

In [6]:
movies_metadata = pd.read_csv("../../data/movies_metadata.csv")
movies_metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### Preprocessing metadata

In [7]:
movies_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [8]:
movies_metadata.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [9]:
# Let's take only relevant columns
relevant_columns = ['id', 'original_title', 'genres', 'original_language', 'overview', 'production_companies', 'production_countries']
relevant_movies_metadata = movies_metadata[relevant_columns]
relevant_movies_metadata.head()

Unnamed: 0,id,original_title,genres,original_language,overview,production_companies,production_countries
0,862,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en,"Led by Woody, Andy's toys live happily in his ...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o..."
1,8844,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en,When siblings Judy and Peter discover an encha...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o..."
2,15602,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",en,A family wedding reignites the ancient feud be...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o..."
3,31357,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,"Cheated on, mistreated and stepped on, the wom...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o..."
4,11862,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",en,Just when George Banks has recovered from his ...,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o..."


In [10]:
def convert_to_int(x):
    try:
        return int(x)
    except:
        return np.NaN

In [11]:
# Converting movies id to integer
relevant_movies_metadata['id'] = relevant_movies_metadata['id'].apply(convert_to_int)

In [12]:
relevant_movies_metadata['id'].isna().sum()

3

In [13]:
# Dropping movies without id
relevant_movies_metadata.dropna(subset=['id'], inplace=True)
relevant_movies_metadata.reset_index(drop=True, inplace=True)

In [14]:
relevant_movies_metadata = relevant_movies_metadata.rename(columns={'id':'movieId'})
relevant_movies_metadata.head()

Unnamed: 0,movieId,original_title,genres,original_language,overview,production_companies,production_countries
0,862.0,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en,"Led by Woody, Andy's toys live happily in his ...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o..."
1,8844.0,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en,When siblings Judy and Peter discover an encha...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o..."
2,15602.0,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",en,A family wedding reignites the ancient feud be...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o..."
3,31357.0,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,"Cheated on, mistreated and stepped on, the wom...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o..."
4,11862.0,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",en,Just when George Banks has recovered from his ...,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o..."


In [15]:
relevant_movies_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45463 entries, 0 to 45462
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   movieId               45463 non-null  float64
 1   original_title        45463 non-null  object 
 2   genres                45463 non-null  object 
 3   original_language     45452 non-null  object 
 4   overview              44509 non-null  object 
 5   production_companies  45460 non-null  object 
 6   production_countries  45460 non-null  object 
dtypes: float64(1), object(6)
memory usage: 2.4+ MB


In [16]:
from ast import literal_eval

def convert_obj_to_list(column):
    relevant_movies_metadata[column] = relevant_movies_metadata[column].fillna('[]')
    relevant_movies_metadata[column] = relevant_movies_metadata[column].apply(literal_eval)
    relevant_movies_metadata[column] = relevant_movies_metadata[column].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [17]:
convert_obj_to_list('genres')

In [18]:
convert_obj_to_list("production_companies")

In [19]:
convert_obj_to_list("production_countries")

In [20]:
relevant_movies_metadata.head()

Unnamed: 0,movieId,original_title,genres,original_language,overview,production_companies,production_countries
0,862.0,Toy Story,"[Animation, Comedy, Family]",en,"Led by Woody, Andy's toys live happily in his ...",[Pixar Animation Studios],[United States of America]
1,8844.0,Jumanji,"[Adventure, Fantasy, Family]",en,When siblings Judy and Peter discover an encha...,"[TriStar Pictures, Teitler Film, Interscope Co...",[United States of America]
2,15602.0,Grumpier Old Men,"[Romance, Comedy]",en,A family wedding reignites the ancient feud be...,"[Warner Bros., Lancaster Gate]",[United States of America]
3,31357.0,Waiting to Exhale,"[Comedy, Drama, Romance]",en,"Cheated on, mistreated and stepped on, the wom...",[Twentieth Century Fox Film Corporation],[United States of America]
4,11862.0,Father of the Bride Part II,[Comedy],en,Just when George Banks has recovered from his ...,"[Sandollar Productions, Touchstone Pictures]",[United States of America]


In [21]:
for column in ["genres", "production_companies", "production_countries"]:
    relevant_movies_metadata[column] = relevant_movies_metadata[column].apply(lambda x: " ".join(x))

In [22]:
relevant_movies_metadata["movieId"] = relevant_movies_metadata["movieId"].astype(int)

In [23]:
relevant_movies_metadata.head()

Unnamed: 0,movieId,original_title,genres,original_language,overview,production_companies,production_countries
0,862,Toy Story,Animation Comedy Family,en,"Led by Woody, Andy's toys live happily in his ...",Pixar Animation Studios,United States of America
1,8844,Jumanji,Adventure Fantasy Family,en,When siblings Judy and Peter discover an encha...,TriStar Pictures Teitler Film Interscope Commu...,United States of America
2,15602,Grumpier Old Men,Romance Comedy,en,A family wedding reignites the ancient feud be...,Warner Bros. Lancaster Gate,United States of America
3,31357,Waiting to Exhale,Comedy Drama Romance,en,"Cheated on, mistreated and stepped on, the wom...",Twentieth Century Fox Film Corporation,United States of America
4,11862,Father of the Bride Part II,Comedy,en,Just when George Banks has recovered from his ...,Sandollar Productions Touchstone Pictures,United States of America


In [24]:
relevant_movies_metadata.isna().sum()

movieId                   0
original_title            0
genres                    0
original_language        11
overview                954
production_companies      0
production_countries      0
dtype: int64

In [25]:
relevant_movies_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45463 entries, 0 to 45462
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   movieId               45463 non-null  int32 
 1   original_title        45463 non-null  object
 2   genres                45463 non-null  object
 3   original_language     45452 non-null  object
 4   overview              44509 non-null  object
 5   production_companies  45463 non-null  object
 6   production_countries  45463 non-null  object
dtypes: int32(1), object(6)
memory usage: 2.3+ MB


In [26]:
relevant_movies_metadata.fillna('', inplace=True)

In [27]:
def combine_all(x):
    result = ""
    for col in x[1:]:
        result += " " + str(col).lower()
    return result

In [28]:
relevant_movies_metadata['metasoup'] = relevant_movies_metadata.apply(combine_all, axis=1)

In [29]:
relevant_movies_metadata.head()

Unnamed: 0,movieId,original_title,genres,original_language,overview,production_companies,production_countries,metasoup
0,862,Toy Story,Animation Comedy Family,en,"Led by Woody, Andy's toys live happily in his ...",Pixar Animation Studios,United States of America,toy story animation comedy family en led by w...
1,8844,Jumanji,Adventure Fantasy Family,en,When siblings Judy and Peter discover an encha...,TriStar Pictures Teitler Film Interscope Commu...,United States of America,jumanji adventure fantasy family en when sibl...
2,15602,Grumpier Old Men,Romance Comedy,en,A family wedding reignites the ancient feud be...,Warner Bros. Lancaster Gate,United States of America,grumpier old men romance comedy en a family w...
3,31357,Waiting to Exhale,Comedy Drama Romance,en,"Cheated on, mistreated and stepped on, the wom...",Twentieth Century Fox Film Corporation,United States of America,waiting to exhale comedy drama romance en che...
4,11862,Father of the Bride Part II,Comedy,en,Just when George Banks has recovered from his ...,Sandollar Productions Touchstone Pictures,United States of America,father of the bride part ii comedy en just wh...


### Text encoding

In [30]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
movies_encoding = model.encode(relevant_movies_metadata["metasoup"])

In [31]:
movies_encoding.shape

(45463, 384)

In [32]:
movies_encoding[:5]

array([[ 0.03592231, -0.05000314,  0.07487418, ...,  0.01259368,
         0.09317448,  0.05876607],
       [ 0.0272309 ,  0.07721959, -0.01427438, ..., -0.01563095,
        -0.10285594, -0.02959121],
       [-0.04545005, -0.0388744 , -0.0217113 , ...,  0.0117311 ,
        -0.00314587, -0.00389413],
       [ 0.00624025, -0.07834873, -0.03801201, ..., -0.05192328,
         0.0134725 ,  0.00920105],
       [-0.00046618, -0.05234849,  0.04314076, ...,  0.05090494,
         0.08229707, -0.06663051]], dtype=float32)

### Build title to id and id to title map

In [33]:
# build title to id and id to title mappings
id_map = relevant_movies_metadata[["movieId", "original_title"]]
id_to_title = id_map.set_index("movieId")
id_to_title.head()

Unnamed: 0_level_0,original_title
movieId,Unnamed: 1_level_1
862,Toy Story
8844,Jumanji
15602,Grumpier Old Men
31357,Waiting to Exhale
11862,Father of the Bride Part II


In [34]:
title_to_id = id_map.set_index("original_title")
title_to_id.head()

Unnamed: 0_level_0,movieId
original_title,Unnamed: 1_level_1
Toy Story,862
Jumanji,8844
Grumpier Old Men,15602
Waiting to Exhale,31357
Father of the Bride Part II,11862


### Cosine similarity

In [35]:
from sklearn.metrics.pairwise import linear_kernel

cosine_similarities = linear_kernel(movies_encoding, movies_encoding)

In [36]:
cosine_similarities.shape

(45463, 45463)

### Movies to indices

In [37]:
indices = pd.Series(relevant_movies_metadata.index, index=relevant_movies_metadata["original_title"])
indices[:2]

original_title
Toy Story    0
Jumanji      1
dtype: int64

In [38]:
movie_title = relevant_movies_metadata["original_title"]
movie_title[:2]

0    Toy Story
1      Jumanji
Name: original_title, dtype: object

### Content based recommender

In [39]:
def content_recommender(title, num_movies=25):
  if title not in indices:
    raise KeyError("Title Not Found in database.")
  idx = indices[title]
  sim_scores = list(enumerate(cosine_similarities[idx]))
  sim_scores = sorted(sim_scores, key = lambda x:x[1], reverse = True)
  sim_scores = sim_scores[1:num_movies+1]
  movie_indices = [i[0] for i in sim_scores]
  return movie_title.iloc[movie_indices]

In [40]:
content_recommender('Toy Story')

2997                           Toy Story 2
15348                          Toy Story 3
21927                 Toy Story of Terror!
33109                     Babes in Toyland
4799                               The Toy
16396                      The Pixar Story
24520                    Hawaiian Vacation
24522                            Small Fry
25799           Toy Story That Time Forgot
25797                      Partysaurus Rex
36091                 Welcome to Happiness
13724                                   Up
10659                             Luxo Jr.
30313                           Inside Out
19216                        For the Birds
28984                      Superstar Goofy
36424                 The Bear That Wasn't
39012                                Gooby
42232                        The Boss Baby
25565                The Nutcracker Prince
19111                              Tin Toy
27933    Family Guy Presents: Blue Harvest
6186                 It Runs in the Family
10796      

### Saving models

In [41]:
import pickle

with open("../../data/cosine_similarities.pkl", "wb") as f:
    pickle.dump(cosine_similarities, f)

### Preprocessing rating

In [53]:
ratings = ratings_small.drop(['timestamp'], axis=1)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [54]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100004 non-null  int64  
 1   movieId  100004 non-null  int64  
 2   rating   100004 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB


In [55]:
ratings.isna().sum()

userId     0
movieId    0
rating     0
dtype: int64

### Collaborative filtering

In [71]:
from surprise import Reader, Dataset, SVD, accuracy
from surprise.model_selection import cross_validate

In [80]:
reader = Reader()
data = Dataset.load_from_df(ratings, reader)
svd = SVD()

cross_validate(svd, data, measures=["RMSE"], cv=5)

{'test_rmse': array([0.88754807, 0.89934801, 0.89230729, 0.90120948, 0.89700389]),
 'fit_time': (0.8648698329925537,
  0.8833386898040771,
  0.9000144004821777,
  0.8998897075653076,
  0.8833327293395996),
 'test_time': (0.11600899696350098,
  0.09994840621948242,
  0.366657018661499,
  0.10737276077270508,
  0.11591625213623047)}

In [81]:
svd.predict(1, 12)

Prediction(uid=1, iid=12, r_ui=None, est=2.453103071108528, details={'was_impossible': False})

### Hybrid model

In [77]:
def hybrid_model(user_id, title, num_movies=30):
    movies_list = content_recommender(title, num_movies=num_movies)
    movies_id = list(movies_list.index)
    rank = []
    for movie_id in movies_id:
        score = svd.predict(user_id, movie_id).est
        rank.append((score, movies_list[movie_id]))
    
    result = sorted(rank, key=lambda x: x[0], reverse=True)
    return result

In [84]:
user_id = 1
movie_name = "Toy Story"	
hybrid_model(user_id, movie_name, num_movies=30)

[(3.2928119803942306, 'Toy Story 2'),
 (2.990187245115343, 'A Goofy Movie'),
 (2.8082546653135436, 'The Toy'),
 (2.748072884295331, 'Toy Story 3'),
 (2.748072884295331, 'Toy Story of Terror!'),
 (2.748072884295331, 'Babes in Toyland'),
 (2.748072884295331, 'The Pixar Story'),
 (2.748072884295331, 'Hawaiian Vacation'),
 (2.748072884295331, 'Small Fry'),
 (2.748072884295331, 'Toy Story That Time Forgot'),
 (2.748072884295331, 'Partysaurus Rex'),
 (2.748072884295331, 'Welcome to Happiness'),
 (2.748072884295331, 'Up'),
 (2.748072884295331, 'Luxo Jr.'),
 (2.748072884295331, 'Inside Out'),
 (2.748072884295331, 'For the Birds'),
 (2.748072884295331, 'Superstar Goofy'),
 (2.748072884295331, "The Bear That Wasn't"),
 (2.748072884295331, 'Gooby'),
 (2.748072884295331, 'The Boss Baby'),
 (2.748072884295331, 'The Nutcracker Prince'),
 (2.748072884295331, 'Tin Toy'),
 (2.748072884295331, 'Family Guy Presents: Blue Harvest'),
 (2.748072884295331, 'It Runs in the Family'),
 (2.748072884295331, 'Curi

In [85]:
user_id = 100
movie_name = "Toy Story"	
hybrid_model(user_id, movie_name, num_movies=30)

[(3.855951637718828, 'Toy Story 2'),
 (3.709782739165575, 'A Goofy Movie'),
 (3.599377895095934, 'The Toy'),
 (3.4066239615230685, 'Toy Story 3'),
 (3.4066239615230685, 'Toy Story of Terror!'),
 (3.4066239615230685, 'Babes in Toyland'),
 (3.4066239615230685, 'The Pixar Story'),
 (3.4066239615230685, 'Hawaiian Vacation'),
 (3.4066239615230685, 'Small Fry'),
 (3.4066239615230685, 'Toy Story That Time Forgot'),
 (3.4066239615230685, 'Partysaurus Rex'),
 (3.4066239615230685, 'Welcome to Happiness'),
 (3.4066239615230685, 'Up'),
 (3.4066239615230685, 'Luxo Jr.'),
 (3.4066239615230685, 'Inside Out'),
 (3.4066239615230685, 'For the Birds'),
 (3.4066239615230685, 'Superstar Goofy'),
 (3.4066239615230685, "The Bear That Wasn't"),
 (3.4066239615230685, 'Gooby'),
 (3.4066239615230685, 'The Boss Baby'),
 (3.4066239615230685, 'The Nutcracker Prince'),
 (3.4066239615230685, 'Tin Toy'),
 (3.4066239615230685, 'Family Guy Presents: Blue Harvest'),
 (3.4066239615230685, 'It Runs in the Family'),
 (3.406