# **Movies Recommender Systems**

![](https://i.kinja-img.com/gawker-media/image/upload/s--e3_2HgIC--/c_scale,f_auto,fl_progressive,q_80,w_800/1259003599478673704.jpg)

In this notebook we'll be building a baseline Movie Recommendation System using [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata). 

> *  **Collaborative Filtering**- This system matches persons with similar interests and givrs recommendations based on this matching.

> *  **Content Based Filtering**- They suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.

Let's load the data now.

In [1]:
import numpy as np
np.random.seed(42)
import pandas as pd
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel
#from surprise import Reader, Dataset, accuracy, SVD, SVDpp, SlopeOne, NMF, NormalPredictor, KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering
from surprise import Reader, Dataset, accuracy, BaselineOnly
from surprise.model_selection import train_test_split, cross_validate
from ast import literal_eval
from warnings import filterwarnings
filterwarnings('ignore')

# Collaborative Filtering

In [2]:
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [3]:
ratings.shape

(100004, 4)

In [4]:
ratings.rating.unique()

array([2.5, 3. , 2. , 4. , 3.5, 1. , 5. , 4.5, 1.5, 0.5])

In [5]:
reader = Reader(rating_scale=(min(ratings.rating.unique()), max(ratings.rating.unique())))

Note that in this dataset movies are rated on a scale of 5 unlike the earlier one.

In [6]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [146]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.892241,36.929111,10.957998
BaselineOnly,0.89724,0.285474,0.457537
SVD,0.902373,0.854992,0.270315
KNNBaseline,0.907518,0.420924,2.722181
KNNWithZScore,0.927317,0.244732,2.512498
KNNWithMeans,0.929421,0.192646,2.361296
SlopeOne,0.940538,2.214963,7.182505
NMF,0.961535,1.958267,0.224148
CoClustering,0.972899,2.562874,0.16141
KNNBasic,0.978931,0.1653,2.459502


In [123]:
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


{'test_rmse': array([0.89798978, 0.88597238, 0.89043227]),
 'fit_time': (0.14056181907653809, 0.14896345138549805, 0.14439773559570312),
 'test_time': (0.2081165313720703, 0.15616345405578613, 0.1260836124420166)}

In [7]:
#Splitting the dataset
trainset, testset = train_test_split(data, test_size=0.25, random_state=0)

# Use user_based true/false to switch between user-based or item-based collaborative filtering
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
algo.fit(trainset)

Estimating biases using als...


<surprise.prediction_algorithms.baseline_only.BaselineOnly at 0x2756a0de9e0>

In [8]:
# run the trained model against the testset
test_pred = algo.test(testset)

# get RMSE
accuracy.rmse(test_pred, verbose=True)

RMSE: 0.8835


0.8835003710582577

We get a mean Root Mean Sqaure Error of 0.88 approx which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.

Let us pick user with user Id 1  and check the ratings she/he has given.

In [9]:
algo.predict(uid = 1, iid = 31).est #, r_ui=3)

2.6009259934657907

For movie with ID 31, we get an estimated prediction of **2.6**. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

In [182]:
def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0

In [184]:
d = pd.DataFrame(test_pred, columns=['uid', 'iid', 'rui', 'est', 'details'])
d['Iu'] = d.uid.apply(get_Iu)
d['Ui'] = d.iid.apply(get_Ui)
d['err'] = abs(d.est - d.rui)
best_predictions = d.sort_values(by='err')[:10]
worst_predictions = d.sort_values(by='err')[-10:]

In [189]:
best_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
17834,287,98491,5.0,5.0,{'was_impossible': False},194,6,0.0
18280,242,908,5.0,5.0,{'was_impossible': False},308,61,0.0
14947,656,1203,5.0,5.0,{'was_impossible': False},91,55,0.0
13405,656,2858,5.0,5.0,{'was_impossible': False},91,169,0.0
12912,559,926,5.0,5.0,{'was_impossible': False},99,27,0.0
746,656,1208,5.0,5.0,{'was_impossible': False},91,85,0.0
20072,656,913,5.0,5.0,{'was_impossible': False},91,45,0.0
6185,242,527,5.0,5.0,{'was_impossible': False},308,186,0.0
22010,287,1198,5.0,5.0,{'was_impossible': False},194,167,0.0
11023,287,4993,5.0,5.0,{'was_impossible': False},194,155,0.0


In [186]:
worst_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
4936,174,1252,0.5,4.125798,{'was_impossible': False},17,55,3.625798
22747,430,8665,0.5,4.140969,{'was_impossible': False},219,57,3.640969
528,431,920,1.0,4.65143,{'was_impossible': False},188,44,3.65143
20171,646,955,1.0,4.738314,{'was_impossible': False},133,22,3.738314
2191,405,2858,0.5,4.273652,{'was_impossible': False},322,169,3.773652
9040,479,5618,0.5,4.324019,{'was_impossible': False},59,50,3.824019
18178,78,1089,1.0,4.83705,{'was_impossible': False},194,99,3.83705
12452,219,4993,0.5,4.338166,{'was_impossible': False},101,155,3.838166
11442,410,527,0.5,4.487503,{'was_impossible': False},20,186,3.987503
5174,546,1235,0.5,4.814421,{'was_impossible': False},49,30,4.314421


In [31]:
def predict(user):
    pred = pd.DataFrame(index= ratings['movieId'].unique())
    pred.reset_index(inplace=True)
    pred.rename(columns={'index': 'itemId'}, inplace=True)
    pred['similar_item'] = pred.apply(lambda x: round(algo.predict(uid= user, iid= x['itemId']).est, 1), axis=1)
    pred.sort_values(by='similar_item', inplace=True, ascending=False)
    return pred

In [36]:
prediction = predict(31)

In [37]:
prediction

Unnamed: 0,itemId,similar_item
157,858,4.7
99,318,4.7
740,926,4.6
24,50,4.6
160,913,4.6
...,...,...
2234,181,2.5
821,1556,2.5
4788,829,2.4
4210,546,2.4


### For Content Based Filtering we will use two datasets

The first dataset contains the following features:-

* movie_id - A unique identifier for each movie.
* cast - The name of lead and supporting actors.
* crew - The name of Director, Editor, Composer, Writer etc.

The second dataset has the following features:- 

* budget - The budget in which the movie was made.
* genre - The genre of the movie, Action, Comedy ,Thriller etc.
* homepage - A link to the homepage of the movie.
* id - This is infact the movie_id as in the first dataset.
* keywords - The keywords or tags related to the movie.
* original_language - The language in which the movie was made.
* original_title - The title of the movie before translation or adaptation.
* overview - A brief description of the movie.
* popularity - A numeric quantity specifying the movie popularity.
* production_companies - The production house of the movie.
* production_countries - The country in which it was produced.
* release_date - The date on which it was released.
* revenue - The worldwide revenue generated by the movie.
* runtime - The running time of the movie in minutes.
* status - "Released" or "Rumored".
* tagline - Movie's tagline.
* title - Title of the movie.
* vote_average -  average ratings the movie recieved.
* vote_count - the count of votes recieved.

Let's join the two dataset on the 'id' column


In [22]:
df1 = pd.read_csv('tmdb_5000_credits.csv')
df1.columns = ['id','tittle','cast','crew']

df2 = pd.read_csv('tmdb_5000_movies.csv')

In [23]:
df = df2.merge(df1, on='id')

In [24]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [25]:
def alphanum(row):
    try:
        return (re.sub(r'[^A-Za-z0-9 ]+', '', row)).lower()
    except:
        return ''

# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''
        
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])# + ' ' + str(x['overview'])

# **Content Based Filtering**

We will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score. The plot description is given in the **overview** feature of our dataset. 
Let's take a look at the data. .. 

The next step would be to convert the names and keyword instances into lowercase and strip all the spaces between them. This is done so that our vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same.

We are now in a position to create our "metadata soup", which is a string that contains all the metadata that we want to feed to our vectorizer (namely actors, director and keywords).

In [26]:
df.overview[0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [27]:
# Remove non alphanumeric characters and lower all to not double 'a' and 'A'
df.overview = df.overview.apply(alphanum)

In [37]:
# Parse the stringified features into their corresponding python objects
df['cast'] = df['cast'].apply(literal_eval)
df['crew'] = df['crew'].apply(literal_eval)
df['keywords'] = df['keywords'].apply(literal_eval)
df['genres'] = df['genres'].apply(literal_eval)

# Define new director, cast, genres and keywords features that are in a suitable form.
df['director'] = df['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(get_list)

# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']
for feature in features:
    df[feature] = df[feature].apply(clean_data)

# create column for combined features
df['soup'] = df.apply(create_soup, axis=1)

In [28]:
df[df.overview.isna()]

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew


In [29]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
#Construct the required TF-IDF matrix by fitting and transforming the data
tfv = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfv.fit_transform(df['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4803, 23277)

In [30]:
# Compute the cosine similarity matrix
tfidf_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [31]:
# Reset index of our main DataFrame
df.drop_duplicates(subset=['title'], inplace=True)
df.reset_index(drop=True, inplace=True)

In [32]:
df[df['title'].duplicated(keep='first')]

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew


We are going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [33]:
def overview_based_recommender(title, df):
    try:
        # Get the index that matches the title
        index = df[df['title'] == title].index.values[0]
        # Get the pairwsie similarity of all scores
        sim = list(enumerate(tfidf_sim[index]))
        # Sort based on the similarity scores
        sorted_sim = sorted(sim, key= lambda x: x[1], reverse=True)[1:6]
        res = df[['title', 'vote_average', 'genres']].iloc[[i[0] for i in sorted_sim]]
        #res.sort_values(by='vote_average', ascending=False, inplace=True)
        return res
    except:
        return 'Sorry, This Movie is not in the dataset'

In [38]:
overview_based_recommender('The Dark Knight Rises', df)

Unnamed: 0,title,vote_average,genres
428,Batman Returns,6.6,"[action, fantasy]"
65,The Dark Knight,8.2,"[drama, action, crime]"
1359,Batman,7.0,"[fantasy, action]"
3854,In the Name of the King III,3.3,"[action, adventure, drama]"
299,Batman Forever,5.2,"[action, crime, fantasy]"


In [39]:
overview_based_recommender('The Avengers', df)

Unnamed: 0,title,vote_average,genres
7,Avengers: Age of Ultron,7.3,"[action, adventure, sciencefiction]"
3144,Amour,7.5,"[drama, romance]"
1715,Timecop,5.5,"[thriller, sciencefiction, action]"
4124,The Last Five Years,5.5,"[comedy, drama, music]"
3033,Mud,7.0,[drama]


In [40]:
overview_based_recommender('The Godfather', df)

Unnamed: 0,title,vote_average,genres
3143,Plastic,6.1,"[drama, action, comedy]"
3251,New Nightmare,6.4,"[horror, thriller, mystery]"
1743,Octopussy,6.2,"[adventure, action, thriller]"
1343,Never Say Never Again,5.8,"[adventure, action, thriller]"
170,The World Is Not Enough,6.0,"[adventure, action, thriller]"


In [41]:
df.soup[0]

'cultureclash future spacewar samworthington zoesaldana sigourneyweaver jamescameron action adventure fantasy'

In [42]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
#Construct the required TF-IDF matrix by fitting and transforming the data
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['soup'])

#Output the shape of tfidf_matrix
count_matrix.shape

(4800, 11508)

In [43]:
# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [108]:
# Reset index of our main DataFrame
df.drop_duplicates(subset=['title'], inplace=True)
df.reset_index(drop=True, inplace=True)

In [44]:
df[df['title'].duplicated(keep='first')]

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,status,tagline,title,vote_average,vote_count,tittle,cast,crew,director,soup


We are going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [45]:
def content_based_recommender(title, df):
    try:
        # Get the index that matches the title
        index = df[df['title'] == title].index.values[0]
        # Get the pairwsie similarity of all scores
        sim = list(enumerate(cosine_sim[index]))
        # Sort based on the similarity scores
        sorted_sim = sorted(sim, key= lambda x: x[1], reverse=True)[1:6]
        res = df[['title', 'vote_average', 'genres']].iloc[[i[0] for i in sorted_sim]]
        #res.sort_values(by='vote_average', ascending=False, inplace=True)
        return res
    except:
        return 'Sorry, This Movie is not in the dataset'

In [46]:
content_based_recommender('The Dark Knight Rises', df)

Unnamed: 0,title,vote_average,genres
65,The Dark Knight,8.2,"[drama, action, crime]"
119,Batman Begins,7.5,"[action, crime, drama]"
4635,Amidst the Devil's Wings,0.0,"[drama, action, crime]"
1196,The Prestige,8.0,"[drama, mystery, thriller]"
3072,Romeo Is Bleeding,5.7,"[action, crime, drama]"


In [47]:
content_based_recommender('The Avengers', df)

Unnamed: 0,title,vote_average,genres
7,Avengers: Age of Ultron,7.3,"[action, adventure, sciencefiction]"
26,Captain America: Civil War,7.1,"[adventure, action, sciencefiction]"
79,Iron Man 2,6.6,"[adventure, action, sciencefiction]"
169,Captain America: The First Avenger,6.6,"[action, adventure, sciencefiction]"
174,The Incredible Hulk,6.1,"[sciencefiction, action, adventure]"


In [48]:
content_based_recommender('The Godfather', df)

Unnamed: 0,title,vote_average,genres
867,The Godfather: Part III,7.1,"[crime, drama, thriller]"
2731,The Godfather: Part II,8.3,"[drama, crime]"
4635,Amidst the Devil's Wings,0.0,"[drama, action, crime]"
2649,The Son of No One,4.8,"[drama, thriller, crime]"
1525,Apocalypse Now,8.0,"[drama, war]"


In [121]:
def hybrid_content(title, df):
    try:
        # Get the index that matches the title
        index = df[df['title'] == title].index.values[0]
        # Get the pairwsie similarity of all scores
        sim1 = list(enumerate(cosine_sim[index]))
        # Sort based on the similarity scores
        sorted_sim1 = sorted(sim1, key= lambda x: x[1], reverse=True)[1:6]
        res1 = df[['title', 'vote_average', 'genres']].iloc[[i[0] for i in sorted_sim1]]
        
        # Get the pairwsie similarity of all scores
        sim2 = list(enumerate(tfidf_sim[index]))
        # Sort based on the similarity scores
        sorted_sim2 = sorted(sim2, key= lambda x: x[1], reverse=True)[1:6]
        res2 = df[['title', 'vote_average', 'genres']].iloc[[i[0] for i in sorted_sim2]]
    except:
        return 'Sorry, This Movie is not in the dataset'

    res = res1.append(res2)
    res.drop_duplicates(subset=['title'], inplace=True)

    return res[['title', 'vote_average', 'genres']]

In [122]:
hybrid_content("Harry Potter and the Half-Blood Prince", df)

Unnamed: 0,title,vote_average,genres
113,Harry Potter and the Order of the Phoenix,7.4,"[adventure, fantasy, family]"
114,Harry Potter and the Goblet of Fire,7.5,"[adventure, fantasy, family]"
197,Harry Potter and the Philosopher's Stone,7.5,"[adventure, fantasy, family]"
276,Harry Potter and the Chamber of Secrets,7.4,"[adventure, fantasy, family]"
191,Harry Potter and the Prisoner of Azkaban,7.7,"[adventure, fantasy, family]"
501,The Little Prince,7.6,"[adventure, animation, fantasy]"
