## Movie Recommendation Model

### Introduction: 

Key component of our chat bot is understanding of a recommendation based question and recommend movies accordingly. Our recommendation model is based on the metadata of the movies like cast, keywords, genres and director. Chat bot parses the input query by the user and categorize it as a recommendation based question. Then, we extract the movie name if present and provide it to our recommendation model. Recommendation model considers the factors described above and based on cosine similarity provides name of 10 movies whose ratings are above a predefined threshold. 

In [606]:
import pandas as pd
from ast import literal_eval
import numpy as np
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
import warnings; warnings.simplefilter('ignore')


In [607]:
md = pd. read_csv('C:\ADS_Project_files\movies_metadata.csv',encoding='utf-8')
credits = pd.read_csv('C:\ADS_Project_files\credits.csv',encoding='utf-8')
keywords = pd.read_csv('C:\ADS_Project_files\keywords_n.csv',encoding='utf-8')

### Genres:

Genres in our data set is present in the form described below:
[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]

Here we are extracting the 'name' and storing all the name as a single python list. Null values are filled as '[]'.

In [608]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [609]:
md['genres']

0                        [Animation, Comedy, Family]
1                       [Adventure, Fantasy, Family]
2                                  [Romance, Comedy]
3                           [Comedy, Drama, Romance]
4                                           [Comedy]
5                   [Action, Crime, Drama, Thriller]
6                                  [Comedy, Romance]
7                 [Action, Adventure, Drama, Family]
8                      [Action, Adventure, Thriller]
9                      [Adventure, Action, Thriller]
10                          [Comedy, Drama, Romance]
11                                  [Comedy, Horror]
12                    [Family, Animation, Adventure]
13                                  [History, Drama]
14                               [Action, Adventure]
15                                    [Drama, Crime]
16                                  [Drama, Romance]
17                                   [Crime, Comedy]
18                        [Crime, Comedy, Adve

Extracting year from release date of the movie.

In [610]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [611]:
md = md.drop([19730, 29503, 35587])

### Merging DataFrames:

As we are using data from multiple csv files, here we are converting id type of three main dataframes to integer so that we can easily merge all the three dataframes to a single dataframe as described below.

In [612]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [613]:
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

Importing a minified version of the movies dataset for fast data processing and similar to above dataframes, converting ID column to Integer.

In [614]:
links_small = pd.read_csv('C:\ADS_Project_files\links_small.csv',encoding='utf-8')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

Comparing the rows with old dataframe and adding only those rows to a new dataframe which are matched. 

In [615]:
smd = md[md['id'].isin(links_small)]

In [616]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

In [617]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

Director is considered as a parameter for our recommendation model, as many users who liked a particular movie are more likely 
to like the director of that movie as well and might be interested in watching some more movies of that director. 

Here we are extracting name of the director from crew column of the dataframe by calling 'get_director' method.

In [618]:
smd['director'] = smd['crew'].apply(get_director)

### Cast:

Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list.

In [619]:
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

On similar lines, extracting keywords from our dataset which is present as array of dictionary objects. 

In [620]:
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

Converting name of the top 3 cast members to lower case and remove white spaces between first name and last name of the cast member.

In [621]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

### Director:
Converting director's first name and last name to lower case and removing the blank spaces, name of the director is mentioned 3 times to give it more weight relative to the entire cast.

In [622]:
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x, x])

### Keywords:
We will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we calculate the frequenct counts of every keyword that appears in the dataset

In [623]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

Keywords occur in frequencies ranging from 1 to 610. We do not have any use for keywords that occur only once. Therefore, these can be safely removed.

In [624]:
s = s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

In [625]:
s = s[s > 1]

In [626]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

Adding keywords to the dataframe by removing the white spaces and converting it into lower case.

In [627]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

Now we are adding keywords, cast, director and genres to a new dataframe and then joining each record to form a single record of each movie. Then we are removing junk characters from the each record.

In [628]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [629]:
smd['soup'] = smd['soup'].apply(lambda x: x.decode('unicode_escape').
                                          encode('ascii', 'ignore').
                                          strip())

### Count Vectorizer:

We have used CountVectorizer to tokenize a collection of text documents and build a vocabulary of known words, and also to encode new documents using that vocabulary.

In [630]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

### Cosine Similarity:

Cosine Similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine angle between them. Here we are using cosine similarity to find similarity between the movie vectors that were vectorized by count vectorizer method for finding out the similarity between the movies and recommend them

In [631]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [632]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

### Weighted_Rating:

We have used IMDB's weighted rating formula to construct our chart. Mathematically, it is represented as follows:

Weighted Rating (WR) =  (v/(v+m).R)+(m/(v+m).C)

where,

v is the number of votes for the movie
m is the minimum votes required to be listed in the chart
R is the average rating of the movie
C is the mean vote across the whole report

The next step is to determine an appropriate value for m, the minimum votes required to be listed in the chart. We will use 95th percentile as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

In [633]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()

m = vote_counts.quantile(0.95)

In [634]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

### Recommendation:

There comes a problem where movies are very similar to each other but one movie had received good ratings and other performed poorly. Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of  'm' , we will calculate the weighted rating of each movie using IMDB's formula.

In [635]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [638]:
get_recommendations('The Dark Knight').head(5)

Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.919065
8613,Interstellar,11187,8,2014,7.898936
6623,The Prestige,4510,8,2006,7.762198
3381,Memento,4168,8,2000,7.744491
8031,The Dark Knight Rises,9263,7,2012,6.922734
