In this notebook, I will build two movie content based filtering recommendation engines: One of them is based on the movie description and tagline, the other is based on more details about the movie including genres, director, actors and keywords. (Below you can click the link to each engine)

* [1. Movie Description and Tagline Based Recommender ](#part_2)
   
* [2. Genres, Director, Actors and Keywords Based Recommender ](#part_3)

(All the dataset can be found here: https://grouplens.org/datasets/movielens/)

In [2]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os
import datetime as dt
import warnings  
warnings.filterwarnings('ignore')
import ast

In [2]:
df = pd. read_csv('./movies_metadata.csv')
df.head().transpose()

Unnamed: 0,0,1,2,3,4
adult,False,False,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",,"{'id': 96871, 'name': 'Father of the Bride Col..."
budget,30000000,65000000,0,16000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 35, 'name': 'Comedy'}]"
homepage,http://toystory.disney.com/toy-story,,,,
id,862,8844,15602,31357,11862
imdb_id,tt0114709,tt0113497,tt0113228,tt0114885,tt0113041
original_language,en,en,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...,"Cheated on, mistreated and stepped on, the wom...",Just when George Banks has recovered from his ...


<a id='part_2'></a>
### 1. Movie Description and Tagline Based Recommender

Here we will build a recommender based on cosine similarity between movies based on description and tagline.

Here we can't use the full dataset of movies. The reason is the calculation of similarity of such large dataset will lead to Memory Error in my own computer (16GB). Hence here we will use the links_small.csv to extract a small amount of movies for study.

In [3]:
links_small = pd.read_csv('./links_small.csv')
print(links_small.columns)
# here we will just use the tmdbId since it is id used in the movies_metadata
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int') 
links_small.head()

Index(['movieId', 'imdbId', 'tmdbId'], dtype='object')


0      862
1     8844
2    15602
3    31357
4    11862
Name: tmdbId, dtype: int32

In [4]:
print(type(df['id'][0]))
print(df['id'][19730])
# As we can see, there are several invalid ids, we need to drop them
# Also the default type for id is string, in order to combine df and links_small, we need to keep ids of them in the same satatype

<class 'str'>
1997-08-20


In [5]:
df = df.drop([19730, 29503, 35587]) # these three ids are not int type
df['id'] = df['id'].astype('int') # change id as the same data type as links, so we can join/combine them

df_small = df[df['id'].isin(links_small)]
df_small.head()
df_small.shape

(9099, 24)

In [6]:
# combine "tag" and "overview" as "description" for the movie
df_small['tagline'] = df_small['tagline'].fillna('')
df_small['description'] = df_small['overview'] + df_small['tagline']
df_small['description'] = df_small['description'].fillna('')

In [7]:
# reference: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df_small['description'])

In [8]:
df_small['description'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: description, dtype: object

Then we need to use the movie description feature above, to calculate the similarity between paired movies. The second link below has a good tutorial how to do this with words (TF-IDF Vectorizer)

Tutorials that how to calculate the cosine_similarity with sk-learn
* http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
* http://scikit-learn.org/stable/modules/metrics.html

In [9]:
tfidf_matrix.shape

(9099, 268124)

In [11]:
# Here will result MemoryError:  if we use the full dataset
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
print(cosine_sim.shape)
print(cosine_sim[0])

(9099, 9099)
[ 1.          0.00680476  0.         ...,  0.          0.00344913  0.        ]


Now we get the similarity matrix, and we can use this matrix to recommend similar movies to the target movie.

In [13]:
df_small = df_small.reset_index()
idx = df_small[df_small['title'] == 'The Dark Knight'].index.values[0] # there is only one value such that title is same
cosine_sim[idx]

array([ 0.        ,  0.00777413,  0.        , ...,  0.        ,
        0.        ,  0.        ])

In [25]:
def get_recommendations(title):
    idx = df_small[df_small['title'] == title].index.values[0]
    #print(idx)
    sim_scores = []
    for index, score in enumerate(cosine_sim[idx]): # itearate the array(cosine_sim[idx]), to extract index and score as tuple, saved to list
        sim_scores.append((index,score))
    #sim_scores = list(enumerate(cosine_sim[idx])) same as above
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # sort a list of tuples by one index/column
    top_sim_scores = sim_scores[1:11] # only get top 10
    print('The similarity scores of top similar movies are: ',top_sim_scores)
    top_movie_indices = [x[0] for x in top_sim_scores] # extract indexs of top similar movies as in the original df (same as in the array)
    return df_small['title'].iloc[top_movie_indices]

In [26]:
# Let's get the top 10 movies similar to 'The Dark Knight'
get_recommendations('The Dark Knight').head(10)

The similarity scores of top similar movies are:  [(7931, 0.171373648623319), (132, 0.1224441524010626), (1113, 0.10088975721500877), (8227, 0.084762418538368439), (7565, 0.084196920814179024), (524, 0.081623429210359449), (7901, 0.077806921264712711), (2579, 0.069628630875434111), (2696, 0.061758594450787727), (8165, 0.060949423343999609)]


7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

<a id='part_3'></a>
## 2. Genres, Director, Actors and Keywords Based Recommender

First, let's combine meatadata, credits and keywords files together.

In [3]:
# import libraries again, so we can start from here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os
import datetime as dt
import warnings  
warnings.filterwarnings('ignore')
import ast

In [4]:
# load the df again, so that we will not mix with last recomender work
df = pd. read_csv('./movies_metadata.csv')

In [5]:
df = df.drop([19730, 29503, 35587]) # these three ids are not int type
df['id'] = df['id'].astype('int') # change id as the same data type as links, so we can join/combine them

In [6]:
credits = pd.read_csv('./credits.csv')
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [7]:
credits['id'] = credits['id'].astype('int')
df = df.merge(credits, on='id')
df.shape

(45538, 26)

In [8]:
keywords = pd.read_csv('./keywords.csv')
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [9]:
keywords['id'] = keywords['id'].astype('int')
df = df.merge(keywords, on='id')
df.shape

(46628, 27)

Again, here we will focus on the small dataset, to avoid memory error issues.

In [10]:
links_small = pd.read_csv('./links_small.csv')
print(links_small.columns)
# here we will just use the tmdbId since it is id used in the movies_metadata
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int') 
links_small.head()

Index(['movieId', 'imdbId', 'tmdbId'], dtype='object')


0      862
1     8844
2    15602
3    31357
4    11862
Name: tmdbId, dtype: int32

In [11]:
df_small = df[df['id'].isin(links_small)]
#df_small.head()
df_small.shape

(9219, 27)

In [12]:
df = df_small.copy()
df = df.reset_index() # this is important, since it will help reset 4xxxx index to 9xxx index

### Convert genres data types
All cast, crew and keyword have special datatypes as genres, we need to convert them.

In [13]:
def get_values(data_str):
    if isinstance(data_str, float):
        pass
    else:
        values = []
        data_str = ast.literal_eval(data_str)
        if isinstance(data_str, list):
            for k_v in data_str:
                values.append(k_v['name'])
            return values
        else:
            return None

In [14]:
df['genres'] = df['genres'].map(lambda x: get_values(x)) # we will use this one for genres

### Convert crew data types and extract director info

When convert/change types of crew, we need to extract some information from it, like director. As we know, people will take the director of movie in account for whether they should watch it.

In [15]:
df['crew'] = df['crew'].apply(ast.literal_eval)

In [16]:
df['crew'].head()
# Here note that we need to extract the director info, so will keep this form first, which is same as 'genres'

0    [{'name': 'John Lasseter', 'credit_id': '52fe4...
1    [{'name': 'Larry J. Franco', 'credit_id': '52f...
2    [{'name': 'Howard Deutch', 'credit_id': '52fe4...
3    [{'name': 'Forest Whitaker', 'credit_id': '52f...
4    [{'name': 'Alan Silvestri', 'credit_id': '52fe...
Name: crew, dtype: object

In [17]:
def get_director(data):
    for x in data:
        #print(x)
        if x['job'] == 'Director':
            return x['name']
    return np.nan

In [18]:
df['director'] = df['crew'].apply(get_director)

In [19]:
df['director'].head()

0      John Lasseter
1       Joe Johnston
2      Howard Deutch
3    Forest Whitaker
4      Charles Shyer
Name: director, dtype: object

In [20]:
# convert to lower case and remove the space
df['director'] = df['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
df['director'].head()

0      johnlasseter
1       joejohnston
2      howarddeutch
3    forestwhitaker
4      charlesshyer
Name: director, dtype: object

The below part is interesting, the idea here is that we will treat director more important than actors.


In [21]:
df['director'] = df['director'].apply(lambda x: [x,x,x])
df['director'].head()

0          [johnlasseter, johnlasseter, johnlasseter]
1             [joejohnston, joejohnston, joejohnston]
2          [howarddeutch, howarddeutch, howarddeutch]
3    [forestwhitaker, forestwhitaker, forestwhitaker]
4          [charlesshyer, charlesshyer, charlesshyer]
Name: director, dtype: object

### Convert cast data types and eliminate the numers of actors
Again do the same thing for crew. Also we will only count for the several main actors

In [22]:
df['cast'] = df['cast'].apply(ast.literal_eval)

In [23]:
df['cast'].head()
# now the data type is same as genres, we can apply the fun() get_values()

0    [{'order': 0, 'name': 'Tom Hanks', 'credit_id'...
1    [{'order': 0, 'name': 'Robin Williams', 'credi...
2    [{'order': 0, 'name': 'Walter Matthau', 'credi...
3    [{'order': 0, 'name': 'Whitney Houston', 'cred...
4    [{'order': 0, 'name': 'Steve Martin', 'credit_...
Name: cast, dtype: object

In [24]:
def get_values_after_literal_eval(data_str):
    if isinstance(data_str, float):
        pass
    else:
        values = []
        #data_str = ast.literal_eval(data_str)
        if isinstance(data_str, list):
            for k_v in data_str:
                values.append(k_v['name'])
            return values
        else:
            return None

In [25]:
df['cast'] = df['cast'].map(lambda x: get_values_after_literal_eval(x)) # do the same thing as genres

In [26]:
df['cast'].head()

0    [Tom Hanks, Tim Allen, Don Rickles, Jim Varney...
1    [Robin Williams, Jonathan Hyde, Kirsten Dunst,...
2    [Walter Matthau, Jack Lemmon, Ann-Margret, Sop...
3    [Whitney Houston, Angela Bassett, Loretta Devi...
4    [Steve Martin, Diane Keaton, Martin Short, Kim...
Name: cast, dtype: object

since there are too many actors, we will only count the main actors (if we add others, then the system will treat them as same weight, so that the cosine-similarity will not be correct

In [27]:
df['cast'] = df['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)
df['cast'].head()

0                  [Tom Hanks, Tim Allen, Don Rickles]
1       [Robin Williams, Jonathan Hyde, Kirsten Dunst]
2           [Walter Matthau, Jack Lemmon, Ann-Margret]
3    [Whitney Houston, Angela Bassett, Loretta Devine]
4           [Steve Martin, Diane Keaton, Martin Short]
Name: cast, dtype: object

In [28]:
# Also convert to lower case as compare, and remove space between first/last name
df['cast'] = df['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
df['cast'].head()

0                  [tomhanks, timallen, donrickles]
1       [robinwilliams, jonathanhyde, kirstendunst]
2          [waltermatthau, jacklemmon, ann-margret]
3    [whitneyhouston, angelabassett, lorettadevine]
4           [stevemartin, dianekeaton, martinshort]
Name: cast, dtype: object

### Convert the keyword data types

In [29]:
df['keywords'].head()

0    [{'id': 931, 'name': 'jealousy'}, {'id': 4290,...
1    [{'id': 10090, 'name': 'board game'}, {'id': 1...
2    [{'id': 1495, 'name': 'fishing'}, {'id': 12392...
3    [{'id': 818, 'name': 'based on novel'}, {'id':...
4    [{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...
Name: keywords, dtype: object

In [30]:
df['keywords'] = df['keywords'].map(lambda x: get_values(x)) # we will use same methods as genres

In [31]:
df['keywords'].head()
#df['keywords'][0]

0    [jealousy, toy, boy, friendship, friends, riva...
1    [board game, disappearance, based on children'...
2    [fishing, best friend, duringcreditsstinger, o...
3    [based on novel, interracial relationship, sin...
4    [baby, midlife crisis, confidence, aging, daug...
Name: keywords, dtype: object

In [32]:
# learned from Udemy, machine learning A-Z
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer # find the rrot of word, loved, loves, loving... -> love

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lighterkey\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
# purpose here is to convert [board game, dis...] to [board, game, ....] 
# iterate rows and change value, see NYC-Taxi and 
# https://stackoverflow.com/questions/23330654/update-a-dataframe-in-pandas-while-iterating-row-by-row
for index, row in df.iterrows():
    #print(index)
    final_keywords = []
    for keywords in row['keywords']:
        keywords = keywords.lower()
        keywords.split(' ')
        final_keywords += keywords.split(' ')
        
    ps = PorterStemmer()
    final_keywords = [ps.stem(word) for word in final_keywords if not word in set(stopwords.words('english'))]
    final_keywords = ' '.join(final_keywords).strip() # "wow", "love", "place" -> "wow love place"
    #print(index, final_keywords)
    df.loc[index, 'keywords'] = [final_keywords]
df['keywords'].head()

0    [jealousi toy boy friendship friend rivalri bo...
1    [board game disappear base children' book new ...
2        [fish best friend duringcreditssting old men]
3    [base novel interraci relationship singl mothe...
4    [babi midlif crisi confid age daughter mother ...
Name: keywords, dtype: object

### Combine cast, crew, keywords together

https://www.kaggle.com/rounakbanik/movie-recommender-systems

In [36]:
df['cast'].head(2)

0               [tomhanks, timallen, donrickles]
1    [robinwilliams, jonathanhyde, kirstendunst]
Name: cast, dtype: object

In [37]:
df['director'].head(2)

0    [johnlasseter, johnlasseter, johnlasseter]
1       [joejohnston, joejohnston, joejohnston]
Name: director, dtype: object

In [38]:
df['genres'].head(2)

0     [Animation, Comedy, Family]
1    [Adventure, Fantasy, Family]
Name: genres, dtype: object

In [39]:
df['keywords'].head(2)

0    [jealousi toy boy friendship friend rivalri bo...
1    [board game disappear base children' book new ...
Name: keywords, dtype: object

In [41]:
# let's define the 'cdgk_bag' as 
df['cdgk_bag'] = df['keywords'] + df['cast'] + df['director'] + df['genres']
df['cdgk_bag'].head(2)

0    [jealousi toy boy friendship friend rivalri bo...
1    [board game disappear base children' book new ...
Name: cdgk_bag, dtype: object

In [43]:
df['cdgk_bag'] = df['cdgk_bag'].apply(lambda x: ' '.join(x)) # we need to take out of the words from list
df['cdgk_bag'].head(2)

0    jealousi toy boy friendship friend rivalri boy...
1    board game disappear base children' book new h...
Name: cdgk_bag, dtype: object

In [44]:
# Here will result MemoryError:  if we use the full dataset
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = cv.fit_transform(df['cdgk_bag'])

In [46]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [47]:
print(cosine_sim.shape)
print(cosine_sim[0])

(9219, 9219)
[ 1.          0.03798001  0.04280585 ...,  0.          0.          0.        ]


### Final Recommendation
Now we get the similarity matrix, and we can use this matrix to recommend similar movies to the target movie.

In [48]:
df = df.reset_index() # reset index, very important, since the old index is from 0-4xxxx, we need to convert to 0-9xxx

# Below we just test to see whether we can get correct cosine_sim[j] by idx j
idx = df[df['title'] == 'The Dark Knight'].index.values[0] # there is only one value such that title is same
cosine_sim[idx]

array([ 0.        ,  0.01481035,  0.        , ...,  0.0188545 ,
        0.02962069,  0.        ])

In [51]:
# ssame method as before
def get_recommendations(title):
    idx = df[df['title'] == title].index.values[0]
    #print(idx)
    sim_scores = []
    for index, score in enumerate(cosine_sim[idx]): # itearate the array(cosine_sim[idx]), to extract index and score as tuple, saved to list
        sim_scores.append((index,score))
    #sim_scores = list(enumerate(cosine_sim[idx])) same as above
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # sort a list of tuples by one index/column
    top_sim_scores = sim_scores[1:11] # only get top 10
    print('The similarity scores of top similar movies are: ',top_sim_scores)
    top_movie_indices = [x[0] for x in top_sim_scores] # extract indexs of top similar movies as in the original df (same as in the array)
    return df['title'].iloc[top_movie_indices]

In [52]:
# Let's get the top 10 movies similar to 'The Dark Knight'
get_recommendations('The Dark Knight').head(10)

The similarity scores of top similar movies are:  [(6218, 0.53783063827745781), (8031, 0.5264324633366726), (7659, 0.30897692791791959), (1134, 0.29480217946853654), (1260, 0.29421966330711236), (6623, 0.27893704369178529), (2085, 0.26108901389150202), (524, 0.26012757305655326), (2131, 0.24228545929153814), (4145, 0.24170073228323968)]


6218                 Batman Begins
8031         The Dark Knight Rises
7659    Batman: Under the Red Hood
1134                Batman Returns
1260                Batman & Robin
6623                  The Prestige
2085                     Following
524                         Batman
2131                      Superman
4145                      Insomnia
Name: title, dtype: object

Reference: 
* http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
* https://medium.com/@tomar.ankur287/content-based-recommender-system-in-python-2e8e94b16b9e
* https://www.kaggle.com/rounakbanik/movie-recommender-systems
* https://www.kaggle.com/sohier/film-recommendation-engine-converted-to-use-tmdb