#  Content Based Recommender System
<p>We will try to build a system that recommends movies that are similar to a particular movie. More specifically, we will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.</p>

In [0]:
#from google.colab import files
#files.upload()

In [29]:
import pandas as pd
from ast import literal_eval
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [3]:
movie_data = pd.read_csv('movies_metadata.csv', low_memory=False)
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

In [4]:
movie_data['overview'].head(5)

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

<p>We will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. This will give us a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each column represents a movie, as before.</p>
<p>
In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.</p>

<p>The over view contains words such as 'the' , 'a' this are stop words and wouldn't add value to our system. We need to remove them.</p>

In [5]:
word_vector = TfidfVectorizer(stop_words='english')

<p>Some descriptions are empty we need to fill them with empty string</p>

In [6]:
movie_data['overview'] = movie_data['overview'].fillna('')

<p>We need to form a matrix.</p>

In [7]:
word_matrix = word_vector.fit_transform(movie_data['overview'])
word_matrix.shape

(45466, 75827)

<p>From the shape we can see that there are 75827 words used to describe 45466 movies.<br/>From the matrix we can compute a similarity score. we will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies.  cosine similarity score is independent of magnitude and is relatively easy and fast to calculate. </p>
<p>It is represented by:<br/>
    cosine(x,y)=x.y⊺/(||x||.||y||)
    
</p>

In [34]:
cosine_similarity = linear_kernel(word_matrix, word_matrix)

MemoryError: 

In [0]:
indices = pd.Series(movie_data.index, index=movie_data['title']).drop_duplicates()

In [0]:
def get_recommendations(title, cosine_siimilarity=cosine_similarity):
    # index of the movie that matches the title
    movie_index = indices[title]

    #pairwsie similarity scores of all movies with that movie
    similarity_scores = list(enumerate(cosine_similarity[movie_index]))

    # Sort the movies based on the similarity scores
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    similarity_scores = similarity_scores[1:11]

    # movie indices
    movie_indices = [i[0] for i in similarity_scores]

    # Return most similar movies
    return movie_data['title'].iloc[movie_indices]

In [16]:
get_recommendations('The Godfather')

1178      The Godfather: Part II
1914     The Godfather: Part III
11297           Household Saints
10821                   Election
8653                Violent City
13177               I Am the Law
6711                    Mobsters
6977             Queen of Hearts
2891              American Movie
12661              The FBI Story
Name: title, dtype: object

<p>Our system has done a decent job. It is key to note that people would be interested in movies with the same credits, genres or keywords. To do this we will add two more datasets credits, keywords.</p>

In [8]:
# Load
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# remove bad ids
movie_data = movie_data.drop([19730, 29503, 35587])

# Convert IDs to int to merge
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
movie_data['id'] = movie_data['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
movie_data = movie_data.merge(credits, on='id')
movie_data = movie_data.merge(keywords, on='id')

In [15]:
movie_data.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


<p>From the new features we need to get the actors, directors and keywords associated with the movie. The first step is converting them from stringified list to a usable formart. </p>

In [11]:

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    movie_data[feature] = movie_data[feature].apply(literal_eval)


<p>Extracting the data.</p>

In [14]:
# extract director if not present return Nan
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan


In [16]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [17]:
# Define new director, cast, genres and keywords features that are in a suitable form.
movie_data['director'] = movie_data['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    movie_data[feature] = movie_data[feature].apply(get_list)

In [18]:
# Print the new features of the first 3 films
movie_data[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


<p>At this point we have the requiered features. Next we need to clean them. This entails converting to lowercase, stripping the spaces.</p>

In [19]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''



In [20]:
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    movie_data[feature] = movie_data[feature].apply(clean_data)

In [21]:
movie_data[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[tomhanks, timallen, donrickles]",johnlasseter,"[jealousy, toy, boy]","[animation, comedy, family]"
1,Jumanji,"[robinwilliams, jonathanhyde, kirstendunst]",joejohnston,"[boardgame, disappearance, basedonchildren'sbook]","[adventure, fantasy, family]"
2,Grumpier Old Men,"[waltermatthau, jacklemmon, ann-margret]",howarddeutch,"[fishing, bestfriend, duringcreditsstinger]","[romance, comedy]"


<p>Inorder to  vectorize our data we need to convert it to a string that contains all the data we need to feed to our vectorizer.</p>

In [22]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])


In [23]:
movie_data['soup'] = movie_data.apply(create_soup, axis=1)

In [28]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movie_data['soup'])

In [33]:
cosine_similarity2 = cosine_similarity(count_matrix, count_matrix)

MemoryError: 

In [32]:
# Reset index of main DataFrame and construct reverse mapping
movie_data = movie_data.reset_index()
indices = pd.Series(movie_data.index, index=movie_data['title'])

In [None]:
get_recommendations('The Dark Knight Rises', cosine_cosine_similarity2)

<p>This recommendations are based on more features than the one we had before therefore more likely to offer better results.</p>