#  Content Based Recommender System
<p>We will try to build a system that recommends movies that are similar to a particular movie. More specifically, we will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.</p>

In [0]:
#from google.colab import files
#files.upload()

In [0]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [5]:
movie_data = pd.read_csv('movies_metadata.csv', low_memory=False)
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13959 entries, 0 to 13958
Data columns (total 24 columns):
adult                    13959 non-null object
belongs_to_collection    1876 non-null object
budget                   13958 non-null float64
genres                   13958 non-null object
homepage                 1517 non-null object
id                       13958 non-null float64
imdb_id                  13955 non-null object
original_language        13958 non-null object
original_title           13958 non-null object
overview                 13912 non-null object
popularity               13958 non-null float64
poster_path              13917 non-null object
production_companies     13958 non-null object
production_countries     13958 non-null object
release_date             13952 non-null object
revenue                  13958 non-null float64
runtime                  13948 non-null float64
spoken_languages         13958 non-null object
status                   13948 non-null ob

In [6]:
movie_data['overview'].head(5)

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

<p>We will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. This will give us a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each column represents a movie, as before.</p>
<p>
In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.</p>

<p>The over view contains words such as 'the' , 'a' this are stop words and wouldn't add value to our system. We need to remove them.</p>

In [0]:
word_vector = TfidfVectorizer(stop_words='english')

<p>Some descriptions are empty we need to fill them with empty string</p>

In [0]:
movie_data['overview'] = movie_data['overview'].fillna('')

<p>We need to form a matrix.</p>

In [9]:
word_matrix = word_vector.fit_transform(movie_data['overview'])
word_matrix.shape

(13959, 38602)

<p>From the shape we can see that there are 75827 words used to describe 45466 movies.<br/>From the matrix we can compute a similarity score. we will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies.  cosine similarity score is independent of magnitude and is relatively easy and fast to calculate. </p>
<p>It is represented by:<br/>
    cosine(x,y)=x.y⊺/(||x||.||y||)
    
</p>

In [0]:
cosine_similarity = linear_kernel(word_matrix, word_matrix)

In [0]:
indices = pd.Series(movie_data.index, index=movie_data['title']).drop_duplicates()

In [0]:
def get_recommendations(title, cosine_siimilarity=cosine_similarity):
    # index of the movie that matches the title
    movie_index = indices[title]

    #pairwsie similarity scores of all movies with that movie
    similarity_scores = list(enumerate(cosine_similarity[movie_index]))

    # Sort the movies based on the similarity scores
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    similarity_scores = similarity_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in similarity_scores]

    # Return the top 10 most similar movies
    return movie_data['title'].iloc[movie_indices]

In [16]:
get_recommendations('The Godfather')

1178      The Godfather: Part II
1914     The Godfather: Part III
11297           Household Saints
10821                   Election
8653                Violent City
13177               I Am the Law
6711                    Mobsters
6977             Queen of Hearts
2891              American Movie
12661              The FBI Story
Name: title, dtype: object