# Content-Based Recommender
### Plot Description Based Recommender
Suggest similar items based on a particular item. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. The subset dataset can be downloaded from [here](https://www.kaggle.com/rounakbanik/the-movies-dataset/data).

In [1]:
import pandas as pd

In [3]:
# load movies metadata datasets into Dataframe
metadata = pd.read_csv('dataset/movies_metadata.csv', low_memory=False)
# Show firts two rows
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


We will compute pairwise consine similarity scores for all moives based on their descritions and recommend moives based on that similarity score threshold.

In [4]:
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

We need to extract some kind of features from the above text data before you can compute the similarity or dissimilarity between them. To do this, We need to compute the word vectors of each overview or document

In [5]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

In [6]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[3030:3040]

['anamika',
 'anamorphosis',
 'anand',
 'ananda',
 'anang',
 'ananga',
 'ananka',
 'anant',
 'ananya',
 'anapolis']

From the above output, you observe that 75,827 different vocabularies or words in your dataset have 45,466 movies. With the matrix, we can compute a similarity score now. Since we have used the `TF-IDF` vectorizer, calculating the dot product between each vector will directly give you the cosine similarity score. Therefore, you will use sklearn's `linear_kernel()` instead of` cosine_similarities()`.

In [7]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

This would return a matrix of shape 45466x45466, which means each movie overview cosine similarity score with every other movie overview. Each movie will be a 1x45466 column vector where each column will be a similarity score with each movie.

In [8]:
cosine_sim.shape

(45466, 45466)

In [9]:
cosine_sim[1]

array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])

Define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. For this, we need a reverse mapping of movie titles and DataFrame indices.

In [10]:
movie_indices = pd.Series(data=metadata.index, index=metadata['title'])
movie_indices.drop_duplicates()
movie_indices

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64

Define our recommendation function:
1. Get the index of the moive given its title.
2. Get the list of consine similarity scores for that particular movie with all moives. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.
3. Sort the aforementioned list of tupoles based on the similarity scores.
4. Get the top 10 elements of this list. (Ignore the first element as it refers to self).
5. Return the titles corresponfing to the indices of the top elements.

In [11]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = movie_indices[title]
    
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Top 10 
    sim_scores = sim_scores[1:11]
    
    # Get the movie indices
    indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return metadata['title'].iloc[indices]

In [12]:
get_recommendations('The Dark Knight Rises', cosine_sim)

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

In [13]:
get_recommendations('The Godfather')

1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object