In this notebook, out of the 2 widely adopted approaches of building recommender systems, we will be using the Content-Based methodology to recommend movies to our user.

The 2 widely used approaches are - 
### 1. Content Based Recommenders - 
In this, recommendations are provided to users on basis of their profile, which revolves around their preferences and tastes.

### 2. Collaborative Filtering - 
In this, user is matched to similar users (based on preferences), and then recommends items that the similar users have liked against the provided input. Basically in this, users are matched and there is no need to extract information from the recommended item unlike content-based filters.

# Content-Based Recommenders

Content based recommenders first try to figure out a user's favourite aspects about an item, and then recommend items thhat present those aspects. 

We will widely use 3 different types of metrics to recommend a movie using this type of recommendation engine.

1. Recommendations on basis of genres
2. Recommendations on basis of movie plot description
3. Recommendations on basis of credits, genres and keywords

### I. Genre Based Recommender

The approach here will be to figure out a provided input's favourite genres from the movies and ratings dataset provided.
We will first compute a weighted genre matrix for our user, and then using that weighted genre matrix, we will get the movies that have the highest weighted average as per the weighted genres and recommend those movies.

So stepwise the process is listed below:

#### 1. Data Acquisition

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

movies_df = pd.read_csv('../input/moviesdataset/movies.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


#### 2. Data Modification
We will transform this dataset a bit.

First we will remove year from the movie title and store it in a separate column.

Next we will split the genres column, so that the genres in different rows are stored as lists

In [2]:
#Extracting and storing year in a new column
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing Parentheses from the year
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the extracted year's text from the title column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any whitespace characters
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

#splitting genres column on basis of '|'
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


For our content based recommender, based on genres of movies, we need the genres to be one-hot encoded into the dataframe for improved computation of recommendations.

By using one-hot encoding, the list of genres will be converted to a vector where each column corresponds one possible values of the feature.

A 1 in the column would indicate that the movie has that genre, whereas a 0 would indicate otherwise.

In [3]:
encoded_genre_movies = movies_df.copy()

for index, row in movies_df.iterrows():
    for genre in row['genres']:
        encoded_genre_movies.at[index, genre] = 1
encoded_genre_movies = encoded_genre_movies.fillna(0)
encoded_genre_movies.head()


Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 3. Input Dataframe
Now we will take a user input, i.e. a list of few movies with their ratings

In [4]:
input_movies_dict = [
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Shutter Island', 'rating':4.5},
            {'title':'Prestige, The', 'rating':4},
            {'title':"Source Code", 'rating':5},
            {'title':'Interstellar', 'rating':4}
         ] 
input_movies = pd.DataFrame(input_movies_dict)
input_movies

Unnamed: 0,title,rating
0,Pulp Fiction,5.0
1,Shutter Island,4.5
2,"Prestige, The",4.0
3,Source Code,5.0
4,Interstellar,4.0


#### 4. Genre Matrix and Weighted Genre Matrix
Next part involves computing the Weighted Genre Matrix. To compute that, we will filter out the genre matrix first from the **encoded_genre_movies** for the movie titles entered by the user

In [5]:
input_movies_genre_matrix = encoded_genre_movies[encoded_genre_movies['title'].isin(input_movies['title'].tolist())]
input_movies_genre_matrix
#We'll remove columns other than the genre values to get just the genre matrix
input_movies_genre_matrix = input_movies_genre_matrix.reset_index(drop=True)
input_movies_genre_matrix = input_movies_genre_matrix.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
input_movies_genre_matrix

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


Now to start learning the input's preferences, we will compute a dot product of the ratings from the 
**input_movies** and the **input_movies_genre_matrix** to get the Weighted Genre Matrix

In [6]:
user_profile = input_movies_genre_matrix.transpose().dot(input_movies['rating'])
user_profile

Adventure              0.0
Animation              0.0
Children               0.0
Comedy                 5.0
Fantasy                0.0
Romance                0.0
Drama                 18.5
Action                 5.0
Crime                  5.0
Thriller              18.5
Horror                 0.0
Mystery               13.5
Sci-Fi                13.5
War                    0.0
Musical                0.0
Documentary            0.0
IMAX                   4.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

Next we will get the genre matrix for the entire movie database, and compute the weighted average of every movie based on input user profile and recommend the top 20 movies

In [7]:
complete_genre_matrix = encoded_genre_movies.set_index(encoded_genre_movies['movieId'])
#Also we'll drop the unnecessary columns
complete_genre_matrix = complete_genre_matrix.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
complete_genre_matrix.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
recommendation_list = ((complete_genre_matrix*user_profile).sum(axis=1))/(user_profile.sum())
recommendation_list = recommendation_list.sort_values(ascending=False)
recommendation_list.head()

movieId
79132    0.939759
198      0.891566
26701    0.891566
60684    0.879518
67197    0.831325
dtype: float64

In [9]:
movies_df[movies_df['movieId'].isin(recommendation_list.head(20).index)]

Unnamed: 0,movieId,title,genres,year
167,198,Strange Days,"[Action, Crime, Drama, Mystery, Sci-Fi, Thriller]",1995
563,680,"Alphaville (Alphaville, une étrange aventure d...","[Drama, Mystery, Romance, Sci-Fi, Thriller]",1965
1484,2009,Soylent Green,"[Drama, Mystery, Sci-Fi, Thriller]",1973
4631,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002
5333,8870,"Forgotten, The","[Drama, Mystery, Sci-Fi, Thriller]",2004
5556,26701,Patlabor: The Movie (Kidô keisatsu patorebâ: T...,"[Action, Animation, Crime, Drama, Film-Noir, M...",1989
5699,27788,"Jacket, The","[Drama, Mystery, Sci-Fi, Thriller]",2005
5729,27904,"Scanner Darkly, A","[Animation, Drama, Mystery, Sci-Fi, Thriller]",2006
5750,30894,White Noise,"[Drama, Horror, Mystery, Sci-Fi, Thriller]",2005
6145,43932,Pulse,"[Action, Drama, Fantasy, Horror, Mystery, Sci-...",2006


### II. Movie Plot Based Recommender

In this approach we will be finding movies that are similar to input movie(s). We will be calculating similarity scores for movies based on their plots (overview feature in our tmdb dataset), and then provided the top 20 movies which have the highest similarity score to our input, as the recommendation.

Since the plot is present in the overview column in form of sentences, we need to transform the plots into word vectors in order to be able to compare them for generating the similarity scores.

We will use the Term Frequency-Inverse Document Frequency (TF-IDF) vectors for computing the word vector for each overview feature (called each document in the context of TF-IDF), which will provide us with a matrix where each column represents a word and each row represents a movie.

In [10]:
#Extracting necessary features from the tmdb dataset
tmdb_df = pd.read_csv('../input/tmdb-movies-dataset/tmdb_movies_data.csv')
tmdb_df_final = tmdb_df[['id', 'original_title', 'cast', 'director', 'overview', 'genres', 'keywords']]
tmdb_df_final.head()

Unnamed: 0,id,original_title,cast,director,overview,genres,keywords
0,135397,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,Action|Adventure|Science Fiction|Thriller,monster|dna|tyrannosaurus rex|velociraptor|island
1,76341,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,An apocalyptic story set in the furthest reach...,Action|Adventure|Science Fiction|Thriller,future|chase|post-apocalyptic|dystopia|australia
2,262500,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,Beatrice Prior must confront her inner demons ...,Adventure|Science Fiction|Thriller,based on novel|revolution|dystopia|sequel|dyst...
3,140607,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,Thirty years after defeating the Galactic Empi...,Action|Adventure|Science Fiction|Fantasy,android|spaceship|jedi|space opera|3d
4,168259,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,Deckard Shaw seeks revenge against Dominic Tor...,Action|Crime|Thriller,car race|speed|revenge|suspense|car


To formulate the TF-IDF Vector we will use the *TfIdfVectorizer* class of scikit-learn which produces the tfidf matrix very conveniently, without much overhead

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

#From the tfidf object, we will be removing all the stopwords - 'a', 'the' from the overview
tfidf_object = TfidfVectorizer(stop_words='english')

#Lets replace any Na's in the overview column with empty strings
tmdb_df_final['overview'] = tmdb_df_final['overview'].fillna('')

tfidf_matrix = tfidf_object.fit_transform(tmdb_df_final['overview'])

tfidf_matrix.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


(10866, 32786)

Looking at the shape of the tfidf matrix, we can see that about 32,786 different words were used to describe the 10,866 movies in our dataframe.

Our next task would be to determine the similarity between each of these tfidf word cells. This similarity score is what we will be using to recommend movies to our user.

Out of the several different ways to calculate a similarity score (Euclidean, Pearson, Cosine), we'll be using the Cosine Similarity Score, because with a tfidf vectorizer a simple linear kernel computation of the matrix will give the cosine similarity matrix. This matrix will contain the cosine similarity score of each movie, with every other movie in the dataframe.




In [12]:
from sklearn.metrics.pairwise import linear_kernel
cosine_similarity_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_similarity_matrix

array([[1.        , 0.00710078, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.00710078, 1.        , 0.        , ..., 0.        , 0.04902739,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.04902739, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

Now that we have the cosine similarities calculated in a matrix, we will write a function that will take movie title as the input, and return 20 recommendations that have the highest cosine similarity with the input movie. To do this we'll need a reverse mapping of movie title and dataframe indices, i.e. we'll need a way to identify the index of a movie in our tmdb_df_final dataframe.

In [13]:
movie_indices = pd.Series(tmdb_df_final.index, index=tmdb_df_final['original_title']).drop_duplicates()
movie_indices

original_title
Jurassic World                      0
Mad Max: Fury Road                  1
Insurgent                           2
Star Wars: The Force Awakens        3
Furious 7                           4
                                ...  
The Endless Summer              10861
Grand Prix                      10862
Beregis Avtomobilya             10863
What's Up, Tiger Lily?          10864
Manos: The Hands of Fate        10865
Length: 10866, dtype: int64

Now our final recommendation function will accomplish the following tasks:

    1. Take movie title as input and fetch its index in the dataframe
    
    2. Get a list of tuples of the cosine similarity scores of the input movie with all movies present in the dataset. 
    The tuple should be of the form (position, similarity score)
    
    3. Sort this list in the descending order of the similarity scores so that the movie with the highest similarity score 
    (i.e. the entered movie itself since a movie's similarity score is the highest with self), is at the top
    
    4. Return the titles corresponding to the indices of the top 20 elements of this sorted list of tuples 
    (except the first element since the first element is the entered movie itself)

In [14]:
def recommend(title, cosine_similarity=cosine_similarity_matrix):
    movie_index = movie_indices[title]
    
    similarity_scores = list(enumerate(cosine_similarity[movie_index]))
    
    similarity_scores = sorted(similarity_scores, key = lambda x: x[1], reverse=True)
    
    similarity_scores = similarity_scores[1:11]
    
    indices = [x[0] for x in similarity_scores]
    
    return tmdb_df_final['original_title'].iloc[indices]
    

In [15]:
recommend('The Dark Knight')

4363                                The Dark Knight Rises
8245                                       Batman Returns
2024                           Batman: Under the Red Hood
8082                                       Batman Forever
5463              Batman: The Dark Knight Returns, Part 2
3246    Batman Unmasked: The Psychology of the Dark Kn...
4400              Batman: The Dark Knight Returns, Part 1
9182                                               Batman
3570                                     Batman: Year One
6330                                The Batman vs Dracula
Name: original_title, dtype: object

Now the problem with the plot based recommender is that it is only as good as the description of the plot. 
For example, since it maps word vectors the plots with specific keywords are priotized always. This can be seen when recommendation against a batman movie is sought. All recommendations against 'The Dark Knight', only fetch Batman movies in the top 10 recommendations, even the one's which weren't as great. We should also account for other metadata in the dataset like directors, actors and keywords because when a user enters a movie like Pulp Fiction or Inglorious Bastards, it is highly likely that they'll like other movies by the director Quentin Tarantino.

### III. Movie Credits, Genres and Keywords based recommender

In this, we will prepare a combined recommender system which will take into account the genres, credits (actors, directors in a movie) and the keywords available in the dataset for each movie. The idea is to improve the quality of recommendations by using as much metadata as we can, without hampering performance by huge margin

We will be using data from 4 columns of the tmdb_df_final dataset for building this recommender - cast, director, genre and keywords

From the cast, genre and keywords, we will be extracting the 3 most important features (top 3) for each movie.

In [16]:
#storing data in cast, genre and keyword as a list in the dataset
tmdb_df_final['cast'] = tmdb_df_final.cast.str.split('|')
tmdb_df_final['genres'] = tmdb_df_final.genres.str.split('|')
tmdb_df_final['keywords'] = tmdb_df_final.keywords.str.split('|')
    
tmdb_df_final.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,id,original_title,cast,director,overview,genres,keywords
0,135397,Jurassic World,"[Chris Pratt, Bryce Dallas Howard, Irrfan Khan...",Colin Trevorrow,Twenty-two years after the events of Jurassic ...,"[Action, Adventure, Science Fiction, Thriller]","[monster, dna, tyrannosaurus rex, velociraptor..."
1,76341,Mad Max: Fury Road,"[Tom Hardy, Charlize Theron, Hugh Keays-Byrne,...",George Miller,An apocalyptic story set in the furthest reach...,"[Action, Adventure, Science Fiction, Thriller]","[future, chase, post-apocalyptic, dystopia, au..."
2,262500,Insurgent,"[Shailene Woodley, Theo James, Kate Winslet, A...",Robert Schwentke,Beatrice Prior must confront her inner demons ...,"[Adventure, Science Fiction, Thriller]","[based on novel, revolution, dystopia, sequel,..."
3,140607,Star Wars: The Force Awakens,"[Harrison Ford, Mark Hamill, Carrie Fisher, Ad...",J.J. Abrams,Thirty years after defeating the Galactic Empi...,"[Action, Adventure, Science Fiction, Fantasy]","[android, spaceship, jedi, space opera, 3d]"
4,168259,Furious 7,"[Vin Diesel, Paul Walker, Jason Statham, Miche...",James Wan,Deckard Shaw seeks revenge against Dominic Tor...,"[Action, Crime, Thriller]","[car race, speed, revenge, suspense, car]"


In [17]:
#function to return top 3 elements from input lists
def return_top_3(x):
    if isinstance(x, list):
        if(len(x) > 3):
            x = x[:3]
        return x
    return []

In [18]:
features = ['cast', 'genres', 'keywords']
for feature in features:
    tmdb_df_final[feature] = tmdb_df_final[feature].apply(return_top_3)
    
tmdb_df_final[['original_title', 'cast', 'director', 'genres', 'keywords']].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,original_title,cast,director,genres,keywords
0,Jurassic World,"[Chris Pratt, Bryce Dallas Howard, Irrfan Khan]",Colin Trevorrow,"[Action, Adventure, Science Fiction]","[monster, dna, tyrannosaurus rex]"
1,Mad Max: Fury Road,"[Tom Hardy, Charlize Theron, Hugh Keays-Byrne]",George Miller,"[Action, Adventure, Science Fiction]","[future, chase, post-apocalyptic]"
2,Insurgent,"[Shailene Woodley, Theo James, Kate Winslet]",Robert Schwentke,"[Adventure, Science Fiction, Thriller]","[based on novel, revolution, dystopia]"
3,Star Wars: The Force Awakens,"[Harrison Ford, Mark Hamill, Carrie Fisher]",J.J. Abrams,"[Action, Adventure, Science Fiction]","[android, spaceship, jedi]"
4,Furious 7,"[Vin Diesel, Paul Walker, Jason Statham]",James Wan,"[Action, Crime, Thriller]","[car race, speed, revenge]"


Now to make the input data distinct for our Vectorizer, we will strip the data of any spaces between single items, as well as convert it to lowercase (for example, transformation of 'Chris Pratt' to chrispratt). 

In [19]:
def data_transform(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [20]:
features = ['cast', 'genres', 'director', 'keywords']
for feature in features:
    tmdb_df_final[feature] = tmdb_df_final[feature].apply(data_transform)
tmdb_df_final[features].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,cast,genres,director,keywords
0,"[chrispratt, brycedallashoward, irrfankhan]","[action, adventure, sciencefiction]",colintrevorrow,"[monster, dna, tyrannosaurusrex]"
1,"[tomhardy, charlizetheron, hughkeays-byrne]","[action, adventure, sciencefiction]",georgemiller,"[future, chase, post-apocalyptic]"
2,"[shailenewoodley, theojames, katewinslet]","[adventure, sciencefiction, thriller]",robertschwentke,"[basedonnovel, revolution, dystopia]"
3,"[harrisonford, markhamill, carriefisher]","[action, adventure, sciencefiction]",j.j.abrams,"[android, spaceship, jedi]"
4,"[vindiesel, paulwalker, jasonstatham]","[action, crime, thriller]",jameswan,"[carrace, speed, revenge]"


We will now create the metadata soup. This metadata soup is a string that contains all of the metadata that we will be feeding to our vectorizer.

In [21]:
def data_soup(x):
    return ' '.join(x['cast']) + ' ' + ' '.join(x['genres']) + ' ' + x['director'] + ' ' + ' '.join(x['keywords'])

tmdb_df_final['soup'] = tmdb_df_final.apply(data_soup, axis=1)
tmdb_df_final['soup'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


0    chrispratt brycedallashoward irrfankhan action...
1    tomhardy charlizetheron hughkeays-byrne action...
2    shailenewoodley theojames katewinslet adventur...
3    harrisonford markhamill carriefisher action ad...
4    vindiesel paulwalker jasonstatham action crime...
Name: soup, dtype: object

Now that we have the data soup, we will be using the same methodology that we used for the Plot Based Recommender, i.e calculate a similarity score for movies on basis of this data soup. In the plot based recommender we calculated this similarity on the basis of the plot description. For that, since we did not want common words to be retained in the similariy computation, we used TF-IDF Vectorizer. Here we don't need to eliminate common words, because it is highly possible that the same director(s), actor(s) have performed in myriad other movies. Also the genres are a fixed set, hence repetitive. In fact here, repetitiveness of common data forms a good basis for recommendation.

Hence we will use a different Vectorizer for similarity computation - CountVectorizer()

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

cnt = CountVectorizer(stop_words='english')
count_matrix = cnt.fit_transform(tmdb_df_final['soup'])

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity_matrix2 = cosine_similarity(count_matrix, count_matrix)
cosine_similarity_matrix2

array([[1.        , 0.27386128, 0.2       , ..., 0.        , 0.11952286,
        0.        ],
       [0.27386128, 1.        , 0.18257419, ..., 0.        , 0.10910895,
        0.        ],
       [0.2       , 0.18257419, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.12598816,
        0.        ],
       [0.11952286, 0.10910895, 0.        , ..., 0.12598816, 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [23]:
tmdb_data_copy = tmdb_df_final.copy()
tmdb_data_copy = tmdb_data_copy.reset_index()

indices = pd.Series(tmdb_data_copy.index, index=tmdb_data_copy['original_title'])

#re-using the recommend function
recommend('The Dark Knight', cosine_similarity_matrix2)

4363      The Dark Knight Rises
6191              Batman Begins
6565               The Prestige
7468                     Hitman
164     Kidnapping Mr. Heineken
1124               The Outsider
1246                       Tell
1318                   Kill Dil
2176                 Boy Wonder
3813    House of the Rising Sun
Name: original_title, dtype: object

This recommendation list has taken into account the genre of a movie, actors who've worked, the director of the movie as well as the keywords associated with the movie. Thus there is a higher chance of users liking these recommendations, as they are more "likely" or "similar" to the user's entered choice.

A good approach would also be to combine these 2 recommendation bases, and get 4 plot based and 6 soup based recommendations.