# <font color=dark>Content Based Recommender System</font>

<hr style="border:2px solid gray">

## <font color=blue>Objective</font>
### The information that we give and the metadata it already have on movies, it will create a watchlist for us

<hr style="border:2px solid gray">

# <font color=red>Plot Description Based Recommender</font>

### The model here will take movie title as an argument and recommend a list of movies that are most similar based on their plots.


### Steps to Build

1. Create TF-IDF vectors for the plot description for every movie.

2. Compute the pairwise cosine similarity score of every movie.

3. The recommender system function that will take movie title as feature and will recommend movies most similar based on the plot.

## <font color=blue>Importing Libraries & Data Pre-Processing</font>

In [5]:
import pandas as pd
import numpy as np
import sklearn
import warnings
warnings.filterwarnings('ignore')

#Import data from the clean file 

df = pd.read_csv('metadata_clean.csv')

#Print the head of the cleaned DataFrame

df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995


In [8]:
#Import the original movie dataset

orig_df = pd.read_csv('movies_metadata.csv', low_memory=False)

#Add the useful features into the cleaned dataframe (saved from Knowledge Base RS) like Overiew Feature & ID feature (from original movie dataset) because we need to recommend on basis of Plot

df['overview'], df['id'] = orig_df['overview'], orig_df['id']

df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862.0
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844.0
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602.0
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357.0
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862.0


### As we need to work on plot, we need to remove all punctuation, stop words and converting all the words to lowercase. 

### Scikit-learn library will do it.

In [9]:
#Import TfIdfVectorizer from the scikit-learn library

from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stopwords

tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string

df['overview'] = df['overview'].fillna('')

# Here each row will represent the TF-IDF vector of the overview feature of the corresponding movie present in our Data Frame.
#Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature

tfidf_matrix = tfidf.fit_transform(df['overview'])

#Output the shape of tfidf_matrix

tfidf_matrix.shape

(45466, 75827)

In [10]:
# Import linear_kernel to compute the dot product

from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
# we are going to create a 45,466 × 45,466 matrix.
# Here the value in the ith row and jth column represents the similarity score between movies i and j.
# Every element in the diagonal is 1, since it is it's own similarity score.


cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
print("similarity scores of every movie with every other movie \n \n ",cosine_sim)

[[1.         0.01504121 0.         ... 0.         0.         0.        ]
 [0.01504121 1.         0.04681953 ... 0.         0.         0.        ]
 [0.         0.04681953 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [13]:
# We need to create a reverse mapping of titles and their respective indices.
# Creating a pandas series with movie title and the it's corresponding index.


#Construct a reverse mapping of indices and movie titles, and drop duplicate titles, if any
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
print (indices)

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64


<hr style="border:2px solid gray">

## <font color=blue>Creating Recommender System Function</font>

### 1. Title of the movie is given.
### 2. The index of the movie is taken from reverse mapping done.
### 3. Convert the list of cosine_sim into a list of tuples such that the first element is the position and the second is the similarity score.
### 4. Sorted the list on basis of Cosine Similarity for top 10 movies.
### 5. Removed the first movie, as it will be same movie with cosine similarity score 1.


In [50]:
# Function that takes in movie title as input and gives recommendations 
def content_recommender(title, cosine_sim=cosine_sim, df=df, indices=indices):
    # Obtain the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies. Ignore the first movie.
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [18]:
#Get recommendations for The Lion King on plot based as similarity done on Plot.
content_recommender('The Lion King')

34680    Metamorphosis : The Alien Factor
9353                     The Lion King 1½
9115       The Lion King 2: Simba's Pride
42826                Lead with Your Heart
25653                               Fanny
17041                        African Cats
27932                          Rio Diablo
6094                            Born Free
37406            The Field of Enchantment
3203                     The Waiting Game
Name: title, dtype: object

In [17]:
content_recommender('Count Dracula')

1300                       Nosferatu
6516                         Dracula
5355           Nosferatu the Vampyre
2530                         Dracula
6080            Bend It Like Beckham
8649      Taste the Blood of Dracula
1282               Blood for Dracula
8589     Dracula: Prince of Darkness
26734                 Monster Island
19407                Vampir-Cuadecuc
Name: title, dtype: object

<hr style="border:2px solid gray">

# <font color=red>Metadata Based Recommender</font>

### It takes features, such as genres, keywords, cast, and crew, into consideration and provides recommendations that are the most similar with respect to the above features.

### Metadata used here are genre, 3 major stars, director, sub-genre(keywords) of the movie.


In [19]:
# Now making a recommender system on basis of Meta data of movies.

# Load the keywords and credits files for the features not available in df

cred_df = pd.read_csv('credits.csv')
key_df = pd.read_csv('keywords.csv')

In [20]:
# Print the head of the credit dataframe

cred_df.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [21]:
# Print the head of the keywords dataframe

key_df.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [22]:
#checking for blank values in df

df.isnull().sum()

title             6
genres            0
runtime         263
vote_average      6
vote_count        6
year              0
overview          0
id                3
dtype: int64

In [23]:
# Function to convert all non-integer IDs to NaN

def clean_ids(x):
    try:
        return int(x)
    except:
        return np.nan

In [25]:
# Clean the ids of df

df['id'] = df['id'].apply(clean_ids)

# Filter all rows that have a null ID

df = df[df['id'].notnull()]

In [26]:
# Convert IDs into integer

df['id'] = df['id'].astype('int')
key_df['id'] = key_df['id'].astype('int')
cred_df['id'] = cred_df['id'].astype('int')

# Merge keywords and credits into the main metadata dataframe

df = df.merge(cred_df, on='id')
df = df.merge(key_df, on='id')

# Display the head of df

df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id,cast,crew,keywords
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46628 entries, 0 to 46627
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         46620 non-null  object 
 1   genres        46628 non-null  object 
 2   runtime       46356 non-null  float64
 3   vote_average  46620 non-null  float64
 4   vote_count    46620 non-null  float64
 5   year          46628 non-null  int64  
 6   overview      46628 non-null  object 
 7   id            46628 non-null  int32  
 8   cast          46628 non-null  object 
 9   crew          46628 non-null  object 
 10  keywords      46628 non-null  object 
dtypes: float64(3), int32(1), int64(1), object(6)
memory usage: 4.1+ MB


In [28]:
# Convert the stringified objects into the native python objects

from ast import literal_eval

features = ['cast','crew', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46628 entries, 0 to 46627
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         46620 non-null  object 
 1   genres        46628 non-null  object 
 2   runtime       46356 non-null  float64
 3   vote_average  46620 non-null  float64
 4   vote_count    46620 non-null  float64
 5   year          46628 non-null  int64  
 6   overview      46628 non-null  object 
 7   id            46628 non-null  int32  
 8   cast          46628 non-null  object 
 9   crew          46628 non-null  object 
 10  keywords      46628 non-null  object 
dtypes: float64(3), int32(1), int64(1), object(6)
memory usage: 4.1+ MB


In [30]:
# Print the first cast member of the first movie in df

df.iloc[0]['crew'][0]

{'credit_id': '52fe4284c3a36847f8024f49',
 'department': 'Directing',
 'gender': 2,
 'id': 7879,
 'job': 'Director',
 'name': 'John Lasseter',
 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}

In [31]:
# Extract the director's name. If director is not listed, return NaN

def director(x):
    for crew_member in x:
        if crew_member['job'] == 'Director':
            return crew_member['name']
    return np.nan

In [32]:
# Define the new director feature

df['director'] = df['crew'].apply(director)

# Print the directors of the first five movies

df['director'].head()

0      John Lasseter
1       Joe Johnston
2      Howard Deutch
3    Forest Whitaker
4      Charles Shyer
Name: director, dtype: object

In [33]:
# Returns the list top 3 elements or entire list; whichever is more.

def generate_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [34]:
#Apply the generate_list function to cast and keywords as we are supposed to take 3 cast members & 3 keywords

df['cast'] = df['cast'].apply(generate_list)
df['keywords'] = df['keywords'].apply(generate_list)

In [35]:
#Only consider a maximum of 3 genres

df['genres'] = df['genres'].apply(lambda x: x[:3])

In [37]:
# Print the new features of the first 5 movies along with title

df[['title', 'cast', 'director', 'keywords', 'genres']].head()

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[animation, comedy, family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[adventure, fantasy, family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[romance, comedy]"
3,Waiting to Exhale,"[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker,"[based on novel, interracial relationship, sin...","[comedy, drama, romance]"
4,Father of the Bride Part II,"[Steve Martin, Diane Keaton, Martin Short]",Charles Shyer,"[baby, midlife crisis, confidence]",[comedy]


In [38]:
# Function to clean data to prevent ambiguity. It removes spaces and converts to lowercase

def clean_data(x):
    if isinstance(x, list):
        #Strip spaces and convert to lowercase
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [39]:
#Apply the generate_list function to cast, keywords, director and genres
for feature in ['cast', 'director', 'genres', 'keywords']:
    df[feature] = df[feature].apply(clean_data)

In [40]:
# Function that creates a join out of the desired metadata i.e. joining all to make one feature

def create_join(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [41]:
# Create the new JOINED feature

df['JOINED'] = df.apply(create_join, axis=1)

In [42]:
#Display the soup of the first movie

df.iloc[0]['JOINED']

'jealousy toy boy tomhanks timallen donrickles johnlasseter animation comedy family'

In [43]:
df

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id,cast,crew,keywords,director,JOINED
0,Toy Story,"[animation, comedy, family]",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862,"[tomhanks, timallen, donrickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy]",johnlasseter,jealousy toy boy tomhanks timallen donrickles ...
1,Jumanji,"[adventure, fantasy, family]",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844,"[robinwilliams, jonathanhyde, kirstendunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[boardgame, disappearance, basedonchildren'sbook]",joejohnston,boardgame disappearance basedonchildren'sbook ...
2,Grumpier Old Men,"[romance, comedy]",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602,"[waltermatthau, jacklemmon, ann-margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fishing, bestfriend, duringcreditsstinger]",howarddeutch,fishing bestfriend duringcreditsstinger walter...
3,Waiting to Exhale,"[comedy, drama, romance]",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357,"[whitneyhouston, angelabassett, lorettadevine]","[{'credit_id': '52fe44779251416c91011acb', 'de...","[basedonnovel, interracialrelationship, single...",forestwhitaker,basedonnovel interracialrelationship singlemot...
4,Father of the Bride Part II,[comedy],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862,"[stevemartin, dianekeaton, martinshort]","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[baby, midlifecrisis, confidence]",charlesshyer,baby midlifecrisis confidence stevemartin dian...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
46623,The Burkittsville 7,[horror],30.0,7.0,1.0,2000,Rising and falling between a man and woman.,439050,"[leilahatami, kouroshtahami, elhamkorda]","[{'credit_id': '5894a97d925141426c00818c', 'de...",[tragiclove],hamidnematollah,tragiclove leilahatami kouroshtahami elhamkord...
46624,Caged Heat 3000,[sciencefiction],85.0,3.5,1.0,1995,An artist struggles to finish his work while a...,111109,"[angelaquino, perrydizon, hazelorencio]","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...","[artist, play, pinoy]",lavdiaz,artist play pinoy angelaquino perrydizon hazel...
46625,Robin Hood,"[drama, action, romance]",104.0,5.7,26.0,1991,"When one of her hits goes wrong, a professiona...",67758,"[erikaeleniak, adambaldwin, juliedupage]","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",[],markl.lester,erikaeleniak adambaldwin juliedupage markl.le...
46626,Subdue,"[drama, family]",90.0,4.0,1.0,0,"In a small town live two brothers, one a minis...",227506,"[iwanmosschuchin, nathalielissenko, pavelpavlov]","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",[],yakovprotazanov,iwanmosschuchin nathalielissenko pavelpavlov ...


In [44]:
# Import CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

#Define a new CountVectorizer object and create vectors for the soup

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['JOINED'])

In [45]:
count_matrix.shape

(46628, 73890)

In [46]:
#Import cosine_similarity function

from sklearn.metrics.pairwise import cosine_similarity

#Compute the cosine similarity score (equivalent to dot product for tf-idf vectors)

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [47]:
print(cosine_sim2)

[[1.         0.09534626 0.1        ... 0.         0.12909944 0.        ]
 [0.09534626 1.         0.         ... 0.         0.12309149 0.        ]
 [0.1        0.         1.         ... 0.1118034  0.         0.        ]
 ...
 [0.         0.         0.1118034  ... 1.         0.14433757 0.25      ]
 [0.12909944 0.12309149 0.         ... 0.14433757 1.         0.28867513]
 [0.         0.         0.         ... 0.25       0.28867513 1.        ]]


In [48]:
# Reset index of your df and construct reverse mapping again

df = df.reset_index()
indices2 = pd.Series(df.index, index=df['title'])

In [49]:
print (indices2)

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
The Burkittsville 7            46623
Caged Heat 3000                46624
Robin Hood                     46625
Subdue                         46626
Century of Birthing            46627
Length: 46628, dtype: int64


In [51]:
#Get recommendations for The Lion King on meta data based as similarity done on JOINED feature

content_recommender('The Lion King', cosine_sim2, df, indices2)

31151             Kung Fu Panda: Secrets of the Masters
40901          VeggieTales: Where's God When I'm Scared
40913    VeggieTales: The Ultimate Silly Song Countdown
42569                                    A Silent Voice
20983                                     Wolf Children
41624                                        John Henry
15209          Spiderman: The Ultimate Villain Showdown
16613                         Cirque du Soleil: Varekai
20079                                   Baby Take a Bow
20080                                   Baby Take a Bow
Name: title, dtype: object

In [52]:
content_recommender('Count Dracula', cosine_sim2, df, indices2)

19597                 Vampir-Cuadecuc
8764                 Scars of Dracula
6420     The Satanic Rites of Dracula
8672      Dracula: Prince of Darkness
8732       Taste the Blood of Dracula
8772                Dracula A.D. 1972
17764             Drive-In Horrorshow
31571             Babysitter Massacre
31884                           Jacob
31941            Devil In The Flesh 2
Name: title, dtype: object

<hr style="border:2px solid gray">

## <font color=blue>Conclusion</font>

### It seems meta data content based recommender system has captured more information than Lion or similar to Lion that is captured in plot content based recommender system.

### Most of the movies listed in Meta data is animated & with anthromorpic characters for recommender on Lion King.


<hr style="border:2px solid gray">