### 3 Types of Recommender Systems: 
        Collaborative Filtering => Recommend based on items bought by User (Same recommendations if two persons buy same product)

        Content-Based Filtering => Focus completely on Content (Same recommendations to two persons if both give same rating to an article)

        Hybrid Filtering (Combining Collaborative and Content-Based)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
pwd

'C:\\Users\\jorge.grisman\\OneDrive - Quantum Health\\jorge\\Recommendation system Real World Projects\\source'

In [3]:
### lets read clean_df.csv

df=pd.read_csv("../datasets/clean_data_jorge.csv",index_col=[0])
df.reset_index(drop=True, inplace=True)

In [6]:
from pandas import option_context

with option_context('display.max_rows', None,'display.max_colwidth', 50,'display.max_columns',50):
              display(df.head(3))

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Enter the World of Pandora.,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","At the end of the world, the adventure begins.",6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",A Plan No One Escapes,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


In [7]:
df.shape

(4803, 18)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   id                    4803 non-null   int64  
 3   keywords              4803 non-null   object 
 4   original_language     4803 non-null   object 
 5   original_title        4803 non-null   object 
 6   overview              4800 non-null   object 
 7   popularity            4803 non-null   float64
 8   production_companies  4803 non-null   object 
 9   release_date          4802 non-null   object 
 10  revenue               4803 non-null   int64  
 11  runtime               4801 non-null   float64
 12  spoken_languages      4803 non-null   object 
 13  tagline               3959 non-null   object 
 14  vote_average          4803 non-null   float64
 15  vote_count           

In [9]:
df.dtypes

budget                    int64
genres                   object
id                        int64
keywords                 object
original_language        object
original_title           object
overview                 object
popularity              float64
production_companies     object
release_date             object
revenue                   int64
runtime                 float64
spoken_languages         object
tagline                  object
vote_average            float64
vote_count                int64
cast                     object
crew                     object
dtype: object

### Content Based Recommendation System
    Now lets make a recommendations based on the movie’s plot summaries given in the overview column. So if our user gives us a movie title, our goal is to recommend movies that share similar plot summaries.

In [10]:
from pandas import option_context

with option_context('display.max_rows', None,'display.max_colwidth', 50,'display.max_columns',50):
              display(df['overview'].head(20
                             ))

0     In the 22nd century, a paraplegic Marine is di...
1     Captain Barbossa, long believed to be dead, ha...
2     A cryptic message from Bond’s past sends him o...
3     Following the death of District Attorney Harve...
4     John Carter is a war-weary, former military ca...
5     The seemingly invincible Spider-Man goes up ag...
6     When the kingdom's most wanted-and most charmi...
7     When Tony Stark tries to jumpstart a dormant p...
8     As Harry begins his sixth year at Hogwarts, he...
9     Fearing the actions of a god-like Super Hero l...
10    Superman returns to discover his 5-year absenc...
11    Quantum of Solace continues the adventures of ...
12    Captain Jack Sparrow works his way out of a bl...
13    The Texas Rangers chase down a gang of outlaws...
14    A young boy learns that he has extraordinary p...
15    One year after their incredible adventures in ...
16    When an unexpected enemy emerges and threatens...
17    Captain Jack Sparrow crosses paths with a 

In [None]:
# Considering overview column, As i have to create a recommendation engine , so for each and every movie I have to create vector of matrix
# overview col is a String or we can say its a text df and our ML model cant understand these text df,so we have to use
# NLP algo like (TF-IDF , BOW , Word-2-Vec etc..) to convert this Text df into some numerical-format or vector so that our ML Model can understand


In [11]:
df['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [12]:
df['overview'].isnull().sum()

3

In [13]:
# Filling NaNs with empty string
df['overview']=df['overview'].fillna('')

In [14]:
df['overview'].isnull().sum()

0

In [15]:
### import TF-IDF vectorizer to convert text df into numerical df
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
#ngram_range=(1, 3),-means take the diff combinations of 1 to 3 different kinds of words
#stop_words = 'english'-remove all un-necessary words like the,,that,of,she,he,is

tfv=TfidfVectorizer(min_df=3,max_features=None,ngram_range=(1,3),stop_words='english')

In [17]:
# Fitting the TF-IDF on the 'overview' text
tfv_matrix=tfv.fit_transform(df['overview'])

In [18]:
tfv_matrix

<4803x9919 sparse matrix of type '<class 'numpy.float64'>'
	with 121480 stored elements in Compressed Sparse Row format>

In [19]:
tfv_matrix.shape
#why dimension is (4803, 9919),bcz of combinatio of diff diff features just because of ngram(1,3)

## We see that over 9919 different words were used to describe the 4803 movies in our dfset.

(4803, 9919)

In [20]:
df['overview'].shape

(4803,)

In [21]:
df['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [22]:
## getting TF-IDF matrix of first entry of overview..
tfv_matrix.toarray()[0]

### Return a array representation

array([0., 0., 0., ..., 0., 0., 0.])

In [23]:
### lets transpose matrix 
tfv_matrix[0].T.toarray()

array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])

###  getting TF-IDF matrix of entire df

In [24]:

X=tfv_matrix.toarray()

In [26]:
df2= pd.DataFrame(X)

In [27]:
df2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9909,9910,9911,9912,9913,9914,9915,9916,9917,9918
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4798,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4800,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4801,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
df['overview']

0       In the 22nd century, a paraplegic Marine is di...
1       Captain Barbossa, long believed to be dead, ha...
2       A cryptic message from Bond’s past sends him o...
3       Following the death of District Attorney Harve...
4       John Carter is a war-weary, former military ca...
                              ...                        
4798    El Mariachi just wants to play his guitar and ...
4799    A newlywed couple's honeymoon is upended by th...
4800    "Signed, Sealed, Delivered" introduces a dedic...
4801    When ambitious New York attorney Sam is sent t...
4802    Ever since the second grade when he first saw ...
Name: overview, Length: 4803, dtype: object

In [None]:
#after this ,our vector is ready ,so now we have to find similarity value

In [29]:
## lets apply Sigmoid kernel
## Note-->> The function sigmoid_kernel computes the sigmoid kernel between two vectors
### https://scikit-learn.org/stable/modules/metrics.html


from sklearn.metrics.pairwise import sigmoid_kernel

In [30]:
# Compute the sigmoid kernel
sig=sigmoid_kernel(tfv_matrix,tfv_matrix)

In [31]:
sig

array([[0.76163649, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
        0.76159416],
       [0.76159416, 0.76163649, 0.76159416, ..., 0.76159519, 0.76159416,
        0.76159416],
       [0.76159416, 0.76159416, 0.76163649, ..., 0.76159484, 0.76159416,
        0.76159416],
       ...,
       [0.76159416, 0.76159519, 0.76159484, ..., 0.76163649, 0.76159488,
        0.76159447],
       [0.76159416, 0.76159416, 0.76159416, ..., 0.76159488, 0.76163649,
        0.76159467],
       [0.76159416, 0.76159416, 0.76159416, ..., 0.76159447, 0.76159467,
        0.76163649]])

In [29]:
sig[1]

array([0.76159416, 0.76163649, 0.76159416, ..., 0.76159519, 0.76159416,
       0.76159416])

In [32]:
index=df.index

In [33]:
index

RangeIndex(start=0, stop=4803, step=1)

In [34]:
index[4802]

4802

In [35]:
title=df['original_title']

In [36]:
title[4802]

'My Date with Drew'

In [37]:
df.index

RangeIndex(start=0, stop=4803, step=1)

In [38]:
# create a series of indices and movie titles
indices=pd.Series(df.index,index=df['original_title'])

In [39]:
indices

original_title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

In [40]:
indices['Avatar']

0

In [41]:
sig[0]

array([0.76163649, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
       0.76159416])

In [42]:
sig[indices['Avatar']]

array([0.76163649, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
       0.76159416])

In [43]:
list(sig[indices['Avatar']])

[0.7616364930962501,
 0.7615941559557649,
 0.7615941559557649,
 0.761595125164868,
 0.7615941559557649,
 0.761595309447695,
 0.7615941559557649,
 0.761595908255829,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615950846887262,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.76159572705,
 0.7615962614863476,
 0.7615941559557649,
 0.7615952950243241,
 0.7615941559557649,
 0.7615968611459101,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.761597988130195,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615941559557649,
 0.7615951605621958,
 0.7615941559557649,
 0.7615968356260668,
 0.7615941559557649,
 0.7615956092492897,
 

In [44]:
len(list(sig[indices['Avatar']]))

4803

In [45]:
list(enumerate(list(sig[indices['Avatar']])))

[(0, 0.7616364930962501),
 (1, 0.7615941559557649),
 (2, 0.7615941559557649),
 (3, 0.761595125164868),
 (4, 0.7615941559557649),
 (5, 0.761595309447695),
 (6, 0.7615941559557649),
 (7, 0.761595908255829),
 (8, 0.7615941559557649),
 (9, 0.7615941559557649),
 (10, 0.7615941559557649),
 (11, 0.7615950846887262),
 (12, 0.7615941559557649),
 (13, 0.7615941559557649),
 (14, 0.7615941559557649),
 (15, 0.7615941559557649),
 (16, 0.7615941559557649),
 (17, 0.7615941559557649),
 (18, 0.7615941559557649),
 (19, 0.7615941559557649),
 (20, 0.7615941559557649),
 (21, 0.7615941559557649),
 (22, 0.7615941559557649),
 (23, 0.7615941559557649),
 (24, 0.7615941559557649),
 (25, 0.7615941559557649),
 (26, 0.76159572705),
 (27, 0.7615962614863476),
 (28, 0.7615941559557649),
 (29, 0.7615952950243241),
 (30, 0.7615941559557649),
 (31, 0.7615968611459101),
 (32, 0.7615941559557649),
 (33, 0.7615941559557649),
 (34, 0.7615941559557649),
 (35, 0.7615941559557649),
 (36, 0.761597988130195),
 (37, 0.761594155955

In [46]:
#sort similarity values in ascending order
sigma_scores=sorted(list(enumerate(list(sig[indices['Avatar']]))),key=lambda x:x[1],reverse=True)

In [47]:
sigma_scores

[(0, 0.7616364930962501),
 (1341, 0.7616030155858681),
 (634, 0.7616028561141562),
 (3604, 0.761601930611584),
 (2130, 0.7616015339622925),
 (775, 0.7616011086528327),
 (529, 0.7615996114069044),
 (151, 0.7615991171152051),
 (311, 0.7615990624497703),
 (847, 0.7615987706430225),
 (570, 0.7615986450599548),
 (942, 0.7615984376900236),
 (36, 0.761597988130195),
 (1610, 0.7615979793934843),
 (3070, 0.7615978406764746),
 (1033, 0.7615978182403835),
 (2628, 0.7615977834088159),
 (1784, 0.7615977150705628),
 (2578, 0.7615976778191441),
 (150, 0.7615976453752453),
 (3724, 0.7615975951237102),
 (1013, 0.761597590729192),
 (4211, 0.7615975631290406),
 (1213, 0.7615975380289366),
 (1345, 0.7615974549075267),
 (312, 0.7615974086679764),
 (4039, 0.7615973645677722),
 (2967, 0.7615973512232982),
 (614, 0.7615972949789032),
 (281, 0.7615972537743877),
 (174, 0.7615972462403858),
 (3493, 0.7615971922075142),
 (3624, 0.7615971821325882),
 (972, 0.7615971791001622),
 (1274, 0.7615971587672579),
 (1959,

In [48]:
sigma_scores[1:11]

[(1341, 0.7616030155858681),
 (634, 0.7616028561141562),
 (3604, 0.761601930611584),
 (2130, 0.7616015339622925),
 (775, 0.7616011086528327),
 (529, 0.7615996114069044),
 (151, 0.7615991171152051),
 (311, 0.7615990624497703),
 (847, 0.7615987706430225),
 (570, 0.7615986450599548)]

In [49]:
## to access index of above list
ind=[index[0] for index in sigma_scores[1:11]]

In [50]:
ind

[1341, 634, 3604, 2130, 775, 529, 151, 311, 847, 570]

In [51]:
df['original_title'].iloc[ind]

1341                Obitaemyy Ostrov
634                       The Matrix
3604                       Apollo 18
2130                    The American
775                        Supernova
529                 Tears of the Sun
151                          Beowulf
311     The Adventures of Pluto Nash
847                         Semi-Pro
570                           Ransom
Name: original_title, dtype: object

In [None]:
# ##### func for getting recommendation
# def get_recomendation(title, n = 6, cos_sim = cos_sim):
#     if title in title_ind.index:
#         indx = title_ind.loc[title] 
#         similar = list(enumerate(cos_sim[indx][0]))
#         top_sim = sorted(similar, reverse = True, key = lambda x: x[1])
#         return [anime.loc[x[0]]["name"] for x in top_sim[1:n]]
#     return "There is no such an anime"

In [74]:
###### we have to define such a function that will give recommendations

def give_rec(title,model,n=6):
    
    idx=indices[title]
    
    model_scores=list(enumerate(list(model[indices[idx]])))
    
    model_scores=sorted(model_scores,key=lambda x:x[1],reverse=True)
    
    model_scores=model_scores[1:n]
    
    movie_indices=[index[0] for index in model_scores]
    
    return df['original_title'].iloc[movie_indices]

In [75]:
# Testing our content-based recommendation system with the film 'Avatar'
give_rec('Avatar',sig)

1341    Obitaemyy Ostrov
634           The Matrix
3604           Apollo 18
2130        The American
775            Supernova
Name: original_title, dtype: object

In [76]:
from pandas import option_context

with option_context('display.max_rows', None,'display.max_colwidth',60,'display.max_columns',50):
        display(df['original_title'].head(20))

0                                          Avatar
1        Pirates of the Caribbean: At World's End
2                                         Spectre
3                           The Dark Knight Rises
4                                     John Carter
5                                    Spider-Man 3
6                                         Tangled
7                         Avengers: Age of Ultron
8          Harry Potter and the Half-Blood Prince
9              Batman v Superman: Dawn of Justice
10                               Superman Returns
11                              Quantum of Solace
12     Pirates of the Caribbean: Dead Man's Chest
13                                The Lone Ranger
14                                   Man of Steel
15       The Chronicles of Narnia: Prince Caspian
16                                   The Avengers
17    Pirates of the Caribbean: On Stranger Tides
18                                 Men in Black 3
19      The Hobbit: The Battle of the Five Armies


In [78]:
# Testing our content-based recommendation system with the film 'Avatar'
give_rec('Man of Steel',sig, 20)

223                 The Chronicles of Riddick
2644                                Ong Bak 2
539                                Titan A.E.
1934                          Say It Isn't So
1761                    The Neverending Story
3390                                   Hesher
117         Charlie and the Chocolate Factory
1275                                 Sunshine
90                          The Polar Express
2226                                   Doogal
1519                                    Bogus
2208                                Let Me In
3402                         Снежная королева
220                                Prometheus
3881                                Beginners
585                                 War Horse
2886                                     Duma
3087                        Nicholas Nickleby
200     The Hunger Games: Mockingjay - Part 1
Name: original_title, dtype: object

In [56]:
df.columns

Index(['budget', 'genres', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'tagline',
       'vote_average', 'vote_count', 'cast', 'crew'],
      dtype='object')

In [57]:
df['original_title']

0                                         Avatar
1       Pirates of the Caribbean: At World's End
2                                        Spectre
3                          The Dark Knight Rises
4                                    John Carter
                          ...                   
4798                                 El Mariachi
4799                                   Newlyweds
4800                   Signed, Sealed, Delivered
4801                            Shanghai Calling
4802                           My Date with Drew
Name: original_title, Length: 4803, dtype: object

In [58]:
# Testing our content-based recommendation system with the film 'Avatar'
give_rec('El Mariachi',sig)

1701      Once Upon a Time in Mexico
3959    My Big Fat Independent Movie
3707       Salvando al Soldado Perez
3704                        Salvador
4769         The Legend of God's Gun
324            The Road to El Dorado
3997                 Blazing Saddles
729                   A Civil Action
3475                Casa De Mi Padre
1965                       Footloose
Name: original_title, dtype: object

## Improve Existing Content Based Recommendation System..

In [1]:
###### Consider more features like Credits, Genres and Keywords that can impact my recommendations############

In [2]:
###### Consider more features like Credits, Genres and Keywords that can impact my recommendations############

In [3]:
###### Consider more features like Credits, Genres and Keywords that can impact my recommendations############

In [4]:
###### Consider more features like Credits, Genres and Keywords that can impact my recommendations############

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta
import os
import math
import warnings 
from datetime import datetime, date, timedelta
import pyodbc
import time

In [2]:
pwd

'C:\\Users\\jorge.grisman\\OneDrive - Quantum Health\\jorge\\Recommendation system Real World Projects\\source'

In [3]:
### lets read clean_df.csv

df=pd.read_csv("../datasets/clean_data_jorge.csv",index_col=[0])
df.reset_index(drop=True, inplace=True)

In [4]:
df.columns

Index(['budget', 'genres', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'tagline',
       'vote_average', 'vote_count', 'cast', 'crew'],
      dtype='object')

In [5]:
df['crew'][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [None]:
### now u have to access director name if job belongs to director , but in above case  , df sounds extremely pretty bad..
#### hence it is good to provide some structure , hence u can think of using eval or literal_eval ..
####

In [6]:
eval(df['crew'][0])

[{'credit_id': '52fe48009251416c750aca23',
  'department': 'Editing',
  'gender': 0,
  'id': 1721,
  'job': 'Editor',
  'name': 'Stephen E. Rivkin'},
 {'credit_id': '539c47ecc3a36810e3001f87',
  'department': 'Art',
  'gender': 2,
  'id': 496,
  'job': 'Production Design',
  'name': 'Rick Carter'},
 {'credit_id': '54491c89c3a3680fb4001cf7',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Sound Designer',
  'name': 'Christopher Boyes'},
 {'credit_id': '54491cb70e0a267480001bd0',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Supervising Sound Editor',
  'name': 'Christopher Boyes'},
 {'credit_id': '539c4a4cc3a36810c9002101',
  'department': 'Production',
  'gender': 1,
  'id': 1262,
  'job': 'Casting',
  'name': 'Mali Finn'},
 {'credit_id': '5544ee3b925141499f0008fc',
  'department': 'Sound',
  'gender': 2,
  'id': 1729,
  'job': 'Original Music Composer',
  'name': 'James Horner'},
 {'credit_id': '52fe48009251416c750ac9c3',
  'department': 'Directing',
  

In [7]:
from ast import literal_eval

In [8]:
literal_eval(df['crew'][0])

[{'credit_id': '52fe48009251416c750aca23',
  'department': 'Editing',
  'gender': 0,
  'id': 1721,
  'job': 'Editor',
  'name': 'Stephen E. Rivkin'},
 {'credit_id': '539c47ecc3a36810e3001f87',
  'department': 'Art',
  'gender': 2,
  'id': 496,
  'job': 'Production Design',
  'name': 'Rick Carter'},
 {'credit_id': '54491c89c3a3680fb4001cf7',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Sound Designer',
  'name': 'Christopher Boyes'},
 {'credit_id': '54491cb70e0a267480001bd0',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Supervising Sound Editor',
  'name': 'Christopher Boyes'},
 {'credit_id': '539c4a4cc3a36810c9002101',
  'department': 'Production',
  'gender': 1,
  'id': 1262,
  'job': 'Casting',
  'name': 'Mali Finn'},
 {'credit_id': '5544ee3b925141499f0008fc',
  'department': 'Sound',
  'gender': 2,
  'id': 1729,
  'job': 'Original Music Composer',
  'name': 'James Horner'},
 {'credit_id': '52fe48009251416c750ac9c3',
  'department': 'Directing',
  

In [9]:
#### Parse the stringified features into their corresponding python objects

features=['cast','crew','keywords','genres']

In [10]:
for feature in features:
    df[feature]=df[feature].apply(literal_eval)

In [12]:
df['crew'][1]

[{'credit_id': '52fe4232c3a36847f800b579',
  'department': 'Camera',
  'gender': 2,
  'id': 120,
  'job': 'Director of Photography',
  'name': 'Dariusz Wolski'},
 {'credit_id': '52fe4232c3a36847f800b4fd',
  'department': 'Directing',
  'gender': 2,
  'id': 1704,
  'job': 'Director',
  'name': 'Gore Verbinski'},
 {'credit_id': '52fe4232c3a36847f800b54f',
  'department': 'Production',
  'gender': 2,
  'id': 770,
  'job': 'Producer',
  'name': 'Jerry Bruckheimer'},
 {'credit_id': '52fe4232c3a36847f800b503',
  'department': 'Writing',
  'gender': 2,
  'id': 1705,
  'job': 'Screenplay',
  'name': 'Ted Elliott'},
 {'credit_id': '52fe4232c3a36847f800b509',
  'department': 'Writing',
  'gender': 2,
  'id': 1706,
  'job': 'Screenplay',
  'name': 'Terry Rossio'},
 {'credit_id': '52fe4232c3a36847f800b57f',
  'department': 'Editing',
  'gender': 0,
  'id': 1721,
  'job': 'Editor',
  'name': 'Stephen E. Rivkin'},
 {'credit_id': '52fe4232c3a36847f800b585',
  'department': 'Editing',
  'gender': 2,
 

In [13]:
df[features]

Unnamed: 0,cast,crew,keywords,genres
0,"[{'cast_id': 242, 'character': 'Jake Sully', '...","[{'credit_id': '52fe48009251416c750aca23', 'de...","[{'id': 1463, 'name': 'culture clash'}, {'id':...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
1,"[{'cast_id': 4, 'character': 'Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...","[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...","[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
3,"[{'cast_id': 2, 'character': 'Bruce Wayne / Ba...","[{'credit_id': '52fe4781c3a36847f81398c3', 'de...","[{'id': 849, 'name': 'dc comics'}, {'id': 853,...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam..."
4,"[{'cast_id': 5, 'character': 'John Carter', 'c...","[{'credit_id': '52fe479ac3a36847f813eaa3', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
...,...,...,...,...
4798,"[{'cast_id': 1, 'character': 'El Mariachi', 'c...","[{'credit_id': '52fe44eec3a36847f80b280b', 'de...","[{'id': 5616, 'name': 'united states–mexico ba...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam..."
4799,"[{'cast_id': 1, 'character': 'Buzzy', 'credit_...","[{'credit_id': '52fe487dc3a368484e0fb013', 'de...",[],"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '..."
4800,"[{'cast_id': 8, 'character': 'Oliver O’Toole',...","[{'credit_id': '52fe4df3c3a36847f8275ecf', 'de...","[{'id': 248, 'name': 'date'}, {'id': 699, 'nam...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4801,"[{'cast_id': 3, 'character': 'Sam', 'credit_id...","[{'credit_id': '52fe4ad9c3a368484e16a36b', 'de...",[],[]


In [14]:
df['crew'][0]

[{'credit_id': '52fe48009251416c750aca23',
  'department': 'Editing',
  'gender': 0,
  'id': 1721,
  'job': 'Editor',
  'name': 'Stephen E. Rivkin'},
 {'credit_id': '539c47ecc3a36810e3001f87',
  'department': 'Art',
  'gender': 2,
  'id': 496,
  'job': 'Production Design',
  'name': 'Rick Carter'},
 {'credit_id': '54491c89c3a3680fb4001cf7',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Sound Designer',
  'name': 'Christopher Boyes'},
 {'credit_id': '54491cb70e0a267480001bd0',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Supervising Sound Editor',
  'name': 'Christopher Boyes'},
 {'credit_id': '539c4a4cc3a36810c9002101',
  'department': 'Production',
  'gender': 1,
  'id': 1262,
  'job': 'Casting',
  'name': 'Mali Finn'},
 {'credit_id': '5544ee3b925141499f0008fc',
  'department': 'Sound',
  'gender': 2,
  'id': 1729,
  'job': 'Original Music Composer',
  'name': 'James Horner'},
 {'credit_id': '52fe48009251416c750ac9c3',
  'department': 'Directing',
  

In [15]:
'shan '.strip()

'shan'

In [17]:
#### Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job']=='Director':
            return i['name'].strip()
        
    else:
        return np.nan

In [18]:
get_director(df['crew'][0])

'James Cameron'

In [19]:
df['director']=df['crew'].apply(get_director)

In [20]:
df['director'].shape

(4803,)

In [66]:

from pandas import option_context


with option_context('display.max_rows', None,'display.max_colwidth', 50,'display.max_columns',50):
              display(df.head(3))

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,cast,crew,director,important_feature
0,237000000,"[action, adventure, fantasy]",19995,"[culture clash, future, space war]",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Enter the World of Pandora.,7.2,11800,"[sam worthington, zoe saldana, sigourney weaver]","[stephen e. rivkin, rick carter, christopher b...",James Cameron,culture clash future space war sam worthington...
1,300000000,"[adventure, fantasy, action]",285,"[ocean, drug abuse, exotic island]",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","At the end of the world, the adventure begins.",6.9,4500,"[johnny depp, orlando bloom, keira knightley]","[dariusz wolski, gore verbinski, jerry bruckhe...",Gore Verbinski,ocean drug abuse exotic island johnny depp orl...
2,245000000,"[action, adventure, crime]",206647,"[spy, based on novel, secret agent]",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",A Plan No One Escapes,6.3,4466,"[daniel craig, christoph waltz, léa seydoux]","[thomas newman, sam mendes, anna pinnock]",Sam Mendes,spy based on novel secret agent daniel craig c...


In [21]:
df['director'].isnull().sum()

30

In [22]:
df['cast'][0]

[{'cast_id': 242,
  'character': 'Jake Sully',
  'credit_id': '5602a8a7c3a3685532001c9a',
  'gender': 2,
  'id': 65731,
  'name': 'Sam Worthington',
  'order': 0},
 {'cast_id': 3,
  'character': 'Neytiri',
  'credit_id': '52fe48009251416c750ac9cb',
  'gender': 1,
  'id': 8691,
  'name': 'Zoe Saldana',
  'order': 1},
 {'cast_id': 25,
  'character': 'Dr. Grace Augustine',
  'credit_id': '52fe48009251416c750aca39',
  'gender': 1,
  'id': 10205,
  'name': 'Sigourney Weaver',
  'order': 2},
 {'cast_id': 4,
  'character': 'Col. Quaritch',
  'credit_id': '52fe48009251416c750ac9cf',
  'gender': 2,
  'id': 32747,
  'name': 'Stephen Lang',
  'order': 3},
 {'cast_id': 5,
  'character': 'Trudy Chacon',
  'credit_id': '52fe48009251416c750ac9d3',
  'gender': 1,
  'id': 17647,
  'name': 'Michelle Rodriguez',
  'order': 4},
 {'cast_id': 8,
  'character': 'Selfridge',
  'credit_id': '52fe48009251416c750ac9e1',
  'gender': 2,
  'id': 1771,
  'name': 'Giovanni Ribisi',
  'order': 5},
 {'cast_id': 7,
  'c

In [23]:
type(df['cast'][0])

list

In [24]:
'Shan singh '.strip().lower()

'shan singh'

In [25]:
# Returns the list top 3 elements or entire list , whichever is more..
def get_list(x):
    names=[i['name'].strip().lower() for i in x if type(x)==list]
    
    if len(names)>3:
        return names[0:3]
    else:
        return names

In [26]:
'''
def get_list(x):
    names=[]
    for i in x:
        if type(x)==list:
            name=i['name']
            name=name.strip()
            name=name.lower()
            names.append(name)
    
    if len(names)>3:
        return names[0:3]
    else:
        return names
        
'''

"\ndef get_list(x):\n    names=[]\n    for i in x:\n        if type(x)==list:\n            name=i['name']\n            name=name.strip()\n            name=name.lower()\n            names.append(name)\n    \n    if len(names)>3:\n        return names[0:3]\n    else:\n        return names\n        \n"

In [27]:
get_list(df['crew'][0])

['stephen e. rivkin', 'rick carter', 'christopher boyes']

In [28]:
df2=df.copy()

In [29]:
for feature in features:
    df[feature]=df[feature].apply(get_list)

In [30]:
df[features]

Unnamed: 0,cast,crew,keywords,genres
0,"[sam worthington, zoe saldana, sigourney weaver]","[stephen e. rivkin, rick carter, christopher b...","[culture clash, future, space war]","[action, adventure, fantasy]"
1,"[johnny depp, orlando bloom, keira knightley]","[dariusz wolski, gore verbinski, jerry bruckhe...","[ocean, drug abuse, exotic island]","[adventure, fantasy, action]"
2,"[daniel craig, christoph waltz, léa seydoux]","[thomas newman, sam mendes, anna pinnock]","[spy, based on novel, secret agent]","[action, adventure, crime]"
3,"[christian bale, michael caine, gary oldman]","[hans zimmer, charles roven, christopher nolan]","[dc comics, crime fighter, terrorist]","[action, crime, drama]"
4,"[taylor kitsch, lynn collins, samantha morton]","[andrew stanton, andrew stanton, john lasseter]","[based on novel, mars, medallion]","[action, adventure, science fiction]"
...,...,...,...,...
4798,"[carlos gallardo, jaime de hoyos, peter marqua...","[robert rodriguez, robert rodriguez, robert ro...","[united states–mexico barrier, legs, arms]","[action, crime, thriller]"
4799,"[edward burns, kerry bishé, marsha dietlein]","[edward burns, edward burns, edward burns]",[],"[comedy, romance]"
4800,"[eric mabius, kristin booth, crystal lowe]","[carla hetland, harvey kahn, adam sliwinski]","[date, love at first sight, narration]","[comedy, drama, romance]"
4801,"[daniel henney, eliza coupe, bill paxton]","[daniel hsia, daniel hsia]",[],[]


In [31]:
df2[features]

Unnamed: 0,cast,crew,keywords,genres
0,"[{'cast_id': 242, 'character': 'Jake Sully', '...","[{'credit_id': '52fe48009251416c750aca23', 'de...","[{'id': 1463, 'name': 'culture clash'}, {'id':...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
1,"[{'cast_id': 4, 'character': 'Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...","[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...","[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
3,"[{'cast_id': 2, 'character': 'Bruce Wayne / Ba...","[{'credit_id': '52fe4781c3a36847f81398c3', 'de...","[{'id': 849, 'name': 'dc comics'}, {'id': 853,...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam..."
4,"[{'cast_id': 5, 'character': 'John Carter', 'c...","[{'credit_id': '52fe479ac3a36847f813eaa3', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
...,...,...,...,...
4798,"[{'cast_id': 1, 'character': 'El Mariachi', 'c...","[{'credit_id': '52fe44eec3a36847f80b280b', 'de...","[{'id': 5616, 'name': 'united states–mexico ba...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam..."
4799,"[{'cast_id': 1, 'character': 'Buzzy', 'credit_...","[{'credit_id': '52fe487dc3a368484e0fb013', 'de...",[],"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '..."
4800,"[{'cast_id': 8, 'character': 'Oliver O’Toole',...","[{'credit_id': '52fe4df3c3a36847f8275ecf', 'de...","[{'id': 248, 'name': 'date'}, {'id': 699, 'nam...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4801,"[{'cast_id': 3, 'character': 'Sam', 'credit_id...","[{'credit_id': '52fe4ad9c3a368484e16a36b', 'de...",[],[]


In [32]:
df.isnull().sum()

budget                    0
genres                    0
id                        0
keywords                  0
original_language         0
original_title            0
overview                  3
popularity                0
production_companies      0
release_date              1
revenue                   0
runtime                   2
spoken_languages          0
tagline                 844
vote_average              0
vote_count                0
cast                      0
crew                      0
director                 30
dtype: int64

In [33]:
df.dropna(subset=['director'],inplace=True)

In [34]:
df.isnull().sum()

budget                    0
genres                    0
id                        0
keywords                  0
original_language         0
original_title            0
overview                  3
popularity                0
production_companies      0
release_date              0
revenue                   0
runtime                   2
spoken_languages          0
tagline                 822
vote_average              0
vote_count                0
cast                      0
crew                      0
director                  0
dtype: int64

In [None]:
'''
We are now in a position to create our "metadf " as important_feature which is a string 
that contains all the metadf that we want to feed to our vectorizer 'Countvectorizer' (namely actors, director and keywords)

'''

In [35]:
df['cast'][0]

['sam worthington', 'zoe saldana', 'sigourney weaver']

In [36]:
type(df['cast'][0])

list

In [37]:
' '.join(df['cast'][0])

'sam worthington zoe saldana sigourney weaver'

In [38]:
type(' '.join(df['cast'][0]))

str

In [39]:
df.columns

Index(['budget', 'genres', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'tagline',
       'vote_average', 'vote_count', 'cast', 'crew', 'director'],
      dtype='object')

In [40]:
df[features]

Unnamed: 0,cast,crew,keywords,genres
0,"[sam worthington, zoe saldana, sigourney weaver]","[stephen e. rivkin, rick carter, christopher b...","[culture clash, future, space war]","[action, adventure, fantasy]"
1,"[johnny depp, orlando bloom, keira knightley]","[dariusz wolski, gore verbinski, jerry bruckhe...","[ocean, drug abuse, exotic island]","[adventure, fantasy, action]"
2,"[daniel craig, christoph waltz, léa seydoux]","[thomas newman, sam mendes, anna pinnock]","[spy, based on novel, secret agent]","[action, adventure, crime]"
3,"[christian bale, michael caine, gary oldman]","[hans zimmer, charles roven, christopher nolan]","[dc comics, crime fighter, terrorist]","[action, crime, drama]"
4,"[taylor kitsch, lynn collins, samantha morton]","[andrew stanton, andrew stanton, john lasseter]","[based on novel, mars, medallion]","[action, adventure, science fiction]"
...,...,...,...,...
4798,"[carlos gallardo, jaime de hoyos, peter marqua...","[robert rodriguez, robert rodriguez, robert ro...","[united states–mexico barrier, legs, arms]","[action, crime, thriller]"
4799,"[edward burns, kerry bishé, marsha dietlein]","[edward burns, edward burns, edward burns]",[],"[comedy, romance]"
4800,"[eric mabius, kristin booth, crystal lowe]","[carla hetland, harvey kahn, adam sliwinski]","[date, love at first sight, narration]","[comedy, drama, romance]"
4801,"[daniel henney, eliza coupe, bill paxton]","[daniel hsia, daniel hsia]",[],[]


In [41]:
def create_feature(row):
    return ' '.join(row['keywords']) + ' ' + ' '.join(row['cast']) + ' ' + row['director'] + ' ' + ' '.join(row['genres'])

In [42]:
df['important_feature']=df.apply(create_feature,axis=1)

In [43]:
df.head(2)

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,cast,crew,director,important_feature
0,237000000,"[action, adventure, fantasy]",19995,"[culture clash, future, space war]",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Enter the World of Pandora.,7.2,11800,"[sam worthington, zoe saldana, sigourney weaver]","[stephen e. rivkin, rick carter, christopher b...",James Cameron,culture clash future space war sam worthington...
1,300000000,"[adventure, fantasy, action]",285,"[ocean, drug abuse, exotic island]",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","At the end of the world, the adventure begins.",6.9,4500,"[johnny depp, orlando bloom, keira knightley]","[dariusz wolski, gore verbinski, jerry bruckhe...",Gore Verbinski,ocean drug abuse exotic island johnny depp orl...


In [44]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

In [45]:
count=CountVectorizer(stop_words='english')

In [46]:
count_matrix=count.fit_transform(df['important_feature'])

In [47]:
count_matrix

<4773x10750 sparse matrix of type '<class 'numpy.int64'>'
	with 66282 stored elements in Compressed Sparse Row format>

In [48]:
count_matrix.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [49]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

In [50]:
cosine_sim2=cosine_similarity(count_matrix,count_matrix)

In [51]:
cosine_sim2

array([[1.        , 0.1875    , 0.1875    , ..., 0.        , 0.        ,
        0.        ],
       [0.1875    , 1.        , 0.125     , ..., 0.        , 0.        ,
        0.        ],
       [0.1875    , 0.125     , 1.        , ..., 0.        , 0.16666667,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.16666667, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [52]:
cosine_sim2[0]

array([1.    , 0.1875, 0.1875, ..., 0.    , 0.    , 0.    ])

In [53]:
df['original_title'][6]

'Tangled'

In [57]:
# create a series of indices and movie titles
indices=pd.Series(df.index,index=df['original_title'])

In [58]:
indices

original_title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4773, dtype: int64

In [59]:
###### we have to define such a function that will give recommendations

def give_rec(title,model,n=6):
    
    idx=indices[title]
    
    model_scores=list(enumerate(list(model[indices[idx]])))
    
    model_scores=sorted(model_scores,key=lambda x:x[1],reverse=True)
    
    model_scores=model_scores[1:n]
    
    movie_indices=[index[0] for index in model_scores]
    
    return df['original_title'].iloc[movie_indices]

In [63]:
give_rec('Tangled',cosine_sim2,n=10)

578     Alvin and the Chipmunks: The Squeakquel
1108                                  Pinocchio
1481                         The House of Magic
1857                            Rugrats Go Wild
42                                  Toy Story 3
390                          Hotel Transylvania
565                                     Shrek 2
899                                       Shrek
1695                                    Aladdin
Name: original_title, dtype: object

In [64]:
give_rec('The Godfather',cosine_sim2,n=10)

867     The Godfather: Part III
2731     The Godfather: Part II
1525             Apocalypse Now
2792        Glengarry Glen Ross
1209              The Rainmaker
3012              The Outsiders
4209           The Conversation
2649          The Son of No One
1018            The Cotton Club
Name: original_title, dtype: object

In [65]:
give_rec('The Godfather',cosine_sim2,n=10)

867     The Godfather: Part III
2731     The Godfather: Part II
1525             Apocalypse Now
2792        Glengarry Glen Ross
1209              The Rainmaker
3012              The Outsiders
4209           The Conversation
2649          The Son of No One
1018            The Cotton Club
Name: original_title, dtype: object

In [None]:
'''

We see that our recommender has been successful in capturing more information due to more metadf
and has given us better recommendations.


In order to improve more , We can also increase the weight of the director,by adding the feature multiple times in the soup.

'''