In [20]:
import numpy as np
import pandas as pd
import ast

In [3]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

In [4]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


TMDB is a website which is the contains information related to all movies, in this `tmbd_5000_movies` dataset **movie_id** is the id of a movie in the TMDB website. **keywords** contains the description of the movie. And like this we have total 20 columns. In `tmdb_5000_credits` dataset we have movie_id, title, cast and crew of the movie.

In [5]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [7]:
movies = movies.merge(credits, on='title')

We have merge both the dataframes on the basis of title column and now we have total 23 columns. 

In [8]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [11]:
# genres, id, keywords, title, overview, cast, crew
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords','cast','crew']]

Out of the 23 columns (features) we will only keep some features which are important to us and for this project and we will drop rest of the columns.

In [12]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [15]:
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [14]:
movies.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies.dropna(inplace=True)


In the overview column there are 3 missing values, means 3 movies have no overview so we will drop/remove those 3 movies, as the dataset contains 5000 movies, removing 3 of them will not effect much.

In [17]:
movies.duplicated().sum()

0

In [18]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

As we can see the gneres column contains data in a different way, it is a **list** of **dictionaries**. So to convert it into simple **list** we will use the **`ast library`**, because directly appending the genre into list will give error as it expects `string indices must be integers`. So we have to use **`ast.literal_eval()`**. We will apply the same concept for all columns.

In [21]:
def convert(obj):
    L = []
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

In [23]:
movies['genres'] = movies['genres'].apply(convert)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['genres'] = movies['genres'].apply(convert)


In [24]:
movies['keywords'] = movies['keywords'].apply(convert)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['keywords'] = movies['keywords'].apply(convert)


For cast we will only consider the top 3 names of the cast, cast contains list of dictionaries in which each dictionary gives the information of the actor. So we have to consider only the first 3 dictionaries. And from these 3 dictionaries we have to extract the name value. So in the convert function we will add a counter so that only the first 3 actor names are appended. 

In [25]:
def convert_cast(obj):
    L = []
    c = 0
    for i in ast.literal_eval(obj):
        if c != 3:
            L.append(i['name'])
            c += 1
        else:
            break
    return L

In [26]:
movies['cast'] = movies['cast'].apply(convert_cast)

For crew we want only the name of the director, so if **`"job": "director"`** then extract the name from that dictionary.

In [28]:
def convert_crew(obj):
    L = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            L.append(i['name'])
            break
    return L

In [29]:
movies['crew'] = movies['crew'].apply(convert_crew)

We will also convert overview column into list so that it will be easier to concat it with other columns.

In [31]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [32]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


Now we will remove all the spaces because suppose if two actors have same first name then the model will get confuse which to recommend. It will be easier for us to make tags if we remove spaces. After removing spaces just add all the columns together. 

In [37]:
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ","") for i in x])
movies['keywords']= movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x])
movies['overview']= movies['overview'].apply(lambda x:[i.replace(" ","") for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ","") for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(" ","") for i in x])

In [39]:
movies['tags'] = movies['genres'] + movies['overview'] + movies['cast'] + movies['crew'] + movies['keywords']

The new dataframe will have only 3 columns, **movie_id, title, tags**. Also try to make all the strings into lowercase as it is a good practice.

In [40]:
new_df = movies[['movie_id', 'title', 'tags']]

In [42]:
new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))


In [44]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


In [45]:
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,action adventure fantasy sciencefiction in the...


For recommendation we have to check similar movies, the similarity of tags can be done using the basic concepts of NLP.<br>
**Vectorization :-** To convert a text into vectors. This can be done by various techniques like :- **Bag of Words (BOW), TF-IDF, Word2Vec,** etc. In this project we will use BOW technique. Here we will first join all the tags and we get a huge string, from this string we will extract 5000 most common words. After getting these 5000 words we will go through every tag and count how many times these 5000 words has appear in the tag. By doing this we get a matrix of shape **5000 X 5000**. In which each row is the vector representation of the tag of that movie. In this process we will also remove **stopwords** and apply **stemming**.<br>
**Stopwords :-** Words which are used for formation of a sentence but have no contribution in the actual meaning of the sentence. <br>
**Stemming :-** Converting all the words into its base form. Eg. Love, Lovely, Lovable will get converted into Love.

In [59]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
cv = CountVectorizer(max_features=5000, stop_words='english')

In [60]:
vectors = cv.fit_transform(new_df['tags']).toarray()

In [56]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

In [57]:
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


In [61]:
cv.get_feature_names_out() #list of 5000 words

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

So in total we have **4806** movies and their corresponding vectors. Now we will calculate the distance between every vector from each other. `More the distance less the similarity`. Note that we will not calculate the **Euclidean distance** as it calculates the distance between the tip points of the vector, we will use **Cosine distance** which is the angle between the vectors. If the angle is 0 degree then the vectors are similar, if the angle is 5-10 degrees then the vectors are slightly similar, if the angle is 180 degrees then the vectors are completely opposite. Euclidean distance is not a good measure, if we are working with huge data it is not advisable to use Euclidean distance. <br>
After applying the **`cosine_similarity`** function we get the **similarity matrix** which is a matrix of arrays, in which the first array is the similarity of first movie with every 4806 movies. 

In [62]:
from sklearn.metrics.pairwise import cosine_similarity

In [64]:
similarity = cosine_similarity(vectors)

After this we will make a function which will take the movie title and will return the top 5 similar movies. For this  we will take each array and sort it in ascending order, but here is a problem if we directly apply the sorting then we will lose the indexing, so we have to use the **`enumerate function`**, this will make a list of tuples in which the 1st element of the tuple is index number. 

In [69]:
def recommend(movie):
    movies_index = new_df[new_df['title'] == movie].index[0]
    distances = similarity[movies_index]
    movies_list = sorted(list(enumerate(similarity[0])), reverse=True, key=lambda x:x[1])[1:6]
    
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

In [70]:
recommend('Avatar')

Aliens vs Predator: Requiem
Aliens
Falcon Rising
Independence Day
Titan A.E.


In [71]:
import pickle

In [75]:
pickle.dump(new_df.to_dict(), open('movie_dict.pkl', 'wb'))

In [76]:
pickle.dump(similarity, open('similarity.pkl', 'wb'))