# Content Based Movie recommender System

    In this project, we will use item metadata, such as genre, director, description, actors, etc., for movies to make recommendations. The general idea behind these recommender systems is that if a person likes a particular item, they will also like an item similar to it.
    We will preprocess the data and make a single sentence consisting of all metadata. Then we make tokens out of that sentence and vectorize it to input numerical data to our model using the bag of words method. Then using our model, we will find cosine similarities (inverse of cosine distance) between every vector and suggest 'n' similar vectors representing movies. 

Dataset used : https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata

In [7]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

We will merge all collected datasets into single one on the basis of title

In [51]:
moviedata= pd.read_csv('Raw_data/movies.csv')
creditsdata= pd.read_csv('Raw_data/credits.csv')

In [9]:
data=moviedata.merge(creditsdata,on='title')

### Preprocessing

In [10]:
data.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [11]:
data.isnull().sum()

budget                     0
genres                     0
homepage                3096
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
movie_id                   0
cast                       0
crew                       0
dtype: int64

In [12]:
data = data.drop(['homepage','tagline'],axis=1)

In [13]:
data=data.dropna()

In [14]:
# data.isnull().sum()

In [15]:
data.duplicated().sum()

0

In [16]:
data.columns.values

array(['budget', 'genres', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'], dtype=object)

In [17]:
# We will select important columns to determine similarities between two movies
# It should explain content of movie properly
movies=data[['movie_id','title','overview','genres','keywords','cast', 'crew']]

In [18]:
# we need another dataframe to fetch all the info about the movie
movieinfo= data[['movie_id','release_date','runtime','vote_count','vote_average','original_language','budget']]

In [19]:
movieinfo.head(2)

Unnamed: 0,movie_id,release_date,runtime,vote_count,vote_average,original_language,budget
0,19995,2009-12-10,162.0,11800,7.2,en,237000000
1,285,2007-05-19,169.0,4500,6.9,en,300000000


In [20]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [21]:
# Here genres, keywords, cast and crew are in dict format we should convert it into list of str
movies['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [22]:
import ast

In [23]:
def converter(obj):
    L=[]
    for i in ast.literal_eval(obj):
        L.append((i['name']))
    return L
movies['genres']=movies['genres'].apply(converter)
movies['genres'][0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['genres']=movies['genres'].apply(converter)


['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [24]:
movies['keywords']=movies['keywords'].apply(converter)
movies['keywords'][0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['keywords']=movies['keywords'].apply(converter)


['culture clash',
 'future',
 'space war',
 'space colony',
 'society',
 'space travel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alien planet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'love affair',
 'anti war',
 'power relations',
 'mind and soul',
 '3d']

In [25]:
def converter2(obj):
    #we need only top 3 casts
    L=[]
    counter=0
    for i in ast.literal_eval(obj):
        if counter!=3:
            L.append((i['name']))
            counter+=1
        else:
            break
    return L
movies['cast']=movies['cast'].apply(converter2)
movies['cast'][0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['cast']=movies['cast'].apply(converter2)


['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver']

In [26]:
def converter3(obj):
    #We only need director of the movie
    L=[]
    for i in ast.literal_eval(obj):
        if i['job']=='Director':
            L.append((i['name']))
            break
        else:
            continue
    return L
movies['crew']=movies['crew'].apply(converter3)
movies['crew'][0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['crew']=movies['crew'].apply(converter3)


['James Cameron']

In [27]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


In [28]:
# Now we will bring overview in similar format
movies['overview']=movies['overview'].apply(lambda x:x.split())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['overview']=movies['overview'].apply(lambda x:x.split())


In [29]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


In [30]:
movieinfo['cast']=movies['cast']
movieinfo['director']=movies['crew']
movieinfo['genre']=movies['genres']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movieinfo['cast']=movies['cast']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movieinfo['director']=movies['crew']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movieinfo['genre']=movies['genres']


In [31]:
# Now, we should remove stopwords and perform stemming
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
sw=stopwords.words('english')
ps=PorterStemmer() #<-- To stem

In [32]:
def clean(wordlist):
    L=[]
    for i in wordlist:
        if i not in sw:
            L.append(ps.stem(i.lower()))
    return L
def clean2(wordlist):
    L=[]
    for i in wordlist:
        L.append(ps.stem(i.lower()))
    return L
movies['overview']=movies['overview'].apply(clean)
movies['keywords']=movies['keywords'].apply(clean2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['overview']=movies['overview'].apply(clean)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['keywords']=movies['keywords'].apply(clean2)


In [33]:
# Now we will remove spaces between sentences like culture clash --> cultureclash, James Cameron --> JamesCameron
# Because in the process of tokenization of words they could lose meanings 
movies['cast']=movies['cast'].apply(lambda x:[i.replace(" ","")for i in x])
movies['crew']=movies['crew'].apply(lambda x:[i.replace(" ","")for i in x])
movies['keywords']=movies['keywords'].apply(lambda x:[i.replace(" ","")for i in x])
movies['genres']=movies['genres'].apply(lambda x:[i.replace(" ","")for i in x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['cast']=movies['cast'].apply(lambda x:[i.replace(" ","")for i in x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['crew']=movies['crew'].apply(lambda x:[i.replace(" ","")for i in x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['keywords']=movies['keywords'].apply(lambda x:

In [34]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[in, 22nd, century,, parapleg, marin, dispatch...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, futur, spacewar, spacecoloni, s...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[captain, barbossa,, long, believ, dead,, come...","[Adventure, Fantasy, Action]","[ocean, drugabus, exoticisland, eastindiatradi...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]


In [35]:
# we save a copy of movies
moviesdata=movies.copy()

In [36]:
#Now we will join all the columns except title & make a single string by joining all of them
movies['tags']=movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']
movies['tags']=movies['tags'].apply(lambda x: " ".join(x) )
movies['tags']=movies['tags'].apply(lambda x: x.lower() )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags']=movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags']=movies['tags'].apply(lambda x: " ".join(x) )
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags']=movies['tags'].apply(lambda x:

In [37]:
movies['tags'][1]

"captain barbossa, long believ dead, come back life head edg earth will turner elizabeth swann. but noth quit seems. adventure fantasy action ocean drugabus exoticisland eastindiatradingcompani loveofone'slif traitor shipwreck strongwoman ship allianc calypso afterlif fighter pirat swashbuckl aftercreditssting johnnydepp orlandobloom keiraknightley goreverbinski"

In [38]:
movies.columns.values

array(['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast',
       'crew', 'tags'], dtype=object)

In [39]:
movies=movies.drop(['overview','genres','keywords','cast','crew'],axis=1)
movies.head(2)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in 22nd century, parapleg marin dispatch moon ..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ dead, come back ..."


This is the format ie [ Index | title | tags ] we need for further processing

In [40]:
# movies.to_csv('Preprocessed.csv')

## Vectorization and Model building

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

In [42]:
t=3
type(t)

int

Methods in class movierecomender will process our cleaned data and recomend movies. In future if we found more data the we can append it by using update method of this class.

In [43]:
class MovieRecommender:
#     data
#     vectors
#     cv
#     max_features
#     This class can help to increase reusability of the code. Once data has been cleaned we remain no more modifications
#     so the same code can help us. It also allows to update data and we don't have to perform all steps again
    def __init__(self,data=pd.DataFrame(columns=['movie_id','title','tags']),max_features=5000):
        self.data=data
        self.max_features=max_features
        self.cv =CountVectorizer(max_features=self.max_features,stop_words='english')
        self.vectorize()
    def vectorize(self):
        self.vectors=self.cv.fit_transform(self.data['tags']).toarray()
        self.similarity=cosine_similarity(self.vectors)
    def get_cosine_similarities(self,index=0):
        return sorted(list(enumerate(self.similarity[index])),reverse=True,key=lambda x : x[1])
    def index_of(self,title):
        return self.data[self.data['title']==title].index[0]
    def recommend(self,input_movie="Avatar",numbers=5):
        try:
            if (type(input_movie)==int):
                movie_index=input_movie
            else :
                movie_index=self.index_of(input_movie)
            movies_list = self.get_cosine_similarities(movie_index)[1:numbers+1]
            recommended=[]
            for i in movies_list:
                recommended.append(self.data['title'][i[0]])
            return recommended
        except IndexError:
            return "Invalid input!"
    @property
    def feature_names(self):
        return self.cv.get_feature_names_out()
    
    def update(self, data2,new_max_features=None):
        if new_max_features!=None:
            self.max_features=new_max_features
        self.data= pd.concat([self.data,data2])
        self.data=self.data.drop_duplicates()
        self.cv =CountVectorizer(max_features=self.max_features,stop_words='english')
        self.vectorize()
    def dumpdata(self):
        pickle.dump((self.data.drop(['tags'],axis=1)).to_dict() ,open('Data/movies.pkl','wb'))
        pickle.dump(self.similarity,open('Data/similarity.pkl','wb'))

In [44]:
mr= MovieRecommender(movies)

In [45]:
mr.recommend('Avatar',13)

['Aliens vs Predator: Requiem',
 'Aliens',
 'Anne of Green Gables',
 'Titan A.E.',
 'Independence Day',
 'Battle: Los Angeles',
 'Predators',
 'Small Soldiers',
 'Meet Dave',
 'Jupiter Ascending',
 'Lifeforce',
 'This Is England',
 'The Vatican Tapes']

In [46]:
mr.similarity[0]

array([1.        , 0.08471737, 0.0860309 , ..., 0.02271554, 0.        ,
       0.        ])

Dumpdata method will export movies list and similarity matrix into .pkl file

In [47]:
# mr.dumpdata()