## Movie Recommender System

In [30]:
import numpy as np 
import pandas as pd

### Importing Datasets


In [31]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [32]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [33]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


### merging the data frames based on title column

In [34]:
movies_merged = movies.merge(credits,on='title')

<p> now we need to create a content based filtering system. For that we need to create tags. Look for columns for creating tags</p>

<pre> List of required columns to create tags:
       1. Genres
       2. id - required while developing website
       3. keywords 
       4. title 
       5. overview - if the summaries of the movies are similar then movies are similar.
       6. cast - some people choose movie based on director or actor
       7. crew
       </pre>

In [35]:
final_movies_frame = movies_merged[['movie_id','title','overview','genres','keywords','cast','crew']]

<pre> The concise dataframe will have 3 colums - movie-id, title, tags.
 tags - overview, genres, keywords, cast and crew will be merged for tags</pre>

### preprocessing the data

In [36]:
final_movies_frame.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [49]:
final_movies_frame.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [38]:
final_movies_frame.duplicated().sum()

0

In [39]:
final_movies_frame.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

### Formatting the columns

In [40]:
final_movies_frame.iloc[0].genres
# converting this list of dictionaries to list

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [41]:
## creating a helper function to create list of dictionaries to list of genre tags
import ast
def helper(obj):
    genre_list = []

    for i in ast.literal_eval(obj):
        genre_list.append(i['name'])

    return genre_list



In [42]:
#example
helper('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

# As genre is a string passing it into the function will produce an error as below.

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [20]:
# To sort the above error
# converting the string of list to a list

import ast
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

# we will use this snippet in the above code.

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

<p> Formmating Genres </p>

In [43]:
final_movies_frame['genres'] = final_movies_frame['genres'].apply(helper)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_movies_frame['genres'] = final_movies_frame['genres'].apply(helper)


In [44]:
## applying the same for keywords

final_movies_frame['keywords']=final_movies_frame['keywords'].apply(helper)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_movies_frame['keywords']=final_movies_frame['keywords'].apply(helper)


In [45]:
final_movies_frame.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [46]:
#keeping the first 3 main actors name -- for tags
def helper2(obj):
    list_cast = []

    counter = 0

    for i in ast.literal_eval(obj):
        if counter != 3:
            list_cast.append(i['name'])
            counter += 1
        else:
            break

    return list_cast

In [47]:
final_movies_frame['cast'] = final_movies_frame['cast'].apply(helper2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_movies_frame['cast'] = final_movies_frame['cast'].apply(helper2)


In [50]:
# creating a function to format crew - only filtering director

def filter_director(obj):
    list_director = []

    for i in ast.literal_eval(obj):

        if i['job'] == 'Director':
            list_director.append(i['name'])
            break

    return list_director



In [51]:
final_movies_frame['crew'] = final_movies_frame['crew'].apply(filter_director)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_movies_frame['crew'] = final_movies_frame['crew'].apply(filter_director)


In [52]:
final_movies_frame.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]


<p> Formmating overview </p>

In [53]:
final_movies_frame['overview'] = final_movies_frame['overview'].apply(lambda x:x.split())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_movies_frame['overview'] = final_movies_frame['overview'].apply(lambda x:x.split())


In [54]:
final_movies_frame['overview'][0]

['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.']

<pre> Now we need to remove spaces within each words or names as suppose name of the director is sam worthington
 this would create two different tags instead of one. </pre>

In [59]:
final_movies_frame['genres']=final_movies_frame['genres'].apply(lambda x:[i.replace(" ","") for i in x])
final_movies_frame['cast']=final_movies_frame['cast'].apply(lambda x:[i.replace(" ","") for i in x])
final_movies_frame['crew']=final_movies_frame['crew'].apply(lambda x:[i.replace(" ","") for i in x])
final_movies_frame['Keywords']=final_movies_frame['keywords'].apply(lambda x:[i.replace(" ","") for i in x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_movies_frame['genres']=final_movies_frame['genres'].apply(lambda x:[i.replace(" ","") for i in x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_movies_frame['cast']=final_movies_frame['cast'].apply(lambda x:[i.replace(" ","") for i in x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
 

<p> Creating tags column combination of all the formatted columns </p>

In [60]:
final_movies_frame['tags'] = final_movies_frame['overview'] + final_movies_frame['keywords'] + final_movies_frame['cast'] + final_movies_frame['crew']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_movies_frame['tags'] = final_movies_frame['overview'] + final_movies_frame['keywords'] + final_movies_frame['cast'] + final_movies_frame['crew']


In [61]:
final_movies_frame.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,Keywords,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[culture clash, future, space war, space colon...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron],"[cultureclash, future, spacewar, spacecolony, ...","[In, the, 22nd, century,, a, paraplegic, Marin..."


In [63]:
## This is the final created dataframe which will be used further. 

movie_tags_df = final_movies_frame[['movie_id','title','tags']]

In [64]:
## converting the list in tags to string

movie_tags_df['tags'] = movie_tags_df['tags'].apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_tags_df['tags'] = movie_tags_df['tags'].apply(lambda x:" ".join(x))


In [66]:
movie_tags_df['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. culture clash future space war space colony society space travel futuristic romance space alien tribe alien planet cgi marine soldier battle love affair anti war power relations mind and soul 3d SamWorthington ZoeSaldana SigourneyWeaver StephenLang MichelleRodriguez GiovanniRibisi JoelDavidMoore CCHPounder WesStudi LazAlonso DileepRao MattGerald SeanAnthonyMoran JasonWhyte ScottLawrence KellyKilgour JamesPatrickPitt SeanPatrickMurphy PeterDillon KevinDorman KelsonHenderson DavidVanHorn JacobTomuri MichaelBlain-Rozgay JonCurry LukeHawker WoodySchultz PeterMensah SoniaYee JahnelCurfman IlramChoi KylaWarren LisaRoumain DebraWilson ChrisMala TaylorKibby JodieLandau JulieLamm CullenB.Madden JosephBradyMadden FrankieTorres AustinWilson SaraWilson TamicaWashington-Miller LucyBriant NathanMeister GerryBlair MatthewChamb

In [67]:
## converting all the strings in tags to lowercase
## This process is recommeded

movie_tags_df['tags'] = movie_tags_df['tags'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_tags_df['tags'] = movie_tags_df['tags'].apply(lambda x: x.lower())


## Vectorization

<pre> 
Our main goal is to determine how similar the tags of two movies are. This is crucial for a recommendation system. If a customer selects a movie,
 we want to recommend other movies that have similar tags.

we'll convert the movie tags into vectors. These vectors will represent the tag in a numerical form that allows us to easily calculate similarities between them. Once the tags are converted into vectors, when a customer picks a movie, we can find other movies whose tags vectors are closest to the chosen vector

Method used: 

BAG OF WORDS:

1. Create a list of all unique tags across all movies. This is called vocabulary or create a list of x most frequently used tags from all 
the unique tags. We do not consider stop words in these tags.

2. For each movie we look at its tags and represent them as evctor. Each position in the vector corresponds to a tag from the vocabular list. 
If a movie has a particular tag, we put a "1" in the corresponding position in its vector. If it doesn’t have the tag, we put a "0."

3. After converting the tag of 2 movies into vectors, we can measure how similar they are by comparing the vectors.

One of the common method for comparision is cosine similarity. 


STOP WORDS - are the words that are used for sentence formation but do not 
contribute to the meaning of the sentence.
Ex: are, the, a, an etc..
</pre>

In [70]:
from sklearn.feature_extraction.text import CountVectorizer

#`CountVectorizer` in `sklearn.feature_extraction.text` converts a collection of 
# text documents into a matrix of token counts, representing the frequency of
#  each word in each document. It's commonly used in NLP to prepare text data 
# for machine learning models.

cv = CountVectorizer(max_features=5000, stop_words='english')
#removes the english stop-words

### Stemming


<pre> 
Here in the 5000 most frequent words there are different words with similar meaning like accept, accepts, accepted. 

To avoid that we apply stemming

Stemming: it is a process that reduces words to their base or root form by removing suffixes or prefixes. The goal is to group together 
different forms of a word so they can be analyzed as a single item. 
</pre>

In [86]:
!pip install nltk




[notice] A new release of pip is available: 24.1.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [87]:
import nltk

In [88]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [89]:
def stem(text):
    y = []

    for i in text.split():
        y.append(ps.stem(i))

    return " ".join(y)



In [90]:
movie_tags_df['tags'] = movie_tags_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_tags_df['tags'] = movie_tags_df['tags'].apply(stem)


In [91]:
vectors = cv.fit_transform(movie_tags_df['tags']).toarray()

# by default many values will be 0. and by default cv will return sparse matrix
#we convert it into numpy array

In [92]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [93]:
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zoo', 'zooeydeschanel', 'zoëkravitz'],
      dtype=object)

In [95]:
# now we the similarity of every movie with every other movie in the space 
# in higher dimension euclidian distance is not a reliable measure. We use 
#COSINE SIMILARITY.

from sklearn.metrics.pairwise import cosine_similarity




In [96]:
similarity_matrix = cosine_similarity(vectors)

In [99]:
print(similarity_matrix.shape)

(4806, 4806)


In [101]:
similarity_matrix[0]

array([1.        , 0.01623069, 0.01707221, ..., 0.03378687, 0.        ,
       0.        ])

## Recommedation Function

In [111]:
# returns 5 similar movies
def recommend(movie):
    movie_index = movie_tags_df[movie_tags_df['title'] == movie].index[0]
    distances = similarity_matrix[movie_index]
    similar_movie_list = sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]

    movies_name = []

    for x in similar_movie_list:
        movies_name.append(movie_tags_df.iloc[x[0]].title)


    return movies_name


    

In [112]:
recommend('Avatar')

['Aliens', 'Silent Running', 'Moonraker', 'Alien', 'Spaceballs']