# Movie Recommender System

This system recommends you movies based on the previously watched content.

Two datasets(Movies and Credits) have been used for recommending movies from https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata

**1. MOVIE DATASET** 
* budget 
* genres 
* homepage
* id
* keywords
* original language
* original title
* overview	
* popularity	
* production_companies	
* production_countries	
* release_date	
* revenue	
* runtime	
* spoken_languages	
* status	
* tagline	
* title	
* vote_average	
* vote_count

**2. CREDIT DATSESET**
* movie_id	
* title	
* cast	
* crew

**Importing relevant libraries**

In [295]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

**Loading Datasets**

In [296]:
movies = pd.read_csv('data/tmdb_5000_movies.csv')
credits = pd.read_csv('data/tmdb_5000_credits.csv')

# Exploring data

In [297]:
movies.shape

(4803, 20)

In [298]:
credits.shape

(4803, 4)

In [299]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [300]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [301]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


In [302]:
credits.head(2)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


**Merging both datasets on title**

In [303]:
df = movies.merge(credits, on = 'title')

In [304]:
df.shape

(4809, 23)

In [305]:
df.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


# Selecting significant features for the recommendation system

In [306]:
features = ['movie_id','title','overview','genres','keywords','cast','crew']

In [307]:
data = df[features]

In [308]:
data.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


**Checking for missing values**

In [309]:
data.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

**Dropping rows with any missing value**

In [310]:
data.dropna(inplace = True)

In [311]:
#checking for duplicate rows
data.duplicated().sum()

0

# Data Pre-processing

**Extracting 'name' from the genres and keywords into a list**

In [312]:
import ast

def extract_(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L 

In [313]:
data['genres'] = data['genres'].apply(extract_)
data.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [314]:
data['keywords'] = data['keywords'].apply(extract_)
data.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


**Extracting 3 lead actors from the cast into a list**

In [315]:
def extract3(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L 

In [316]:
data['cast'] = data['cast'].apply(extract3)
data.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [317]:
data['cast'] = data['cast'].apply(lambda x:x[0:3])

**Extracting only director's name from crew** 

In [318]:
def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 

In [319]:
data['crew'] = data['crew'].apply(fetch_director)

In [320]:
data.sample(5)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
2783,24227,Excessive Force,Chicago policeman Terry McCain is determined t...,[Action],"[police, shoulder holster, police shootout, sh...","[Thomas Ian Griffith, Lance Henriksen, James E...",[Jon Hess]
3550,51995,Salvation Boulevard,Set in the world of mega-churches in which a f...,"[Comedy, Thriller, Action, Drama]","[pastor, church service, spirituality, religion]","[Jennifer Connelly, Marisa Tomei, Pierce Brosnan]",[George Ratliff]
4109,17994,Witchboard,"Playing around with a Ouija board, a trio of f...",[Horror],"[ax, possession, psychic power, ouija, ouija b...","[Stephen Nichols, Tawny Kitaen, Todd Allen]",[Kevin Tenney]
574,9257,S.W.A.T.,Hondo Harrelson recruits Jim Street to join an...,"[Action, Thriller, Crime]","[liberation, transport of prisoners, special u...","[Samuel L. Jackson, Colin Farrell, Michelle Ro...",[Clark Johnson]
2982,214,Saw III,Jigsaw has disappeared. Along with his new app...,"[Horror, Thriller, Crime]","[brain tumor, nudity, suffocation, mutilation,...","[Tobin Bell, Shawnee Smith, Angus Macfadyen]",[Darren Lynn Bousman]


**Removing space in between words**

In [321]:
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

In [322]:
data['cast'] = data['cast'].apply(collapse)
data['crew'] = data['crew'].apply(collapse)
data['keywords'] = data['keywords'].apply(collapse)
data['genres'] = data['genres'].apply(collapse)

In [323]:
data.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]


**Converting string into a list for 'overview'**

In [324]:
data['overview'] = data['overview'].apply(lambda x:x.split())

# CREATING TAGS
* A tag recommender system is a recommender system which recommends tags to the user. In this context, a tag is defined as a word freely added to an object by a user.

In [325]:
data['tags'] = data['overview'] + data['genres'] + data['keywords'] + data['cast'] + data['crew']

In [326]:
#FINAL DATAFRAME 
movies_df = data[['movie_id','title', 'tags']]

In [327]:
# converting list of strings to a single string

movies_df['tags'] = movies_df['tags'].apply( lambda x: " ".join(x))

#changing to lower case letters

movies_df['tags'] = movies_df['tags'].apply(lambda x : x.lower())

In [328]:
movies_df.head(5)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [329]:
movies_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

# Vectorizing tags

**STEMMING all the tags**
* Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words. 
* For example, ['run', 'running', 'runs'] will be converted to ['run','run', 'run' ]

In [330]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
ps.stem('running')

'run'

In [331]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    
    return " ".join(y)

In [332]:
movies_df['tags'] = movies_df['tags'].apply(stem)

In [333]:
movies_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."


**Extracting top 5000 occuring words**

In [334]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=5001,stop_words='english')

In [335]:
vector = cv.fit_transform(movies_df['tags']).toarray()

In [336]:
vector.shape

(4806, 5001)

**Using COSINE similarity to recommend similar content**
* Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

In [337]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector)
similarity

array([[1.        , 0.08346223, 0.0860309 , ..., 0.04499213, 0.        ,
        0.        ],
       [0.08346223, 1.        , 0.06063391, ..., 0.02378257, 0.        ,
        0.02615329],
       [0.0860309 , 0.06063391, 1.        , ..., 0.02451452, 0.        ,
        0.        ],
       ...,
       [0.04499213, 0.02378257, 0.02451452, ..., 1.        , 0.03962144,
        0.04229549],
       [0.        , 0.        , 0.        , ..., 0.03962144, 1.        ,
        0.08714204],
       [0.        , 0.02615329, 0.        , ..., 0.04229549, 0.08714204,
        1.        ]])

**Defining our recommendation function**

In [338]:
def recommend(movie):
    
    #Extracting the index of the given movie
    index = movies_df[movies_df['title'] == movie].index[0]
    
    #calculating the similarity with each movie and sorting in descending order
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    
    for i in distances[1:6]:
        
        #printing top 5 similar movies
        print(movies_df.iloc[i[0]].title)
        

In [339]:
recommend('Iron Man')

Iron Man 3
Iron Man 2
Avengers: Age of Ultron
The Avengers
Captain America: Civil War


**Deploying models**

In [343]:
import pickle
import bz2file as bz2

def compressed_pickle(title, data):

    with bz2.BZ2File(title + '.pbz2', 'w') as f:
        pickle.dump(data, f)

compressed_pickle('movies_dict', movies_df.to_dict())
compressed_pickle('similarity', similarity)