# Movie Recommender System using Content-based filtering

Recommender systems can be based on three main algorithms:
1. Content Based Filtering: This method is based on similarity of content such as genre, tags etc. It clusters new items based on the items viewer has interacted with before or shown interest in.
2. Collaborative filtering: Clusters viewers/users together based on the similarities in their activities. If A likes 'alpha' and B likes 'alpha' and 'gamma' then 'gamma' may be recommended to A.
3. Hybrid: This is a combination of the above two where both content based and collaborative filtering approaches are combined.

In [295]:
#importing libraries
import numpy as np
import pandas as pd
#to convert string to literal expression
from ast import literal_eval
#for stem words
import nltk
from nltk.stem.porter import PorterStemmer
#For text vectorization
from sklearn.feature_extraction.text import CountVectorizer
#For cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

#Set limit for dataframe viewing
pd.option_context("display.max_columns", None)


<pandas._config.config.option_context at 0x7ff0998a5420>

## Load the Dataset

In [296]:
#Read the data 
credits = pd.read_csv('tmdb_5000_credits.csv')

movies = pd.read_csv('tmdb_5000_movies.csv')

#Display the datasets
print("Credits")
display(credits.head())
print("Movies")
display(movies.head())

Credits


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


Movies


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


### Merge the two loaded datasets
We want to analyze the data together so the complete set of information is available for each sample.
They have two columns in common: movie's title and id. We will use both of them for merging to avoid redundancy.

In [297]:
#Merge the two datasets to give reference to the data we are trying to explore
movies_credits = pd.merge(movies, credits, right_on = ['movie_id', 'title'], left_on = ['id', 'title'], how = 'inner')
display(movies_credits.head())

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [298]:
#How big is the merged dataset?
print("Shape of dataset = ", movies_credits.shape)

Shape of dataset =  (4803, 23)


## Selecting the features
There are 4803 samples/rows and 23 features/columns in this merged datsets that contains vast amount of data.
How much of this data do we actually need for our model?
Are all the features relevant? Do they all contribute equally to the analysis we are trying to do?
Since this is a content based algorithm, we are interested in the features that give us key details.

We will be using the following columns:
1. id (Movie id)
2. Genre
3. Keywords related to the movie
4. title
5. overview
6. cast
7. crew

In [299]:
#Separate these columns out of the dataset and store them in a separate dataframe
df = movies_credits[['id', 'genres', 'keywords', 'title', 'overview', 'cast', 'crew']]
display(df)

Unnamed: 0,id,genres,keywords,title,overview,cast,crew
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Spectre,A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",John Carter,"John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...,...
4798,9367,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 5616, ""name"": ""united states\u2013mexi...",El Mariachi,El Mariachi just wants to play his guitar and ...,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4799,72766,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",[],Newlyweds,A newlywed couple's honeymoon is upended by th...,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4800,231617,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...","[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...","Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...","[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4801,126186,[],[],Shanghai Calling,When ambitious New York attorney Sam is sent t...,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


### Handling missing values

In [300]:
#Missing data?
print("Before removing missing data")
print(df.isnull().sum())
#The data seems to have very few missing values, we can drop these rows.
df = df.dropna()
#Missing data after
print("After removing missing data")
print(df.isnull().sum())

#Check duplicates
print("No. of duplicate rows")
df.duplicated().sum()

Before removing missing data
id          0
genres      0
keywords    0
title       0
overview    3
cast        0
crew        0
dtype: int64
After removing missing data
id          0
genres      0
keywords    0
title       0
overview    0
cast        0
crew        0
dtype: int64
No. of duplicate rows


0

## Data Processing/Feature Engineering
- Some of the text columns need to be processed to delete the unnecessary information that does not contribute to any key information.  
- Columns: 'genres', 'keywords', 'overview', 'cast', 'crew' contain data that can be combined to give a brief about each sample.  
- We do not need all the data contained in these columns. So we can combine them into one columna to group similar information.  

In [301]:
#Look at the dataframe
df.head()

Unnamed: 0,id,genres,keywords,title,overview,cast,crew
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Spectre,A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",John Carter,"John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


#### List of dictionaries
- Columns = ['genres', 'keywords', 'cast', 'crew'] contact list of dictionaries but are stored as strings.  
- We must first convert them to lists in order to perform any operation on these rows/samples.

In [302]:
#Convert string for columns that are collection of dictionaries
for col in ['genres', 'keywords', 'cast', 'crew']:
    
    #Print data type of each sample in the column before converting
    print("Data type of each sample in the column ",col," is ",type(df[col][0]))
    
    #Make the conversion
    df[col] = df[col].apply(lambda x: literal_eval(x))
    
    #print data type after conversion
    print("After converting the string, data type is ", type(df[col][0]),"\n")

Data type of each sample in the column  genres  is  <class 'str'>
After converting the string, data type is  <class 'list'> 

Data type of each sample in the column  keywords  is  <class 'str'>
After converting the string, data type is  <class 'list'> 

Data type of each sample in the column  cast  is  <class 'str'>
After converting the string, data type is  <class 'list'> 

Data type of each sample in the column  crew  is  <class 'str'>
After converting the string, data type is  <class 'list'> 



#### Extract relevant information from these dictionaries
- 'name' from genres and keywords columns
- 'name' of three cast members for each movie
- 'name' of the director of the movie

In [303]:
#Function for the columns ['genres', 'keywords'] from all dictionaries
def extract_info(row, dict_key):
    #declare list to store string
    info = []
    for dicti in row:
        info.append(dicti[dict_key])
    return info


#Extract genres information from the list of dictionaries using key 'names'
df['genres'] = df['genres'].apply(lambda x: extract_info(x, 'name'))

#Extraxt keywords information from the list of dictionaries using key 'names'
df['keywords'] = df['keywords'].apply(lambda x: extract_info(x, 'name'))


#Function for the column ['cast']. We are only interested in the first three main cast members
def extract_cast(row, dict_key):
    #declare list to store string
    info = []
    #counter to count the number of cast members saved
    counter = 0
    for dicti in row:
        if counter < 3:
            info.append(dicti[dict_key])
            #increment counter after adding cast name
            counter += 1 
        else:
            break
    return info

#Extract 3 names from cast
df['cast'] = df['cast'].apply(lambda x: extract_cast(x, 'name'))


'''Function for the column ['crew']. We are only interested in the name of the dictionary 
that contains the director's name
If a viewer watched a movie by a director they are likley to watch more content created by the director
as they have a bigger role to play in the making of the movie and can have signature styles.'''
def extract_director(row, dict_key):
    info = []
    for dicti in row:
        #if job is director save the name
        if dicti['job'] == 'Director':
            info.append(dicti['name'])
            #we don't need to go through the rest of the dictionaries since we already extracted the director
            break
    return info

#Extract 3 names from cast
df['crew'] = df['crew'].apply(lambda x: extract_director(x, 'name'))

#What does the dataframe look like after applying these function to these four columns?
df.head()

Unnamed: 0,id,genres,keywords,title,overview,cast,crew
0,19995,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",Spectre,A cryptic message from Bond’s past sends him o...,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",The Dark Knight Rises,Following the death of District Attorney Harve...,"[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",John Carter,"John Carter is a war-weary, former military ca...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


#### Convert string column 'overview' to lists to make it easier for columns to be combined later.
- column 'overview' is string that needs to be converted to list first

In [304]:
#convert overview to list
df['overview'] = df['overview'].apply(lambda x: x.split())

#View dataframes after this list conversion
df.head()

Unnamed: 0,id,genres,keywords,title,overview,cast,crew
0,19995,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...",Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


#### Before combining any of the columns we need to delete spaces in between phrases so as to uniquely identify them
-remove white space between names of cast, crew, genres, keywords

In [305]:
#For each column in this list remove white spaces
for col in ['genres', 'keywords', 'cast', 'crew']:
    df[col] = df[col].apply(lambda x: [word.replace(" ", "") for word in x])
    
#View the dataframe after these changes
df.head()

Unnamed: 0,id,genres,keywords,title,overview,cast,crew
0,19995,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...",Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...",Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...",Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...",The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...",John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


## Combine columns 'genres', 'keywords', 'overview', 'cast' and 'crew' 

In [306]:
#Combins the columns
df['identifier'] = df['overview'] + df['genres'] + df['keywords'] + df['cast'] + df['crew']

#Comvert list to a single string
df['identifier'] = df['identifier'].apply(lambda x: " ".join(x))

#Select the columns we will use further and view the dataframe
df = df[['id', 'title', 'identifier']]

df.head()

Unnamed: 0,id,title,identifier
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


### Convert text into stem words to avoid repetition

In [307]:
#Function to convert text into stem words
ps = PorterStemmer()
def get_stem(line):
    stem_word = []
    for word in line.split():
        stem_word.append(ps.stem(word))
    return " ".join(stem_word)

print("Text looks like this before stem conversion:\n",df['identifier'][0],"\n")
df['identifier'] = df['identifier'].apply(get_stem)
print("After stem conversion:\n",df['identifier'][0])

Text looks like this before stem conversion:
 In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron 

After stem conversion:
 in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron


## Text Vectorization using Bag of Words

In [308]:
#Declare relevant function
cv = CountVectorizer(max_features = 5000, stop_words = 'english')
#Define vectors for text
vectors = cv.fit_transform(df['identifier']).toarray()

## Cosine Similarity
Greater the distance between two movie vectors the more different (less similar) they are.
- We now have vector for each of the movies in our dataset, derived from identifer column. 
- We can use these vectors to calculate cosine similarity.

In [309]:
#Declare function for cosine similarity
similarity = cosine_similarity(vectors)

#print shape of similarity matrix
print("Similarity are calculated between movies so they must be equal to no of movies, shape =",similarity.shape)

Similarity are calculated between movies so they must be equal to no of movies, shape = (4800, 4800)


## Define function to recommend movies based on similarity

In [310]:
def recommend_5_movies(movie):
    movie_index = df[df['title'] == movie].index[0]
    #find similarity index for this movie to all the other movies in the dataframe
    movie_distance = similarity[movie_index]
    #movie_distance is the size of the number of movies as distances is calculated from each one
    #We want to select the 5 closet ones
    #Define list of tuples that contacin movie index and the distance
    index_distance = list(enumerate(movie_distance))
    #Sort movies by the closest distance
    sorted_index_distance = sorted(index_distance, reverse = True, key = lambda x: x[1])
    #index of Top 5 movies, we do not include index 0 is it will be, movies is closest to itself
    movies_5_index = sorted_index_distance[1:6]
    for ind in movies_5_index:
        movie_title = df.iloc[ind[0]]['title']
        print(movie_title)
    
recommend_5_movies('John Carter')

Riddick
Krrish
The Other Side of Heaven
The Legend of Hercules
Get Carter


## Conculsion:
This code recommends 5 movies using cosine similarity.