# Movie Recommender System
## (Self-Guided Project)

## by Justin Sierchio

In this project, we will be looking at usng TMDB data for constructing a movie recommender system. These systems work by utilizing rating or user preference data to make recommendations. They are a form of information filter systems.

This data is in .csv file format and is from Kaggle at: https://www.kaggle.com/tmdb/tmdb-movie-metadata/download. More information related to the dataset can be found at: https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system.

Note: this is a self-guided project following the tutorial provided by Ibtesam Ahmed at Kaggle.

## Notebook Initialization

In [1]:
# Import Relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
import re

print('Initial libraries loaded into workspace!')

Initial libraries loaded into workspace!


In [2]:
# Upload Datasets for Study
df1=pd.read_csv('tmdb_5000_credits.csv');
df2=pd.read_csv('tmdb_5000_movies.csv');

print('Datasets uploaded!');

Datasets uploaded!


Let's display the first 5 rows for each of these datasets.

In [3]:
# Display 1st 5 rows of TMDB movie dataset
df1.head(5)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [4]:
# Display 1st 5 rows of TMDB credit dataset
df2.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Let's give a listing what all the columns in these datasets represent.

The first dataset contains the following features:
<ul>
    <li>movie_id - A unique identifier for each movie.</li>
    <li>cast - The name of lead and supporting actors.</li>
    <li>crew - The name of Director, Editor, Composer, Writer etc.</li>
</ul>

The second dataset has the following features:
<ul>
    <li>budget - The budget in which the movie was made.</li>
    <li>genre - The genre of the movie, Action, Comedy ,Thriller etc.</li>
    <li>homepage - A link to the homepage of the movie.</li>
    <li>id - This is infact the movie_id as in the first dataset.</li>
    <li>keywords - The keywords or tags related to the movie.</li>
    <li>original_language - The language in which the movie was made.</li>
    <li>original_title - The title of the movie before translation or adaptation.</li>
    <li>overview - A brief description of the movie.</li>
    <li>popularity - A numeric quantity specifying the movie popularity.</li>
    <li>production_companies - The production house of the movie.</li>
    <li>production_countries - The country in which it was produced.</li>
    <li>release_date - The date on which it was released.</li>
    <li>revenue - The worldwide revenue generated by the movie.</li>
    <li>runtime - The running time of the movie in minutes.</li>
    <li>status - "Released" or "Rumored".</li>
    <li>tagline - Movie's tagline.</li>
    <li>title - Title of the movie.</li>
    <li>vote_average - average ratings the movie recieved.</li>
    <li>vote_count - the count of votes recieved.</li>
</ul>

## Data Cleaning

To begin, let's merge these two datasets together by joining them along the 'id' column.

In [5]:
# Join Datasets together along 'id' column
df1.columns = ['id','tittle','cast','crew']
df2= df2.merge(df1,on='id')

Let's take a look at the resulting join.

In [6]:
df2.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


As we can see, the datasets have been joined along the 'id' column.

## Data Filtering

In order to filter by demographic, we will need to be able to evaluate the following:
<ul>
    <li> metric score (rating) for each film</li>
    <li> calculate the score for each film</li>
    <li> sort the film ratings and recommend the best rated film for the user</li>
</ul>

In order make sure that the ratings are legitimate (movies with more votes are given a higher weight), we will use the weighting rating formula for IMDB. 

https://image.ibb.co/jYWZp9/wr.png

where in the formula above:
<ul>
    <li> v = # of votes for the movie</li>
    <li> m = minimum votes required to be listed on the chart</li>
    <li> R = average rating of the movie</li>
    <li> C = mean vote across the whole report</li>
</ul>

Using the dataset, we already have calculated v (vote count) and R (average rating). Hence C can be calculated easily.

In [7]:
# Calculate Mean Vote Rating across the entire TMDB dataset.
C= df2['vote_average'].mean()
C

6.092171559442011

So for the entire set, the average film gets a 6/10. Now we will need to establish a cutoff. In other words, what minimum score will the system need in order to recommmend a film? Let's say 90%.

In [11]:
# Show the number of filmes that meet a 90% vote count threshold
m = df2['vote_count'].quantile(0.9);
m

1838.4000000000015

Now let us filter out the movies that meet our criteria.

In [12]:
# Create subset that has films that meet our earlier defined criteria
q_movies = df2.copy().loc[df2['vote_count'] >= m];
q_movies.shape

(481, 23)

So this tells us that there are 483 films that meet our criteria. Now, we need to calculate our metric for each qualified movie. This will require defining a function which we will call "weighted_rating()" as well as defining a new feature score (for which we will calculate the value by applying this function to our DataFrame of qualified movies).

In [13]:
# Create a Function called 'Weighted Rating'
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [14]:
# Define new feature score by applying 'Weighted Rating' function
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

Now let us sort the DataFrame based on the score feature. We will display the output as the title, vote count, vote average and weighted rating for the top 10 movies meeting this criteria.

In [15]:
# Sort movies based on new 'Feature Score'
q_movies = q_movies.sort_values('score', ascending=False)

# Display the Top 10 movies meeting our criteria
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

Unnamed: 0,title,vote_count,vote_average,score
1881,The Shawshank Redemption,8205,8.5,8.059258
662,Fight Club,9413,8.3,7.939256
65,The Dark Knight,12002,8.2,7.92002
3232,Pulp Fiction,8428,8.3,7.904645
96,Inception,13752,8.1,7.863239
3337,The Godfather,5893,8.4,7.851236
95,Interstellar,10867,8.1,7.809479
809,Forrest Gump,7927,8.2,7.803188
329,The Lord of the Rings: The Return of the King,8064,8.1,7.727243
1990,The Empire Strikes Back,5879,8.2,7.697884


At this juncture, we have a very basic recommending system. However, there is more to be done!