## DATA 612 Project 2: Content-Based and Collaborative Filtering
#### Team members: Mia Chen, Wei Zhou
#### Date: 6/21/2020

The goal of this assignment is for us to try out different ways of implementing and configuring a recommender, and to evaluate the different approaches. We will use the MovieLens [dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset) from Kaggle. The dataset files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. This dataset captures feature points like cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts, and vote averages.

This dataset consists of the following files:

* movies_metadata.csv: This file contains information on ~45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, genre, revenue, release dates, languages, production countries, and companies.

* keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

* credits.csv: Consists of Cast and Crew Information for all the movies. Available in the form of a stringified JSON Object.

* links.csv: This file contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

* links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

* ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

## Content-Based Filtering
Content-based filtering recommends movies that are similar to a particular movie. To achieve this, we will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.

In [1]:
# Load modules
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Load Movies Metadata
metadata = pd.read_csv('movies_metadata.csv', low_memory=False)

# Inspect the plots of a few movies
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

To compute the similarity and/or dissimilarity between them, we will compute the Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. It will give us a matrix where each column represents a word in the overview vocabulary, and each row represents a movie.

In [3]:
# Define a TF-IDF Vectorizer Object
# Remove all English stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

In [4]:
# Array mapping from feature integer indices to feature name
tfidf.get_feature_names()[1000:1010]

['abdel',
 'abdelatif',
 'abdelhakim',
 'abdelilah',
 'abdelkader',
 'abdicate',
 'abdicated',
 'abdicates',
 'abdicating',
 'abdication']

In [5]:
# Import sklearn's linear_kernel() since it will be faster than cosine_similarities()
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_sim.shape

(45466, 45466)

In [6]:
cosine_sim[1]

array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])

In [7]:
# Reverse mapping of movie titles and DataFrame indices
# Make title as an index in a Series
index = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

index[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

In [10]:
# Define a function that takes in movie tiles and outputs similar movies
def movie_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    ind = index[title]  
    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[ind]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores of the 15 most similar movies
    sim_scores = sim_scores[1:16]
    # Get the movie indices
    movie_index = [i[0] for i in sim_scores]
    # Return the top 15 most similar movies
    return metadata['title'].iloc[movie_index]

In [11]:
movie_recommendations('Toy Story')

15348                                     Toy Story 3
2997                                      Toy Story 2
10301                          The 40 Year Old Virgin
24523                                       Small Fry
23843                     Andy Hardy's Blonde Trouble
29202                                      Hot Splash
43427                Andy Kaufman Plays Carnegie Hall
38476    Superstar: The Life and Times of Andy Warhol
42721    Andy Peters: Exclamation Mark Question Point
8327                                        The Champ
27206                      Life Begins for Andy Hardy
1071                            Rebel Without a Cause
36094                            Welcome to Happiness
40261                                   Wabash Avenue
1932                                        Condorman
Name: title, dtype: object