# Movie recommendation model

Build movie recommendation models based on:
* Highest Rating Filtering
* Content Based Filtering
* Collaborative Filtering

Evaluate and combine models to improve recommendations.

In [None]:
import pandas as pd
import numpy as np

credits_df = pd.read_csv('../data/tmdb_5000_credits.csv')
movies_df_original = pd.read_csv('../data/tmdb_5000_movies.csv')
credits_df = credits_df.rename(columns={'movie_id': 'id'})
movies_df_original = movies_df_original.merge(credits_df, on='id')
len(movies_df_original)

In [None]:
movies_df_original.head(3)

## Highest Rating Filtering
Recommend movies based on user rating on a scale of 10.

In [None]:
mean_rating = movies_df_original['vote_average'].mean()
mean_rating

Exclude movies without the minimum votes required, which is defined by the 90th percentile.

In [None]:
min_votes = movies_df_original['vote_count'].quantile(0.9)
movies_df = movies_df_original.copy().loc[movies_df_original['vote_count'] >= min_votes]
len(movies_df)

Average rating is not an accurate rating measure since with movies high ratings and low number of votes might not be representative. Therefore, the IMDB's weighted rating is used.

In [None]:
def weighted_rating(x, m=min_votes, C=mean_rating):
    v = x['vote_count']
    R = x['vote_average']
    return (v / (v + m) * R) + (m / (m + v) * C)

Compute weighted rating for each movie and print top 10 movies.

In [None]:
movies_df['score'] = movies_df.apply(weighted_rating, axis=1)
movies_df = movies_df.sort_values('score', ascending=False)
movies_df[['original_title', 'vote_count', 'vote_average', 'score']].head(10)

## Content based filtering
Recommend movies based on item features similarity (e.g. overview, cast, keyword, etc) to other movies that the user liked, given previous actions or explicit feedback.

### Using movie overview
Use similarity scores between movie overviews to make movie recommendations.

In [None]:
movies_df = movies_df_original.copy()
movies_df['overview'].head(3)

To process the overview text, the Term Frequency-Inverse Document Frequency (TF-IDF) vectors will be computed for each movie.

TF-IDF is the product between:
* TF: the relative frequency of a word in a document (word instances/total instances)
* IDF: the relative count of documents containing the word as log(number of documents/documents with term)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
movies_df['overview'] = movies_df['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies_df['overview'])

Use cosine similarity to calculate the similarity score between two movies.

In [None]:
from sklearn.metrics.pairwise import linear_kernel

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Get the top 10 most similar movies given a movie.

In [None]:
# Construct a map with indices and movie titles
indices = pd.Series(movies_df.index, index=movies_df['original_title']).drop_duplicates()

def get_recommendations(title, cosine_sim=cosine_sim):
    # Get movie index with title
    idx = indices[title]

    # Get similarity scores for movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get scores of top 10 movies
    sim_scores = sim_scores[1:11]

    # Get movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies_df['original_title'].iloc[movie_indices]

In [None]:
get_recommendations('The Dark Knight Rises')

### Using movie credits, genre and keywords
Use similarity scores between movie top actors, director, genres and keywords to make movie recommendations.

In [None]:
from ast import literal_eval

movies_df = movies_df_original.copy()

# Parse features into objects
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    movies_df[feature] = movies_df[feature].apply(literal_eval)

In [None]:
movies_df[['original_title', 'cast', 'crew', 'keywords', 'genres']].head(3)

Process required features to compute similarity.

In [None]:
# Get director name. If directornot listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# Return top 3 elements or entire list
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names
    return []

# Process director, cast, genres and keywords
movies_df['director'] = movies_df['crew'].apply(get_director)
features = ['cast', 'keywords', 'genres']
for feature in features:
    movies_df[feature] = movies_df[feature].apply(get_list)

movies_df[['original_title', 'cast', 'director', 'keywords', 'genres']].head(3)

In [None]:
# Convert strings to lower case and strip spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

# Clean director, cast, genres and keywords
features = ['cast', 'keywords', 'director', 'genres']
for feature in features:
    movies_df[feature] = movies_df[feature].apply(clean_data)

Combine features and convert it into a matrix of word counts. Then, compute cosine similarity.

In [None]:
def combine_features(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
movies_df['mix'] = movies_df.apply(combine_features, axis=1)

from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movies_df['mix'])

from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

# Reset index and construct reverse map
movies_df = movies_df.reset_index()
indices = pd.Series(movies_df.index, index=movies_df['original_title'])

In [None]:
get_recommendations('The Dark Knight Rises', cosine_sim2)

## Collaborative filtering
To avoid scalability and sparsity issues created by User based and item based collaborative filtering, Single Value Decomposition is used. SVD decreases the dimension of the utility matrix by extracting its latent factors and mapping each user and each item into a latent space with dimension r. The Root Mean Square Error (RMSE) is then used to compute similarity to predict the rating that a user would give to a certain movie.

In [None]:
from surprise import Dataset, SVD, Reader
from surprise.model_selection import cross_validate

ratings = pd.read_csv('../data/themoviesdataset/ratings_small.csv')
ratings.head()

In [None]:
reader = Reader()
svd = SVD()
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

The RMSE is higher than 0.89, which is acceptable. The model can be trained.

In [None]:
trainset = data.build_full_trainset()
svd.fit(trainset)

Predict the rating of user with id 1 for movie with id 302.

In [None]:
svd.predict(1, 302)

## Further work
* For content based and collaborative filtering, only recommend with a weighted rating above a certain threshold.
* Create an hybrid model where a content based filtering is used to retrieve a list of top movies and then collaborative filtering is used to predict user rating.

## Reference
https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system/