The point of this notebook is to build a movie recommender systems based on movie genres and ratings 

Dataset from MovieLens: https://grouplens.org/datasets/movielens/latest/

In [2]:
# load libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

## Data reading and EDA

In [33]:
def summary(df):
    print (f"shape of data: {df.shape}")
    sum = pd.DataFrame(df.dtypes, columns=['data type'])
    sum['#missing'] = df.isnull().sum().values
    sum['%missing'] = df.isnull().sum().values / len(df)
    sum['unique'] = df.nunique().values

    # add statistics
    desc = pd.DataFrame(df.describe(include='all').transpose())
    sum['mean'] = desc['mean'].values
    sum['std'] = desc['std'].values
    sum['min'] = desc['min'].values
    sum['25%'] = desc['25%'].values
    sum['50%'] = desc['50%'].values
    sum['75%'] = desc['75%'].values
    sum['max'] = desc['max'].values

    return sum

In [49]:
links = pd.read_csv("links.csv")
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

In [25]:
summary(links).style.background_gradient(cmap='YlOrBr')

shape of data: (9742, 3)


Unnamed: 0,data type,#missing,%missing,unique,mean,std,min,25%,50%,75%,max
movieId,int64,0,0.0,9742,42200.353623,52160.494854,1.0,3248.25,7300.0,76232.0,193609.0
imdbId,int64,0,0.0,9742,677183.898173,1107227.57676,417.0,95180.75,167260.5,805568.5,8391976.0
tmdbId,float64,8,0.000821,9733,55162.123793,93653.481487,2.0,9665.5,16529.0,44205.75,525662.0


there are some missing values for `tmdbld` variable 

In [7]:
summary(movies).style.background_gradient(cmap='YlOrBr')

shape of data: (9742, 3)


Unnamed: 0,data type,#missing,%missing,unique,mean,std,min,25%,50%,75%,max
movieId,int64,0,0.0,9742,42200.353623,52160.494854,1.0,3248.25,7300.0,76232.0,193609.0
title,object,0,0.0,9737,,,,,,,
genres,object,0,0.0,951,,,,,,,


In [8]:
summary(ratings).style.background_gradient(cmap='YlOrBr')

shape of data: (100836, 4)


Unnamed: 0,data type,#missing,%missing,unique,mean,std,min,25%,50%,75%,max
userId,int64,0,0.0,610,326.127564,182.618491,1.0,177.0,325.0,477.0,610.0
movieId,int64,0,0.0,9724,19435.295718,35530.987199,1.0,1199.0,2991.0,8122.0,193609.0
rating,float64,0,0.0,10,3.501557,1.042529,0.5,3.0,3.5,4.0,5.0
timestamp,int64,0,0.0,85043,1205946087.368469,216261035.995132,828124615.0,1019123866.0,1186086662.0,1435994144.5,1537799250.0


lowest rating is 0.5, highest rating is 5 across all movies

In [9]:
summary(tags).style.background_gradient(cmap='YlOrBr')

shape of data: (3683, 4)


Unnamed: 0,data type,#missing,%missing,unique,mean,std,min,25%,50%,75%,max
userId,int64,0,0.0,58,431.149335,158.472553,2.0,424.0,474.0,477.0,610.0
movieId,int64,0,0.0,1572,27252.013576,43490.558803,1.0,1262.5,4454.0,39263.0,193565.0
tag,object,0,0.0,1589,,,,,,,
timestamp,int64,0,0.0,3411,1320031966.823785,172102450.437126,1137179352.0,1137521216.0,1269832564.0,1498456765.5,1537098603.0


Insights from the summary of these files:
1. Not much missing values. Great !
2. It seems like movie id is what we will be using to connect these tables
3. There were quite alot of duplicates for movie Id, for now I will just drop them


In [50]:
# drop duplicates
movies.drop_duplicates(subset=['movieId'], inplace=True)
ratings.drop_duplicates(subset=['userId', 'movieId'], inplace=True)
tags.drop_duplicates(subset=['userId', 'movieId', 'tag'], inplace=True)

## Feature Engineering

In [None]:
# Extract the year of release of the movie and create a new column for it
movies['year'] = movies['title'].str.extract(r'\((\d{4})\)', expand=False)

# Converet genres into a list of genres
movies['genres'] = movies['genres'].apply(lambda x: x.split(" | "))

# Create a new df for movie ratings, containing the movieId and its average rating
average_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()
average_ratings.columns = ['movieId', 'average_rating']

In [53]:
movies_df = movies.merge(average_ratings, on='movieId')
movies_df.head()

Unnamed: 0,movieId,title,genres,year,average_rating
0,1,Toy Story (1995),[Adventure|Animation|Children|Comedy|Fantasy],1995,3.92093
1,2,Jumanji (1995),[Adventure|Children|Fantasy],1995,3.431818
2,3,Grumpier Old Men (1995),[Comedy|Romance],1995,3.259615
3,4,Waiting to Exhale (1995),[Comedy|Drama|Romance],1995,2.357143
4,5,Father of the Bride Part II (1995),[Comedy],1995,3.071429


I will just use a [jaccard similarity test](https://en.wikipedia.org/wiki/Jaccard_index) as I think using a ML model is an overkill for this simple project.

In [62]:
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1) + len(set2) - intersection 
    return intersection / union

# genre and rating based recommendations
def recommend_by_genres_and_ratings(genres, movies_df, top_n=10):
    input_genres = set(genres)
    movies_df['similarity'] = movies_df['genres'].apply(lambda x: jaccard_similarity(input_genres, set(x)))
    return movies_df.sort_values(by=['similarity', 'average_rating'], ascending=[False, False]).head(top_n)

## Let's test !

In [74]:
try:
    input_movie = 'Up'
    input_genres = movies_df[movies_df['title'].str.contains(input_movie)]['genres'].iloc[0]
    recommendations = recommend_by_genres_and_ratings(input_genres, movies_df)
    print(f'Movie: {input_movie}, Genre: {input_genres}\n')
    print("Recommended Movies:")
    print(recommendations[['title', 'genres', 'average_rating']])

# If the movie does not exists in the original list
except ValueError and IndexError:
    print ("There are no related movies !")

Movie: Up, Genre: ['Drama|Romance']

Recommended Movies:
                                                  title           genres  \
2232  Man and a Woman, A (Un homme et une femme) (1966)  [Drama|Romance]   
2317                              Sandpiper, The (1965)  [Drama|Romance]   
3499  Moscow Does Not Believe in Tears (Moskva sleza...  [Drama|Romance]   
3802                                        Rain (2001)  [Drama|Romance]   
4103         Cruel Romance, A (Zhestokij Romans) (1984)  [Drama|Romance]   
4245                                   Lady Jane (1986)  [Drama|Romance]   
4667                                   Jane Eyre (1944)  [Drama|Romance]   
5417                             Mr. Skeffington (1944)  [Drama|Romance]   
2878  Affair of Love, An (Liaison pornographique, Un...  [Drama|Romance]   
4946  Happy Together (a.k.a. Buenos Aires Affair) (C...  [Drama|Romance]   

      average_rating  
2232            5.00  
2317            5.00  
3499            5.00  
3802          

## Next Steps

That's it ! It's just a weekend project so I am not spending that much time on it. Some of next steps worth considering if you want to expand it include:
1. Use the larger/full dataset.
2. Do content-based filtering such as director, actors or other relevant keyword. Can use natural language techniques like TF-IDF.
3. Do collaborative filtering, using past behaviour of users (their ratings or interactions) to make personalized recommendations.
4. Combine content-based filtering and collaborative filtering (Hybrid systems).

One great example is to understand how [Netflix's recommendation system](https://help.netflix.com/en/node/100639) works, which could really gives you an idea on how real-world recommendation system works.