Let's take a look at the data. First, we'll use ratings to create a collaborative filtering algorithm. Then, we'll use movie metadata to find similar movies.

In [None]:
import numpy as np
import pandas as pd

In [None]:
movies_metadata = pd.read_csv('./data/movies_metadata.csv')
ratings = pd.read_csv('./data/ratings_small.csv')

In [None]:
movies_metadata.head()

In [None]:
C = movies_metadata['vote_average'].mean()
m = movies_metadata['vote_count'].quantile(0.9)

q_movies = movies_metadata.copy().loc[movies_metadata['vote_count'] >= m]

In [None]:
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

q_movies['score'] = q_movies.apply(weighted_rating, axis=1)
q_movies = q_movies.sort_values('score', ascending=False)

q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

In [None]:
ratings = pd.read_csv("./data/ratings_small.csv")
print(ratings.head())
print(len(ratings))

In [None]:
print(len(ratings[['userId']].drop_duplicates()))

In [None]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

reader = Reader()
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [None]:
pred = [svd.predict(1, i).est for i in range(100)]

In [None]:
print(pred)

Problem: new users or new ratings by existing users are expected to come in quickly. It's not really feasible to retrain the entire model each time a new ratings comes in. Unfortunately, common implementations of SVD of even KNN-based collaborative filtering algorithms do not support online-learning for new ratings without retraining the whole model.

For this, I'll try to implement the online-updating algorithm presented in: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.8010&rep=rep1&type=pdf