Let's take a look at the data. First, we'll use ratings to create a collaborative filtering algorithm. Then, we'll use movie metadata to find similar movies.

In [1]:
import numpy as np
import pandas as pd

In [2]:
movies_metadata = pd.read_csv('./data/movies_metadata.csv')
credits = pd.read_csv('./data/credits.csv')
keywords = pd.read_csv('./data/keywords.csv')

# There are IDs in date format

def is_integer(field):
    try:
        int(field)
        return True
    except Exception:
        return False

movies_metadata = movies_metadata[movies_metadata['id'].apply(is_integer)]

movies_metadata = movies_metadata.astype({'id': 'int64'})
movies_metadata = movies_metadata.merge(credits, on='id')
movies_metadata = movies_metadata.merge(keywords, on='id')

ratings = pd.read_csv('./data/ratings_small.csv')
links = pd.read_csv('./data/links.csv')

movies_metadata.head()

  movies_metadata = pd.read_csv('./data/movies_metadata.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [3]:
print(ratings.head())
print(len(ratings))

   userId  movieId  rating   timestamp
0       1       31     2.5  1260759144
1       1     1029     3.0  1260759179
2       1     1061     3.0  1260759182
3       1     1129     2.0  1260759185
4       1     1172     4.0  1260759205
100004


In [4]:
print(len(ratings[['userId']].drop_duplicates()))

671


Problem: new users or new ratings by existing users are expected to come in quickly. It's not really feasible to retrain the entire model each time a new ratings comes in. Unfortunately, common implementations of SVD of even KNN-based collaborative filtering algorithms do not support online-learning for new ratings without retraining the whole model.

For this, I'll try to implement the online-updating algorithm presented in: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.8010&rep=rep1&type=pdf

The implementation will follow default parameters from the surprise package: https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD

I use a sigmoid activation function so values are bounded between 0 and 5 (the minimum and maximum ratings in the dataset). The item bias is used for training but, different from the Surprise implementation, it's not used for inference. That's because movies with a very high average rating (e.g. The Godfather) have large positive biases, therefore they are almost always the top predicted ones. We want to model this bias to better learn the similarity between user and movie embeddings, but in inference mode it seems desirable not to use this bias.

In [5]:
import torch
from torch import nn

class KMF(nn.Module):
    def __init__(self, n_users: int, n_items: int, emb_dim: int, max_score: int):
        super().__init__()
        self.user_emb = nn.Embedding(n_users, emb_dim)
        self.user_bias = nn.Parameter(torch.zeros(n_users))
        self.item_emb = nn.Embedding(n_items, emb_dim)
        self.item_bias = nn.Parameter(torch.zeros(n_items))
        nn.init.normal_(self.user_emb.weight, 0, 0.1)
        nn.init.normal_(self.item_emb.weight, 0, 0.1)
        self.global_bias = nn.Parameter(torch.zeros(1))
        self.max_score = max_score

    def forward(self, users, items):
        users_emb = self.user_emb(users)
        items_emb = self.item_emb(items)
        users_bias = self.user_bias[users]
        items_bias = self.item_bias[items]
        pred_score = self.max_score * torch.sigmoid(self.global_bias + items_bias + users_bias + (items_emb * users_emb).sum(1))
        return pred_score, users_emb, items_emb, users_bias, items_bias

    @torch.no_grad()
    def inference(self, user, avg_rating: float, threshold: float, vote_counts: torch.Tensor, with_bias=False):
        user_emb = self.user_emb(user)
        items_emb = self.item_emb.weight
        user_bias = self.user_bias[user]
        items_bias = self.item_bias
        if with_bias:
            pred_score = self.max_score * torch.sigmoid(self.global_bias + items_bias +user_bias + (items_emb * user_emb).sum(1))
        else:
            pred_score = self.max_score * torch.sigmoid(self.global_bias + user_bias + (items_emb * user_emb).sum(1))
        return (vote_counts / (vote_counts + threshold) * pred_score) + (threshold / (threshold + vote_counts) * avg_rating)


  return torch._C._cuda_getDeviceCount() > 0


In [6]:
import random
from torch.utils.data import DataLoader


def mse(scores, pred):
    return (scores - pred).pow(2).mean().sqrt()


def get_param_squared_norms(users_emb, items_emb, users_bias, items_bias):
    emb_norms = users_emb.pow(2).sum(1).mean() + items_emb.pow(2).sum(1).mean()
    bias_norms = users_bias.pow(2).mean() + items_bias.pow(2).mean()
    return emb_norms + bias_norms


def setup_weight_decay(model: nn.Module, weight_decay: float):
    decay, no_decay = [], []
    for name, param in model.named_parameters():
        if len(param.shape) > 1:
            decay.append(param)
        else:
            no_decay.append(param)
    return [
        {"params": decay, "weight_decay": weight_decay},
        {"params": no_decay, "weight_decay": 0},
    ]


n_users = len(ratings['userId'].unique())
n_items = len(ratings['movieId'].unique())

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = KMF(n_users, n_items, 40, 5)
model.to(device)
# not all movieIds and userId exist in the dataset
movie_id_to_emb = {movie_id: i for i, movie_id in enumerate(ratings['movieId'].unique())}
user_id_to_emb = {user_id: i for i, user_id in enumerate(ratings['userId'].unique())}
ratings_values = ratings['rating'].values
users = ratings['userId'].values
movies = ratings['movieId'].values
dataset = [
    (torch.tensor([value, movie_id_to_emb[movie], user_id_to_emb[user]]).float().to(device))
    for value, movie, user in zip(ratings_values, movies, users)
]

random.shuffle(dataset)

split_point = int(0.8 * len(dataset))
train_dataset = dataset[:split_point]
test_dataset = dataset[split_point:]

train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)
optimizer = torch.optim.Adam(setup_weight_decay(model, weight_decay=0), lr=1e-2)
alpha = 0.1


for epoch in range(20):
    mse_acc = l2_acc = steps = 0
    model.train()
    for i, batch in enumerate(train_loader):
        optimizer.zero_grad()
        scores = batch[:, 0]
        items = batch[:, 1].long()
        users = batch[:, 2].long()
        pred, users_emb, items_emb, users_bias, items_bias = model(users, items)
        mse_loss = mse(scores, pred)
        l2_loss = alpha * get_param_squared_norms(users_emb, items_emb, users_bias, items_bias)
        mse_acc += float(mse_loss)
        l2_acc += float(l2_loss)
        loss = mse_loss + l2_loss
        loss.backward()
        steps += 1
        optimizer.step()
    mse_acc /= steps
    l2_acc /= steps
    print(f"MSE = {mse_acc:.3f} | L2 reg = {l2_acc:.2f}")

    test_loss = test_steps = 0
    model.eval()
    with torch.no_grad():
        for batch in test_loader:
            scores = batch[:, 0]
            items = batch[:, 1].long()
            users = batch[:, 2].long()
            pred, *_ = model(users, items)
            mse_loss = mse(scores, pred)
            test_loss += float(mse_loss)
            test_steps += 1
    print(f"Test: MSE = {test_loss / test_steps:.3f}")

MSE = 1.088 | L2 reg = 0.05
Test: MSE = 0.931
MSE = 0.884 | L2 reg = 0.05
Test: MSE = 0.887
MSE = 0.822 | L2 reg = 0.07
Test: MSE = 0.875
MSE = 0.776 | L2 reg = 0.10
Test: MSE = 0.870
MSE = 0.742 | L2 reg = 0.12
Test: MSE = 0.869
MSE = 0.718 | L2 reg = 0.14
Test: MSE = 0.868
MSE = 0.697 | L2 reg = 0.15
Test: MSE = 0.869
MSE = 0.682 | L2 reg = 0.16
Test: MSE = 0.868
MSE = 0.671 | L2 reg = 0.17
Test: MSE = 0.869
MSE = 0.660 | L2 reg = 0.18
Test: MSE = 0.870
MSE = 0.652 | L2 reg = 0.19
Test: MSE = 0.870
MSE = 0.646 | L2 reg = 0.19
Test: MSE = 0.870
MSE = 0.640 | L2 reg = 0.19
Test: MSE = 0.870
MSE = 0.636 | L2 reg = 0.20
Test: MSE = 0.871
MSE = 0.632 | L2 reg = 0.20
Test: MSE = 0.871
MSE = 0.629 | L2 reg = 0.20
Test: MSE = 0.872
MSE = 0.626 | L2 reg = 0.20
Test: MSE = 0.872
MSE = 0.622 | L2 reg = 0.21
Test: MSE = 0.872
MSE = 0.619 | L2 reg = 0.21
Test: MSE = 0.873
MSE = 0.616 | L2 reg = 0.21
Test: MSE = 0.872


In [7]:
emb = model.user_emb.weight.data
print(emb.norm(dim=1).mean())

emb = model.item_emb.weight.data
print(emb.norm(dim=1).mean())

tensor(0.9401)
tensor(0.7996)


Next step: implement steps to add/update users and check if it's too slow

In [28]:
test_user = random.choice(ratings['userId'].unique())

print(f'{test_user = }')
new_ratings = ratings[ratings['userId'] == test_user]

new_items = torch.tensor([movie_id_to_emb[x] for x in new_ratings['movieId'].to_list()]).to(device)
new_ratings_values = torch.tensor(new_ratings['rating'].to_list())
new_user = torch.tensor([user_id_to_emb[x] for x in new_ratings['userId'].to_list()]).to(device)

with torch.no_grad():
    pred, *_ = model(new_user, new_items)
assert len(pred) == len(new_ratings_values)

print(mse(new_ratings_values, pred.cpu()))

test_user = 46
tensor(0.4385)


In [29]:
def cos_sim(a, b):
    a = a / (torch.norm(a, dim=-1, keepdim=True) + 1e-4)
    b = b / (torch.norm(b, dim=-1, keepdim=True) + 1e-4)
    return a @ b.T

new_ratings = ratings[ratings['userId'] == test_user]

new_user_emb = nn.Parameter(torch.zeros(1, 40))
nn.init.normal_(new_user_emb, 0, 0.1)
new_user_bias = nn.Parameter(torch.tensor(0.0))

new_items = torch.tensor([movie_id_to_emb[x] for x in new_ratings['movieId'].to_list()]).to(device)
new_ratings_values = torch.tensor(new_ratings['rating'].to_list())

print(len(new_items))
user_optim = torch.optim.SGD([new_user_emb, new_user_bias], lr=1e-2)
global_bias = model.global_bias.data.cpu()
for i in range(20):
    user_optim.zero_grad()
    items_emb = model.item_emb(new_items)
    items_emb = items_emb.cpu().detach()
    items_bias = model.item_bias[new_items]
    items_bias = items_bias.cpu().detach()

    pred = 5 * torch.sigmoid(model.global_bias.detach() + new_user_bias + items_bias + (items_emb * new_user_emb).sum(1))
    mse_loss = mse(new_ratings_values, pred)
    if i % 5 == 0:
        print(f"loss = {float(mse_loss):.2f}")
    l2_loss = alpha * new_user_emb.pow(2).sum() + new_user_bias.pow(2)
    l2_loss = 0
    loss = mse_loss * len(new_items) + l2_loss
    loss.backward()
    nn.utils.clip_grad_norm_([new_user_emb, new_user_bias], 100)
    user_optim.step()
print(f"final loss = {float(mse_loss):.2f}")


user_emb = model.user_emb(torch.tensor(test_user - 1).to(device)).cpu().flatten()
print(f"cos sim between new user vector and fully-trained one: {cos_sim(new_user_emb[0, :], user_emb):.2f}")

39
loss = 1.72
loss = 0.47
loss = 0.27
loss = 0.21
final loss = 0.20
cos sim between new user vector and fully-trained one: 0.92


Gladly, it's pretty fast and also the cosine similarity between the true user vector and the one trained with this approach is reasonably high. Therefore it seems a valid approach. The entire model is retrained only after a big number of changes accumulate. Meanwhile, weights are only updated using this approach.

Of course, we don't want to recommend all movies equally. We know that some movies have very few ratings, therefore the predictions are not really reliable. Thus, I use a common weighting strategy - the more ratings a movie has, the closer the prediction will be to the output of the model. Movies with very few ratings will use the global average rating instead.

In [14]:
avg_rating = ratings['rating'].mean()
print(avg_rating)
threshold = np.quantile(ratings['movieId'].value_counts().to_numpy(), 0.75)
print(threshold)

3.543608255669773
9.0


In [15]:
vote_counts = ratings['movieId'].value_counts()
vote_counts_tensor = torch.zeros(len(movie_id_to_emb), dtype=torch.long)
for idx, count in zip(vote_counts.index, vote_counts.values):
    idx = movie_id_to_emb[idx]
    vote_counts_tensor[idx] = count

Let's get recommendations for one user

In [30]:
import math

user_emb_to_id = {v: k for k, v in user_id_to_emb.items()}
movie_emb_to_id = {v: k for k, v in movie_id_to_emb.items()}

user_input = torch.tensor([user_id_to_emb[test_user]] * len(movie_id_to_emb))
movies_input = torch.arange(len(movie_id_to_emb))
preds = model.inference(torch.tensor(test_user), avg_rating, threshold, vote_counts_tensor, False)
preds = preds.detach().numpy() 
topk = np.argsort(-preds)[:15]

topk_ids = [movie_emb_to_id[x] for x in topk]


def get_tmdb_ids(rating_ids, links):
    tmdb_ids = []
    for idx in rating_ids:
        values = links[links['movieId'] == idx]['tmdbId'].values
        if len(values) == 0:
            tmdb_ids.append(None)
        elif math.isnan(values[0]):
            tmdb_ids.append(None)
        else:
            tmdb_ids.append(int(values[0]))
    return tmdb_ids


def get_movie_titles(tmbd_ids, movies_metadata):
    titles = []
    for idx in tmbd_ids:
        values = movies_metadata.loc[movies_metadata['id'] == idx]['title'].values
        if len(values) == 0:
            titles.append(None)
        else:
            titles.append(values[0])
    return titles


def get_rated_movies(user_id, ratings, links, movies_metadata):
    rated_movies_ids = ratings[ratings['userId'] == user_id]['movieId'].values
    rated_movies_tmdb_ids = get_tmdb_ids(rated_movies_ids, links)
    return set(rated_movies_tmdb_ids)
    

tmdb_ids = get_tmdb_ids(topk_ids, links)
titles = get_movie_titles(tmdb_ids, movies_metadata)
rated_movies = get_rated_movies(test_user, ratings, links, movies_metadata)
watched = []
for idx in tmdb_ids:
    if idx in rated_movies:
        watched.append("rated")
    else:
        watched.append("new")
print(f"Top recommendations for user_id {test_user}:")
print("\n".join([f"{preds[x]:.2f} -> {title} -- {watched_}" for title, x, watched_ in zip(titles, topk, watched)]))

Top recommendations for user_id 46:
4.17 -> Judge Dredd -- new
4.16 -> Armageddon -- new
4.14 -> Titanic -- new
4.14 -> DragonHeart -- new
4.12 -> Braveheart -- new
4.10 -> The Game -- new
4.10 -> Outbreak -- new
4.10 -> Star Wars: Episode I - The Phantom Menace -- new
4.10 -> American Pie -- new
4.09 -> Stargate -- new
4.09 -> Independence Day -- new
4.09 -> Star Wars: Episode III - Revenge of the Sith -- new
4.08 -> The Patriot -- new
4.07 -> Batman Forever -- new
4.07 -> The Santa Clause -- new


And below are the top rated movies by this user:

In [31]:
def get_top_user_ratings(user_id, ratings, links, movies_metadata, top_k=20):
    user_ratings = ratings[ratings['userId'] == user_id]
    user_ratings = user_ratings.sort_values('rating', ascending=False)
    user_ratings = user_ratings[:top_k]
    tmdb_ids = get_tmdb_ids(user_ratings['movieId'].values, links)
    ratings = user_ratings['rating'].values
    movie_titles = get_movie_titles(tmdb_ids, movies_metadata)
    return [(title, rating) or "NOT FOUND" for title, rating in zip(movie_titles, ratings)]

top_ratings = get_top_user_ratings(test_user, ratings, links, movies_metadata)
print("\n".join(f"{rating} -> {title}" for title, rating in top_ratings))

5.0 -> Les Miserables
5.0 -> Ratatouille
5.0 -> Kindergarten Cop
5.0 -> The Lord of the Rings: The Return of the King
5.0 -> Police Academy: Mission to Moscow
5.0 -> The Bourne Identity
5.0 -> Madagascar
5.0 -> Batman Begins
5.0 -> Blood Diamond
5.0 -> The Dark Knight
5.0 -> The Lord of the Rings: The Fellowship of the Ring
5.0 -> The Visitor
5.0 -> Inception
5.0 -> The Social Network
5.0 -> Harry Potter and the Deathly Hallows: Part 1
5.0 -> Harry Potter and the Deathly Hallows: Part 2
5.0 -> The Avengers
5.0 -> The Dark Knight Rises
5.0 -> The Lord of the Rings: The Two Towers
5.0 -> The Naked Gun: From the Files of Police Squad!


The code below is heavily inspired by the [excellent notebook](https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system/data) by Ibtesam Ahmed. The idea is to use movie metadata to find similar movies.

In [32]:
import json
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']

for feature in features:
    movies_metadata[feature] = movies_metadata[feature].apply(literal_eval)

def get_genres(genres):
    return [x['name'] for x in genres]

print(len(set(x for sublist in movies_metadata['genres'].apply(get_genres) for x in sublist)))

20


In [33]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan


# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [34]:
# Define new director, cast, genres and keywords features that are in a suitable form.
movies_metadata['director'] = movies_metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    movies_metadata[feature] = movies_metadata[feature].apply(get_list)

In [35]:
movies_metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


In [36]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    movies_metadata[feature] = movies_metadata[feature].apply(clean_data)

In [37]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
movies_metadata['soup'] = movies_metadata.apply(create_soup, axis=1)

In [38]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movies_metadata['soup'])

print(count_matrix.shape)

(46628, 73881)


Let's time how long it takes to find the cosine similarity between two movies

In [39]:
from sklearn.metrics.pairwise import cosine_similarity
import timeit

def f():
    sim = cosine_similarity(count_matrix[0, :], count_matrix)
    return np.argsort(-sim)

%timeit f()

5.26 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Not bad. Let's take a look at similar movies according to this vectorization

In [67]:
chosen_movie = 1346
sim = cosine_similarity(count_matrix[chosen_movie, :], count_matrix)
top_k = np.argsort(-sim.flatten())[0:10]
for k in top_k:
    print(movies_metadata['title'][k])

Star Trek II: The Wrath of Khan
Star Trek III: The Search for Spock
Star Trek V: The Final Frontier
Star Trek VI: The Undiscovered Country
Star Trek IV: The Voyage Home
Star Trek: The Motion Picture
Behind Enemy Lines
The Empire Strikes Back
Star Trek
Bells of Innocence


In [68]:
count_matrix.data.nbytes

2770352

We can use bitwise encoding to encode genders efficiently for extremely fast overlap lookup. Then, movies with overlapping gender can be further processed.

In [69]:
genres = set(x for sublist in movies_metadata['genres'] for x in sublist)

genre_to_int = {genre: 2**i for i, genre in enumerate(genres)}

encoded_genres = movies_metadata['genres'].apply(lambda genres: sum(genre_to_int[x] for x in genres)).to_numpy()

def rank_genres(a, b):
    overlap = np.sum(np.bitwise_and(a, b) != 0)
    return np.argsort(-overlap)

%timeit rank_genres(encoded_genres[10], encoded_genres)

52.9 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


And this structure fits smoothly into memory!

In [70]:
encoded_genres.data.nbytes

373024

Useful resources:

- https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system
- https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
- https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.8010&rep=rep1&type=pdf
- https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD