# Movie Recommendation System using Genre Embeddings and Vector DB

#### This Colab notebook aims to illustrate the process of creating a recommendation system using genre embeddings developed through a Neural Network and a Vector DB. For a deeper understanding, refer to this [paper](https://arxiv.org/pdf/2309.08787), and to understand the whole story, refer to my [blog](https://colab.research.google.com/drive/1B6I5SEXzuuEVaHcy4IwaJlrMy8wJfPSx?usp=sharing).

### Genre Spectrum Embeddings

The Genre Spectrum approach involves combining the various movie genres or characteristics of a movie to form Initial embeddings, which offer a comprehensive portrayal of the movie content. Then these embeddings are used as a input to train a Deep Learning model producing `Genre Spectrum embeddings` at the penultimate layer.
 
These embeddings serve dual purposes: they can either be directly inputted into a classification model for genre classification or stored in a VectorDB. By storing embeddings in a VectorDB, efficient retrieval and query search for recommendations become possible at a later stage. This architecture offers a holistic understanding of the underlying processes involved.


### Installing the relevant dependencies

In [None]:
!pip install torch scikit-learn lancedb nltk gensim lancedb scipy==1.12 kaggle

Collecting lancedb
  Downloading lancedb-0.6.13-cp38-abi3-manylinux_2_28_x86_64.whl (18.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.9/18.9 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
Collecting scipy==1.12
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.4/38.4 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-many

## Kaggle Configuration and Data Needs

We are using a movies metadata data which is being uploaded on the Kaggle. To download the dataset and use it for our recommendation system, we will need a `kaggle.json` file containing our creds. 

You can download the `kaggle.json` file from your Kaggle account settings. Follow these steps and make your life easy.

1. Go to Kaggle and log in to your account.
2. Navigate to Your Account Settings and click on your profile picture in the top right corner of the page, Now From the dropdown menu, select `Account`.
3. Scroll down to the `API` section, Click on `Create New API Token`. This will download a file named kaggle.json to your computer.

Once you have the `kaggle.json` file, you need to upload it here on colab data space. After uploading the `kaggle.json` file, run the following code to set up the credentials and download the dataset in `data` directory

In [None]:
import json
import os

# Assuming kaggle.json is uploaded to the current directory
with open('kaggle.json') as f:
    kaggle_credentials = json.load(f)

os.environ['KAGGLE_USERNAME'] = kaggle_credentials['username']
os.environ['KAGGLE_KEY'] = kaggle_credentials['key']

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi

# Initialize the Kaggle API
api = KaggleApi()
api.authenticate()

# Specify the dataset you want to download
dataset = 'rounakbanik/the-movies-dataset'
destination = 'data/'

# Create the destination directory if it doesn't exist
if not os.path.exists(destination):
    os.makedirs(destination)

# Download the dataset
api.dataset_download_files(dataset, path=destination, unzip=True)

print(f"Dataset {dataset} downloaded to {destination}")

Dataset URL: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
Dataset rounakbanik/the-movies-dataset downloaded to data/


### Training a `Doc2Vec` model and building the Vocab

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from nltk import word_tokenize
from torch.utils.data import DataLoader, TensorDataset
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from tqdm import tqdm

import nltk
nltk.download('punkt')

# Read data from CSV file
movie_data = pd.read_csv('data/movies_metadata.csv', low_memory=False)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def preprocess_data(movie_data_chunk):
    tagged_docs = []
    valid_indices = []
    movie_info = []

    # Wrap your loop with tqdm
    for i, row in tqdm(movie_data_chunk.iterrows(), total=len(movie_data_chunk)):
        try:
            # Constructing movie text
            movies_text = ''
            genres = ', '.join([genre['name'] for genre in eval(row['genres'])])
            movies_text += "Genres: " + genres + '\n'
            movies_text += "Title: " + row['title'] + '\n'
            tagged_docs.append(TaggedDocument(words=word_tokenize(movies_text.lower()), tags=[str(i)]))
            valid_indices.append(i)
            movie_info.append((row['title'], genres))
        except Exception as e:
            continue

    return tagged_docs, valid_indices, movie_info

def train_doc2vec_model(tagged_data, num_epochs=10):
    # Initialize Doc2Vec model
    doc2vec_model = Doc2Vec(vector_size=100, min_count=2, epochs=num_epochs)
    doc2vec_model.build_vocab(tqdm(tagged_data, desc="Building Vocabulary"))
    for epoch in range(num_epochs):
        doc2vec_model.train(tqdm(tagged_data, desc=f"Epoch {epoch+1}"), total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

    return doc2vec_model

# Preprocess data and extract genres for the first 1000 movies
chunk_size = 1000
tagged_data = []
valid_indices = []
movie_info = []
for chunk_start in range(0, len(movie_data), chunk_size):
    movie_data_chunk = movie_data.iloc[chunk_start:chunk_start+chunk_size]
    chunk_tagged_data, chunk_valid_indices, chunk_movie_info = preprocess_data(movie_data_chunk)
    tagged_data.extend(chunk_tagged_data)
    valid_indices.extend(chunk_valid_indices)
    movie_info.extend(chunk_movie_info)

doc2vec_model = train_doc2vec_model(tagged_data)
doc2vec_model.save("doc2vec_model")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
100%|██████████| 1000/1000 [00:00<00:00, 1027.42it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1507.88it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2878.24it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3182.71it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3356.53it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3107.71it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3288.81it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3119.84it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3029.96it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3116.96it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3361.77it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2930.60it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2366.91it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2956.92it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3042.90it/s]
100%|██████████| 1000/1000 [00:00<00:00, 3050.06it/s]
100%|██████████| 1000/100

### Training a Neural Network to Predict the Genres for a given Movie!

In [None]:
# Extract genre labels for the valid indices
genres_list = []
for i in valid_indices:
    row = movie_data.loc[i]
    genres = [genre['name'] for genre in eval(row['genres'])]
    genres_list.append(genres)

mlb = MultiLabelBinarizer()
genre_labels = mlb.fit_transform(genres_list)

embeddings = []
for i in valid_indices:
    embeddings.append(doc2vec_model.dv[str(i)])
X_train, X_test, y_train, y_test = train_test_split(embeddings, genre_labels, test_size=0.2, random_state=42)

X_train_np = np.array(X_train, dtype=np.float32)
y_train_np = np.array(y_train, dtype=np.float32)
X_test_np = np.array(X_test, dtype=np.float32)
y_test_np = np.array(y_test, dtype=np.float32)

X_train_tensor = torch.tensor(X_train_np)
y_train_tensor = torch.tensor(y_train_np)
X_test_tensor = torch.tensor(X_test_np)
y_test_tensor = torch.tensor(y_test_np)

class GenreClassifier(nn.Module):
    def __init__(self, input_size, output_size):
        super(GenreClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, 512)
        self.bn1 = nn.BatchNorm1d(512)
        self.fc2 = nn.Linear(512, 256)
        self.bn2 = nn.BatchNorm1d(256)
        self.fc3 = nn.Linear(256, 128)
        self.bn3 = nn.BatchNorm1d(128)
        self.fc4 = nn.Linear(128, output_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.2)  # Adjust the dropout rate as needed

    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.bn2(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc3(x)
        x = self.bn3(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc4(x)
        return x

# Move model to the selected device
model = GenreClassifier(input_size=100, output_size=len(mlb.classes_)).to(device)

# Define loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 50
batch_size = 64

train_dataset = TensorDataset(X_train_tensor.to(device), y_train_tensor.to(device))
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)  # Move data to device
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * inputs.size(0)
    epoch_loss = running_loss / len(train_loader.dataset)
    print(f'Epoch [{epoch + 1}/{epochs}], Loss: {epoch_loss:.4f}')

Epoch [1/50], Loss: 0.1100
Epoch [2/50], Loss: 0.0473
Epoch [3/50], Loss: 0.0396
Epoch [4/50], Loss: 0.0352
Epoch [5/50], Loss: 0.0326
Epoch [6/50], Loss: 0.0308
Epoch [7/50], Loss: 0.0293
Epoch [8/50], Loss: 0.0278
Epoch [9/50], Loss: 0.0266
Epoch [10/50], Loss: 0.0256
Epoch [11/50], Loss: 0.0253
Epoch [12/50], Loss: 0.0245
Epoch [13/50], Loss: 0.0232
Epoch [14/50], Loss: 0.0229
Epoch [15/50], Loss: 0.0227
Epoch [16/50], Loss: 0.0222
Epoch [17/50], Loss: 0.0213
Epoch [18/50], Loss: 0.0212
Epoch [19/50], Loss: 0.0203
Epoch [20/50], Loss: 0.0197
Epoch [21/50], Loss: 0.0193
Epoch [22/50], Loss: 0.0199
Epoch [23/50], Loss: 0.0188
Epoch [24/50], Loss: 0.0184
Epoch [25/50], Loss: 0.0182
Epoch [26/50], Loss: 0.0183
Epoch [27/50], Loss: 0.0175
Epoch [28/50], Loss: 0.0172
Epoch [29/50], Loss: 0.0169
Epoch [30/50], Loss: 0.0169
Epoch [31/50], Loss: 0.0168
Epoch [32/50], Loss: 0.0165
Epoch [33/50], Loss: 0.0163
Epoch [34/50], Loss: 0.0160
Epoch [35/50], Loss: 0.0160
Epoch [36/50], Loss: 0.0158
E

### Testing the `model` to see if our model is able to predict the genres for the movies from the test dataset

In [None]:
from sklearn.metrics import f1_score

model.eval()
with torch.no_grad():
    X_test_tensor, y_test_tensor = X_test_tensor.to(device), y_test_tensor.to(device)  # Move test data to device
    outputs = model(X_test_tensor)
    test_loss = criterion(outputs, y_test_tensor)
    print(f'Test Loss: {test_loss.item():.4f}')


thresholds = [0.1] * len(mlb.classes_)
thresholds_tensor = torch.tensor(thresholds, device=device).unsqueeze(0)

# Convert the outputs to binary predictions using varying thresholds
predicted_labels = (outputs > thresholds_tensor).cpu().numpy()

# Convert binary predictions and actual labels to multi-label format
predicted_multilabels = mlb.inverse_transform(predicted_labels)
actual_multilabels = mlb.inverse_transform(y_test_np)

# Print the Predicted and Actual Labels for each movie
for i, (predicted, actual) in enumerate(zip(predicted_multilabels, actual_multilabels)):
    print(f'Movie {i+1}:')
    print(f'    Predicted Labels: {predicted}')
    print(f'    Actual Labels: {actual}')


# Compute F1-score
f1 = f1_score(y_test_np, predicted_labels, average='micro')
print(f'F1-score: {f1:.4f}')

# Saving the trained model
torch.save(model.state_dict(), 'trained_model.pth')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
    Actual Labels: ('Comedy', 'Romance')
Movie 7427:
    Predicted Labels: ('Romance',)
    Actual Labels: ('Romance', 'TV Movie')
Movie 7428:
    Predicted Labels: ('Comedy', 'Romance')
    Actual Labels: ('Comedy', 'Romance')
Movie 7429:
    Predicted Labels: ('Comedy', 'Drama')
    Actual Labels: ('Comedy', 'Drama')
Movie 7430:
    Predicted Labels: ('Action', 'Crime', 'Drama', 'Thriller')
    Actual Labels: ('Action', 'Crime', 'Drama', 'Thriller')
Movie 7431:
    Predicted Labels: ('Comedy',)
    Actual Labels: ('Comedy',)
Movie 7432:
    Predicted Labels: ('Action', 'Comedy', 'Crime')
    Actual Labels: ('Action', 'Comedy', 'Crime')
Movie 7433:
    Predicted Labels: ()
    Actual Labels: ('Comedy',)
Movie 7434:
    Predicted Labels: ('Horror',)
    Actual Labels: ('Horror',)
Movie 7435:
    Predicted Labels: ('Comedy',)
    Actual Labels: ('Comedy',)
Movie 7436:
    Predicted Labels: ('Drama', 'Fantasy')
    Actual L

In [None]:
def test_model(movie_descriptions, doc2vec_model, model, mlb):
    tagged_docs = [TaggedDocument(words=word_tokenize(desc.lower()), tags=[str(i)]) for i, desc in enumerate(movie_descriptions)]
    embeddings = [doc2vec_model.infer_vector(doc.words) for doc in tagged_docs]
    X_test_np = np.array(embeddings, dtype=np.float32)
    X_test_tensor = torch.tensor(X_test_np).to(device)

    model.eval()
    with torch.no_grad():
        outputs = model(X_test_tensor)

    # Get top N genres with the highest probabilities
    N = 3  # Number of top genres to select
    top_n_indices = np.argsort(-outputs.cpu().numpy())[:, :3]
    predicted_genres = mlb.classes_[top_n_indices]

    return predicted_genres


# Example movie descriptions to test the model
example_movie_descriptions = [
    "A boy discovers some powers and ready to conquer the world",
    "Ailens comes from the future and destorying the earth",
    "An amazing comedy movie with an unexpected turn of events",
    "A gas station becomes the center of social life in the village after six Swedish girls start working there.",
]

test_model(example_movie_descriptions, doc2vec_model, model, mlb)

array([['Comedy', 'Drama', 'Documentary'],
       ['Comedy', 'Drama', 'Adventure'],
       ['Comedy', 'Drama', 'Horror'],
       ['Drama', 'Comedy', 'War']], dtype=object)

### Extracting the `genre_embeddings` to use it for our `movie_recommendation` system

The `genre_embeddings` are taken from the `penultimate` layer of our Neural Network. The pratical way to think about this is based on the fact that if our classification model was able to predict the `genres` for the different movies, it might understood the relevances of the different `movies` which we want for our `movie_recommendation` system

Refer to [blog](https://colab.research.google.com/drive/1B6I5SEXzuuEVaHcy4IwaJlrMy8wJfPSx?usp=sharing) to understand more about it.

In [None]:
def extract_genre_embeddings(model, X_data):
    model.eval()
    with torch.no_grad():
        embeddings = model.fc3(model.relu(model.fc2(model.relu(model.fc1(X_data.to(device))))))
    return embeddings.cpu().numpy()

train_embeddings = extract_genre_embeddings(model, X_train_tensor)
test_embeddings = extract_genre_embeddings(model, X_test_tensor)

# Combine training and test data
all_indices = valid_indices[:len(X_train_tensor)] + valid_indices[len(X_train_tensor):]
all_embeddings = np.concatenate((train_embeddings, test_embeddings), axis=0)
all_genres = np.concatenate((y_train_np, y_test_np), axis=0)

# Create a dataframe
movie_embeddings_df = pd.DataFrame({
    'movie_index': all_indices,
    'title': [movie_data.loc[idx, 'title'] for idx in all_indices],
    'genre_embeddings': [list(embeddings) for embeddings in all_embeddings],  # Convert each array of embeddings to a list
    'genre_labels': [mlb.classes_[labels.nonzero()[0]] for labels in all_genres]
})

# Save the data as a csv file
movie_embeddings_df.to_csv("movie_embeddings.csv", index=False)

### Storing the `genre_embeddings` to `LanceDB` Vector Database

In [None]:
import ast
import lancedb
import pandas as pd
from lancedb.pydantic import LanceModel, Vector

data = pd.read_csv("movie_embeddings.csv")
data.drop(columns=["movie_index"], inplace=True)

movie_data = []
for index, row in data.iterrows():
    embedding_vector = ast.literal_eval(row["genre_embeddings"])
    movie_data.append(
        {
            "title": row['title'],
            "embeddings": embedding_vector,
            "genre_labels": row['genre_labels']
        }
    )

# Define LanceDB model
class Movie(LanceModel):
    title: str
    embeddings: Vector(128)
    genre_labels: str

# Create LanceDB connection
db = lancedb.connect("./db")
movie_table = db.create_table(
    "movies",
    schema=Movie,
    mode="Overwrite")
movie_table.add(movie_data)

### Prediction based on the `genre_embeddings`

In [None]:
movie_data_pd = pd.DataFrame(movie_data)
def get_recommendation(title):
    result = (
        movie_table.search(movie_data_pd[movie_data_pd["title"] == title]["embeddings"].values[0]).metric('cosine')
        .limit(5)
        .to_pandas()
    )
    return result

result = get_recommendation("Toy Story")
result[['title']]

Unnamed: 0,title
0,Toy Story
1,Blood Sisters of Lesbian Sin
2,Hostile Intentions
3,Satyricon
4,Underworld


### Note: The current implementation of the Neural Network is quite basic and serves primarily for the demonstration. While the movie recommendations generated are relatively simple and may be wrong, they can be significantly improved by using a more sophisticated and larger Neural Network within the same workflow.