# 2.0 Embeddings Construction

## Reading cleaned dataframes

In [None]:
import pandas as pd

users = pd.read_csv('../data/interim/users.csv')
movies = pd.read_csv('../data/interim/movies.csv')
ratings = pd.read_csv('../data/interim/ratings.csv')
occupations = pd.read_csv('../data/interim/occupations.csv')

## Constructing user embeddings

**Idea**: I am going to construct the user embeddings using simple One-Hot encoding. I will use the following columns: `gender`, `occupation`, `age`. I will also use the `user_id` column to be able to join the embeddings with the `films` and `ratings` dataframes.

`zip-code` column contains too many unique values and the distribution of the values is nearly uniform. Therefore, I will not use this column.

Also, `age` column has many different integers. To obtain more information, we can separate all users into 6 age bins:
- 0: 0-18
- 1: 18-25
- 2: 25-35
- 3: 35-45
- 4: 45-60
- 5: 60-75

Those clearly show the age groups that are most likely to watch movies.

In [None]:
users.head()

In [None]:
gender_cols = ['genderF', 'genderM']
age_cols = ['age0_18', 'age18_25', 'age25_35', 'age35_45', 'age45_60', 'age60_75']

In [None]:
age_mappings = {
    0: [0, 18],
    1: [18, 25],
    2: [25, 35],
    3: [35, 45],
    4: [45, 60],
    5: [60, 75]
}

In [None]:
user_embeddings = pd.DataFrame()

In [None]:
user_embeddings['user_id'] = users['user_id']

In [None]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(sparse_output=False)

user_embeddings[gender_cols] = enc.fit_transform(users['gender'].to_numpy().reshape(-1, 1))
user_embeddings[occupations['occupation'].tolist()] = enc.fit_transform(users['occupation'].to_numpy().reshape(-1, 1))

In [None]:
# Splitting age column into 7 columns
user_embeddings[age_cols] = users['age'].apply(
    lambda x: pd.Series([1 if age_mappings[i][0] <= x < age_mappings[i][1] else 0 for i in range(6)]))

In [None]:
user_embeddings.head()

Here are our user embeddings, which contain all necessary information. Comparing to the original dataframe, we have 30 columns instead of 5.

In [None]:
user_embeddings.to_csv('../data/processed/user_embeddings.csv', index=False)

## Constructing movie embeddings

In [None]:
movies.head()

Part of work is already done. The `genre` of the film is already one-hot encoded. For the `release_year` column I will also use bin splitting. Movie titles will be encoded using BERT embeddings.

In [None]:
year_cols = ['year0_1980', 'year1980_1990', 'year1990_1994', 'year1994', 'year1995', 'year1996',
             'year1997_2000']
year_mappings = {
    0: [0, 1980],
    1: [1980, 1990],
    2: [1990, 1994],
    3: [1994, 1995],
    4: [1995, 1996],
    5: [1996, 1997],
    6: [1997, 2000],
}

In [None]:
movie_embeddings = movies.copy()

In [None]:
movie_embeddings[year_cols] = movies['release_year'].apply(
    lambda x: pd.Series([1 if year_mappings[i][0] <= x < year_mappings[i][1] else 0 for i in range(7)]))
movie_embeddings.drop('release_year', axis=1, inplace=True)

In [None]:
movie_embeddings.head()

### Gettings embeddings for movie titles

In [None]:
movie_titles = movie_embeddings['title'].tolist()

In [None]:
from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
import torch


def bert_text_preparation(text, tokenizer):
    """
    Preprocesses text input in a way that BERT can interpret.
    """
    marked_text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segments_ids = [1] * len(indexed_tokens)
    # convert inputs to tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensor = torch.tensor([segments_ids])
    return tokens_tensor, segments_tensor

In [None]:
def get_bert_embeddings(tokens_tensor, segments_tensor, model):
    """
    Obtains BERT embeddings for tokens.
    """
    # gradient calculation id disabled
    with torch.no_grad():
        # obtain hidden states
        outputs = model(tokens_tensor, segments_tensor)
        hidden_states = outputs[2]
    # concatenate the tensors for all layers
    # use "stack" to create new dimension in tensor
    token_embeddings = torch.stack(hidden_states, dim=0)
    # remove dimension 1, the "batches"
    token_embeddings = torch.squeeze(token_embeddings, dim=1)
    # swap dimensions 0 and 1 so we can loop over tokens
    token_embeddings = token_embeddings.permute(1,0,2)
    # intialized list to store embeddings
    token_vecs_sum = []
    # "token_embeddings" is a [Y x 12 x 768] tensor
    # where Y is the number of tokens in the sentence
    # loop over tokens in sentence
    for token in token_embeddings:
        # "token" is a [12 x 768] tensor
        # sum the vectors from the last four layers
        sum_vec = torch.sum(token[-4:], dim=0)
        token_vecs_sum.append(sum_vec)
    return token_vecs_sum

In [None]:
bert_embeddings_list = [
    sum(get_bert_embeddings(*bert_text_preparation(title, tokenizer), model)).numpy()
    for title in movie_titles
]

In [None]:
bert_embeddings = pd.DataFrame(bert_embeddings_list, columns=[f'bert{i}' for i in range(768)])

In [None]:
bert_embeddings.head()

In [None]:
movie_embeddings = pd.concat([movie_embeddings, bert_embeddings], axis=1)

In [None]:
movie_embeddings.drop('title', axis=1, inplace=True)

In [None]:
movie_embeddings.head()

## Joining embeddings with ratings

In [None]:
ratings.head()

In [None]:
ratings_embeddings = ratings.merge(user_embeddings, on='user_id', how='left')
ratings_embeddings = ratings_embeddings.merge(movie_embeddings, on='item_id', how='left')

In [None]:
ratings_embeddings.drop(['user_id', 'item_id', 'timestamp'], axis=1, inplace=True)

In [None]:
ratings_embeddings.head()

In [None]:
ratings_embeddings.to_csv('../data/processed/ratings_embeddings.csv', index=False)

In [None]:
pd.read_csv('../data/processed/ratings_embeddings.csv').head()

## Splitting data into train and test for baseline model

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(ratings_embeddings, test_size=0.2, random_state=42, stratify=ratings_embeddings['rating'])

## Training

So, my idea is to train some model to predict the rating based on the user and movie embeddings. Then, I will recommend the movies with the highest predicted ratings to the user.

As baseline model I will use Decision Tree Classifier. It sounds reasonable to use it, because it is easy to interpret and it is not prone to overfitting.

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=42, max_depth=5)

clf.fit(train.drop('rating', axis=1), train['rating'])

In [None]:
clf.score(test.drop('rating', axis=1), test['rating'])

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test['rating'], clf.predict(test.drop('rating', axis=1))))

## Plotting tree

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 20))
plot_tree(clf, feature_names=train.drop('rating', axis=1).columns, fontsize=10)