# Experiments with custom NCF model

The goal of this project is to eventually work towards implementing xDeepFM, but in an iterative fashion. We start with an SVD baseline, and gradually improve upon that with the addition of content based filtering (hybrid).

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import torch
from torch import nn
import ncf

In [2]:
TEST_USER = 43708 # ankh

## Training the neural network

In [3]:
# Load wrangled data
store = pd.HDFStore('store.h5')
votes, vn = store['votes'], store['vn']

store.close()

In [36]:
train_dl, test_dl, num_users, num_books = ncf.get_data(votes)
mf_net = ncf.MatrixFactor('mf', num_users, num_books)
J = []

In [38]:
J += ncf.fit(mf_net, train_dl, test_dl, nn.MSELoss(), epochs=5)

100%|██████████| 1145/1145 [00:13<00:00, 87.51it/s]


Epoch 1/5: Loss 0.07


Evaluating: 100%|██████████| 287/287 [00:02<00:00, 99.79it/s] 
 20%|██        | 1/5 [00:16<01:04, 16.07s/it]

Average Loss: 0.0869


100%|██████████| 1145/1145 [00:12<00:00, 90.87it/s]


Epoch 2/5: Loss 0.06


Evaluating: 100%|██████████| 287/287 [00:02<00:00, 100.50it/s]
 40%|████      | 2/5 [00:31<00:47, 15.77s/it]

Average Loss: 0.0794


100%|██████████| 1145/1145 [00:12<00:00, 92.40it/s] 


Epoch 3/5: Loss 0.06


Evaluating: 100%|██████████| 287/287 [00:02<00:00, 96.42it/s]
 60%|██████    | 3/5 [00:47<00:31, 15.64s/it]

Average Loss: 0.0733


100%|██████████| 1145/1145 [00:12<00:00, 90.92it/s]


Epoch 4/5: Loss 0.05


Evaluating: 100%|██████████| 287/287 [00:02<00:00, 98.59it/s] 
 80%|████████  | 4/5 [01:02<00:15, 15.63s/it]

Average Loss: 0.0682


100%|██████████| 1145/1145 [00:12<00:00, 90.71it/s]


Epoch 5/5: Loss 0.05


Evaluating: 100%|██████████| 287/287 [00:02<00:00, 101.16it/s]
100%|██████████| 5/5 [01:18<00:00, 15.66s/it]

Average Loss: 0.0641





If we take a look at the validation loss, it's half as much as SVD's! This is starting to look really nice for us. Let's dig into the fun stuff - generating the actual recommendations! 

## Recommendation sampling


Keep in mind, the model should optimally now be trained on the entire dataset if we're wanting the best quality predictions. But let's move on anyway and see how it ends up.

In [5]:
votes_user_indexed = votes.set_index('user_id')

Let's carry out inference with the neural net:

In [40]:
seen_vns = votes_user_indexed.loc[TEST_USER]
unseen_vns = vn.loc[~vn.index.isin(seen_vns['vn_id'])].index.to_numpy()
unseen_vns = torch.tensor(unseen_vns).cuda()
user = np.full(len(unseen_vns), TEST_USER)
user = torch.tensor(user).cuda()

mf_net.eval()
with torch.no_grad():
    preds = mf_net(user, unseen_vns)

In [41]:
unseen_vns = unseen_vns.cpu().numpy()
preds = preds.cpu().numpy()

Compare resulting prediction distribution with the test user's rating distribution

In [None]:
preds_hist = pd.Series(preds)
sns.histplot(preds_hist, kde=True, stat='probability')
sns.histplot(votes_user_indexed.loc[TEST_USER]['vote'], kde=True, stat='probability')

In [None]:
K = 10

top = unseen_vns[np.argsort(preds)][::-1]
reccs_data = vn.loc[top][['title', 'description', 'c_votecount', 'c_rating', 'tags']]
predictions = np.sort(preds)[::-1]
reccs_data['predicted_rating'] = predictions

Applying the same filtering:

In [None]:
reccs_data.loc[(reccs_data['c_votecount'] > 50) & ~(reccs_data['c_rating'].isna())].head(30).sort_values(by='c_votecount', ascending=False)

# Evaluating recommendation quality

- MAP@K and MAR@K
- 