# Evaluation

## Set up

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Set your root directory below. Make sure the `/data` and `/data_exports` folder is uploaded and is situated in this directory.

In [None]:
# Adjust your root directory
root = '/content/drive/MyDrive/KuaiRec/'

## Load Prediction Scores

In [18]:
import pandas as pd

import eval_fns

In [4]:
root = './'

prediction_scores_caption = pd.read_csv(root + 'recommendations/recommendations_caption_test_full.csv')
prediction_scores_ncf = pd.read_csv(root + 'recommendations/final_w_clustering_batch_size512_num_epochs20_lr0.001_embedding_dim64_dropout0.3_decay0.01.csv')
prediction_scores_random = pd.read_csv(root + 'recommendations/recommendations_random_test_full.csv')
joined_train_data = pd.read_csv(root + 'data_exports/joined_train_data.csv')
joined_val_data = pd.read_csv(root + 'data_exports/joined_val_data.csv')
joined_test_data = pd.read_csv(root + 'data_exports/joined_test_data.csv')

joined_train_val_data = pd.concat([joined_train_data, joined_val_data])

video_data = pd.read_csv(root + 'data/kuairec_caption_category_translated.csv', index_col=0)

In [5]:
# Rename
prediction_scores_caption = prediction_scores_caption.rename(columns={'watch_ratio': 'predicted_watch_ratio'})
prediction_scores_ncf = prediction_scores_ncf.rename(columns={'watch_ratio': 'predicted_watch_ratio'})
prediction_scores_random = prediction_scores_random.rename(columns={'watch_ratio': 'predicted_watch_ratio'})

## Get user watch history

We want to be able to filter out videos that the user has already watched. This is so that we recommend new videos instead.

In [6]:
user_watch_history = eval_fns.get_user_watch_history(joined_train_val_data)

# Show 10 videos watched by user 14
list(user_watch_history[14])[:10]

[8195, 8200, 8201, 8204, 8207, 8212, 8213, 8220, 8222, 8228]

## Getting ground truth videos for each user

Next, we process the test set to obtain the ground truth watch ratios. The test set is filtered to only contain videos that are present in training and validation data, as well as those that the user has not watched before. Users and videos that are not in the training data are filtered out as well, as we cannot make recommendations for them. The remaining data is then sorted by user in ascending order and watch_ratio in descending order.

In [7]:
# Obtain users and videos in training and validation data
users_in_train_val_data = set(joined_train_val_data['user_id'])
videos_in_train_val_data = set(joined_train_val_data['video_id'])

# Get ground truth user-item combinations and their watch ratios
ground_truth = eval_fns.get_ground_truth(joined_test_data[['user_id', 'video_id', 'watch_ratio']], users_in_train_val_data, videos_in_train_val_data, user_watch_history)

In [8]:
# Ground truth scores for user 14
ground_truth[ground_truth['user_id'] == 14]

Unnamed: 0,user_id,video_id,watch_ratio
27,14,4184,3.234123
0,14,6293,2.442865
184,14,5954,1.899621
31,14,10354,1.884053
18,14,1352,1.780083
...,...,...,...
169,14,6270,0.062602
113,14,7736,0.060968
116,14,10140,0.032761
183,14,2755,0.032283


## Getting recommendations for each user

With the prediction scores generated from our models, we obtain the video recommendations for each user. This is done by first filtering for videos that the user has not watched before, then sorting the predicted watch ratio in descending order.

In [9]:
videos_in_test_data = set(joined_test_data['video_id'])

recommendations_caption = eval_fns.get_user_recommendations(prediction_scores_caption, videos_in_test_data, user_watch_history)
recommendations_ncf = eval_fns.get_user_recommendations(prediction_scores_ncf, videos_in_test_data, user_watch_history)
recommendations_random = eval_fns.get_user_recommendations(prediction_scores_random, videos_in_test_data, user_watch_history)

100%|██████████| 1411/1411 [00:05<00:00, 267.46it/s]
100%|██████████| 1411/1411 [00:04<00:00, 301.95it/s]
100%|██████████| 1411/1411 [00:05<00:00, 273.02it/s]


In [10]:
# Recommendations from NCF mdoel for user 14
recommendations_ncf[recommendations_ncf['user_id'] == 14]

Unnamed: 0,user_id,video_id,predicted_watch_ratio,cluster
143915,14,1306,1.224071e+00,0
152254,14,1352,1.217318e+00,0
422868,14,4719,1.121088e+00,0
156289,14,1379,1.109590e+00,0
830672,14,10404,1.068774e+00,0
...,...,...,...,...
648559,14,7736,1.086311e-07,0
797585,14,9986,8.865571e-08,0
130465,14,1166,2.004953e-08,0
811573,14,10140,1.299631e-09,0


## Calculation of Evaluation Metrics

We have chosen various evaluation metrics to provide a comprehensive evaluation of our models' performance. They can be grouped into 3 broad categories - Engagement, Relevance and Diversity.

### Engagement
1. **Average Watch Ratio @ k**: Measures the average proportion of content that users watch across all recommended videos.

### Relevance
1. **Precision@k**: Proportion of recommended videos in top K that are relevant.

2. **Recall@k**: Proportion of all relevant videos that appear in top K recommendations.

3. **F1-Score@k**: The harmonic mean of precision and recall at K, balancing the trade-off between recommending relevant videos (precision) and capturing all relevant videos (recall). 

As these metrics require a binary label, we establish a threshold for predicted_watch_ratio of 0.7, where if a video has `predicted_watch_ratio` >= 0.7: relevant, `predicted_watch_ratio` < 0.7: irrelevant.

### Diversity
1. **Category-Aware NDCG @ k**: Measures how well the recommended videos' category distribution matches the user's true category preference ranking.

2. **Distinct Categories @ k**: Number of distinct categories that appear in the top K recommendations.


We have chosen k to be 100.

In [11]:
k = 100
threshold = 0.7

In [12]:
reco_grp_caption = recommendations_caption.groupby('user_id')
reco_grp_ncf = recommendations_ncf.groupby('user_id')
reco_grp_random = recommendations_random.groupby('user_id')
ground_truth_grp = ground_truth.groupby('user_id')

### Performance Across Models

In [13]:
metrics_df_caption = eval_fns.get_all_metrics(k, ground_truth_grp, reco_grp_caption, video_data, threshold, by_cluster = False)
metrics_df_ncf = eval_fns.get_all_metrics(k, ground_truth_grp, reco_grp_ncf, video_data, threshold, by_cluster = False)
metrics_df_random = eval_fns.get_all_metrics(k, ground_truth_grp, reco_grp_random, video_data, threshold, by_cluster = False)

100%|██████████| 1411/1411 [01:11<00:00, 19.86it/s]
100%|██████████| 1411/1411 [00:52<00:00, 27.04it/s]
100%|██████████| 1411/1411 [00:01<00:00, 932.52it/s]
100%|██████████| 1411/1411 [00:36<00:00, 38.82it/s]
100%|██████████| 1411/1411 [01:26<00:00, 16.34it/s]
100%|██████████| 1411/1411 [00:38<00:00, 36.26it/s]
100%|██████████| 1411/1411 [00:02<00:00, 614.88it/s]
100%|██████████| 1411/1411 [00:25<00:00, 54.35it/s]
100%|██████████| 1411/1411 [01:00<00:00, 23.51it/s]
100%|██████████| 1411/1411 [00:37<00:00, 38.01it/s]
100%|██████████| 1411/1411 [00:02<00:00, 684.49it/s]
100%|██████████| 1411/1411 [00:55<00:00, 25.39it/s]


In [None]:
# Concatenate the metrics dataframes
metrics_combined = pd.concat([metrics_df_ncf, metrics_df_caption, metrics_df_random], axis=0)

# Add model names
metrics_combined.index = ['Neural Collaborative Filtering with Time Decay', 'Caption-based Video Filtering with Time Decay', 'Random']

metrics_combined.drop(columns=['cluster'], inplace=True)

metrics_combined

Unnamed: 0,Avg Watch Ratio @ 100,Avg Precision@100,Avg Recall@100,Avg F1@100,Category-Aware NDCG @ 100,Distinct Categories @ 100
Neural Collaborative Filtering with Time Decay,0.93781,0.629683,0.775293,0.676081,0.970287,25.73494
Caption-based Video Filtering with Time Decay,0.821867,0.534254,0.662118,0.575053,0.907141,21.40893
Random,0.797052,0.510781,0.635939,0.550876,0.929594,26.715804


## NCF Model Performance

Previously, we segmented users into four distinct clusters based on their behavioral patterns, in hopes to capture subtle patterns unique to each group and improve model performance. Let us see if performance is indeed better with customer segmentation. 

### Performance With User Segmentation

In [19]:
metrics_ncf_per_cluster = eval_fns.get_all_metrics(k, ground_truth, recommendations_ncf, video_data, threshold, by_cluster=True)

  0%|          | 0/269 [00:00<?, ?it/s]

100%|██████████| 269/269 [00:30<00:00,  8.90it/s]
100%|██████████| 419/419 [00:46<00:00,  8.97it/s]
100%|██████████| 345/345 [00:38<00:00,  8.93it/s]
100%|██████████| 378/378 [00:41<00:00,  9.19it/s]
100%|██████████| 269/269 [00:10<00:00, 24.63it/s]
100%|██████████| 419/419 [00:16<00:00, 24.72it/s]
100%|██████████| 345/345 [00:14<00:00, 23.48it/s]
100%|██████████| 378/378 [00:15<00:00, 23.80it/s]
100%|██████████| 269/269 [00:06<00:00, 38.70it/s]
100%|██████████| 419/419 [00:10<00:00, 41.07it/s]
100%|██████████| 345/345 [00:08<00:00, 40.92it/s]
100%|██████████| 378/378 [00:09<00:00, 41.01it/s]
100%|██████████| 269/269 [00:14<00:00, 18.08it/s]
100%|██████████| 419/419 [00:23<00:00, 18.19it/s]
100%|██████████| 345/345 [00:18<00:00, 18.35it/s]
100%|██████████| 378/378 [00:20<00:00, 18.17it/s]


In [None]:
metrics_ncf_per_cluster

Unnamed: 0,cluster,Avg Watch Ratio @ 100,Avg Precision@100,Avg Recall@100,Avg F1@100,Category-Aware NDCG @ 100,Distinct Categories @ 100
0,0,0.955919,0.609204,0.760632,0.657452,0.967586,26.230483
0,1,0.941045,0.654094,0.768648,0.686297,0.970797,25.842482
0,2,0.916312,0.609012,0.780041,0.665949,0.970502,25.637681
0,3,0.940958,0.636063,0.788758,0.687261,0.971448,25.351852
0,Overall,0.93781,0.629683,0.775293,0.676081,0.970287,25.73494


: 

### Performance Without Segmentation