# Evaluation

## Set up

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Set your root directory below. Make sure the `/data`, `/data_exports` and `/recommendations` folders are uploaded and is situated in this directory.

In [None]:
# Adjust your root directory
root = '/content/drive/MyDrive/KuaiRec/'

## Load Data and Predictions

In [3]:
import pandas as pd

import eval_fns

In total, we have eight sets of predictions - 2 per model. One set is generated from only the training set and is to be evaluated against the validation set, while the other is generated after training on both the training and validation sets and to be evaluated on the final testing set. Hyperparameters for the models were chosen based on the validation set, and the final predictions are generated using these hyperparameters. The test set represents the final unseen data, and the performance of the models on this set is the most important.

There is also an additional set of predictions from the NCF model, trained on the train and validation data, without user segmentation. This will be evaluated on the test dataset and compared to the performance of the NCF model with user segmentation. With this analysis, we want to see how much does the user segmentation contribute to the NCF model overall performance.

In [4]:
root = './'

# Predictions to be tested on the validation set
prediction_scores_caption_on_val = pd.read_csv(root + 'recommendations/recommendations_caption_val_full.csv')
prediction_scores_ncf_on_val = pd.read_csv(root + 'recommendations/w_clustering_batch_size512_num_epochs20_lr0.001_embedding_dim64_dropout0.3_decay0.01.csv')
prediction_scores_hybrid_on_val = pd.read_csv(root + 'recommendations/recommendations_hybrid_val_full.csv')
prediction_scores_random_on_val = pd.read_csv(root + 'recommendations/recommendations_random_val_full.csv')

# Predictions to be tested on the test set
prediction_scores_caption_on_test = pd.read_csv(root + 'recommendations/recommendations_caption_test_full.csv')
prediction_scores_ncf_on_test = pd.read_csv(root + 'recommendations/final_w_clustering_batch_size512_num_epochs20_lr0.001_embedding_dim64_dropout0.3_decay0.01.csv')
prediction_scores_hybrid_on_test = pd.read_csv(root + 'recommendations/recommendations_hybrid_test_full.csv')
prediction_scores_random_on_test = pd.read_csv(root + 'recommendations/recommendations_random_test_full.csv')

prediction_scores_ncf_on_test_without_clustering = pd.read_csv(root + 'recommendations/final_wo_clustering_batch_size512_num_epochs20_lr0.001_embedding_dim64_dropout0.3_decay0.01.csv')

# Obtain the ground truth watch ratios from data
joined_train_data = pd.read_csv(root + 'data_exports/joined_train_data.csv')
joined_val_data = pd.read_csv(root + 'data_exports/joined_val_data.csv')
joined_test_data = pd.read_csv(root + 'data_exports/joined_test_data.csv')

joined_train_val_data = pd.concat([joined_train_data, joined_val_data])

# Load the video data in order to get the video categories
video_data = pd.read_csv(root + 'data/kuairec_caption_category_translated.csv', index_col=0)

## Data Preprocessing

In [23]:
# Rename
prediction_scores_caption_on_val = prediction_scores_caption_on_val.rename(columns={'watch_ratio': 'predicted_watch_ratio'})
prediction_scores_ncf_on_val = prediction_scores_ncf_on_val.rename(columns={'watch_ratio': 'predicted_watch_ratio'})
prediction_scores_hybrid_on_val = prediction_scores_hybrid_on_val.rename(columns={'watch_ratio': 'predicted_watch_ratio'})
prediction_scores_random_on_val = prediction_scores_random_on_val.rename(columns={'watch_ratio': 'predicted_watch_ratio'})

prediction_scores_caption_on_test = prediction_scores_caption_on_test.rename(columns={'watch_ratio': 'predicted_watch_ratio'})
prediction_scores_ncf_on_test = prediction_scores_ncf_on_test.rename(columns={'watch_ratio': 'predicted_watch_ratio'})
prediction_scores_hybrid_on_test = prediction_scores_hybrid_on_test.rename(columns={'watch_ratio': 'predicted_watch_ratio'})
prediction_scores_random_on_test = prediction_scores_random_on_test.rename(columns={'watch_ratio': 'predicted_watch_ratio'})

prediction_scores_ncf_on_test_without_clustering = prediction_scores_ncf_on_test_without_clustering.rename(columns={'watch_ratio': 'predicted_watch_ratio'})

# Sort
prediction_scores_caption_on_val = prediction_scores_caption_on_val.sort_values(by=['user_id', 'video_id'])
prediction_scores_ncf_on_val = prediction_scores_ncf_on_val.sort_values(by=['user_id', 'video_id'])
prediction_scores_hybrid_on_val = prediction_scores_hybrid_on_val.sort_values(by=['user_id', 'video_id'])
prediction_scores_random_on_val = prediction_scores_random_on_val.sort_values(by=['user_id', 'video_id'])

prediction_scores_caption_on_test = prediction_scores_caption_on_test.sort_values(by=['user_id', 'video_id'])
prediction_scores_ncf_on_test = prediction_scores_ncf_on_test.sort_values(by=['user_id', 'video_id'])
prediction_scores_hybrid_on_test = prediction_scores_hybrid_on_test.sort_values(by=['user_id', 'video_id'])
prediction_scores_random_on_test = prediction_scores_random_on_test.sort_values(by=['user_id', 'video_id'])

prediction_scores_ncf_on_test_without_clustering = prediction_scores_ncf_on_test_without_clustering.sort_values(by=['user_id', 'video_id'])

## Get user watch history

We want to be able to filter out videos that the user has already watched. This is so that we recommend new videos instead.

In [24]:
user_watch_history_from_train = eval_fns.get_user_watch_history(joined_train_data)
user_watch_history_from_train_val = eval_fns.get_user_watch_history(joined_train_val_data)

## Getting ground truth videos for each user

Next, we process the test set to obtain the ground truth watch ratios. The test set is filtered to only contain videos that are present in training and validation data, as well as those that the user has not watched before. Users and videos that are not in the training data are filtered out as well, as we cannot make recommendations for them. The remaining data is then sorted by user in ascending order and watch_ratio in descending order.

In [25]:
# Obtain users and videos in training data
users_in_train_data = set(joined_train_data['user_id'])
videos_in_train_data = set(joined_train_data['video_id'])

# Obtain users and videos in training and validation data
users_in_train_val_data = set(joined_train_val_data['user_id'])
videos_in_train_val_data = set(joined_train_val_data['video_id'])

# Get ground truth user-item combinations and their watch ratios
ground_truth_val = eval_fns.get_ground_truth(joined_val_data[['user_id', 'video_id', 'watch_ratio']], users_in_train_data, videos_in_train_data, user_watch_history_from_train)
ground_truth_test = eval_fns.get_ground_truth(joined_test_data[['user_id', 'video_id', 'watch_ratio']], users_in_train_val_data, videos_in_train_val_data, user_watch_history_from_train_val)

In [26]:
# Ground ground truth watch ratios for user 14
ground_truth_test[ground_truth_test['user_id'] == 14]

Unnamed: 0,user_id,video_id,watch_ratio
27,14,4184,3.234123
0,14,6293,2.442865
184,14,5954,1.899621
31,14,10354,1.884053
18,14,1352,1.780083
...,...,...,...
169,14,6270,0.062602
113,14,7736,0.060968
116,14,10140,0.032761
183,14,2755,0.032283


## Getting recommendations for each user

With the prediction scores generated from our models, we obtain the video recommendations for each user. This is done by first filtering for videos that the user has not watched before, then sorting the predicted watch ratio in descending order.

In [27]:
# Get recommendations on validation set
videos_in_val_data = set(joined_val_data['video_id'])

recommendations_caption_for_val = eval_fns.get_user_recommendations(prediction_scores_caption_on_val, videos_in_val_data, user_watch_history_from_train)
recommendations_ncf_for_val = eval_fns.get_user_recommendations(prediction_scores_ncf_on_val, videos_in_val_data, user_watch_history_from_train)
recommendations_hybrid_for_val = eval_fns.get_user_recommendations(prediction_scores_hybrid_on_val, videos_in_val_data, user_watch_history_from_train)
recommendations_random_for_val = eval_fns.get_user_recommendations(prediction_scores_random_on_val, videos_in_val_data, user_watch_history_from_train)

# Get recommendations on test set
videos_in_test_data = set(joined_test_data['video_id'])

recommendations_caption_for_test = eval_fns.get_user_recommendations(prediction_scores_caption_on_test, videos_in_test_data, user_watch_history_from_train_val)
recommendations_ncf_for_test = eval_fns.get_user_recommendations(prediction_scores_ncf_on_test, videos_in_test_data, user_watch_history_from_train_val)
recommendations_hybrid_for_test = eval_fns.get_user_recommendations(prediction_scores_hybrid_on_test, videos_in_test_data, user_watch_history_from_train_val)
recommendations_random_for_test = eval_fns.get_user_recommendations(prediction_scores_random_on_test, videos_in_test_data, user_watch_history_from_train_val)

recommendations_ncf_for_test_without_clustering = eval_fns.get_user_recommendations(prediction_scores_ncf_on_test_without_clustering, videos_in_test_data, user_watch_history_from_train_val)

100%|██████████| 1411/1411 [00:06<00:00, 216.18it/s]
100%|██████████| 1411/1411 [00:05<00:00, 275.66it/s]
100%|██████████| 1411/1411 [00:04<00:00, 291.48it/s]
100%|██████████| 1411/1411 [00:06<00:00, 222.08it/s]
100%|██████████| 1411/1411 [00:04<00:00, 283.16it/s]
100%|██████████| 1411/1411 [00:05<00:00, 280.08it/s]
100%|██████████| 1411/1411 [00:04<00:00, 299.13it/s]
100%|██████████| 1411/1411 [00:04<00:00, 290.52it/s]
100%|██████████| 1411/1411 [00:03<00:00, 370.47it/s]


In [28]:
# Recommendations from NCF model for user 14 to be evaluated on the test set
recommendations_ncf_for_test[recommendations_ncf_for_test['user_id'] == 14]

Unnamed: 0,user_id,video_id,predicted_watch_ratio,cluster
143915,14,1306,1.224071e+00,0
152254,14,1352,1.217318e+00,0
422868,14,4719,1.121088e+00,0
156289,14,1379,1.109590e+00,0
830672,14,10404,1.068774e+00,0
...,...,...,...,...
648559,14,7736,1.086311e-07,0
797585,14,9986,8.865571e-08,0
130465,14,1166,2.004953e-08,0
811573,14,10140,1.299631e-09,0


## Evaluation Metrics

We have chosen various evaluation metrics to provide a comprehensive evaluation of our models' performance. They can be grouped into 3 broad categories - Engagement, Relevance and Diversity.

### Engagement
1. **Average Watch Ratio @ k**: Measures the average proportion of content that users watch across all recommended videos.

### Relevance
1. **Precision@k**: Proportion of recommended videos in top K that are relevant.

2. **Recall@k**: Proportion of all relevant videos that appear in top K recommendations.

3. **F1-Score@k**: The harmonic mean of precision and recall at K, balancing the trade-off between recommending relevant videos (precision) and capturing all relevant videos (recall). 

As these metrics require a binary label, we establish a threshold for predicted_watch_ratio of 0.7, where if a video has `predicted_watch_ratio` >= 0.7: relevant, `predicted_watch_ratio` < 0.7: irrelevant.

### Diversity
1. **Category-Aware NDCG @ k**: Measures how well the recommended videos' category distribution matches the user's true category preference ranking.

2. **Distinct Categories @ k**: Number of distinct categories that appear in the top K recommendations.


We have chosen k to be 100.

In [29]:
k = 100
threshold = 0.7

In [30]:
# For Validation
reco_grp_caption_for_val = recommendations_caption_for_val.groupby('user_id')
reco_grp_ncf_for_val = recommendations_ncf_for_val.groupby('user_id')
reco_grp_hybrid_for_val = recommendations_hybrid_for_val.groupby('user_id')
reco_grp_random_for_val = recommendations_random_for_val.groupby('user_id')

ground_truth_grp_for_val = ground_truth_val.groupby('user_id')

# For Test
reco_grp_caption_for_test = recommendations_caption_for_test.groupby('user_id')
reco_grp_ncf_for_test = recommendations_ncf_for_test.groupby('user_id')
reco_grp_hybrid_for_test = recommendations_hybrid_for_test.groupby('user_id')
reco_grp_random_for_test = recommendations_random_for_test.groupby('user_id')

reco_grp_ncf_for_test_without_clustering = recommendations_ncf_for_test_without_clustering.groupby('user_id')

ground_truth_grp_for_test = ground_truth_test.groupby('user_id')

## Performance Across Models on the Validation Set
### Results

In [31]:
metrics_df_caption_for_val = eval_fns.get_all_metrics(k, ground_truth_grp_for_val, reco_grp_caption_for_val, video_data, threshold, by_cluster = False)
metrics_df_ncf_for_val = eval_fns.get_all_metrics(k, ground_truth_grp_for_val, reco_grp_ncf_for_val, video_data, threshold, by_cluster = False)
metrics_df_hybrid_for_val = eval_fns.get_all_metrics(k, ground_truth_grp_for_val, reco_grp_hybrid_for_val, video_data, threshold, by_cluster = False)
metrics_df_random_for_val = eval_fns.get_all_metrics(k, ground_truth_grp_for_val, reco_grp_random_for_val, video_data, threshold, by_cluster = False)

100%|██████████| 1411/1411 [00:59<00:00, 23.89it/s]
100%|██████████| 1411/1411 [00:28<00:00, 49.38it/s]
100%|██████████| 1411/1411 [00:01<00:00, 1210.24it/s]
100%|██████████| 1411/1411 [00:25<00:00, 54.95it/s]
100%|██████████| 1411/1411 [00:58<00:00, 24.21it/s]
100%|██████████| 1411/1411 [00:32<00:00, 43.45it/s]
100%|██████████| 1411/1411 [00:01<00:00, 1021.02it/s]
100%|██████████| 1411/1411 [00:31<00:00, 44.62it/s]
100%|██████████| 1411/1411 [01:12<00:00, 19.44it/s]
100%|██████████| 1411/1411 [00:35<00:00, 40.26it/s]
100%|██████████| 1411/1411 [00:01<00:00, 1040.91it/s]
100%|██████████| 1411/1411 [00:31<00:00, 44.76it/s]
100%|██████████| 1411/1411 [01:10<00:00, 19.95it/s]
100%|██████████| 1411/1411 [00:35<00:00, 39.64it/s]
100%|██████████| 1411/1411 [00:01<00:00, 1110.69it/s]
100%|██████████| 1411/1411 [00:26<00:00, 53.11it/s]


In [32]:
# Concatenate the metrics dataframes
metrics_for_val_combined = pd.concat([metrics_df_ncf_for_val, metrics_df_caption_for_val, metrics_df_hybrid_for_val, metrics_df_random_for_val], axis=0)

# Add model names
metrics_for_val_combined.index = ['Neural Collaborative Filtering with Time Decay', 'Caption-based Video Filtering with Time Decay', 'Hybrid', 'Random']

metrics_for_val_combined.drop(columns=['cluster'], inplace=True)

metrics_for_val_combined

Unnamed: 0,Avg Watch Ratio @ 100,Avg Precision@100,Avg Recall@100,Avg F1@100,Category-Aware NDCG @ 100,Distinct Categories @ 100
Neural Collaborative Filtering with Time Decay,1.03608,0.709447,0.323266,0.435504,0.950681,24.956768
Caption-based Video Filtering with Time Decay,0.862624,0.57533,0.260211,0.351417,0.868736,18.795889
Hybrid,1.034439,0.70798,0.322404,0.434463,0.95272,24.672573
Random,0.843575,0.557066,0.252694,0.340944,0.921808,25.933381


## Performance Across Models on the Testing Set
### Results

In [33]:
metrics_df_caption_for_test = eval_fns.get_all_metrics(k, ground_truth_grp_for_test, reco_grp_caption_for_test, video_data, threshold, by_cluster = False)
metrics_df_ncf_for_test = eval_fns.get_all_metrics(k, ground_truth_grp_for_test, reco_grp_ncf_for_test, video_data, threshold, by_cluster = False)
metrics_df_hybrid_for_test = eval_fns.get_all_metrics(k, ground_truth_grp_for_test, reco_grp_hybrid_for_test, video_data, threshold, by_cluster = False)
metrics_df_random_for_test = eval_fns.get_all_metrics(k, ground_truth_grp_for_test, reco_grp_random_for_test, video_data, threshold, by_cluster = False)

metrics_df_ncf_for_test_without_clustering = eval_fns.get_all_metrics(k, ground_truth_grp_for_test, reco_grp_ncf_for_test_without_clustering, video_data, threshold, by_cluster = False)

100%|██████████| 1411/1411 [01:01<00:00, 22.88it/s]
100%|██████████| 1411/1411 [00:30<00:00, 45.76it/s]
100%|██████████| 1411/1411 [00:01<00:00, 1264.72it/s]
100%|██████████| 1411/1411 [00:28<00:00, 49.03it/s]
100%|██████████| 1411/1411 [00:59<00:00, 23.71it/s]
100%|██████████| 1411/1411 [00:29<00:00, 48.34it/s]
100%|██████████| 1411/1411 [00:01<00:00, 1381.64it/s]
100%|██████████| 1411/1411 [00:24<00:00, 57.05it/s]
100%|██████████| 1411/1411 [00:59<00:00, 23.85it/s]
100%|██████████| 1411/1411 [00:29<00:00, 48.40it/s]
100%|██████████| 1411/1411 [00:01<00:00, 1259.59it/s]
100%|██████████| 1411/1411 [00:24<00:00, 57.61it/s]
100%|██████████| 1411/1411 [00:59<00:00, 23.56it/s]
100%|██████████| 1411/1411 [00:32<00:00, 42.80it/s]
100%|██████████| 1411/1411 [00:01<00:00, 1227.90it/s]
100%|██████████| 1411/1411 [00:25<00:00, 55.43it/s]
100%|██████████| 1411/1411 [00:58<00:00, 23.95it/s]
100%|██████████| 1411/1411 [00:34<00:00, 40.59it/s]
100%|██████████| 1411/1411 [00:01<00:00, 1149.37it/s]
10

In [34]:
# Concatenate the metrics dataframes
metrics_for_test_combined = pd.concat([metrics_df_ncf_for_test, metrics_df_caption_for_test, metrics_df_hybrid_for_test, metrics_df_random_for_test], axis=0)

# Add model names
metrics_for_test_combined.index = ['Neural Collaborative Filtering with Time Decay', 'Caption-based Video Filtering with Time Decay', 'Hybrid', 'Random']

metrics_for_test_combined.drop(columns=['cluster'], inplace=True)

metrics_for_test_combined

Unnamed: 0,Avg Watch Ratio @ 100,Avg Precision@100,Avg Recall@100,Avg F1@100,Category-Aware NDCG @ 100,Distinct Categories @ 100
Neural Collaborative Filtering with Time Decay,0.93781,0.629683,0.775293,0.676081,0.970287,25.73494
Caption-based Video Filtering with Time Decay,0.821867,0.534254,0.662118,0.575053,0.907141,21.40893
Hybrid,0.911009,0.605267,0.742098,0.648776,0.962718,25.575478
Random,0.797052,0.510781,0.635939,0.550876,0.929594,26.715804


### Insights
1. Engagement

    As we can see, the Average Watch Ratio @ k is higher for all of our models compared to the random baseline, indicating that the users are watching more of the recommended videos. This is a good sign as it shows that our models are able to recommend videos that users are more likely to watch and enjoy. The NCF model performs much better than the Caption-based model, which is expected as the NCF model is a much more complex model that is able to capture more complex patterns in the data.

2. Relevance

    The Precision@k, Recall@k and F1-Score@k are all higher for all of our models compared to the random baseline. This indicates that our models are able to recommend more relevant videos to the users. The higher F1-score@k also indicates that our models are able to balance the trade-off between recommending relevant videos and capturing all relevant videos. Similar to the engagement metrics, the NCF model performs much better than the Caption-based model.

3. Diversity

    In terms of Distinct Categories @ k, we can see that all our models recommend fewer number of distinct categories compared to the random baseline. Caption-based model recommends the least number of distinct categories, which is expected as the model's recommendations were partially based on the embeddings of the category of the video. In general, this suggests that our models are more targeted than random recommendations, but might not be as diverse in terms of the categories of the recommended videos. However, as we can see from the Category-Aware NDCG @ 100 metric, the NCF model has higher scores compared to our baseline, which means that even though the number of distinct categories recommended is lower, the categories of the recommended videos are more aligned with the user's true category preference ranking.

### NCF Model Performance

Previously, we segmented users into four distinct clusters based on their behavioral patterns, in hopes to capture subtle patterns unique to each group and improve model performance. Hence, we have trained the NCF model on both segmented and non-segmented combined train and validation data. 

Let us see if performance is indeed better with customer segmentation. 

#### Results
##### Performance With User Segmentation

In [35]:
metrics_ncf_per_cluster = eval_fns.get_all_metrics(k, ground_truth_test, recommendations_ncf_for_test, video_data, threshold, by_cluster=True)

  0%|          | 0/269 [00:00<?, ?it/s]

100%|██████████| 269/269 [00:15<00:00, 17.24it/s]
100%|██████████| 419/419 [00:25<00:00, 16.35it/s]
100%|██████████| 345/345 [00:20<00:00, 16.48it/s]
100%|██████████| 378/378 [00:22<00:00, 16.78it/s]
100%|██████████| 269/269 [00:05<00:00, 44.84it/s]
100%|██████████| 419/419 [00:08<00:00, 47.24it/s]
100%|██████████| 345/345 [00:07<00:00, 46.86it/s]
100%|██████████| 378/378 [00:08<00:00, 46.92it/s]
100%|██████████| 269/269 [00:03<00:00, 79.06it/s]
100%|██████████| 419/419 [00:05<00:00, 79.38it/s]
100%|██████████| 345/345 [00:04<00:00, 79.28it/s]
100%|██████████| 378/378 [00:04<00:00, 80.95it/s]
100%|██████████| 269/269 [00:07<00:00, 34.50it/s]
100%|██████████| 419/419 [00:12<00:00, 34.46it/s]
100%|██████████| 345/345 [00:10<00:00, 33.80it/s]
100%|██████████| 378/378 [00:10<00:00, 34.89it/s]


In [36]:
metrics_ncf_per_cluster

Unnamed: 0,cluster,Avg Watch Ratio @ 100,Avg Precision@100,Avg Recall@100,Avg F1@100,Category-Aware NDCG @ 100,Distinct Categories @ 100
0,0,0.955919,0.609204,0.760632,0.657452,0.967586,26.230483
0,1,0.941045,0.654094,0.768648,0.686297,0.970797,25.842482
0,2,0.916312,0.609012,0.780041,0.665949,0.970502,25.637681
0,3,0.940958,0.636063,0.788758,0.687261,0.971448,25.351852
0,Overall,0.93781,0.629683,0.775293,0.676081,0.970287,25.73494


##### Performance Without Segmentation

In [37]:
metrics_ncf_per_without_clustering = eval_fns.get_all_metrics(k, ground_truth_test, recommendations_ncf_for_test_without_clustering, video_data, threshold, by_cluster=False)

  0%|          | 0/1411 [00:00<?, ?it/s]

100%|██████████| 1411/1411 [01:11<00:00, 19.79it/s]
100%|██████████| 1411/1411 [00:27<00:00, 52.11it/s]
100%|██████████| 1411/1411 [00:16<00:00, 83.71it/s]
100%|██████████| 1411/1411 [00:30<00:00, 46.61it/s]


In [38]:
metrics_ncf_per_without_clustering

Unnamed: 0,cluster,Avg Watch Ratio @ 100,Avg Precision@100,Avg Recall@100,Avg F1@100,Category-Aware NDCG @ 100,Distinct Categories @ 100
0,Overall,0.819306,0.514577,0.341954,0.399085,0.895342,25.192771


#### Insights