In [4]:
import sys
import os.path as osp

PROJECT_DIR = '../../'
PROJECT_DIR = osp.abspath(PROJECT_DIR)
print(PROJECT_DIR in sys.path)
if PROJECT_DIR not in sys.path:
    print(f'Adding project directory to the sys.path: {PROJECT_DIR!r}')
    sys.path.insert(1, PROJECT_DIR)

True


Let's construct our baseline model based on the items' popularity.

For this system we will need a way to represent our popularity ratings in a way that will theoretically predict the ground truth user rankings. As such, the calculated popularity scores will be normalized to the range [1.0;5.0], where 1 will be the item that not a single user has rated yet, and 5.0 will be the item with the most ratings at the given time point.

The evaluation pipeline here will follow the same two approaches as for the previous mean-rating baseline experiment.

In [6]:
from src.models.abstract_rs_model import AbstractRSModel

In [8]:
import numpy as np
import pandas as pd
import scipy
from tqdm.notebook import tqdm
import json

In [9]:
df_ratings = pd.read_csv('../../data/ml-1m/ratings.dat',
                         delimiter='::',
                         header=None,
                         names=['UserID','MovieID','Rating','Timestamp'],
                         engine ='python')

In [27]:
count_ratings = df_ratings.groupby('MovieID')['Rating'].count()
count_ratings_candidates = count_ratings[~count_ratings.index.isin(
    df_ratings[df_ratings['UserID'] == 0]['MovieID'].unique())]

In [41]:
count_ratings_candidates.sort_values(ascending=False)

MovieID
2858    3428
260     2991
1196    2990
1210    2883
480     2672
        ... 
3237       1
763        1
624        1
2563       1
3290       1
Name: Rating, Length: 3706, dtype: int64

In [35]:
from src.evaluation import EvaluationPipeline

In [37]:
class BaselinePopularityModel(AbstractRSModel):
    def __init__(self):
        self.pre_fit = False
    
    def fit(self, train_data, pre_fit: bool = False):
        if self.pre_fit:
            # The train data was already pre-fit
            self.count_ratings = train_data[['MovieID','Rating']].groupby('MovieID')['Rating'].count()
        else:
            self.count_ratings = train_data[['MovieID','Rating']].groupby('MovieID')['Rating'].count()
        self.count_ratings = (self.count_ratings/self.count_ratings.max())*4 + 1
        self.pre_fit = pre_fit

    def predict(self, data_at_test_timestamp, test_user, test_timestamp):
        mean_ratings_candidates = self.count_ratings[~self.count_ratings.index.isin(
            data_at_test_timestamp[data_at_test_timestamp['UserID'] == test_user]['MovieID'].unique())]
        mean_ratings_candidates = mean_ratings_candidates.sort_values(ascending=False) # kind='mergesort'
        return mean_ratings_candidates.index.to_numpy(), mean_ratings_candidates.to_numpy()

    def fit_predict(self, data, test_user, test_timestamp):
        self.fit(data)
        return self.predict(data, test_user, test_timestamp)

In [38]:
items_pred, ratings_pred = BaselinePopularityModel().fit_predict(df_ratings[
                    df_ratings['Timestamp'] < 978301777], 1, 978301777) # 1028

In [39]:
print(list(zip(items_pred[:20], ratings_pred[:20])))

[(2858, 5.0), (1196, 4.45913096323306), (1210, 4.36918869644485), (480, 4.110300820419326), (589, 4.095715587967183), (2571, 3.972956548161653), (1580, 3.9571558796718325), (593, 3.955940443634154), (1198, 3.8903068975995136), (110, 3.839258584017016), (2762, 3.7675478577939834), (2396, 3.728653904588271), (1197, 3.672743846855059), (527, 3.65572774232756), (1617, 3.6241264053479187), (1097, 3.587663324217563), (1265, 3.5779398359161347), (2997, 3.568216347614707), (2628, 3.5670009115770283), (318, 3.5378304466727437)]


As we can see, the predictions for a certain point are slightly different that the ones built on the whole dataset: the number of ratings changes with time, and so are our predictions, which should be relevant for the exact time point we make the recommendations on. Therefore, the evaluation approach with the model being updated as much as possible for the each time point is the best here, as for the mean-rating baseline model.

There are also some advantages of this model comared to the mean-rating one visible even here. For example, we no longer have a situation where a single user rating a movie 5.0 has an immediate large effect on the predictions. Here all of the values are more spread out in the distribution, which is further improved with normalization.

However, it is obviously done with a tradeoff that the predicted ratings have nothing to do with the estimated movie quality or the other user's opinion for it. We may recommend a very low-rated movie that many people just happen to watch. But there is also a possibility that the target user may also want to watch the movie anyway, as it is very popular. All of this is very far from having any insight on the user itself, though.

So, let's start with the first evaluation approach where the model updating is carried out for each new predicting time point:

In [46]:
eval_baseline = EvaluationPipeline(df_ratings, 0.2) # .sample(frac=0.01, random_state=5)

In [47]:
metrics_output_dict_baseline = eval_baseline.evaluate( # recommendation_results_baseline
    BaselinePopularityModel(),
    user_average_metrics=False,
    retrain_model_each_point=True)

  0%|          | 0/200016 [00:00<?, ?it/s]

In [48]:
metrics_output_dict_baseline

{'mae': 2.9840492508644663,
 'rmse': 3.230121187663979,
 'precision': 0.0004649628029757619,
 'average_precision': 0.12319586042903112,
 'mean_reciprocal_rank': 0.003951362090124058,
 'ndcg': 0.9513780267109001,
 'coverage': 0.012417218543046357}

In [49]:
with open('baseline_popularity_with_updates_metrics.json', 'w') as f:
    json.dump(metrics_output_dict_baseline, f)

As we can see, here almost all of the metrics are worse than for a similar evaluation for the mean-rating baseline model. The most noticeable change is `RMSE` - it has increased significantly, but, more importantly, is now closer to `MAE` than it was for the mean-rating model. This indicates that there are less rating predictions in our data that are very far from the ground truth ratings by the users and result in large errors, or, rather, than the errors are generally more similar in absolute values. Therefore, the mean-rating baseline produced large errors because the mean ratings of other users are sometimes predictive for the target user, but sometimes he has other opinion and the prediction is largely inaccurate. In the case of the popularity-based model however, the number of ratings is less predictive overall of the target user's rating, so there is less situations where the baseline model was indeed close to the target. From this we can conclude that popularity indeed cannot be a good approximation of the rating, but it geves more uniform results.

The same can be said about the ranking-based metrics, though `NDCG` continues to produce largely inaccurate results due to the mensioned preperties of the dataset (low discretization of the ground truth values with only 4 rating options available for the users).

Now, let's evaluate the baseline popularity-based model with the evaluation approach without updates:

In [50]:
eval_baseline_no_updates = EvaluationPipeline(df_ratings, 0.2) # .sample(frac=0.01, random_state=5)

In [51]:
baseline_nodel_not_each_point = BaselinePopularityModel()
baseline_nodel_not_each_point.fit(eval_baseline_no_updates.train_data)

In [52]:
metrics_output_dict_baseline_no_updates = eval_baseline_no_updates.evaluate( # recommendation_results_baseline_no_updates
    baseline_nodel_not_each_point,
    user_average_metrics=False,
    retrain_model_each_point=False)

  0%|          | 0/200016 [00:00<?, ?it/s]

In [53]:
metrics_output_dict_baseline_no_updates

{'mae': 2.9430047722260126,
 'rmse': 3.190256461999607,
 'precision': 0.000374970002399808,
 'average_precision': 0.13829390681488837,
 'mean_reciprocal_rank': 0.0033665109136543046,
 'ndcg': 0.9273651300033005,
 'coverage': 0.007781456953642384}

In [54]:
with open('baseline_popularity_no_updates_metrics.json', 'w') as f:
    json.dump(metrics_output_dict_baseline_no_updates, f)

Absolutely every trend in the difference between the two evaluation approaches is the same here as in the mean-rating baseline model, except `NDCG` decreases more noticeably.

One other important thing to point out here is that precision has still decreased from the first evaluation approach, one again showing the evaluation approach with updates is able to suggest the top recommendation slightly more accurately due to having the newest information in the data in the account.