<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# Estimating Baseline Performance
<br>
Estimating baseline performance is as important as choosing right metrics for model evaluation. In this notebook, we briefly discuss about why do we care about baseline performance and how to measure it.

The notebook covers two example scenarios under the context of movie recommendation: 1) rating prediction and 2) top-k recommendation.

### Why does baseline performance matter? 
<br>
Before we go deep dive into baseline performance estimation, it is worth to think about why we need that.

As we can simply see from the definition of the word 'baseline', <b>baseline performance</b> is a minimum performance we expect to achieve by a model or starting point used for model comparisons.

Once we train a model and get results from evaluation metrics we choose, we will wonder how should we interpret the metrics or even wonder if the trained model is better than a simple rule-based model. Baseline results help us to understand those.

Let's say we are building a food recommender. We evaluated the model on the test set and got nDCG (at 10) = 0.3. At that moment, we would not know if the model is good or bad. But once we find out that a simple rule of <i>'recommending top-10 most popular foods to all users'</i> can achieve nDCG = 0.4, we see that our model is not good enough. Maybe the model is not trained well, or maybe we should think about if nDCG is the right metric for prediction of user behaviors in the given problem.

### How can we estimate the baseline performance?
<br>
To estimate the baseline performance, we first pick a baseline model and evaluate it by using the same evaluation metrics we will use for our main model. In general, a very simple rule or even <b>zero rule</b>--<i>predicts the mean for regression or the mode for classification</i>--will be a enough as a baseline model (Random-prediction might be okay for certain problems, but usually it performs poor than the zero rule). If we already have a running model in hand and now trying to improve that, we can use the previous results as a baseline performance for sure.

Most importantly, <b>different baseline approaches should be taken for different problems and business goals</b>. For example, recommending the previously purchased items could be used as a baseline model for food or restaurant recommendation since people tend to eat the same foods repeatedly. For TV show and/or movie recommendation, on the other hand, recommending previously watched items does not make sense. Probably recommending the most popular (most watched or highly rated) items is more likely useful as a baseline.

In this notebook, we demonstrate how to estimate the baseline performance for the movie recommendation with MovieLens dataset. We use the mean for rating prediction, i.e. our baseline model will predict a user's rating of a movie by averaging the ratings the user previously submitted for other movies. For the top-k recommendation problem, we use top-k most-rated movies as the baseline model. We choose the number of ratings here because we regard the binary signal of 'rated vs. not-rated' as user's implicit preference when evaluating ranking metrics.

Now, let's jump into the implementation!

In [1]:
import sys
import itertools
import pandas as pd

from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_random_split
from recommenders.datasets.pandas_df_utils import filter_by
from recommenders.evaluation.python_evaluation import (
    rmse,
    mae,
    rsquared,
    exp_var,
    map,
    ndcg_at_k,
    precision_at_k,
    recall_at_k,
)
from recommenders.utils.notebook_utils import store_metadata
from recommenders.datasets.python_splitters import python_stratified_split,python_chrono_split
from recommenders.models.deeprec.DataModel.ImplicitCF import ImplicitCF
print(f"System version: {sys.version}")
print(f"Pandas version: {pd.__version__}")


System version: 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0]
Pandas version: 1.5.3


First, let's prepare training and test data sets. 

In [16]:
MOVIELENS_DATA_SIZE = "1m"
TOP_K = 20

In [3]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE, 
    header=["userID", "itemID", "rating", "timestamp"]
)

data.head()

100%|██████████| 5.78k/5.78k [00:03<00:00, 1.68kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,1,1193,5.0,978300760
1,1,661,3.0,978302109
2,1,914,3.0,978301968
3,1,3408,4.0,978300275
4,1,2355,5.0,978824291


In [4]:
data.info

<bound method DataFrame.info of          userID  itemID  rating  timestamp
0             1    1193     5.0  978300760
1             1     661     3.0  978302109
2             1     914     3.0  978301968
3             1    3408     4.0  978300275
4             1    2355     5.0  978824291
...         ...     ...     ...        ...
1000204    6040    1091     1.0  956716541
1000205    6040    1094     5.0  956704887
1000206    6040     562     5.0  956704746
1000207    6040    1096     4.0  956715648
1000208    6040    1097     4.0  956715569

[1000209 rows x 4 columns]>

In [5]:
train,validate, test = python_chrono_split(data, ratio=[0.8,0.1,0.1], filter_by="user",col_user="userID", col_item="itemID", col_timestamp="timestamp")

### 1. Rating prediction baseline

As we discussed earlier, we use each user's **mean rating** as the baseline prediction.

In [7]:
# Calculate avg ratings from the training set
users_ratings = train.groupby(["userID"])["rating"].mean()
users_ratings = users_ratings.to_frame().reset_index()
users_ratings.rename(columns={"rating": "AvgRating"}, inplace=True)

users_ratings.head()

Unnamed: 0,userID,AvgRating
0,1,4.190476
1,2,3.873786
2,3,3.926829
3,4,4.117647
4,5,3.246835


In [8]:
# Generate prediction for the test set
baseline_predictions = pd.merge(test, users_ratings, on=["userID"], how="inner")

baseline_predictions.loc[baseline_predictions["userID"] == 1].head()

Unnamed: 0,userID,itemID,rating,timestamp,AvgRating
0,1,2294,4.0,978824291,4.190476
1,1,783,4.0,978824291,4.190476
2,1,1566,4.0,978824330,4.190476
3,1,1907,4.0,978824330,4.190476
4,1,48,5.0,978824351,4.190476


In [9]:
baseline_predictions = baseline_predictions[["userID", "itemID", "AvgRating"]]

Now, let's evaluate how our baseline model will perform on regression metrics

In [10]:


cols = {
    "col_user": "userID",
    "col_item": "itemID",
    "col_rating": "rating",
    "col_prediction": "AvgRating",
}

eval_rmse = rmse(test, baseline_predictions, **cols)
eval_mae = mae(test, baseline_predictions, **cols)
eval_rsquared = rsquared(test, baseline_predictions, **cols)
eval_exp_var = exp_var(test, baseline_predictions, **cols)

print("RMSE:\t\t%f" % eval_rmse,
      "MAE:\t\t%f" % eval_mae,
      "rsquared:\t%f" % eval_rsquared,
      "exp var:\t%f" % eval_exp_var, sep='\n')

RMSE:		1.099271
MAE:		0.869504
rsquared:	0.080973
exp var:	0.113273


As you can see, our baseline model actually performed quite well on the metrics. E.g. MAE (Mean Absolute Error) was around 0.84 on MovieLens 100k data, saying that users actual ratings would be within +-0.84 of their mean ratings. This also gives us an insight that users' rating could be biased where some users tend to give high ratings for all movies while others give low ratings.

Now, next time we build our machine-learning model, we will want to make the model performs better than this baseline.

### 2. Top-k recommendation baseline

Recommending the **most popular items** is intuitive and simple approach that works for many of recommendation scenarios. Here, we use top-k most-rated movies as the baseline model as we discussed earlier. 

In [11]:
item_counts = train["itemID"].value_counts().to_frame().reset_index()
item_counts.columns = ["itemID", "Count"]
item_counts.head()

Unnamed: 0,itemID,Count
0,2858,3165
1,1196,2745
2,260,2716
3,1210,2650
4,2028,2423


In [12]:
user_item_col = ["userID", "itemID"]

# Cross join users and items
test_users = test['userID'].unique()
user_item_list = list(itertools.product(test_users, item_counts['itemID']))
users_items = pd.DataFrame(user_item_list, columns=user_item_col)

print("Number of user-item pairs:", len(users_items))

# Remove seen items (items in the train set) as we will not recommend those again to the users
users_items_remove_seen = filter_by(users_items, train, user_item_col)

print("After remove seen items:", len(users_items_remove_seen))

Number of user-item pairs: 22148680
After remove seen items: 21348487


In [18]:
# Generate recommendations
baseline_recommendations = pd.merge(item_counts, users_items_remove_seen, on=['itemID'], how='inner')
baseline_recommendations.tail()

Unnamed: 0,itemID,Count,userID
21348482,3904,1,6036
21348483,3904,1,6037
21348484,3904,1,6038
21348485,3904,1,6039
21348486,3904,1,6040


In [17]:
cols["col_prediction"] = "Count"

eval_map = map(test, baseline_recommendations, k=TOP_K, **cols)
eval_ndcg = ndcg_at_k(test, baseline_recommendations, k=TOP_K, **cols)
eval_precision = precision_at_k(test, baseline_recommendations, k=TOP_K, **cols)
eval_recall = recall_at_k(test, baseline_recommendations, k=TOP_K, **cols)

print("MAP:\t%f" % eval_map,
      "NDCG@K:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

MAP:	0.017911
NDCG@K:	0.062968
Precision@K:	0.043767
Recall@K:	0.065228


Again, the baseline is quite high, nDCG = 0.25 and Precision = 0.22.

<br>

### Concluding remarks

In this notebook, we discussed how to measure baseline performance for the movie recommendation example.
We covered very naive approaches as baselines, but still they are useful in a sense that they can provide reference numbers to estimate the complexity of the given problem as well as the relative performance of the recommender models we are building.

In [15]:
# Record results for tests - ignore this cell
store_metadata("map", eval_map)
store_metadata("ndcg", eval_ndcg)
store_metadata("precision", eval_precision)
store_metadata("recall", eval_recall)
store_metadata("rmse", eval_rmse)
store_metadata("mae", eval_mae)
store_metadata("exp_var", eval_exp_var)
store_metadata("rsquared", eval_rsquared)

### References

[[1](https://dl.acm.org/citation.cfm?id=1401944)] Yehuda Koren,	Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD '08 pp. 426-434 2008.  
[[2](https://surprise.readthedocs.io/en/stable/basic_algorithms.html)] Surprise lib, Basic algorithms