# Offline Evaluation Framework
**Objective:** Define a consistent protocol to simulate real-world usage
**Strategy:**
1.  **Split:** Temporal User Split (80/10/10). We sort each user's history by time to respect causality (predicting the next movie they will watch)
2.  **Metrics:** A mix of Accuracy (NDCG), Classification (Recall/Precision), and Diversity (Coverage) to detect popularity bias

In [2]:
import os
import pandas as pd
import numpy as np
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

DATA_DIR = "../data/ml-1m"

ratings = pd.read_csv(
    os.path.join(DATA_DIR, "ratings.dat"),
    sep="::",
    engine="python",
    names=["user_id", "item_id", "rating", "timestamp"],
    encoding="latin-1"
)

movies = pd.read_csv(
    os.path.join(DATA_DIR, "movies.dat"),
    sep="::",
    engine="python",
    names=["item_id", "title", "genres"],
    encoding="latin-1"
)

We use per-user temporal split to ensure that every user in the test set has a history in the training set. This avoids cold-start user problem for evaluation purposes, so that we could focus on algorithms themselves

In [5]:
def per_user_temporal_split(df, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1):
    train_data = []
    val_data = []
    test_data = []
    
    for user_id, user_df in df.groupby('user_id'):
        user_sorted = user_df.sort_values('timestamp')
        n = len(user_sorted)
        
        train_end = int(n * train_ratio)
        val_end = train_end + int(n * val_ratio)
        
        train_data.append(user_sorted.iloc[:train_end])
        val_data.append(user_sorted.iloc[train_end:val_end])
        test_data.append(user_sorted.iloc[val_end:])
    
    train_df = pd.concat(train_data, ignore_index=True)
    val_df = pd.concat(val_data, ignore_index=True)
    test_df = pd.concat(test_data, ignore_index=True)
    
    return train_df, val_df, test_df

train, val, test = per_user_temporal_split(ratings)

print(f"Train: {len(train):,} ({len(train)/len(ratings)*100:.1f}%)")
print(f"Val: {len(val):,} ({len(val)/len(ratings)*100:.1f}%)")
print(f"Test: {len(test):,} ({len(test)/len(ratings)*100:.1f}%)")

print(f"\nTrain users: {train['user_id'].nunique():,}")
print(f"Val users: {val['user_id'].nunique():,}")
print(f"Test users: {test['user_id'].nunique():,}")

print(f"\nTrain items: {train['item_id'].nunique():,}")
print(f"Val items: {val['item_id'].nunique():,}")
print(f"Test items: {test['item_id'].nunique():,}")

Train: 797,758 (79.8%)
Val: 97,383 (9.7%)
Test: 105,068 (10.5%)

Train users: 6,040
Val users: 6,040
Test users: 6,040

Train items: 3,666
Val items: 3,360
Test items: 3,445


* All sets have the same amount of users
* There is small discrepancy in amount of items. It means that some items appear only in the future (cold start items)

# Cold Start items check

In [7]:
train_items = set(train['item_id'].unique())
val_items = set(val['item_id'].unique())
test_items = set(test['item_id'].unique())

new_items_val = val_items - train_items
new_items_test = test_items - train_items

print(f"New items in val (not in train): {len(new_items_val)}")
print(f"New items in test (not in train): {len(new_items_test)}")
print(f"Val interactions with new items: {val[val['item_id'].isin(new_items_val)].shape[0]}")
print(f"Test interactions with new items: {test[test['item_id'].isin(new_items_test)].shape[0]}")

New items in val (not in train): 18
New items in test (not in train): 27
Val interactions with new items: 20
Test interactions with new items: 36


* There are 27 movies in the test set that model would not see during traing. Pure Collaborative filtering will fail for these 36 interactions (it can't recommend id it had never seen). Thus we need Content-based approach to handle such cases

## Metrics definition
Down below are metrics we found to be the most suitable for our cause

* NDCG: measures ranking quality. Penalizes putting relevant items low in the list
* Precision/Recall @K: treats recommendation problem as classification one - out of retrieved items, did we find good ones? We set threshold = 4.0, because we only care if user liked recommendation, not just rated it
* Coverage: what percentage of available items is the model recommending (to prevent recommending only popular items)
* Popularity bias : measures average popularity of recommended items

In [8]:
def dcg_at_k(relevance_scores, k):
    relevance_scores = np.array(relevance_scores)[:k]
    if relevance_scores.size == 0:
        return 0.0
    gains = 2 ** relevance_scores - 1
    discounts = np.log2(np.arange(2, relevance_scores.size + 2))
    return np.sum(gains / discounts)

def ndcg_at_k(predictions, ground_truth, k=10):
    if not predictions or not ground_truth:
        return 0.0
    
    relevance = [ground_truth.get(item_id, 0) for item_id, _ in predictions[:k]]
    
    ideal_relevance = sorted(ground_truth.values(), reverse=True)
    
    dcg = dcg_at_k(relevance, k)
    idcg = dcg_at_k(ideal_relevance, k)
    
    return dcg / idcg if idcg > 0 else 0.0

def recall_at_k(predictions, ground_truth, k=10, threshold=4.0):
    if not predictions or not ground_truth:
        return 0.0
    
    relevant_items = {item for item, rating in ground_truth.items() if rating >= threshold}
    
    if len(relevant_items) == 0:
        return 0.0
    
    predicted_items = {item_id for item_id, _ in predictions[:k]}
    
    hits = len(predicted_items & relevant_items)
    return hits / len(relevant_items)

def precision_at_k(predictions, ground_truth, k=10, threshold=4.0):
    if not predictions or not ground_truth:
        return 0.0
    
    relevant_items = {item for item, rating in ground_truth.items() if rating >= threshold}
    predicted_items = {item_id for item_id, _ in predictions[:k]}
    
    if len(predicted_items) == 0:
        return 0.0
    
    hits = len(predicted_items & relevant_items)
    return hits / len(predicted_items)

def coverage(all_predictions, item_catalog):
    recommended_items = set()
    for preds in all_predictions:
        recommended_items.update([item_id for item_id, _ in preds])
    
    return len(recommended_items) / len(item_catalog) if len(item_catalog) > 0 else 0.0

def popularity_bias(all_predictions, item_popularity):
    all_recommended = []
    for preds in all_predictions:
        all_recommended.extend([item_id for item_id, _ in preds])
    
    if not all_recommended:
        return 0.0
    
    avg_pop = np.mean([item_popularity.get(item, 0) for item in all_recommended])
    return avg_pop

## Evaluator class
We calculate metrics per user, then average them to prevent domanation of power users

In [12]:
class RecommenderEvaluator:    
    def __init__(self, train_df, test_df, k_values=[5, 10, 20], relevance_threshold=4.0):
        self.train_df = train_df
        self.test_df = test_df
        self.k_values = k_values
        self.relevance_threshold = relevance_threshold
        
        self.ground_truth = self.build_ground_truth()
        
        self.item_catalog = set(train_df['item_id'].unique())
        self.item_popularity = train_df.groupby('item_id').size().to_dict()
        
    def build_ground_truth(self):
        ground_truth = defaultdict(dict)
        for _, row in self.test_df.iterrows():
            ground_truth[row['user_id']][row['item_id']] = row['rating']
        return dict(ground_truth)
    
    def evaluate_model(self, model, model_name="Model"):
        results = defaultdict(list)
        all_predictions = []
        
        test_users = list(self.ground_truth.keys())
        
        for user_id in test_users:
            gt = self.ground_truth[user_id]
            
            max_k = max(self.k_values)
            predictions = model.predict_for_user(user_id, k=max_k)
            all_predictions.append(predictions)
            
            for k in self.k_values:
                results[f'NDCG@{k}'].append(ndcg_at_k(predictions, gt, k))
                results[f'Recall@{k}'].append(recall_at_k(predictions, gt, k, self.relevance_threshold))
                results[f'Precision@{k}'].append(precision_at_k(predictions, gt, k, self.relevance_threshold))
        
        metrics = {}
        for metric_name, values in results.items():
            metrics[metric_name] = np.mean(values)
        
        metrics['Coverage'] = coverage(all_predictions, self.item_catalog)
        metrics['Popularity_Bias'] = popularity_bias(all_predictions, self.item_popularity)
        
        return metrics
    
    def print_metrics(self, metrics, model_name="Model"):
        print(f"{model_name} - Evaluation results")
        
        print("Ranking metrics:")
        for k in self.k_values:
            print(f"NDCG@{k:2d}: {metrics[f'NDCG@{k}']:.4f}")
        
        print("\n")
        
        print("Relevance metrics (threshold={:.1f}):".format(self.relevance_threshold))
        for k in self.k_values:
            print(f"Recall@{k:2d}: {metrics[f'Recall@{k}']:.4f}")
            print(f"Precision@{k:2d}: {metrics[f'Precision@{k}']:.4f}")
        
        print("\n")
        
        print(f"Diversity metrics:")
        print(f"Coverage: {metrics['Coverage']:.4f}")
        print(f"Popularity bias: {metrics['Popularity_Bias']:.2f}")

In [13]:
os.makedirs("../data/processed", exist_ok=True)

train.to_csv("../data/processed/train.csv", index=False)
val.to_csv("../data/processed/val.csv", index=False)
test.to_csv("../data/processed/test.csv", index=False)

## Random baseline
Simulate random guess (Monte carlo simulation)

In [14]:
class RandomRecommender:
    def __init__(self, train_df):
        self.items = train_df['item_id'].unique()
    
    def predict_for_user(self, user_id, k=10):
        selected = np.random.choice(self.items, size=min(k, len(self.items)), replace=False)
        scores = np.random.rand(len(selected))
        return list(zip(selected, scores))

evaluator = RecommenderEvaluator(train, test, k_values=[5, 10, 20])
random_model = RandomRecommender(train)
metrics = evaluator.evaluate_model(random_model, "Random baseline")
evaluator.print_metrics(metrics, "Random baseline")

Random baseline - Evaluation results
Ranking metrics:
NDCG@ 5: 0.0022
NDCG@10: 0.0030
NDCG@20: 0.0039


Relevance metrics (threshold=4.0):
Recall@ 5: 0.0011
Precision@ 5: 0.0022
Recall@10: 0.0028
Precision@10: 0.0024
Recall@20: 0.0052
Precision@20: 0.0024


Diversity metrics:
Coverage: 1.0000
Popularity bias: 217.29


* NDCG@10 = 0.003 : effectively zero. Thus the task is not trivial
* Coverage = 100% : random guesses naturally covers all the catalog, thus high diversity

## Popularity baseline
Checking how effective is the strategy of recommending only the most popular items is, since we know that 50% of ratings come from 500 items

In [15]:
class PopularityRecommender:
    def __init__(self, train_df):
        self.popularity = train_df.groupby('item_id').size().sort_values(ascending=False)
        self.popular_items = list(self.popularity.index)
    
    def predict_for_user(self, user_id, k=10):
        items = self.popular_items[:k]
        scores = [self.popularity[item] for item in items]
        return list(zip(items, scores))

pop_model = PopularityRecommender(train)
pop_metrics = evaluator.evaluate_model(pop_model, "Popularity baseline")
evaluator.print_metrics(pop_metrics, "Popularity baseline")

Popularity baseline - Evaluation results
Ranking metrics:
NDCG@ 5: 0.0182
NDCG@10: 0.0223
NDCG@20: 0.0326


Relevance metrics (threshold=4.0):
Recall@ 5: 0.0128
Precision@ 5: 0.0167
Recall@10: 0.0240
Precision@10: 0.0161
Recall@20: 0.0498
Precision@20: 0.0170


Diversity metrics:
Coverage: 0.0055
Popularity bias: 2328.40


* NDCG@10 has jumped significantly to 0.022 from 0.003, meaning that popularity alone is responsible for huge chunk of user behaviour. Any model should beat this threshold to be considered useful
* The trade-off is that model achieves good accuracy by ignoring 99.5% of items (coverage 0.0055). 
* As expected average amount of ratings for recommended items in this approach is high (2300 on average)

Our future goal is to increase NDCG, while increasing coverage compared to this baseline