## Evaluate LingoRank with Collaborative Filtering models

This notebook is used to fit collaborative filtering models (for implicit feedback) to the LingoRank data.

The data should be prepared with the notebook 'RecommanderData'. The current notebook works if we already have CSV files in 'results/recommendation'.

We also compute the metrics for the MovieLens-100k dataset to get a comparison.

### Data preparation

#### LingoRank

The data is prepared following three strategies (1,2,3) (refer to 'RecommanderData' notebook). We will use strategies 2 and 3 (1 doesn't have enough data).

We constitute a train set and a test set ensuring the following conditions:

- Each user and each article should appear at least once in the training data
- ~ 15% of the ratings are used for testing

This if formulated as an optimization problem to split with respect to the constraints.

#### ml-100k

We use one of the predifined train/test split available in the released data (First train/test split).
We remove in the test set the items that are not in the train set.
All users are already in the train set.

We consider a rating of 4 or more as positive and less than 4 as negative.

### Evaluation strategy

We follow the idea used in https://arxiv.org/abs/2305.02182

For each item in the test set, we sample 4 negative samples. We use the model to rank these items and compute MRR@3 and NDCG@3.

### Models

#### Implicit feedback

- ALS
- BPR 
- LMF

#### Explicit feedback (ml-100k only)

- SVD 

#### Random

- Rank the list of items randomly, not using any model

In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sparse
import implicit
from tqdm.notebook import tqdm
from random import shuffle
from pulp import PULP_CBC_CMD
from surprise import SVD, Reader, Dataset
from pulp import LpProblem, LpVariable, lpSum, LpBinary, LpMaximize
import random

my_seed = 2023
random.seed(my_seed)
np.random.seed(my_seed)

In [2]:
def load_ml_100k(strategy: str):
    # Load data into pandas DataFrames
    train_df = pd.read_csv("https://files.grouplens.org/datasets/movielens/ml-100k/u1.base", sep="\t", names=["userId", "itemId", "rating", "timestamp"])
    test_df = pd.read_csv("https://files.grouplens.org/datasets/movielens/ml-100k/u1.test", sep="\t", names=["userId", "itemId", "rating", "timestamp"])

    # Find items in test_df but not in train_df
    missing_items = set(test_df.itemId) - set(train_df.itemId)

    # Count of these items
    # num_missing_items = len(missing_items)

    # print("Number of items in test_df but not in train_df:", num_missing_items)
    # print("Number of items in test_df:", len(set(test_df.itemId)))

    # Filtering out rows in test_df where itemId is in missing_items
    test_df = test_df[~test_df.itemId.isin(missing_items)]

    # Print the updated length of test_df
    # print("Updated number of items in test_df:", len(set(test_df.itemId)))

    if strategy=="explicit":
        reader = Reader(line_format="user item rating", sep="\t", rating_scale=(train_df["rating"].min(), train_df["rating"].max()))

        # Load the filtered data using Surprise's Dataset.load_from_df()
        train_data = Dataset.load_from_df(train_df[["userId", "itemId", "rating"]], reader=reader)
        train_set = train_data.build_full_trainset()

        # predictions_set = train_set.build_anti_testset() #all pairs (u, i) that are NOT in the training set.

        test_set = list(zip(test_df["userId"], test_df["itemId"], test_df["rating"]))

        return train_df, test_df, train_set, test_set
    
    elif strategy=="implicit":
        train_df.loc[train_df['rating'] < 4, 'rating'] = 0

        # Create mapping for userIds and itemIds based on train_df
        unique_userIds = train_df['userId'].unique()
        unique_itemIds = train_df['itemId'].unique()

        userId_mapping = {userId: i for i, userId in enumerate(unique_userIds)}
        itemId_mapping = {itemId: i for i, itemId in enumerate(unique_itemIds)}

        # Map userIds and itemIds in train_df to new consecutive IDs
        train_df['mapped_userId'] = train_df['userId'].map(userId_mapping)
        train_df['mapped_itemId'] = train_df['itemId'].map(itemId_mapping)

        # Function to map new IDs in test_df and update the mapping accordingly
        def map_and_update_id(id_value, current_mapping):
            if id_value not in current_mapping:
                current_mapping[id_value] = max(current_mapping.values()) + 1
            return current_mapping[id_value]

        test_df['mapped_userId'] = test_df['userId'].apply(lambda x: map_and_update_id(x, userId_mapping))
        test_df['mapped_itemId'] = test_df['itemId'].apply(lambda x: map_and_update_id(x, itemId_mapping))

        return train_df, test_df
    

In [3]:
def train_test_split(data, test_size_ratio=0.25, tolerance=0.1):
    # Initialize the optimization problem
    prob = LpProblem("TrainTestSplit", LpMaximize)

    # Create binary variables for each rating
    ratings_vars = LpVariable.dicts("Rating", data.index.tolist(), cat=LpBinary)

    # Add the objective function
    prob += lpSum([ratings_vars[i] for i in data.index])

    # Add the constraints
    target_test_size = test_size_ratio * len(data)
    prob += lpSum([ratings_vars[i] for i in data.index]) >= target_test_size - tolerance * len(data)
    prob += lpSum([ratings_vars[i] for i in data.index]) <= target_test_size + tolerance * len(data)

    for user_id in data['user_id'].unique():
        user_ratings = data[data['user_id'] == user_id].index.tolist()
        prob += lpSum([ratings_vars[i] for i in user_ratings]) <= len(user_ratings) - 1

    for article_id in data['article_id'].unique():
        article_ratings = data[data['article_id'] == article_id].index.tolist()
        prob += lpSum([ratings_vars[i] for i in article_ratings]) <= len(article_ratings) - 1

    # Solve the optimization problem
    # status = prob.solve()
    status = prob.solve(PULP_CBC_CMD(msg=0))

    # Extract the train and test sets
    train_data = data.loc[[i for i in data.index if ratings_vars[i].value() == 0]]
    test_data = data.loc[[i for i in data.index if ratings_vars[i].value() == 1]]

    return train_data, test_data


In [4]:
def load_LingoRank(strategy: int):
    data_full = pd.read_csv(f"../results/recommendation/strategy{strategy}.csv")
    
    ## Remove the articles for which there is no positive rating 
    # Before removing articles, count the unique articles
    original_unique_articles = data_full['article_id'].nunique()

    # Identify articles that have maximum rating <= 0
    articles_to_remove = data_full.groupby('article_id')['rating'].max()
    articles_to_remove = articles_to_remove[articles_to_remove <= 0].index.tolist()

    # Remove these articles from data_full
    data_full = data_full[~data_full['article_id'].isin(articles_to_remove)]

    # After removing articles, count the unique articles
    # remaining_unique_articles = data_full['article_id'].nunique()

    # Calculate and print the number of removed articles
    # num_removed_articles = original_unique_articles - remaining_unique_articles
    # print("Number of removed articles:", num_removed_articles)

    data = data_full[(data_full['rating'] != 0)].copy()
    
    unique_user_ids = data['user_id'].unique()
    unique_article_ids = data['article_id'].unique()

    user_id_mapping = {user_id: i for i, user_id in enumerate(unique_user_ids)}
    article_id_mapping = {article_id: i for i, article_id in enumerate(unique_article_ids)}

    # Map user_ids and article_ids to new consecutive IDs using loc
    data.loc[:, 'mapped_user_id'] = data['user_id'].map(user_id_mapping)
    data.loc[:, 'mapped_article_id'] = data['article_id'].map(article_id_mapping)

    # data_full = data_full.copy() # Avoid SettingWithCopyWarning

    data_full.loc[:, 'mapped_user_id'] = data_full['user_id'].map(user_id_mapping)
    data_full.loc[:, 'mapped_article_id'] = data_full['article_id'].map(article_id_mapping)

    train_data, test_data = train_test_split(data)
    print(f"Strategy {strategy} - Proportion of positive ratings affected to test set: {round(len(test_data)/(len(test_data)+len(train_data))*100,2)} %")

    return data_full, data, train_data, test_data

In [5]:
datasets = {}

# 1. LingoRank
datasets['LingoRank'] = {}
datasets['LingoRank']['strategies'] = {}
for i in range(2,4):
    datasets['LingoRank']['strategies'][i] = {}
    datasets['LingoRank']['strategies'][i]['data_full'], datasets['LingoRank']['strategies'][i]['data'], datasets['LingoRank']['strategies'][i]['train_data'], datasets['LingoRank']['strategies'][i]['test_data'] = load_LingoRank(i)

# 2. MovieLens-100k
datasets['ml-100k'] = {}
datasets['ml-100k']['strategies'] = {}

datasets['ml-100k']['strategies']['explicit'] = {}
datasets['ml-100k']['strategies']['implicit'] = {}

datasets['ml-100k']['strategies']['explicit']['train_df'], datasets['ml-100k']['strategies']['explicit']['test_df'], datasets['ml-100k']['strategies']['explicit']['train_set'], datasets['ml-100k']['strategies']['explicit']['test_set'] = load_ml_100k('explicit')
datasets['ml-100k']['strategies']['implicit']['train_df'], datasets['ml-100k']['strategies']['implicit']['test_df'] = load_ml_100k('implicit')

Strategy 2 - Proportion of positive ratings affected to test set: 15.66 %
Strategy 3 - Proportion of positive ratings affected to test set: 15.97 %


In [6]:
# print(len(set(train_data.user_id))>=len(set(test_data.user_id)))
# print(set(test_data['article_id']).issubset(set(train_data['article_id'])))
# print(len(set(train_data.user_id)))
# print(len(set(test_data.user_id)))
# print(len(set(train_data['article_id'])))
# print(len(set(test_data['article_id'])))

In [7]:
# Create user-item sparse matrix
# 1. Lingorank
for key, strategy in datasets['LingoRank']['strategies'].items():
    strategy['user_item_train_data'] = sparse.csr_matrix(
        (strategy['train_data']['rating'].astype(float), (strategy['train_data']['mapped_user_id'], strategy['train_data']['mapped_article_id']))
    )

# 2. ml-100k (implicit)
strategy = datasets['ml-100k']['strategies']['implicit']
strategy['user_item_train_data'] = sparse.csr_matrix(
    (strategy['train_df']['rating'].astype(float), (strategy['train_df']['mapped_userId'], strategy['train_df']['mapped_itemId']))
)

In [8]:
for key_dataset, dataset in tqdm(datasets.items(), desc="Datasets"):
    for key, strategy in tqdm(dataset['strategies'].items(), desc="Strategies"):

        if key=="explicit":
            continue #Need other specific model
        # Create a dictionary of models
        strategy['models'] = {
            "ALS": implicit.als.AlternatingLeastSquares(factors=50, random_state=my_seed),
            "BPR": implicit.bpr.BayesianPersonalizedRanking(factors=50, random_state=my_seed),
            "LMF": implicit.lmf.LogisticMatrixFactorization(factors=50, random_state=my_seed)
        }

        # Fit each model with a tqdm progress bar
        for name in tqdm(strategy['models'].keys(), desc="Fitting models"):
            if name=="ALS":
                alpha = 40
                training_data = (strategy['user_item_train_data'] * alpha).astype('double')
            else:
                # Convert data to binary preference matrix
                training_data = (strategy['user_item_train_data'] >= 1).astype('double')
            strategy['models'][name].fit(training_data)

        strategy['models']["random"]="random" #Add a fake model that will be use to compute metrics

strategy = datasets['ml-100k']['strategies']['explicit']
strategy['models'] = {}
strategy['models']['SVD'] = SVD(random_state=my_seed)
strategy['models']['SVD'].fit(strategy['train_set'])

Datasets:   0%|          | 0/2 [00:00<?, ?it/s]

Strategies:   0%|          | 0/2 [00:00<?, ?it/s]

  check_blas_config()


Fitting models:   0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Fitting models:   0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Strategies:   0%|          | 0/2 [00:00<?, ?it/s]

Fitting models:   0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x122d9ad00>

In [9]:
# Display the structure of our dict
def display_keys(d, indent=0):
    for key, value in d.items():
        print('  ' * indent + str(key))
        if isinstance(value, dict):
            display_keys(value, indent+1)

display_keys(datasets)

LingoRank
  strategies
    2
      data_full
      data
      train_data
      test_data
      user_item_train_data
      models
        ALS
        BPR
        LMF
        random
    3
      data_full
      data
      train_data
      test_data
      user_item_train_data
      models
        ALS
        BPR
        LMF
        random
ml-100k
  strategies
    explicit
      train_df
      test_df
      train_set
      test_set
      models
        SVD
    implicit
      train_df
      test_df
      user_item_train_data
      models
        ALS
        BPR
        LMF
        random


In [10]:
def ndcg_at_k(ranked_list, k):
    """
    Compute NDCG at rank k
    """
    dcg = sum((2 ** rel - 1) / np.log2(idx + 2) for idx, rel in enumerate(ranked_list[:k]))
    idcg = sum((2 ** rel - 1) / np.log2(idx + 2) for idx, rel in enumerate(sorted(ranked_list, reverse=True)[:k]))
    return dcg / idcg if idcg > 0 else 0

def mrr_at_k(ranked_list, k):
    """
    Compute MRR at rank k
    """
    for idx, rel in enumerate(ranked_list[:k]):
        if rel > 0:
            return 1 / (idx + 1)
    return 0

In [11]:
def evaluate(test_df, data_full, model, dataset, k=3):
    
    ndcgs = []
    mrrs = []
    
    # Group by user and filter positive and negative items
    if dataset=="LingoRank":
        groupByKey = "mapped_user_id"
    elif dataset=="ml-100k" and isinstance(model,SVD):
        groupByKey = "userId"
    else:
        groupByKey = "mapped_userId"
    grouped = test_df.groupby(groupByKey)
    for user, group in tqdm(grouped, desc="Evaluating users"):
        
        if dataset=="ml-100k" and isinstance(model,SVD):
            positive_items = group[group["rating"] >= 4]["itemId"].values
            negative_items = group[group["rating"] < 4]["itemId"].values
        elif dataset=="ml-100k":
            positive_items = group[group["rating"] >= 4]["mapped_itemId"].values
            negative_items = group[group["rating"] < 4]["mapped_itemId"].values
        else:
            positive_items = group["mapped_article_id"].values
            negative_items = data_full[(data_full['mapped_user_id'] == user) & (data_full['rating'] == 0)].mapped_article_id.to_list() # Will return all the articles but usefull to proceed like that if we have a strategy for which zeros are clearly defined
        
        if len(positive_items) == 0 or len(negative_items) == 0:
            # print(f"Skip user with ID {user}")
            continue  # Skip if no positive or negative items
        
        # Select one positive item and four negative items
        positive_item = np.random.choice(positive_items)
        sampled_negative_items = np.random.choice(negative_items, size=min(4, len(negative_items)), replace=False)

        if len(sampled_negative_items)<4:
            # print(f"Sample negative items is {len(sampled_negative_items)} for user {user}")
            continue # Skip if we cannot constitute a list of 5
            
        
        # Candidate list of 5 items
        candidate_items = list(sampled_negative_items) + [positive_item]
        
        if model=="random":
            predictions = list(range(5))
            shuffle(predictions)
        elif isinstance(model,SVD):
            predictions = [model.predict(user, item).est for item in candidate_items]
        else:
            item_factors = model.item_factors
            user_factors = model.user_factors
            predictions = [np.dot(user_factors[user], item_factors[item]) for item in candidate_items]
            
        # Rank the items based on predicted ratings
        ranked_items = [1 if item == positive_item else 0 for _, item in sorted(zip(predictions, candidate_items), reverse=True)]
        
        # Calculate NDCG@3 and MRR@3
        ndcg = ndcg_at_k(ranked_items, k)
        mrr = mrr_at_k(ranked_items, k)
        
        ndcgs.append(ndcg)
        mrrs.append(mrr)

    avg_ndcg = np.mean(ndcgs)
    avg_mrr = np.mean(mrrs)
    return avg_ndcg, avg_mrr, ndcgs, mrrs

In [12]:
for key_dataset, dataset in tqdm(datasets.items(), desc="Datasets"):
    for key, strategy in tqdm(dataset['strategies'].items(), desc="Strategies"):

        # Dictionary to store evaluation metrics
        strategy['evaluation_results'] = {}
        # Evaluate each model and store the results in the evaluation_results dictionary
        for name, model in tqdm(strategy['models'].items(), desc=f"Models for strategy {key}"):
            if key_dataset=="LingoRank":
                test_data = strategy['test_data']
                data_full = strategy['data_full']
            elif key_dataset=="ml-100k":
                test_data = strategy['test_df']
                data_full = None
            avg_ndcg, avg_mrr, ndcgs, mrrs = evaluate(test_data, data_full, model, key_dataset,k=3)
            strategy['evaluation_results'][name] = {
                "NDCG@3": avg_ndcg,
                "MRR@3": avg_mrr,
                "All_NDCGs": ndcgs,
                "All_MRRs": mrrs
            }

        # print("="*50)
        # print(f"Strategy {key}")
        # print("="*50)
        # # Print the evaluation results in a structured manner
        # for name, metrics in strategy['evaluation_results'].items():
            
        #     print(f"Results for {name}:")
        #     print(f"NDCG@3: {metrics['NDCG@3']:.4f}")
        #     print(f"MRR@3: {metrics['MRR@3']:.4f}")
        #     print("-"*50)

Datasets:   0%|          | 0/2 [00:00<?, ?it/s]

Strategies:   0%|          | 0/2 [00:00<?, ?it/s]

Models for strategy 2:   0%|          | 0/4 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/1140 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/1140 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/1140 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/1140 [00:00<?, ?it/s]

Models for strategy 3:   0%|          | 0/4 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/1098 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/1098 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/1098 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/1098 [00:00<?, ?it/s]

Strategies:   0%|          | 0/2 [00:00<?, ?it/s]

Models for strategy explicit:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/459 [00:00<?, ?it/s]

Models for strategy implicit:   0%|          | 0/4 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/459 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/459 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/459 [00:00<?, ?it/s]

Evaluating users:   0%|          | 0/459 [00:00<?, ?it/s]

In [13]:
# Extracting evaluation results
data = []

for dataset_key, dataset in datasets.items():
    for strategy_num, strategy_data in dataset['strategies'].items():
        for model_name, results in strategy_data['evaluation_results'].items():
            row = {
                'dataset': dataset_key,
                'strategy': strategy_num,
                'model': model_name,
                'NDCG@3': results['NDCG@3'],
                'MRR@3': results['MRR@3']
            }
            data.append(row)

# Convert to DataFrame
evaluation_results = pd.DataFrame(data)

def highlight_best(row):
    """Highlight the entire row in light blue if the best NDCG@3 value also has the best MRR@3 value."""
    subset = evaluation_results[(evaluation_results['dataset'] == row['dataset']) & (evaluation_results['strategy'] == row['strategy'])]
    max_ndcg = subset['NDCG@3'].max()
    max_mrr = subset['MRR@3'].max()
    if row['NDCG@3'] == max_ndcg and row['MRR@3'] == max_mrr:
        return ['background-color: lightblue']*5
    return ['']*5

styled_evaluation_results = evaluation_results.style.apply(highlight_best, axis=1)

display(styled_evaluation_results)

Unnamed: 0,dataset,strategy,model,NDCG@3,MRR@3
0,LingoRank,2,ALS,0.424492,0.376901
1,LingoRank,2,BPR,0.561544,0.492398
2,LingoRank,2,LMF,0.4643,0.416959
3,LingoRank,2,random,0.415347,0.357018
4,LingoRank,3,ALS,0.420522,0.368549
5,LingoRank,3,BPR,0.548513,0.481633
6,LingoRank,3,LMF,0.488481,0.435641
7,LingoRank,3,random,0.407176,0.352004
8,ml-100k,explicit,SVD,0.656278,0.599074
9,ml-100k,implicit,ALS,0.583889,0.521759
