*UE Learning from User-generated Data, CP MMS, JKU Linz 2022*
# Exercise 4: Evaluation

In this exercise we'll have a closer look at two very different RecSys evaluation metrics and use them to compare the three algorithms we implemented so far to each other. Please consult the lecture slides and the presentation from UE Session 4 for a recap.

The assignment submission deadline is 15.05.2022 23:59.

Make sure to rename the notebook according to the convention:

LUD22_ex03_k<font color='red'><Matr. Number\></font>_<font color='red'><Surname-Name\></font>.ipynb

for example:

LUD22_ex03_k000007_Bond_James.ipynb

## Implementation
In this exercise, as before, you are reqired to write a number of functions. Only implemented functions are graded. Insert your implementations into the templates provided. Please don't change the templates even if they are not pretty. Don't forget to test your implementation for correctness and efficiency.

Please **only use libraries already imported in the notebook**.

In [1]:
import pandas as pd
import numpy as np
import random as rnd

import torch
from torch import nn, optim
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, pairwise_distances
import scipy.linalg as linalg

from tqdm import tqdm
import random as rnd
import dill as pkl

## <font color='red'>TASK 1/2</font>: Evaluation Metrics

Implement DCG, nDCG and Average Artist entropy in the corresponding templates.

### DCG Score
Implement DCG following the input/output convention:
#### Input:
* prediction - (not an interaction matrix!) numpy array with recommendations. Row index corresponds to User_id, column index corresponds to the rank of the item mentioned in the sell. Every cell (i,j) contains **item id** recommended to the user (i) on the position (j) in the list. For example:

The following predictions structure [[12, 7, 99], [0, 97, 6]] means that the user with id==1 (second row) got recommended item **0** on the top of the list, item **97** on the second place and item **6** on the third place.

* test_interaction_matrix - (plain interaction matrix format as before!) interaction matrix constructed from interactions held out as a test set, rows - users, columns - items, cells - 0 or 1

* topK - integer - top "how many" to consider for the evaluation. By default top 10 items are to be considered

#### Output:
* DCG score

Don't forget, DCG is calculated for every user separately and then the average is returned.


<font color='red'>**Attention!**</font> Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted! Users without interactions in the test set shouldn't contribute to the score.

In [2]:
def get_dcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK = 10) -> float:
    """
    predictions - np.ndarray - predictions of the recommendation algorithm for each user.
    test_interaction_matrix - np.ndarray - test interaction matrix for each user.
    
    returns - float - mean dcg score over all user.
    """    
    score = None
    
    # TODO: YOUR IMPLEMENTATION.

    dcg_scores = []

    for user_id, pred in enumerate(predictions):

        if sum(test_interaction_matrix[user_id]) == 0 or np.all(pred == -1):  # ignore users w/o interaction in test matrix or w/o predictions 
                continue

        current_dcg = 0

        for k, item_id in enumerate(pred[:topK]):
            current_dcg += test_interaction_matrix[user_id, item_id] / np.log2(k + 2)

        dcg_scores.append(current_dcg)

    score = np.mean(dcg_scores)
            
    return score

In [3]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

dcg_score = get_dcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(dcg_score, 1), "1 expected"

* Can DCG score be higher than 1?
Yes. Assume we have 3 topK items and there was an interaction with each one of them (highest possible gain $\rightarrow$ highest DCG). Then $\text{DCG}=1+\frac{1}{log_2(3)}+\frac{1}{log_2(4)}=2.1309... > 1$
* Can the average DCG score be higher than 1?
Yes. If we again assume to have 3 topK items and there was an interaction of every user with each recommended item, the individual DCGs would be $2.1309...$ (see above) which in this case would also be the average.
* Why?

### nDCG Score

Following the same parameter convention as for DCG implement nDCG metric.

<font color='red'>**Attention!**</font> Remember that ideal DCG is calculated separetely for each user and depends on the number of tracks held out for them as a Test set! Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

In [4]:
def get_ndcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK = 10) -> float:
    """
    predictions - np.ndarray - predictions of the recommendation algorithm for each user.
    test_interaction_matrix - np.ndarray - test interaction matrix for each user.
    topK - int - topK recommendations should be evaluated.
    
    returns - average ndcg score over all users.
    """    
    score = None
    
    # TODO: YOUR IMPLEMENTATION.
        
    scores = []

    for user_id, pred in enumerate(predictions):
        
        if sum(test_interaction_matrix[user_id]) == 0 or np.all(pred == -1):  # ignore users w/o interaction in test matrix or w/o predictions 
            continue

        # DCG
        current_dcg = 0

        for k, item_id in enumerate(pred[:topK]):
            current_dcg += test_interaction_matrix[user_id, item_id] / np.log2(k + 2)

        # iDCG
        total_interactions = sum(test_interaction_matrix[user_id])
        icg = total_interactions  # ideal cumulative gain
        if total_interactions >= topK:
            icg = topK

        current_idcg = 0
        for k in range(int(icg)):
            current_idcg += 1 / np.log2(k + 2)

        # nDCG
        scores.append(current_dcg / current_idcg)
    
    
    score = np.mean(scores)

    return score

In [5]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0], [1, 2, 3, 0], [-1, -1, -1, -1]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 0], [0, 0, 0, 0]])

ndcg_score = get_ndcg_score(predictions, test_interaction_matrix, topK=4)
assert np.isclose(ndcg_score, 1), "ndcg score is not correct."

* Can nDCG score be higher than 1?
No. In the optimal case $\text{DCG} = \text{iDCG}$, which means that that $\text{nDCG}=1$, since this score describes the relative share of the $\text{DCG}$ with respect to the ideal $\text{iDCG}$.

### Average Artist Entropy

Calculate the metric of Diversity as the Average Artist entropy (see UE slides).
#### Parameters:
* predictions - as above for DCG and nDCG;
* item_df - dataframe containing 'artist' and 'track' columns, index - track id (use corresponding data file)
* topK - depth of the list to be evaluated, as before

#### Result:
Average Artist Entropy over users

Recap, main points:
* First calculate diversity for each user, then return the mean over users
* For every user build distribution of recommended tracks over artists (within topK). This distribution cannot have more than topK bins! Dont forget to turn it into probability distribution dividing it by topK
* Use the formula from the UE slides for the per-user entropy

<font color='red'>**Attention!**</font> Use logarithm with base 2!

In [6]:
def get_average_entropy_score(predictions: np.ndarray, item_df: pd.DataFrame, topK=10) -> float:
    """
    predictions - np.ndarray - predictions of the recommendation algorithm for each user.
    item_df - pd.DataFrame - information about each song with columns 'artist' and 'track'.
    
    returns - float - average entropy score of the predictions.
    """
    
    score = None
    
    # TODO: YOUR IMPLEMENTATION.

    diversities = []

    for pred in predictions:
        if np.all(pred == -1):  # no recommendations
            continue

        current_top = topK
        if -1 in pred:
            current_top = list(pred).index(-1) + 1

        artists = item_df['artist'][pred[:current_top]]
        artist_counts = np.unique(artists, return_counts=True)[1]
        diversities.append((-np.sum((artist_counts / current_top) * np.log2(artist_counts / current_top)) / np.log2(current_top)))  # entropy / max_entropy

    score = np.mean(diversities)
                
    return score

In [7]:
item_df = pd.DataFrame({'artist': ['A1', 'A1', 'A1', 'A1', 'A2', 'A3', 'A4']})
predictions = np.array([[0, 1, 2, 3], [6, 5, 4, 3], [-1, -1, -1, -1]])

avg_entr_score = get_average_entropy_score(predictions, item_df, topK=4)
assert np.isclose(avg_entr_score, 0.5), "average entropy score is not correct."

## <font color='red'>TASK 2/2</font>: Evaluation
Use provided rec.py (see imports below) to build a simple evaluation framework. It should be able to evaluate POP, ItemKNN (CF) and SVD.

In the end for each algorithm you should be able to obtain results formatted as follows:

```
{'m': {'ndcg': <>, 'average_entropy': <>},
 'f': {'ndcg': <>, 'average_entropy': <>},
 'all': {'ndcg': <>, 'average_entropy': <>}}
```
 
Every metric calculated for three groupls of test users: only for female, only for male and for all users together.
Every value should be an average, calculated over two different data splits.

In [8]:
from rec import svd_decompose, svd_recommend_to_list
from rec import inter_matr_binary, split_interactions
from rec import recTopK
from rec import recTopKPop

In [9]:
path = 'ex05/MRS_Challenge_2022_data/'

usr_path = path + 'MRSC_2022_demo.txt'
itm_path = path + 'MRSC_2022_tracks.txt'
inter_path = path + 'MRSC_2022_inter.txt'

challenge = path + 'MRSC_2022_target_users.txt'

In [10]:
target_users = np.array(pd.read_csv(challenge, sep='\t')['1'])

### (A) Universal Launcher
Receives a config as described, train and test interaction matrices.

Returns a matrix of size (total number of users) x (topK). cells - recommended item ids, sorted according to the score. Fill the cells corresponding to users with no interactions in the test set with (-1).

### Single set evaluation (already implemented)

In [11]:
def evaluate_predictions(predictions: np.ndarray, test_interaction_matrix: np.ndarray, 
                         item_df: pd.DataFrame, topK=10) -> dict:
    """
    This function returns a dictinary with all scores of predictions.
    
    predictions - np.ndarray - predictions of the algorithm over all users.
    test_interaction_matrix - np.ndarray - test interaction matrix over all users and items.
    item_df - pd.DataFrame - information about each item with columns: 'artist', 'track'
    topK - int - topK prediction should be evaluated
    
    returns - dict - calculated metric scores, contains keys "ndcg" and "average_entropy".
    """
    
    metrics = {}
    
    ndcg = get_ndcg_score(predictions, test_interaction_matrix, topK)
    metrics['ndcg'] = ndcg
    
    average_entropy = get_average_entropy_score(predictions, item_df, topK)
    metrics['average_entropy'] = average_entropy
    
    return metrics

### (B) User-group evaluation

In [12]:
def evaluate_gender(predictions: np.ndarray, test_interaction_matrix: np.ndarray, user_df: pd.DataFrame, 
                    item_df: pd.DataFrame, num_users=500, topK=10) -> dict:
    """
    This function will evaluate certain predictions for each gender individually and return a dictionary
    following the structure:
    
    {'gender_key': {'metric_key': metric_score}}
    
    predictions - np.ndarray - predictions of the algorithm over all users.
    test_interaction_matrix - np.ndarray - test interaction matrix over all users and items.
    user_df - pd.DataFrame - information about each user with columns: location', 'age', 'gender', 'date'
    item_df - pd.DataFrame - information about each item with columns: 'artist', 'track'
    topK - int - topK prediction should be evaluated
    
    returns - dict - calculated metric scores for each gender.
    """
    
    metrics = {}
    
    # TODO: YOUR IMPLEMENTATION.
    
    for gender in ['m', 'f']:
        gender_ids = np.where(user_df['gender'] == gender)
        metrics[gender] = evaluate_predictions(predictions[gender_ids], test_interaction_matrix[gender_ids], item_df, topK)
        
    metrics['all'] = evaluate_predictions(predictions, test_interaction_matrix, item_df, topK)
    
    return metrics

### (C) Main evaluation function
Interprets the config and returns evaluation report for a single algorithm:
```
{'m': {'ndcg': <>, 'average_entropy': <>},
 'f': {'ndcg': <>, 'average_entropy': <>},
 'all': {'ndcg': <>, 'average_entropy': <>}}
```

Please pay attention to how splits are created and saved into the corresponding variables (Split Data section below)

In [13]:
def evaluate_algorithm(config) -> dict:
    """
    This function will evaluate a certain algorithm defined with the parameters in config by:
    - going over all test and train files
    - generating the recommendations for each data split
    - calling evaluate gender to get the metrics for each recommendation for each data split
    
    Then the average score for each gender and metric should be calculated over all data splits and
    a dictionary should be returned following the structure:
    {'gender_key': {'metric_key': avg_metric_score}}
    
    config - dict - configuration of this evaluation following the structure:
    
    config = {
        "algorithm": str - one of ['SVD', 'CF', 'TopPop']
        "inter_train_file_paths": str - array of inter train file paths (1 per split),
        "inter_test_file_paths": str - array of inter test file paths (1 per split),
        "user_file_path": str - usr_path,
        "item_file_path": str - itm_path,
        "top_k": int - number of recommendations to evaluate
        "n": int - used for CF.
        "f": int - length of hidden representations for SVD
    }
    
    returns - dict - average score of each metric for each gender over all data splits.
    """
    
    metrics = {}
    
    # TODO: YOUR IMPLEMENTATION.

    split_metrics = []
    df_users = pd.read_csv(config['user_file_path'], sep='\t', header=None, names=['location','age','gender', 'date'])
    df_items = pd.read_csv(config['item_file_path'], sep='\t', header=None, names=['artist','track'])

    for train_split_path, test_split_path in zip(config['inter_train_file_paths'], config['inter_test_file_paths']):
        train_split = inter_matr_binary(config['user_file_path'], config['item_file_path'], train_split_path)
        test_split = inter_matr_binary(config['user_file_path'], config['item_file_path'], test_split_path)

        current_recs = get_recommendations_for_algorithm(config, train_split, test_split)
        split_metrics.append(evaluate_gender(current_recs, test_split, df_users, df_items, config['top_k']))

    for key in ['m', 'f', 'all']:
        current_dict = {}
        for score in ['ndcg', 'average_entropy']:
            current_dict[score] = np.mean([curr_split[key][score] for curr_split in split_metrics])
        metrics[key] = current_dict

    return metrics

### Splitting Data (already implemented)

In [None]:
train_inter_files = []
test_inter_files = []

num_splits = 2
p_i = 0.3
p_u = 0.5

user_file_path = None
inter_file_path = None

user_file_path = usr_path
inter_file_path = inter_path

for i in range(num_splits):
    
    split_interactions(inter_file=inter_file_path,
                       user_file_path = user_file_path,
                       p_u = p_u,
                       p_i = p_i,
                       res_test_file="inter_TEST_" + str(i) + ".txt",
                       res_train_file="inter_TRAIN_" + str(i) + ".txt")
    
    train_inter_files.append("inter_TRAIN_" + str(i) + ".txt")
    test_inter_files.append("inter_TEST_" + str(i) + ".txt")

In [14]:
num_splits = 2
p_i = 0.3
p_u = 0.5
train_inter_files = ['ex05/train_test_data/inter_TRAIN_0.txt', 'ex05/train_test_data/inter_TRAIN_1.txt']
test_inter_files = ['ex05/train_test_data/inter_TEST_0.txt', 'ex05/train_test_data/inter_TEST_0.txt']

In [15]:
assert len(train_inter_files) == num_splits, "Number of Train files do not match the requirement"
assert len(test_inter_files) == num_splits, "Number of Test files do not match the requirement"

### Evaluating Every Algorithm
Make sure everything works. Try running evaluation with the three configs below.
We expect KNN to outperform other algorithms on our small data sample.

In [16]:
def get_recommendations_for_algorithm(config, inter_matrix_train, inter_matrix_test) -> np.ndarray:
    
    rec = None
    
    df_users = pd.read_csv(config['user_file_path'], sep='\t', header=None, names=['location','age','gender', 'date'])
    df_items = pd.read_csv(config['item_file_path'], sep='\t', header=None, names=['artist','track'])
    
    rec = np.full((len(df_users), config['top_k']), -1)
    
    # TODO: YOUR IMPLEMENTATION.
    
    user_ids = np.where(np.array([np.all(user == 0) for user in inter_matrix_test]) == False)[0]  # get user_ids of users with at least 1 test-interaction

    if config['algorithm'] == 'SVD':
        U, V = svd_decompose(inter_matrix_train, f=config['f'])

        # get ids of seen items
        seen_item_ids = []
        for interactions in inter_matrix_train[user_ids]:
            seen_item_ids.append(list(np.where(interactions == 1)[0]))
            
        rec = svd_recommend_to_list(user_ids, seen_item_ids, U, V, config['top_k'])

    elif config['algorithm'] == 'CF':
        for user in user_ids:
            current_recs = recTopK(inter_matrix_train, user, config['top_k'], config['n'])
            rec[user] = current_recs[0]
    
    elif config['algorithm'] == 'TopPop':
        for user in user_ids:
            current_recs = recTopKPop(inter_matrix_train, user, config['top_k'])
            rec[user] = current_recs
            
    elif config['algorithm'] == 'Neighbor':
        print('Calculating distances...')
        #dist = cosine_distances(inter_matrix_train.T, inter_matrix_train.T)
        dist = pairwise_distances(inter_matrix_train.T, inter_matrix_train.T, metric="jaccard", n_jobs=-1)
        print('Done')
    
        for user in tqdm(user_ids):
            interactions = inter_matrix_train[user]
            unseen_item_ids = list(np.where(interactions == 0)[0])
            seen_item_ids = list(np.where(interactions == 1)[0])
            n_seen = len(seen_item_ids)
            usr_dist = dist[:,unseen_item_ids][seen_item_ids]
            #itm_dist = np.sort(usr_dist.T)[:,:config['n']].sum(axis=1) / config['n']
            itm_dist = usr_dist.sum(axis=0)
            ranking = np.argsort(itm_dist)
            unseen_item_ids = np.array(unseen_item_ids)
            rec[user] = unseen_item_ids[ranking[:config['top_k']]]
    
    return rec

In [33]:
from sklearn.metrics.pairwise import pairwise_distances

path = 'ex05/MRS_Challenge_2022_data/'

usr_path = path + 'MRSC_2022_demo.txt'
itm_path = path + 'MRSC_2022_tracks.txt'
inter_path = path + 'MRSC_2022_inter.txt'
df_users = pd.read_csv(usr_path, sep='\t', header=None, names=['location','age','gender', 'date'])
df_items = pd.read_csv(itm_path, sep='\t', header=None, names=['artist','track'])

challenge = path + 'MRSC_2022_target_users.txt'
target_users = np.array(pd.read_csv(challenge, sep='\t')['1'])

#inter_matrix = inter_matr_binary(usr_path, itm_path, inter_path)

# rec = np.full((len(target_users), 15), -1)
rec = []

print('Calculating distances...')
#dist = cosine_distances(inter_matrix_train.T, inter_matrix_train.T)
#dist = pairwise_distances(inter_matrix.T, inter_matrix.T, metric="jaccard", n_jobs=-1)
print('Done')

for n, user in enumerate(tqdm(target_users)):
    interactions = inter_matrix[user]
    unseen_item_ids = list(np.where(interactions == 0)[0])
    seen_item_ids = list(np.where(interactions == 1)[0])

    usr_dist = dist[:,unseen_item_ids][seen_item_ids]
    #itm_dist = np.sort(usr_dist.T)[:,:config['n']].sum(axis=1) / config['n']
    itm_dist = usr_dist.sum(axis=0)
    ranking = np.argsort(itm_dist)
    unseen_item_ids = np.array(unseen_item_ids)
    usr_recs = unseen_item_ids[ranking[:15]]
    rec.append([user, ','.join(usr_recs.astype(str))])

Calculating distances...
Done


100%|██████████| 1298/1298 [02:52<00:00,  7.52it/s]


In [39]:
import csv
with open('rec_k01553060_Dungl_Lion.tsv', 'w', newline='\n') as f:
    writer = csv.writer(f, delimiter='\t')
    for r in rec:
        writer.writerow([str(r[0]), r[1]])

In [45]:
for u, r in rec:
    u_r = np.array(r.split(','), dtype=int)
    if 1 in inter_matrix[u][u_r]:
        print(u)

In [46]:
# Evaluate Item KNN (CF)
config = {
    "algorithm": "Neighbor", # ['SVD', 'CF', 'TopPop']
    "inter_train_file_paths": train_inter_files,
    "inter_test_file_paths": test_inter_files,
    "user_file_path": usr_path,
    "item_file_path": itm_path,
    "top_k": 15, # number of recommendations.
    "n": 5, # used for CF.
    "f": 32, # length of hidden representations
}

scores = evaluate_algorithm(config)
scores

Calculating distances...




Done


100%|██████████| 6499/6499 [14:53<00:00,  7.27it/s]


Calculating distances...




Done


100%|██████████| 6499/6499 [14:42<00:00,  7.37it/s]


{'m': {'ndcg': 0.08865760783175179, 'average_entropy': 0.9995653822251053},
 'f': {'ndcg': 0.08957222455779476, 'average_entropy': 0.999707501218788},
 'all': {'ndcg': 0.0888314117487641, 'average_entropy': 0.9995923889887918}}

In [None]:
# Evaluate TopPop
config = {
    "algorithm": "TopPop", # ['SVD', 'CF', 'TopPop']
    "inter_train_file_paths": train_inter_files,
    "inter_test_file_paths": test_inter_files,
    "user_file_path": usr_path,
    "item_file_path": itm_path,
    "top_k": 10, # number of recommendations.
    "n": 5, # used for CF.
    "f": 32, # length of hidden representations
}

#scores = evaluate_algorithm(config)
#scores

In [None]:
# Evaluate Item KNN (CF)
config = {
    "algorithm": "Neighbor", # ['SVD', 'CF', 'TopPop']
    "inter_train_file_paths": train_inter_files,
    "inter_test_file_paths": test_inter_files,
    "user_file_path": usr_path,
    "item_file_path": itm_path,
    "top_k": 15, # number of recommendations.
    "n": 5, # used for CF.
    "f": 32, # length of hidden representations
}

scores = evaluate_algorithm(config)
scores

In [None]:
# Evaluate SVD
config = {
    "algorithm": "SVD", # ['SVD', 'CF', 'TopPop']
    "inter_train_file_paths": train_inter_files,
    "inter_test_file_paths": test_inter_files,
    "user_file_path": usr_path,
    "item_file_path": itm_path,
    "top_k": 10, # number of recommendations.
    "n": 5, # used for CF.
    "f": 256, # length of hidden representations
}

scores = evaluate_algorithm(config)
scores

In [None]:
g_keys = ['m', 'f', 'all']
m_keys = ['ndcg', 'average_entropy']

assert all([k in g_keys for k in scores.keys()]), 'keys error'
assert all([k in m_keys for k in scores['m'].keys()]), 'keys error'
assert scores['all']['ndcg'] >= 0, "metric score should be a number."
assert scores['all']['average_entropy'] >= 0, "metric score should be a number."

## Questions and Potential Future Work
* Do all algorithms treat Female and Male users similarly? Why?
No. For TopPop and SVD the nDCG is larger for Female users than for Male ones. For SVD it is the other way around. Those differences though are changing each time executing the whole notebook, so sometimes for example the Male-nDCG for SVD is larger then the Female-nDCG (possible reason being different random train-test splits). So in total, there is not a really significant difference between Male and Female. A reason for a hypothetical difference could be more users of one gender in the train set. Regarding Entropy there is no significant difference, since it is always very close to 1.
* How would you try improve performance of all three algorithms?
-- Increase size of dataset if possible<br>
-- Online Testing to get a better estimate of the real values of the used metrics<br>
-- Using more metrics<br>
-- Increase model complexity until overfitting to find optimal model<br>
* What other metrics would you consider to compare these recommender systems?
-- Different diversities, e.g. w.r.t. genre<br>
-- generally beyond-accuracy metrics (e.g. Novelty) to achieve better user satisfaction<br>
-- Recall/Precision/F-measure; disadvantage: order of recommendations is not considered<br>
-- Reciprocal Rank to take order of recommendations into account
* What other user groups would you investigate?
User groups w.r.t...<br>
-- Age<br>
-- Nationality<br>
-- Income<br>
-- Phone brand<br>
-- Premium/Paying vs. Non-Premium/Non-Paying users

In [None]:
# The end.