## Build Evaluation Dataset `EvaluationData.parquet`

- Merge information of test runs with characteristics of each dataset, and for the "best" and "worst" users, namely:
    -  `4-TestRuns/results/{EVAL_RUN}/{Model}-{EVAL_RUN}.csv`
    - `DatasetCharacteristics.csv`
    - `BestWorstUserTopologicalChars.csv`
    - `UserAveragePopularity.csv`

In [1]:
import pandas as pd
import numpy as np
from utils import load_benchmark_datasets
from src.config import EVALUATION_DIRECTORY, json_dumps_user_columns
from utils import get_users_topological_chars, translate_userids, build_users_popularity_lookup_file, get_user_popularity, get_evaluation_data

### 1. Load Dataset Characteristics
- read `DatasetCharacteristics.csv`
- calculate may useful remaining metrics

In [2]:
dataset_split_characteristics_df = pd.read_csv(EVALUATION_DIRECTORY.joinpath(f"utils/DatasetCharacteristics.csv"), sep="\t")
print(dataset_split_characteristics_df.shape)

(176, 37)


In [3]:
dataset_split_characteristics_df['density'] = 1 - dataset_split_characteristics_df['sparsity']
dataset_split_characteristics_df['density_log'] = np.log10(dataset_split_characteristics_df['density'])
dataset_split_characteristics_df['average_clustering_coef_dot_user_log'] = np.log10(dataset_split_characteristics_df['average_clustering_coef_dot_user'])
dataset_split_characteristics_df['average_clustering_coef_dot_item_log'] = np.log10(dataset_split_characteristics_df['average_clustering_coef_dot_item'])

### 2. Load Test Run Datasets
-  read `4-TestRuns/results/{EVAL_RUN}/{Model}-{EVAL_RUN}.csv`
- merge with previous result

In [4]:
EVAL_RUN = "RO"
model_evaluation_df = load_benchmark_datasets(EVAL_RUN)

In [5]:
evaluation_dataset_characteristics_df = pd.merge(model_evaluation_df, dataset_split_characteristics_df, on='dataset', how='left')
print(evaluation_dataset_characteristics_df.shape)

(1584, 61)


### 2. Load User's Characteristics
- read / generate `BestWorstUserTopologicalChars.csv`
- merge with previous result

In [None]:
# HINT: takes approx. 5h
file_path = EVALUATION_DIRECTORY.joinpath(f"{EVAL_RUN}/BestWorstUserTopologicalChars.csv")
user_topologies_df = get_users_topological_chars(evaluation_dataset_characteristics_df, file_path, num_datasets=176)

In [7]:
evaluation_dataset_characteristics_user_topologies_df = pd.merge(evaluation_dataset_characteristics_df, user_topologies_df, on='dataset', how='left')
print(evaluation_dataset_characteristics_user_topologies_df.shape)

(1584, 65)


### 3. Load User's Average Popularity of the interacted Items
- translate userIDs from _recboleID_ -> _localID_ -> _globalID_ to match average popularity of these items, the "best" / "worst" users in `UserAveragePopularity.csv"` 
    - global ID: the userID which holds on all splits
    - local ID: the userID which is only valid within one split
    - recbole ID: the userID which is assigned after the filtering by RecBole
- read / generate `UserAveragePopularity.csv`
- generate final dataframe

In [9]:
# HINT: takes approx. 10-20min.
translated_ids_df = translate_userids(evaluation_dataset_characteristics_user_topologies_df, 
                                      num_rows=evaluation_dataset_characteristics_user_topologies_df.shape[0])

100%|██████████| 1584/1584 [12:49<00:00,  2.06rows/s]


In [None]:
# HINT: takes approx. 1min
file_path = EVALUATION_DIRECTORY.joinpath("utils/UserAveragePopularity.csv")
popularity_dict = build_users_popularity_lookup_file(file_path, num_datasets=177)

In [11]:
# NOTE: takes approx. 2-3 min
classical_user_characteristics_df = get_user_popularity(translated_ids_df, popularity_dict, num_rows = translated_ids_df.shape[0])

100%|██████████| 1584/1584 [01:22<00:00, 19.29rows/s]


### 4. Store the files as Parquet
- `.parquet` is a columnar storage format optimized for high-performance data querying and compression, reducing file sizes and improving read/write speeds compared to `.csv`

In [12]:
final_eval_df = json_dumps_user_columns(classical_user_characteristics_df)

In [13]:
file_path = EVALUATION_DIRECTORY.joinpath(f"{EVAL_RUN}/EvaluationData.parquet")
final_eval_df.to_parquet(file_path, engine="pyarrow", index=False)

#### Examplary: Load from `EvaluationData.parquet`

In [None]:
EVAL_RUN = "TO"
final_eval_df = get_evaluation_data(EVAL_RUN)