# Results post-processor
This notebook is applied on the experiment results as a post-processing step. It contains the aggregation logic to summarize the results of runs with the same parameters on different dataset splits.

The post-processing logic summarizes the results of different splits for the same number of the below parameters:
- **Dataset** (ALOI, Annthyroid, Cardiotocography, etc.)
- **Search algorithm** (random, edb, smac)
- **Validation set strategy** (stratified, balanced)
- **Validation set size** (20, 50, 100, 200, etc.)

## Example
The below raw output files, results of the experiments for different data splits/iterations (assuming the filenaming conventions of the source code):
- ALOI_**1**_edb_balanced_100.csv
- ALOI_**2**_edb_balanced_100.csv
- ALOI_**3**_edb_balanced_100.csv
- ALOI_**4**_edb_balanced_100.csv
- ALOI_**5**_edb_balanced_100.csv

would be summarized in a single file that would contain the average of the above:
- **ALOI_edb_balanced_100.csv**

In [None]:
# Imports
import os
import pandas as pd
from pathlib import Path
from matplotlib import pyplot as plt
from datetime import timedelta as td
from notebook_utils import fill_values, get_combinations

## Setup and metadata
This cell defines the necessary variables by parsing the `metadata.csv` file provided in the results directory. It also creates the output directory where the processed files will later be saved.

In [None]:
# Provide the directory of the raw output files
# Must contain a folder `raw` and a `metadata.csv` file
results_dirname = 'results' # input to the script
#
# Input/output directories
results_path = os.path.join(Path.cwd(), results_dirname)
raw_path = os.path.join(results_path, 'raw')
output_dir = 'processed'
output_path = os.path.join(results_path, output_dir)
if os.path.exists(output_path):
    raise ValueError(
    "Output directory `{}` already exists.".format(output_path))
else:
    os.mkdir(output_path)
#
# Import metadata
metadata_filepath = os.path.join(results_path, 'metadata.csv')
metadata_df = pd.read_csv(metadata_filepath)
# Remove individual edb runs
metadata_df = metadata_df[metadata_df['total_budget'] != 30]
#
# Extract experiment parameters
total_budget = metadata_df.total_budget[0]
dataset_list = list(metadata_df.dataset_name.unique())
validation_strategy_list = list(metadata_df.validation_strategy.unique())
validation_size_list = list(metadata_df.validation_size.unique())
search_algorithm_list = list(metadata_df.search_type.unique())
#
# Print the parameters
print('Total budget:', total_budget)
print('Dataset list:', dataset_list)
print('Search algorithm list:', search_algorithm_list)
print('Validation strategy list:', validation_strategy_list)
print('Validation size list:', validation_size_list)

## Core processing
This cell contains the core processing logic of this notebook. It iterates over all datasets and searches for the appropriate combinations of search algorithm, validation set strategy and size, and transforms the performance results to the appropriate format, before saving them to the output directory.

In [None]:
# Calculate combinations
cross_prod = get_combinations(search_algorithm_list, validation_strategy_list, validation_size_list)
# Aggregate results
for dataset in dataset_list:
    print('Processing', dataset)
    for cp in cross_prod:
        df_list = [] # list to store processed results
        # Process raw results
        for filename in os.listdir(raw_path):
            if dataset in filename and cp in filename:
                df = pd.read_csv(
                    os.path.join(
                        raw_path,
                        filename
                    ),
                    parse_dates=['Timestamp']
                )
                df.drop(columns=['single_best_train_score'], inplace=True)
                # Transform timestamp and boundary values
                df.Timestamp = (df.Timestamp-df.Timestamp[0]).apply(td.total_seconds)
                n = df.shape[0]
                df.at[n, 'Timestamp'] = total_budget
                df = df.astype({"Timestamp": int})
                df.at[n, 'single_best_optimization_score'] = df.at[n-1, 'single_best_optimization_score']
                df.at[n, 'single_best_test_score'] = df.at[n-1, 'single_best_test_score']
                df = df.drop_duplicates().reset_index(drop=True)
                df = fill_values(df, total_budget)
                df_list.append(df)
        # Average individual results
        if len(df_list) > 0:
            df_agg = df_list[0] # aggregate results
            for df in df_list[1:]:
                df_agg['single_best_optimization_score'] += df['single_best_optimization_score']
                df_agg['single_best_test_score'] += df['single_best_test_score']
            df_agg['single_best_optimization_score'] = df_agg['single_best_optimization_score'] / len(df_list)
            df_agg['single_best_test_score'] = df_agg['single_best_test_score'] / len(df_list)
            df_agg = df_agg.astype({"Timestamp": int})
            # Save aggregate results to csv
            out_filename = dataset+'_'+cp
            df_agg.to_csv(os.path.join(output_path, out_filename), index=False)
            print('\tSaved aggregate results to:', out_filename)
print('Done.')