# Results post-processor
This notebook is applied on the experiment results as a post-processing step. It contains the aggregation logic to summarize the results of runs with the same parameters on different dataset splits.

The post-processing logic summarizes the results of different splits for the same number of the below parameters:
- **Dataset** (ALOI, Annthyroid, Cardiotocography, etc.)
- **Search algorithm** (random, ue, smac)
- **Validation set strategy** (stratified, balanced)
- **Validation set size** (20, 50, 100, 200, etc.)

## Example
The below raw output files, results of the experiments for different data splits/iterations (assuming the filenaming conventions of the source code):
- ALOI_**1**_ue_balanced_100.csv
- ALOI_**2**_ue_balanced_100.csv
- ALOI_**3**_ue_balanced_100.csv
- ALOI_**4**_ue_balanced_100.csv
- ALOI_**5**_ue_balanced_100.csv

would be summarized in a single file that would contain the average of the above:
- **ALOI_ue_balanced_100.csv**

In [1]:
# Imports
import os
import pandas as pd
from pathlib import Path
from matplotlib import pyplot as plt
from notebook_utils import preprocess_df, fill_values, get_combinations

## Setup and metadata
This cell defines the necessary variables by parsing the `metadata.csv` file provided in the results directory. It also creates the output directory where the processed files will later be saved.

In [2]:
# Provide the directory of the raw output files
# Must contain a folder `raw` and a `metadata.csv` file
results_dirname = '../results/results' # input to the script
#
# Input/output directories
results_path = os.path.join(Path.cwd(), results_dirname)
raw_path = os.path.join(results_path, 'raw')
output_dir = 'processed'
output_path = os.path.join(results_path, output_dir)
if os.path.exists(output_path):
    raise ValueError(
    "Output directory `{}` already exists.".format(output_path))
else:
    os.mkdir(output_path)
#
# Import metadata
metadata_filepath = os.path.join(results_path, 'metadata.csv')
metadata_df = pd.read_csv(metadata_filepath)
# Remove individual ue runs
metadata_df = metadata_df[metadata_df['total_budget'] != 30]
#
# Extract experiment parameters
total_budget = metadata_df.total_budget[0]
dataset_list = list(metadata_df.dataset_name.unique())
validation_strategy_list = list(metadata_df.validation_strategy.unique())
validation_size_list = list(metadata_df.validation_size.unique())
search_algorithm_list = list(metadata_df.search_type.unique())
#
# Print the parameters
print('Total budget:', total_budget)
print('Dataset list:', dataset_list)
print('Search algorithm list:', search_algorithm_list)
print('Validation strategy list:', validation_strategy_list)
print('Validation size list:', validation_size_list)

Total budget: 300
Dataset list: ['ALOI', 'Annthyroid', 'Cardiotocography']
Search algorithm list: ['edb', 'random', 'smac']
Validation strategy list: ['stratified', 'balanced']
Validation size list: [20, 50, 100, 200]


## Core processing
This cell contains the core processing logic of this notebook. It iterates over all datasets and searches for the appropriate combinations of search algorithm, validation set strategy and size, and transforms the performance results to the appropriate format, before saving them to the output directory.

In [3]:
# Calculate combinations
cross_prod = get_combinations(search_algorithm_list, validation_strategy_list, validation_size_list)
# Aggregate results
for dataset in dataset_list:
    print('Processing', dataset)
    for cp in cross_prod:
        df_list = [] # list to store processed results
        # Process raw results
        for filename in os.listdir(raw_path):
            if dataset in filename and cp in filename:
                df = pd.read_csv(os.path.join(raw_path, filename),
                    usecols = [
                        'Timestamp',
                        'single_best_optimization_score',
                        'single_best_test_score'
                    ],
                    parse_dates=['Timestamp']
                )
                # Transform timestamp and boundary values
                df = preprocess_df(df, total_budget)
                # Fill missing values from 1 to total_budget seconds
                df = fill_values(df, total_budget)
                # Append to list of dataframes
                df_list.append(df)
        # Extract stats
        if len(df_list) > 0:
            # Average individual results
            df_agg = df_list[0] # aggregate results
            for df in df_list[1:]:
                df_agg['single_best_optimization_score'] += df['single_best_optimization_score']
                df_agg['single_best_test_score'] += df['single_best_test_score']
            df_agg['single_best_optimization_score'] = df_agg['single_best_optimization_score'] / len(df_list)
            df_agg['single_best_test_score'] = df_agg['single_best_test_score'] / len(df_list)
            df_agg = df_agg.astype({"Timestamp": int})
            # Compute std
            df_opt = pd.concat([df['single_best_optimization_score'] for df in df_list], axis=1)
            df_test = pd.concat([df['single_best_test_score'] for df in df_list], axis=1)
            y_std_opt = df_opt.std(axis=1).to_numpy()
            y_std_test = df_test.std(axis=1).to_numpy()
            df_agg['single_best_optimization_score_std'] = y_std_opt
            df_agg['single_best_test_score_std'] = y_std_test
            # Save aggregate results to csv
            out_filename = dataset+'_'+cp
            df_agg.to_csv(os.path.join(output_path, out_filename), index=False)
            print('\tSaved aggregate results to:', out_filename)
print('Done.')

Processing ALOI
	Saved aggregate results to: ALOI_edb_stratified_20.csv
	Saved aggregate results to: ALOI_edb_stratified_50.csv
	Saved aggregate results to: ALOI_edb_stratified_100.csv
	Saved aggregate results to: ALOI_edb_stratified_200.csv
	Saved aggregate results to: ALOI_edb_balanced_20.csv
	Saved aggregate results to: ALOI_edb_balanced_50.csv
	Saved aggregate results to: ALOI_edb_balanced_100.csv
	Saved aggregate results to: ALOI_edb_balanced_200.csv
	Saved aggregate results to: ALOI_random_stratified_20.csv
	Saved aggregate results to: ALOI_random_stratified_50.csv
	Saved aggregate results to: ALOI_random_stratified_100.csv
	Saved aggregate results to: ALOI_random_stratified_200.csv
	Saved aggregate results to: ALOI_random_balanced_20.csv
	Saved aggregate results to: ALOI_random_balanced_50.csv
	Saved aggregate results to: ALOI_random_balanced_100.csv
	Saved aggregate results to: ALOI_random_balanced_200.csv
	Saved aggregate results to: ALOI_smac_stratified_20.csv
	Saved aggregat