## Results post-processor
This notebook is applied on the experiment results as a post-processing step. It contains the aggregation logic to summarize the results of runs with the same parameters on different dataset splits.

The post-processing logic summarizes the results of different splits for the same number of the below parameters:
- **Dataset** (ALOI, Annthyroid, Cardiotocography, etc.)
- **Search algorithm** (random, edb, smac)
- **Validation set strategy** (stratified, balanced)
- **Validation set size** (20, 50, 100, 200, etc.)

### Example
The below raw output files, results of the experiments for different data splits/iterations (assuming the filenaming conventions of the source code):
- ALOI_**1**_edb_balanced_100.csv
- ALOI_**2**_edb_balanced_100.csv
- ALOI_**3**_edb_balanced_100.csv
- ALOI_**4**_edb_balanced_100.csv
- ALOI_**5**_edb_balanced_100.csv

would be summarized in a single file that would contain the average of the above:
- **ALOI_edb_balanced_100.csv**

In [1]:
# Imports
import os
import pandas as pd
from pathlib import Path
from matplotlib import pyplot as plt
from datetime import timedelta as td

In [3]:
# Function definitions
def fill_values(df, total_budget):
    '''
    Arguments:
        df(pd.DataFrame): the dataframe of the results
        total_budget(int): the total budget in seconds

    Returns:
        df(pd.DataFrame): the processed df with `total_budget` rows
    '''
    # Fill the missing values for `Timestamp` column
    ref_idx = 0 # the row index with the current max value
    for i in range(1, total_budget):
        if i not in df.Timestamp.values:
            n = df.shape[0]
            df.at[n, 'Timestamp'] = int(i) # keep column name for consistency
            df.at[n, 'single_best_optimization_score'] = df.at[ref_idx, 'single_best_optimization_score']
            df.at[n, 'single_best_test_score'] = df.at[ref_idx, 'single_best_test_score']
        else:
            ref_idx = df.index[df['Timestamp'] == i][0]
            #print('Changing index at Timestamp =', i)
    df = df.iloc[1: , :]
    df = df.sort_values(by='Timestamp').reset_index(drop=True)
    df = df.astype({"Timestamp": int})
    return df

def get_combinations(search_algorithm_list, validation_strategy_list, validation_size_list):
    '''
    Function that computes the combinations of the below values:
      - search algorithm
      - validation strategy
      - validation size

    Arguments:
        search_algorithm_list(list): list of search algorithms
        validation_strategy_list(list): list of validation strategy values
        validation_size_list(list): list of validation size values

    Returns:
        cross_prod: the cross product list of combinations as strings
    '''
    cross_prod = []
    for algorithm in search_algorithm_list:
        for strategy in validation_strategy_list:
            for size in validation_size_list:
                cross_prod.append(
                    '{}_{}_{}.csv'.format(
                        algorithm,
                        strategy,
                        size
                    )
                )
    return cross_prod

In [4]:
# Provide the directory of the raw output files
# Must contain a folder `raw` and a `metadata.csv` file
results_dirname = 'results' # input to the script
#
# Input/output directories
results_path = os.path.join(Path.cwd(), results_dirname)
raw_path = os.path.join(results_path, 'raw')
output_dir = 'processed'
output_path = os.path.join(results_path, output_dir)
if os.path.exists(output_path):
    raise ValueError(
    "Output directory `{}` already exists.".format(output_path))
else:
    os.mkdir(output_path)
#
# Import metadata
metadata_filepath = os.path.join(results_path, 'metadata.csv')
metadata_df = pd.read_csv(metadata_filepath)
# Remove individual edb runs
metadata_df = metadata_df[metadata_df['total_budget'] != 30]
#
# Extract experiment parameters
total_budget = metadata_df.total_budget[0]
dataset_list = list(metadata_df.dataset_name.unique())
validation_strategy_list = list(metadata_df.validation_strategy.unique())
validation_size_list = list(metadata_df.validation_size.unique())
search_algorithm_list = list(metadata_df.search_type.unique())
#
# Print the parameters
print('Total budget:', total_budget)
print('Dataset list:', dataset_list)
print('Search algorithm list:', search_algorithm_list)
print('Validation strategy list:', validation_strategy_list)
print('Validation size list:', validation_size_list)

Total budget: 300
Dataset list: ['ALOI', 'Annthyroid', 'Cardiotocography', 'SpamBase']
Search algorithm list: ['edb', 'random', 'smac']
Validation strategy list: ['stratified', 'balanced']
Validation size list: [20, 50, 100, 200]


In [5]:
# Calculate combinations
cross_prod = get_combinations(search_algorithm_list, validation_strategy_list, validation_size_list)
# Aggregate results
for dataset in dataset_list:
    print('Processing', dataset)
    for cp in cross_prod:
        df_list = [] # list to store processed results
        # Process raw results
        for filename in os.listdir(raw_path):
            if dataset in filename and cp in filename:
                df = pd.read_csv(
                    os.path.join(
                        raw_path,
                        filename
                    ),
                    parse_dates=['Timestamp']
                )
                df.drop(columns=['single_best_train_score'], inplace=True)
                # Transform timestamp and boundary values
                df.Timestamp = (df.Timestamp-df.Timestamp[0]).apply(td.total_seconds)
                n = df.shape[0]
                df.at[n, 'Timestamp'] = total_budget
                df = df.astype({"Timestamp": int})
                df.at[n, 'single_best_optimization_score'] = df.at[n-1, 'single_best_optimization_score']
                df.at[n, 'single_best_test_score'] = df.at[n-1, 'single_best_test_score']
                df = df.drop_duplicates().reset_index(drop=True)
                df = fill_values(df, total_budget)
                df_list.append(df)
        # Average individual results
        if len(df_list) > 0:
            df_agg = df_list[0] # aggregate results
            for df in df_list[1:]:
                df_agg['single_best_optimization_score'] += df['single_best_optimization_score']
                df_agg['single_best_test_score'] += df['single_best_test_score']
            df_agg['single_best_optimization_score'] = df_agg['single_best_optimization_score'] / len(df_list)
            df_agg['single_best_test_score'] = df_agg['single_best_test_score'] / len(df_list)
            df_agg = df_agg.astype({"Timestamp": int})
            # Save aggregate results to csv
            out_filename = dataset+'_'+cp
            df_agg.to_csv(os.path.join(output_path, out_filename), index=False)
            print('\tSaved aggregate results to:', out_filename)
print('Done.')

Processing ALOI
	Saved aggregate results to: ALOI_edb_stratified_20.csv
	Saved aggregate results to: ALOI_edb_stratified_50.csv
	Saved aggregate results to: ALOI_edb_stratified_100.csv
	Saved aggregate results to: ALOI_edb_stratified_200.csv
	Saved aggregate results to: ALOI_edb_balanced_20.csv
	Saved aggregate results to: ALOI_edb_balanced_50.csv
	Saved aggregate results to: ALOI_edb_balanced_100.csv
	Saved aggregate results to: ALOI_edb_balanced_200.csv
	Saved aggregate results to: ALOI_random_stratified_20.csv
	Saved aggregate results to: ALOI_random_stratified_50.csv
	Saved aggregate results to: ALOI_random_stratified_100.csv
	Saved aggregate results to: ALOI_random_stratified_200.csv
	Saved aggregate results to: ALOI_random_balanced_20.csv
	Saved aggregate results to: ALOI_random_balanced_50.csv
	Saved aggregate results to: ALOI_random_balanced_100.csv
	Saved aggregate results to: ALOI_random_balanced_200.csv
	Saved aggregate results to: ALOI_smac_stratified_20.csv
	Saved aggregat