# Testing the post-processing aggretation logic
This notebook contains tests for the aggregation logic that is applied on the raw results of AutoSklearn as a post-processing step.

### Input data
The input data represents the raw performance metrics outputted by AutoSklearn, saved on disk by the thesis source code.

| Timestamp | single_best_optimization_score | single_best_test_score |
|:---:|:---:|:---:|
| 2022-04-11 19:15:01 | 0.5 | 0.5 |
| 2022-04-11 19:15:04 | 0.6 | 0.6 |
| 2022-04-11 19:15:08 | 0.7 | 0.7 |

### Data after *preprocess_df*
The transformed data after applying *preprocess_df* function should be formatted as follows:

| Timestamp | single_best_optimization_score | single_best_test_score |
|:---:|:---:|:---:|
| 0 | 0.5 | 0.5 |
| 3 | 0.6 | 0.6 |
| 7 | 0.7 | 0.7 |
| 10 | 0.7 | 0.7 |

This dataframe can already be plotted and the missing values for the `Timestamp` column (seconds) will be linearly interpolated. However, in order to be able to correctly average the performance of several runs, i.e. several such dataframes, we need to fill in the missing values manually.

### Data after *fill_values*
The transformed data after applying *fill_values* function should be formatted as follows:

| Timestamp | single_best_optimization_score | single_best_test_score |
|:---:|:---:|:---:|
| 1 | 0.5 | 0.5 |
| 2 | 0.5 | 0.5 |
| 3 | 0.6 | 0.6 |
| 4 | 0.6 | 0.6 |
| 5 | 0.6 | 0.6 |
| 6 | 0.6 | 0.6 |
| 7 | 0.7 | 0.7 |
| 8 | 0.7 | 0.7 |
| 9 | 0.7 | 0.7 |
| 10 | 0.7 | 0.7 |

The dataframe is now ready to be plotted with `Timestamp` in the horizontal axis and the two metrics in the vertical axis.

In [1]:
# Imports
import os
import sys
import pandas as pd
p = os.path.abspath('..')
sys.path.insert(1, p)
from notebooks.notebook_utils import preprocess_df, fill_values

In [2]:
# Load input data
budget = 10
df = pd.read_csv(
    'mock_results.csv',
    parse_dates = ['Timestamp']
)

In [3]:
# preprocess_df
df = preprocess_df(df, budget)
df.head(10)

Unnamed: 0,Timestamp,single_best_optimization_score,single_best_test_score
0,0,0.5,0.5
1,3,0.6,0.6
2,7,0.7,0.7
3,10,0.7,0.7


In [4]:
# fill_values
df_new = fill_values(df, budget)
df_new.head(20)

Unnamed: 0,Timestamp,single_best_optimization_score,single_best_test_score
0,1,0.5,0.5
1,2,0.5,0.5
2,3,0.6,0.6
3,4,0.6,0.6
4,5,0.6,0.6
5,6,0.6,0.6
6,7,0.7,0.7
7,8,0.7,0.7
8,9,0.7,0.7
9,10,0.7,0.7
