# Tutorial for adding a new search space to HPO-B

In this tutorial, we will explain how to pre-process a group of flows (e.g. search space) and datasets from OpenML so that it can be added to HPO-B. Although we will focus on a single search space as example, this notebook also resembles the process for creating the original benchmark. However, bear in mind that the original benchmark creation was an iterative process based on the main steps presented here. 

First, we will import the main libraries that we will need.

In [1]:
import openml
from collections import defaultdict
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

Now, we will select a specific flow ID, however, this ID is just a placeholder that corresponds to potential new flows added to OpenML. This example is valid for a list of flows, although we select only one flow. The function **list_runs** will return the ID of specific runs, but not the actual hyperparameter configurations and responses. This information will be queried later on.

In [2]:
flow_id = 423
runs = openml.runs.list_runs(flow = [flow_id])


We create a matrix to structure the runs, such that we can differentiate the runs belonging to a specific flow and dataset. Afterwards, we define a list of selected datasets and flows, which involves only 100 datasets for the sake of simplicity. The original benchmark creation specified some restrictions on the number of runs per flow and dataset. For more information regarding this, please refer to **data_creation.py**.

In [3]:
matrix = defaultdict(lambda: defaultdict(list))#or set?

for run in runs.values():
    matrix[run["flow_id"]][run["task_id"]].append(run['run_id']) 
    
selected_datasets_ids = list(matrix[flow_id].keys())[:100] #limiting to 100 datasets for demonstration purposes
selected_flows_ids = [flow_id]*len(selected_datasets_ids)

Now we query the actual runs. Notice that we do it using batches of 1000 runs to avoid overload on the servers. We also keep an index accounting for the current position in the list of selected datasets. Like this, it is possible to resume the download of runs in case the process stops by just speciying assigning the highest index to **starting_idx**.

In [5]:

def process_run_settings (settings):
    parameter_dict ={}
    for setting in settings:
        value = setting["oml:value"]
        parameter_dict[setting["oml:name"]] = value

    return parameter_dict

starting_idx = 0
meta_dataset = defaultdict(lambda: defaultdict(list))

for idx, (dataset, flow) in enumerate(zip(selected_datasets_ids[starting_idx:], selected_flows_ids[starting_idx:])):
    
    if idx%10==0:
        print("Processing dataset:", dataset, " and flow: ", flow, " position:",idx)
    
    
    run_list = list(matrix[flow][dataset])
    
    for run_list_idx in range(0, max(2000,len(run_list)), 1000):

        
        current_runs = openml.runs.get_runs(list(run_list)[run_list_idx:run_list_idx+1000])

        for i, current_run in enumerate(current_runs):

            temp_dict = {}
            temp_dict["run_id"] = run_list[i]
            temp_dict["task_id"] = current_run.task_id
            temp_dict["flow_name"] = current_run.flow_name
            temp_dict["accuracy"] = current_run.evaluations["predictive_accuracy"]
            temp_dict["parameter_settings"] =  process_run_settings (current_run.parameter_settings)
            meta_dataset[flow][dataset].append(temp_dict)

Processing dataset: 1  and flow:  423  position: 0
Processing dataset: 13  and flow:  423  position: 10
Processing dataset: 27  and flow:  423  position: 20
Processing dataset: 40  and flow:  423  position: 30
Processing dataset: 53  and flow:  423  position: 40
Processing dataset: 235  and flow:  423  position: 50
Processing dataset: 248  and flow:  423  position: 60
Processing dataset: 264  and flow:  423  position: 70
Processing dataset: 277  and flow:  423  position: 80
Processing dataset: 288  and flow:  423  position: 90


## Casting hyperparameters type

As originally all the hyperparameter values are string, we need to cast them to the original type. For this, we consider a hyperparameter to be either integer, float or string. Moreover, we create a **check_value** function  that checks weather the value of hyperparameter is valid. Although we are checking the length, this part is fully customizable as it can be checked whether a hyperparameter is within a range or follows a specific format. In case the hyperparameter does not comply the checking conditions, it is not included.

In [6]:
meta_dataset_checkpoint1 = defaultdict(lambda: defaultdict(list))
max_len = 2000

check_value = lambda x: len(value)<max_len

for flow_id in meta_dataset.keys():
    for dataset_id in meta_dataset[flow_id].keys():
        for run_id, run in enumerate(meta_dataset[flow_id][dataset_id]):
            append_run = 1
            for parameter in run["parameter_settings"].keys():
                try:
                    try:
                        run["parameter_settings"][parameter]= int(run["parameter_settings"][parameter])
                    except:
                        run["parameter_settings"][parameter]= float(run["parameter_settings"][parameter])
                except:
                    value = run["parameter_settings"][parameter]

                    if value is not None:
                        if not check_value(value):
                            append_run = 0
            if append_run == 1:
                    meta_dataset_checkpoint1[flow_id][dataset].append(run)
                    
with open('meta_dataset_ckechpoint1.json', 'w') as outfile:
    json.dump(meta_dataset_checkpoint1, outfile)

## Filtering datasets

Additionally, some datasets may not have enough evaluations, thus are unuseful or may contain wrong evaluations. Therefore, we exclude all their runs from the final meta-dataset. In here, we filter out the datasets with one unique evaluations, but the selection rule can involve more complex conditions, referring for instance, to the dimensionality of the search space.

In [7]:
##Filtering if it has invalid evaluation
exclude_list = []

for flow_id in meta_dataset_checkpoint1.keys():
    for dataset_id in meta_dataset_checkpoint1[flow_id].keys():
        run_parameters=[run["parameter_settings"] for run in  meta_dataset_checkpoint1[flow_id][dataset_id]]
        counter= {}
        for run in run_parameters:
            for param, value in run.items():
                if param in counter.keys():
                    counter[param].append(value)
                else:
                    counter[param] = []
        only_ones = True
        for key, values in counter.items():
            try:
                only_ones= only_ones and (np.unique(values).shape[0]==1)
            except:
                pass
        if only_ones: 
            exclude_list.append((dataset_id, flow_id))

#filtering meta-dataset (getting rid of data in the blacklist)
meta_dataset_checkpoint2 = defaultdict(lambda: defaultdict(list))

for flow_id in meta_dataset_checkpoint1.keys():
    for dataset_id in meta_dataset_checkpoint1[flow_id].keys():
        if (dataset_id, flow_id) not in exclude_list:
            for run in meta_dataset_checkpoint1[flow_id][dataset_id]:
                meta_dataset_checkpoint2[flow_id][dataset_id].append(run)

with open('meta_dataset_ckechpoint2.json', 'w') as outfile:
    json.dump(meta_dataset_checkpoint2, outfile)

## Imputation and hyperparameter transformation

The next step is to impute, transform and filter hyperparameters. Before this, we merge all the datasets within the same search space into a "single view", a dataframe that contains the hyperparameters as columns and the runs as rows. This integrated dataframe allows comparing among datasets, so that the normalization considers all the whole range of values. 

During the imputation, we create a new column that specifies whether a hyperparameter had no value (or invalid value). Similarly, if a hyperparameter only has one value across the datasets, we delete it. Finally, we also specify which hyperparameters receive a log-transformation.

In [8]:
def get_dataframe(set_of_runs):
    df = pd.DataFrame([run["parameter_settings"] for run in set_of_runs])
    df["accuracy"] = [run["accuracy"] for run in set_of_runs]
    df["dataset"] = [run["task_id"] for run in set_of_runs] 
    return df

def get_hps_to_keep(df):
    
    nunique = df.nunique()
    hps_to_keep = list(df.columns[nunique>1])
    return hps_to_keep

def get_hps_to_apply_log(df):
    
    return []

def normalize(x):
    
    return (x-min(x))/(max(x)-min(x))


def impute_dataframe(df):
    df = df.replace("None", np.nan)
    df_ = pd.DataFrame()
    
    columns = list(df.columns)
    columns.remove("accuracy")
    columns.remove("dataset")
    for column in columns:

        if(df[column].dtype == "float64" ):
            df_na = df[[column]].isna().astype(float)
            df_na.columns += ".na"
            df_ = pd.concat((df_, df[[column]].fillna(0.0), df_na), axis=1)
        elif df[column].dtype == "int" :
            df_na = df[[column]].isna().astype(int)
            df_na.columns += ".na"
            df_ = pd.concat((df_, df[[column]].fillna(0.0), df_na), axis=1)            
        else:
            df[[column]] = df[[column]].fillna("INVALID")
            df_ = pd.concat((df_, df[[column]]), axis=1)

    return df_

hps_to_keep = dict()
hps_to_apply_log = dict()
single_views = dict()

for flow_id in meta_dataset_checkpoint2.keys():
    single_view_hp = pd.DataFrame()
    for dataset_id in meta_dataset_checkpoint2[flow_id].keys():
        temp_df = get_dataframe(meta_dataset_checkpoint2[flow_id][dataset_id])
        single_view_hp = pd.concat((single_view_hp, temp_df))
        
    single_views[flow_id] = impute_dataframe(single_view_hp)
    hps_to_keep[flow_id] = get_hps_to_keep(single_views[flow_id] )
    hps_to_apply_log[flow_id] = get_hps_to_apply_log(single_views[flow_id] )

    for hp in single_views[flow_id].columns:
        if hp in hps_to_keep[flow_id]:
            if hp in hps_to_apply_log[flow_id]:
                single_views[flow_id][hp] = np.log(single_views[flow_id][hp])
            single_views[flow_id][hp] = normalize(single_views[flow_id][hp])
        else:
            del single_views[flow_id][hp]
    
    single_views[flow_id]["accuracy"] = single_view_hp["accuracy"]
    single_views[flow_id]["dataset"] = single_view_hp["dataset"]

In [9]:
single_views

{423:             I  accuracy  dataset
 0    0.000000  0.993318        1
 1    0.066667  0.993318        1
 2    0.200000  0.992205        1
 3    0.000000  0.893096        2
 4    0.066667  0.898664        2
 ..        ...       ...      ...
 475  0.466667  0.678049     1773
 476  0.000000  0.824324     1774
 477  0.066667  0.821622     1774
 478  0.200000  0.818919     1774
 479  0.466667  0.820270     1774
 
 [480 rows x 3 columns]}

## Creating the meta-dataset for new the search space

Once we have imputed and transformed the hyperparameters, we create the final json object. In Python, this corresponds to a nested dictionary, where we assign the flow ID as the first level key, and the second level key as the dataset ID. A quick example of how this should look like:

```python
meta_dataset = { "flow_id_1" : { "dataset_id_1": {"X": [[1,1], [0,2]],
                                                  "y": [[0.9], [0.1]]},
                               { "dataset_id_2": ... },
                 "flow_id_2" : ...
                                
                }
```

In [10]:
meta_dataset_new_space = {}
output_path = "../data/"

for flow_id in single_views.keys():
    
    df = single_views[flow_id]
    flow_id = str(flow_id)
    cols = list(df.columns)    
    [cols.remove(x) for x in ["dataset", "accuracy"]]
    datasets_list = df["dataset"].unique()
    meta_dataset_new_space [flow_id] = {}
    for dataset in datasets_list:
        df_temp = df[df["dataset"]==dataset]
        dataset = str(dataset)
        if(df_temp.shape[0]>0):
            meta_dataset_new_space[flow_id][dataset] = {"X": df_temp[cols].to_numpy().tolist(),
                                                 "y": df_temp["accuracy"].to_numpy().tolist()}

Now we can inspect and save the created dictionary. We can see that the format matches to our previous descriptioin

In [12]:
dataset_id="1"
meta_dataset_new_space[flow_id][dataset_id]

{'X': [[0.0], [0.06666666666666667], [0.2]],
 'y': [0.993318, 0.993318, 0.992205]}

In [13]:
with open('meta_dataset_new_space.json', 'w') as outfile:
    json.dump(meta_dataset_new_space, outfile)

## Updating HPO-B with the new search space

If we have created a meta-dataset on new search-spaces by following the above described format, we can update HPO-B easily. In case we use the provided API, we just need to update the meta-test-data attribute, which is a dictionary. Notice that this procedure, however, may differ depending on the goal.

In [14]:
import sys
import os
sys.path.append("..")
from hpob_handler import HPOBHandler

hpob_hdlr = HPOBHandler(root_dir="../hpob-data/", mode="v1")

Loading HPO-B handler
Loading data...


In [15]:
hpob_hdlr.meta_test_data.update(meta_dataset_new_space)