[View in Colaboratory](https://colab.research.google.com/github/mlindauer/lab_course_add/blob/master/Run_SMAC_HPO_RF_with_Instances.ipynb)

# Using SMAC to optimize hyperparameters of RF by using instances

* Installation of SMAC
* Defintion of function to be optimized (RF on Boston Dataset)
* Definition of RF's configspace
* Definition of SMAC's scenario including instances
* Running SMAC

## Installation of SMAC and its dependencies

In [0]:
!apt-get install swig -y
!pip install Cython
!pip install pyrfr==0.8.0 --no-cache --user
# hack to find pyrfr
import sys
sys.path.insert(0,"./.local/lib/python3.6/site-packages")

!pip install git+https://github.com/automl/SMAC3.git@development
  
import logging
logging.basicConfig(level=logging.INFO)

## Optimize RF RMSE performance on Boston Dataset

In [0]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
import numpy as np

boston = load_boston()

# get k-fold cross validation
kfold = KFold(n_splits=10)
splits = list(kfold.split(boston.data))

def rf_from_cfg(cfg, instance, seed):
    """
        Creates a random forest regressor from sklearn and fits the given data on it.
        This is the function-call we try to optimize. Chosen values are stored in
        the configuration (cfg).

        Parameters:
        -----------
        cfg: Configuration
            configuration chosen by smac
        instance: str
            id of cv fold
        seed: int or RandomState
            used to initialize the rf's random generator

        Returns:
        -----------
        np.mean(rmses): float
            mean of root mean square errors of random-forest test predictions
            per cv-fold
    """
    rfr = RandomForestRegressor(
        n_estimators=cfg["num_trees"],
        criterion=cfg["criterion"],
        min_samples_split=cfg["min_samples_to_split"],
        min_samples_leaf=cfg["min_samples_in_leaf"],
        min_weight_fraction_leaf=cfg["min_weight_frac_leaf"],
        max_features=cfg["max_features"],
        max_leaf_nodes=cfg["max_leaf_nodes"],
        bootstrap=cfg["do_bootstrapping"],
        random_state=seed)

    fold = int(instance) # smac only accepts str as "instances"
    train_id, test_id = splits[fold][0], splits[fold][1]
    X_train, y_train = boston.data[train_id], boston.target[train_id]
    X_test, y_test = boston.data[test_id], boston.target[test_id]
    # fit random forest
    rfr.fit(X_train, y_train)
    # predict on test set
    y_pred = rfr.predict(X_test)
    
    # return error on test set
    return np.sqrt(mean_squared_error(y_test, y_pred))

## Define Configuration Space

In [0]:
from smac.configspace import ConfigurationSpace
from ConfigSpace.hyperparameters import CategoricalHyperparameter, \
    UniformFloatHyperparameter, UniformIntegerHyperparameter

cs = ConfigurationSpace()

do_bootstrapping = CategoricalHyperparameter(
    "do_bootstrapping", ["true", "false"], default_value="true")
num_trees = UniformIntegerHyperparameter("num_trees", 10, 50, default_value=10)
max_features = UniformIntegerHyperparameter("max_features", 1, boston.data.shape[1], default_value=1)
min_weight_frac_leaf = UniformFloatHyperparameter("min_weight_frac_leaf", 0.0, 0.5, default_value=0.0)
criterion = CategoricalHyperparameter("criterion", ["mse", "mae"], default_value="mse")
min_samples_to_split = UniformIntegerHyperparameter("min_samples_to_split", 2, 20, default_value=2)
min_samples_in_leaf = UniformIntegerHyperparameter("min_samples_in_leaf", 1, 20, default_value=1)
max_leaf_nodes = UniformIntegerHyperparameter("max_leaf_nodes", 10, 1000, default_value=100)

cs.add_hyperparameters([do_bootstrapping,
                        num_trees, min_weight_frac_leaf, 
                        criterion, max_features, min_samples_to_split, 
                        min_samples_in_leaf, max_leaf_nodes])

print(cs)

def_config = cs.get_default_configuration()
print("Default %s" %(def_config))

for fold in range(0,10):
  #default seed of deterministic algorithms is 0
  rmse = rf_from_cfg(cfg=def_config, instance=str(fold), seed=0)
  print("RMSE of default configuration on %d-th fold: %f" %(fold, rmse))



## Define Scenario

A list of instances can be passed to the scenarios,
such that SMAC will optimize performance on average across all instances.
To avoid evaluating configurations on all instances, SMAC uses a aggressive racing strategy:
Configurations are dropped as soons as they perform worse than the current incumbent (i.e., best configuration found so far) on an arbitrary subset of instances.


In [0]:
from smac.scenario.scenario import Scenario

# define cv-folds as instances "0","1",..."9"
instances = map(str,range(0,10))

scenario = Scenario({"run_obj": "quality",   # we optimize quality (alternative runtime)
                     "wallclock_limit": 60,  # time for running SMAC
                     "cs": cs,               # configuration space
                     "deterministic": "true",
                     "memory_limit": 3072,   # adapt this to reasonable value for your hardware
                     "instances" : instances,
                     "output_dir": ""        # deactivate output
                     })

## Run SMAC

Note that SMAC will evaluate each configuration on at most 10 runs
because we defined 10 instances (10-fold cv) and assumed that the algorithm is deterministic.

In [0]:
from smac.facade.smac_facade import SMAC

smac = SMAC(scenario=scenario, rng=np.random.RandomState(42),
            tae_runner=rf_from_cfg)

incumbent = smac.optimize()

print("Number of evaluated configurations: %d" %(len(smac.solver.runhistory.ids_config )))

## SMAC on non-deterministic Algorithms

Please note that the number of runs per configuration will be much higher now,
because SMAC will try different instance-seed pairs for each configuration.

In [0]:
from smac.scenario.scenario import Scenario

# define cv-folds as instances "0","1",..."9"
instances = map(str,range(0,10))

scenario = Scenario({"run_obj": "quality",   # we optimize quality (alternative runtime)
                     "wallclock_limit": 60,  # time for running SMAC
                     "cs": cs,               # configuration space
                     "deterministic": "false", # ! important change compared to above !
                     "memory_limit": 3072,   # adapt this to reasonable value for your hardware
                     "instances" : instances,
                     "output_dir": ""        # deactivate output
                     })

smac = SMAC(scenario=scenario, rng=np.random.RandomState(42),
            tae_runner=rf_from_cfg)

incumbent = smac.optimize()

print("Number of evaluated configurations: %d" %(len(smac.solver.runhistory.ids_config )))