# Tune Model Hyperparameters with the Data-Driven Library

The datadriven library provides an extensible command-line interface for training, evaluating, and predicting data-driven simulators. However, you may prefer training and sweeping models inside a notebook. This notebook provides an example for doing so.

## Set Working Directory and Import Necessary Libraries

In [1]:
cd ..

/Users/jill/bonsai/datadrivenmodel


In [2]:
from hydra.experimental import initialize, compose
from omegaconf import DictConfig, ListConfig, OmegaConf
from model_loader import available_models
from base import plot_parallel_coords
import logging
import matplotlib.pyplot as plt
import numpy as np
from rich import print
from rich.logging import RichHandler
import copy
import pandas as pd
from assessment_metrics_loader import available_metrics

logging.basicConfig(
    level=logging.INFO,
    format="%(message)s",
    datefmt="[%X]",
    handlers=[RichHandler()]
)
logger = logging.getLogger("ddm_training")
logger.setLevel(logging.INFO)

## Initialize Configuration

While you can provide every argument manually, there is benefit in directly using the `hydra` config class to load an existing configuration file. This way you can ensure your parameters are saved to a file for later use, and you automatically gain the benefit of all the logging and model artifacts that are provided by our workflow of `hydra` and `mlflow`.

If you want to override any settings of the configurations, provide them in a list of `overrides` as shown below.

In [3]:
initialize(config_path="../conf", job_name="ddm_training")
cfg = compose(config_name="config", overrides=["data=house_energy", "model=xgboost"])

Missing @package directive model/xgboost.yaml in file:///Users/jill/bonsai/datadrivenmodel/conf.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
Missing @package directive data/house_energy.yaml in file:///Users/jill/bonsai/datadrivenmodel/conf.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
Missing @package directive simulator/house_energy_simparam.yaml in file:///Users/jill/bonsai/datadrivenmodel/conf.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive


In [4]:
print(OmegaConf.to_yaml(cfg))

In [5]:
# Extract features from yaml file
input_cols = cfg['data']['inputs']
output_cols = cfg['data']['outputs']
augmented_cols = cfg['data']['augmented_cols']
dataset_path = cfg['data']['path']
iteration_order = cfg['data']['iteration_order']
episode_col = cfg['data']['episode_col']
iteration_col = cfg['data']['iteration_col']
max_rows = cfg['data']['max_rows']
test_perc = cfg['data']['test_perc']

## Model Trainer

To make it easy to sweep over models later, we create a simple `train_models` function here:

In [6]:
def train_models(config=cfg):

    logger.info(f'Model type: {available_models[config["model"]["name"]]}')
    Model = available_models[config["model"]["name"]]
    model = Model()
    logger.info(f"Building model with parameters: {config}")
    model.build_model(
        **config["model"]["build_params"]
    )
    logger.info(f"Loading data from {dataset_path}")
    X, y = model.load_csv(
        input_cols=input_cols,
        output_cols=output_cols,
        augm_cols=list(augmented_cols),
        dataset_path=dataset_path,
        iteration_order=iteration_order,
        episode_col=episode_col,
        iteration_col=iteration_col,
        max_rows=max_rows,
    )
    global X_train, y_train, episode_ids_train, X_test, y_test, episode_ids_test
    train_id_end = int(np.floor(X.shape[0] * (1 - test_perc)))
    X_train, y_train, episode_ids_train = (X[:train_id_end,],y[:train_id_end,],model.episode_ids[:train_id_end,])
    X_test, y_test, episode_ids_test = (X[train_id_end:,],y[train_id_end:,],model.episode_ids[train_id_end:,])
    
    
    logger.info(f"Fitting model...")
    model.fit(X_train, y_train)
    logger.info(f"Model trained!")
    y_pred = model.predict(X_test)
    r2_score = available_metrics["r2_score"]
    logger.info(f"R^2 score is {r2_score(y_test,y_pred)} for the test set.")

    return model

In [7]:
model = train_models(cfg)

  exec(code_obj, self.user_global_ns, self.user_ns)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


## Hyperparameter Sweeping

The `datadrivenmodel` has an automatic solution for hyperparameter sweeping and tuning. These settings are provided in the config `model.sweep` parameters. Provide the limits of the variables you want to sweep over and the `sweep` method will automatically parallelize the sweep over the available number of cores and find the optimal solution according to your `scoring_func`.

### Configuration Parameters

You can select the search algorithm you'd like to use: `bayesian` runs bayesian optimiziation (using scikit-optimize), `hyperopt` runs [Tree-Parzen Estimators](https://papers.nips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf) with the `hyperopt` package, `bohb` uses Bayesian Opt/HyperBand, or `optuna` which also runs Tree-Parzen estimators but using the [`optuna`](https://optuna.readthedocs.io/en/stable/) package.

In [8]:
print(OmegaConf.to_yaml(cfg["model"]["sweep"]))

In [9]:
params = OmegaConf.to_container(cfg["model"]["sweep"]["params"])
logger.info(f"Sweeping with parameters: {params}")

# Perform the sweep
sweep_df = model.sweep(
    params=params,
    X=X_train,
    y=y_train,
    search_algorithm=cfg["model"]["sweep"]["search_algorithm"],
    num_trials=cfg["model"]["sweep"]["num_trials"],
    scoring_func=cfg["model"]["sweep"]["scoring_func"],
    results_csv_path=cfg["model"]["sweep"]["results_csv_path"],
)    

In [10]:
# Print the final score for the held out test set
y_pred = model.predict(X_test)
r2_score = available_metrics["r2_score"]
logger.info(f"R^2 score is {r2_score(y_test,y_pred)} for the test set.")

In [11]:
# Print some of the results from the sweep
sweep_df.head()

Unnamed: 0,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,time_total_s,training_iteration,param_estimator__max_depth,param_estimator__gamma,param_estimator__subsample,param_estimator__eta
0,"{'estimator__max_depth': 5, 'estimator__gamma'...",0.99915,0.999252,0.99931,0.999304,0.999222,0.999248,5.9e-05,6,22.460661,1,5,0.5,1.0,0.3
1,"{'estimator__max_depth': 10, 'estimator__gamma...",0.999115,0.999268,0.999245,0.999324,0.999322,0.999255,7.7e-05,4,34.877752,1,10,0.5,0.5,0.5
2,"{'estimator__max_depth': 1, 'estimator__gamma'...",0.992494,0.993475,0.992761,0.993419,0.993241,0.993078,0.000385,14,6.184767,1,1,1.0,1.0,0.5
3,"{'estimator__max_depth': 5, 'estimator__gamma'...",0.998728,0.998939,0.998942,0.998929,0.998957,0.998899,8.6e-05,11,24.567032,1,5,1.0,0.5,0.3
4,"{'estimator__max_depth': 3, 'estimator__gamma'...",0.994716,0.995125,0.994683,0.994927,0.994409,0.994772,0.000242,13,15.76656,1,3,5.0,0.5,0.3


### Visualizing Hyperparameter Results

In [None]:
plot_parallel_coords(sweep_df)

### Reading Saved Runs from CSV

Runs are automatically saved to a CSV in the outputs directory:

In [None]:
sweep_df2 = pd.read_csv("xgboost_gridsearch/search_results.csv")

In [None]:
plot_parallel_coords(sweep_df2)