# Hyperparameter Optimization In Ludwig



Demonstrates hyper-parameter tuning capabilities of Ludwig. The following steps occur in this notebook:
* Training data is prepared for use
* Programmatically create Ludwig model definition dictionary from the training data dataframe
* Setup parameter space for hyperparameter optimization
* Perform two hyperparameter runs
  * Parallel workers using random search strategy
  * Serial processing using random search strategy
  * Parallel workers using grid search strategy
* Convert results returned from hyperparameter optimization to a dataframe

## Import required libraries

In [1]:
import warnings
warnings.simplefilter('ignore')

import logging
import shutil
import tempfile
import datetime

import pandas as pd
import numpy as np

from ludwig.api import LudwigModel
from ludwig.utils.data_utils import load_json
from ludwig.utils.defaults import merge_with_defaults, ACCURACY
from ludwig.utils.tf_utils import get_available_gpus_cuda_string
from ludwig.visualize import learning_curves
from ludwig.hyperopt.execution import get_build_hyperopt_executor
from ludwig.hyperopt.sampling import (get_build_hyperopt_sampler)
from ludwig.hyperopt.utils import update_hyperopt_params_with_defaults
from ludwig.visualize import hyperopt_results_to_dataframe, hyperopt_hiplot_cli, hyperopt_report_cli

from sklearn.model_selection import train_test_split

## Retrieve data for training

In [2]:
train_df = pd.read_csv('./data/winequalityN.csv')
train_df.shape

(6497, 13)

## Standardize column names to replace spaces(" ") with underscore("_")

In [3]:
new_col = []
for i in range(len(train_df.columns)):
    new_col.append(train_df.columns[i].replace(' ', '_'))
    
train_df.columns = new_col


## Data Set Overview

In [4]:
train_df.dtypes

type                     object
fixed_acidity           float64
volatile_acidity        float64
citric_acid             float64
residual_sugar          float64
chlorides               float64
free_sulfur_dioxide     float64
total_sulfur_dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

## Create training and test data sets

In [5]:
train_df['quality'].value_counts().sort_index()

3      30
4     216
5    2138
6    2836
7    1079
8     193
9       5
Name: quality, dtype: int64

In [6]:
# isolate the predictor variables only
predictor_vars = list(set(train_df.columns) - set(['quality']))

#extract categorical variables
categorical_vars = []
for p in predictor_vars:
    if train_df[p].dtype == 'object':
        categorical_vars.append(p)
        
print("categorical variables:", categorical_vars,'\n')

# get numerical variables
numerical_vars = list(set(predictor_vars) - set(categorical_vars))

print("numerical variables:", numerical_vars,"\n")

categorical variables: ['type'] 

numerical variables: ['alcohol', 'pH', 'sulphates', 'free_sulfur_dioxide', 'chlorides', 'citric_acid', 'volatile_acidity', 'fixed_acidity', 'density', 'residual_sugar', 'total_sulfur_dioxide'] 



In [7]:
train_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fixed_acidity,6487.0,7.216579,1.29675,3.8,6.4,7.0,7.7,15.9
volatile_acidity,6489.0,0.339691,0.164649,0.08,0.23,0.29,0.4,1.58
citric_acid,6494.0,0.318722,0.145265,0.0,0.25,0.31,0.39,1.66
residual_sugar,6495.0,5.444326,4.758125,0.6,1.8,3.0,8.1,65.8
chlorides,6495.0,0.056042,0.035036,0.009,0.038,0.047,0.065,0.611
free_sulfur_dioxide,6497.0,30.525319,17.7494,1.0,17.0,29.0,41.0,289.0
total_sulfur_dioxide,6497.0,115.744574,56.521855,6.0,77.0,118.0,156.0,440.0
density,6497.0,0.994697,0.002999,0.98711,0.99234,0.99489,0.99699,1.03898
pH,6488.0,3.218395,0.160748,2.72,3.11,3.21,3.32,4.01
sulphates,6493.0,0.531215,0.148814,0.22,0.43,0.51,0.6,2.0


In [8]:
for p in categorical_vars:
    print("unique values for",p,"is",train_df[p].nunique())

unique values for type is 2


## Create model definition

In [9]:
# template for model definition
model_definition = {'input_features':[], 'output_features': [], 'training':{}}

# setup input features for categorical variables
for p in categorical_vars:
    a_feature = {'name': p.replace(' ','_'), 'type': 'category', 'representation': 'sparse'}
    model_definition['input_features'].append(a_feature)


# setup input features for numerical variables
for p in numerical_vars:
    a_feature = {'name': p.replace(' ','_'), 'type': 'numerical', 
                'preprocessing': {'missing_value_strategy': 'fill_with_mean', 'normalization': 'zscore'}}
    model_definition['input_features'].append(a_feature)

# set up output variable
model_definition['output_features'].append({'name': 'quality', 'type':'category'})

# set up training
model_definition['training'] = {'epochs': 20}

In [10]:
# View the model defintion
print("model definition:")
model_definition

model definition:


{'input_features': [{'name': 'type',
   'type': 'category',
   'representation': 'sparse'},
  {'name': 'alcohol',
   'type': 'numerical',
   'preprocessing': {'missing_value_strategy': 'fill_with_mean',
    'normalization': 'zscore'}},
  {'name': 'pH',
   'type': 'numerical',
   'preprocessing': {'missing_value_strategy': 'fill_with_mean',
    'normalization': 'zscore'}},
  {'name': 'sulphates',
   'type': 'numerical',
   'preprocessing': {'missing_value_strategy': 'fill_with_mean',
    'normalization': 'zscore'}},
  {'name': 'free_sulfur_dioxide',
   'type': 'numerical',
   'preprocessing': {'missing_value_strategy': 'fill_with_mean',
    'normalization': 'zscore'}},
  {'name': 'chlorides',
   'type': 'numerical',
   'preprocessing': {'missing_value_strategy': 'fill_with_mean',
    'normalization': 'zscore'}},
  {'name': 'citric_acid',
   'type': 'numerical',
   'preprocessing': {'missing_value_strategy': 'fill_with_mean',
    'normalization': 'zscore'}},
  {'name': 'volatile_acidity'

## Define hyperparameter search space

In [11]:
SEED=13

HYPEROPT_CONFIG = {
    "parameters": {
        "training.learning_rate": {
            "type": "float",
            "low": 0.0001,
            "high": 0.01,
            "space": "log",
            "steps": 3,
        },
        "training.batch_size": {
            "type": "int",
            "low": 32,
            "high": 256,
            "space": "log",
            "steps": 5,
            "base" : 2
        },
        "quality.fc_size": {
            "type": "int",
            "low": 32,
            "high": 256,
            "steps": 5
        },
        "quality.num_fc_layers": {
            'type': 'int',
            'low': 1,
            'high': 5,
            'space': 'linear',
            'steps': 4
        }
    },
    "goal": "minimize",
    'output_feature': "quality",
    'validation_metrics': 'loss'
}

## Helper Function to run Hyperopt 

In [12]:
# function to run hyperparameter optimization run
def run_hyperopt_executor(sampler, executor, model_definition,
                      dataset,
                        output_directory='results'):

    # update model definition with remaining defaults
    model_definition = merge_with_defaults(model_definition)

    # get copy of hyperparameter configuration parameters to optimize
    hyperopt_config = HYPEROPT_CONFIG.copy()

    # update with remaining defaults
    update_hyperopt_params_with_defaults(hyperopt_config)

    # Extract relevant parameters
    parameters = hyperopt_config["parameters"]
    split = hyperopt_config["split"]
    output_feature = hyperopt_config["output_feature"]
    metric = hyperopt_config["metric"]
    goal = hyperopt_config["goal"]

    # setup sampler
    hyperopt_sampler = get_build_hyperopt_sampler(
        sampler["type"])(goal, parameters, **sampler)

    # setup executor
    hyperopt_executor = get_build_hyperopt_executor(executor["type"])(
        hyperopt_sampler, output_feature, metric, split, **executor)

    # run hyperparameter executor
    hyperopt_results = hyperopt_executor.execute(model_definition,
                              dataset=dataset,
                              gpus=get_available_gpus_cuda_string(),
                              output_directory=output_directory)

    return hyperopt_results

# function to convert results from hyperopt run into a dataframe
def extract_row_data(hyperopt_result):
    row = hyperopt_result['parameters']
    row['metric_score'] = hyperopt_result['metric_score']
    return row

## Train with optimal hyperparameters on the whole data set

In [13]:
# clean out old results
shutil.rmtree('./results_random_parallel', ignore_errors=True)
shutil.rmtree('./results_random_serial', ignore_errors=True)
shutil.rmtree('./results_grid_parallel', ignore_errors=True)
shutil.rmtree('./visualizations', ignore_errors=True)

#### Random Search with 4 parallel executors

In [14]:
%%time
print("starting:", datetime.datetime.now())
random_parallel_results = run_hyperopt_executor(
    {'type': 'random', 'num_samples': 10},  # sampler
   {'type': 'parallel', 'num_workers': 4}, # executor
    model_definition,
    train_df.sample(4000, random_state=42),  # limit number records for demonstration purposes
    output_directory='results_random_parallel'  # location to place results
)

starting: 2020-09-21 04:22:02.370269
CPU times: user 232 ms, sys: 79.4 ms, total: 311 ms
Wall time: 1min 38s


#### Random Search with serial executor

In [15]:
%%time
print("starting:", datetime.datetime.now())
random_serial_results = run_hyperopt_executor(
    {'type': 'random', 'num_samples': 10},  # sampler
    {'type': 'serial'},  #executor
    model_definition,
    train_df.sample(4000, random_state=42),  # limit number records for demonstration purposes
    output_directory='results_random_serial'
)

starting: 2020-09-21 04:23:41.060151
CPU times: user 1min 19s, sys: 15.4 s, total: 1min 34s
Wall time: 1min 25s


#### Grid Search with 4 parallel executors (takes about 35 minutes)
To run the next cell, change it from `Raw NB Convert` to `Code` cell

### Note:
`random_parallel_results`, `random_serial_results` and `grid_parallel_results` are lists.  The first element in each list contains the best performing metric with the associated parameters.

## Convert hyperparameter optimization results to dataframe

#### Results For Random Search with 4 parallel executors

In [16]:
df1 = hyperopt_results_to_dataframe(
    random_parallel_results,
    HYPEROPT_CONFIG['parameters'],
    HYPEROPT_CONFIG['validation_metrics']
)
df1

Unnamed: 0,loss,quality.fc_size,quality.num_fc_layers,training.batch_size,training.learning_rate
0,0.953846,119,4,127,0.001021
1,0.998579,90,1,191,0.000575
2,1.056674,211,3,198,0.001204
3,1.060105,176,3,33,0.000652
4,1.074196,136,5,54,0.000867
5,1.077958,49,4,37,0.00095
6,1.087947,155,3,189,0.002448
7,1.08905,195,1,253,0.003234
8,1.106168,241,5,93,0.000726
9,1.119513,33,3,35,0.00012


#### Results for Random Search with serial executor

In [17]:
df2 = hyperopt_results_to_dataframe(
    random_serial_results,
    HYPEROPT_CONFIG['parameters'],
    HYPEROPT_CONFIG['validation_metrics']
)
df2

Unnamed: 0,loss,quality.fc_size,quality.num_fc_layers,training.batch_size,training.learning_rate
0,0.846123,87,3,127,0.004581
1,0.974156,42,2,95,0.003879
2,1.043028,215,5,77,0.001727
3,1.048163,87,5,44,0.007121
4,1.051152,93,3,130,0.00545
5,1.052577,123,2,54,0.008912
6,1.095933,179,5,117,0.000217
7,1.109733,43,5,61,0.000232
8,1.11034,56,2,72,0.000224
9,1.204301,73,3,125,0.000111


#### Results for Grid Search with 4 parallel executors
To run the next cell convert from `Raw NB Convert` to `Code` cell

## Example Hyperopt Visualizations

In [19]:
hyperopt_report_cli(
    'results_random_parallel/test_statistics.json',
    output_directory='./visualizations'
)

KeyError: 'hyperopt_config'