# Scaling Optuna hyperparameter optimization with Dask

<img src="optuna-logo.jpg"
     width="35%"
     alt="Optuna logo">
<img src="dask-logo.svg"
     width="35%"
     alt="Dask logo">

[Optuna](https://optuna.org/) and [Dask](https://dask.org) are popular Python libraries for hyperparameter optimization and parallel computing, respectively. This notebook walks through a workload which uses [Dask-Optuna](https://jrbourbeau.github.io/dask-optuna/), a library for integrating Dask and Optuna, to optimize an [XGBoost](https://xgboost.readthedocs.io/en/latest/) classification model in parallel across a Dask cluster.

# Example Workload

## Step 1: Define an objective function

Below is a snippet which uses Optuna to optimize several hyperparameters for an XGBoost classifier trained on the [breast cancer dataset](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-wisconsin-diagnostic-dataset).

There is no Dask-specific code here. This is exactly the same code you would write if you were to run Optuna on your local machine.

In [2]:
import numpy as np
import sklearn.datasets
import sklearn.metrics
from sklearn.model_selection import train_test_split
import xgboost as xgb

def objective(trial):
    # Load our dataset
    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)

    # Get set of hyperparameters
    param = {
        "silent": 1,
        "objective": "binary:logistic",
        "booster": trial.suggest_categorical("booster", ["gbtree", "dart"]),
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        "max_depth": trial.suggest_int("max_depth", 1, 9),
        "eta": trial.suggest_float("eta", 1e-8, 1.0, log=True),
        "gamma": trial.suggest_float("gamma", 1e-8, 1.0, log=True),
        "grow_policy": trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"]),
    }

    # Train XGBoost model and get predeictions
    bst = xgb.train(param, dtrain)
    preds = bst.predict(dtest)

    # Compute and return model accuracy
    pred_labels = np.rint(preds)
    accuracy = sklearn.metrics.accuracy_score(y_test, pred_labels)
    return accuracy

## Step 2: Create a Dask cluster

Here we'll use [Coiled](https://coiled.io) to create a remote Dask cluster on AWS. 

**Note:** Dask-Optuna work with _any_ Dask cluster, we're just using Coiled here because it's a conveient way for anyone to spin up a remote Dask cluster. For more information on Coiled, see the [Coiled documentation](https://docs.coiled.io).

In [1]:
%%time

# Use coiled to create a Dask cluster on AWS
import coiled

cluster = coiled.Cluster(
    n_workers=10, 
    software="examples/optuna-xgboost",
)

Creating Cluster. This takes about a minute ...Checking environment images
Valid environment image found
CPU times: user 1.17 s, sys: 296 ms, total: 1.47 s
Wall time: 1min 26s


<img src="dask-cluster.svg"
     width="75%"
     alt="Dask cluster">

In [3]:
# Connect my local machine to the remote cluster
from dask.distributed import Client

client = Client(cluster)
client.wait_for_workers(10)

client

0,1
Client  Scheduler: tls://ec2-18-216-150-48.us-east-2.compute.amazonaws.com:8786  Dashboard: http://ec2-18-216-150-48.us-east-2.compute.amazonaws.com:8787,Cluster  Workers: 10  Cores: 40  Memory: 171.80 GB


☝️ Don't forget to check on the Dask dashboard!

## Step 3: Run optimization trials in parallel

Dask-Optuna leverages Optuna’s existing distributed optimization capabilities to run optimization trials in parallel on a Dask cluster. It does this by providing a Dask-compatible `dask_optuna.DaskStorage` storage class which wraps an Optuna storage class (e.g. Optuna’s in-memory or sqlite storage) and can be used directly by Optuna.

In [4]:
import optuna
import dask_optuna
import joblib

# Create an Optuna study using a Dask-compatible Optuna storage class
storage = dask_optuna.DaskStorage()

study = optuna.create_study(
    direction="maximize",
    storage=storage,
)

# Run 200 optimizations trial on our cluster
with joblib.parallel_backend("dask"):
    study.optimize(objective, n_trials=200, n_jobs=-1)

In [5]:
study.best_params

{'booster': 'gbtree',
 'lambda': 2.2958926213388737e-05,
 'alpha': 0.0024442753560806082,
 'max_depth': 7,
 'eta': 0.2593902003279806,
 'gamma': 0.01747764804032886,
 'grow_policy': 'depthwise'}

And with that, you’re able to run distributed hyperparameter optimizations using Dask and Optuna!

# How Dask-Optuna helps

## Reduces setup

Dask-Optuna helps reduce the setup required when using persistent storage to back an Optuna study (e.g. sqlite databse). It does so by creating a single storage object on the scheduler which is accessible all workers in a Dask cluster, instead of the user needing to set up a storage object which is globally accessible across the entire cluster.

For example:

In [None]:
# Wraps Optuna's in-memory storage
storage_1 = dask_optuna.DaskStorage()

# Wraps Optuna's SQLite DB storage
storage_2 = dask_optuna.DaskStorage("sqlite:///example.db")

The underlying Optuna storage object lives on the cluster’s scheduler and any method calls on the `DaskStorage` instance results in the same method being called on the underlying Optuna storage object.

**NOTE**: there are other totally valid approaches parallelizing Optuna. For example, here's a nice example of running Optuna on Kubernetes https://github.com/optuna/optuna/tree/master/examples/kubernetes. Each method has it's own relative pros and cons.

## Extends Optuna’s `InMemoryStorage` to multiple processes

Helps extend Optuna’s `InMemoryStorage` class to run across multiple processes. This is important when using remote workers in a Dask cluster or situations where Python’s GIL leads to less-than-ideal parallelization.

# Looking ahead

We're currently working with the Optuna developers to include Dask-Optuna's functionality directly into Optuna.
This will help facilitate better integration between Dask and Optuna 🎉