## Distributed Hyperopt and automated MLflow tracking

[Hyperopt](https://github.com/hyperopt/hyperopt) is a Python library for hyperparameter tuning. Databricks Runtime for Machine Learning includes an optimized and enhanced version of Hyperopt, including automated MLflow tracking and the `SparkTrials` class for distributed tuning.  

This notebook illustrates how to scale up hyperparameter tuning for a single-machine Python ML algorithm and track the results using MLflow. In part 1, you create a single-machine Hyperopt workflow. In part 2, you learn to use the `SparkTrials` class to distribute the workflow calculations across the Spark cluster.

## Import required packages and load dataset

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials

# If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line. 
import mlflow

In [2]:
# Load the iris dataset from scikit-learn
iris = iris = load_iris()
X = iris.data
y = iris.target

## Part 1. Single-machine Hyperopt workflow

Here are the steps in a Hyperopt workflow:  
1. Define a function to minimize.  
2. Define a search space over hyperparameters.  
3. Select a search algorithm.  
4. Run the tuning algorithm with Hyperopt `fmin()`.

For more information, see the [Hyperopt documentation](https://github.com/hyperopt/hyperopt/wiki/FMin).

In [3]:
def objective(C):
    # Create a support vector classifier model
    clf = SVC(C=C)
    clf.fit(X,y)
    
    # Use the cross-validation accuracy to compare the models' performance
    accuracy = cross_val_score(clf, X, y).mean()
    with mlflow.start_run():
      mlflow.log_metric("C", C)
      mlflow.log_metric("loss", -accuracy)
    
    # Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
    return {'loss': -accuracy, 'C': C, 'status': STATUS_OK}

### Define the search space over hyperparameters

See the [Hyperopt docs](https://github.com/hyperopt/hyperopt/wiki/FMin#21-parameter-expressions) for details on defining a search space and parameter expressions.

In [4]:
search_space = hp.lognormal('C', 0, 1.0)

### Select a search algorithm

The two main choices are:
* `hyperopt.tpe.suggest`: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on past results
* `hyperopt.rand.suggest`: Random search, a non-adaptive approach that samples over the search space

In [5]:
algo=tpe.suggest

Run the tuning algorithm with Hyperopt `fmin()`

Set `max_evals` to the maximum number of points in hyperparameter space to test, that is, the maximum number of models to fit and evaluate.

In [6]:

mlflow.set_tracking_uri("http://localhost:5001")

mlflow.set_experiment("MLflow + HyperOpt + Scikit-Learn")
argmin = fmin(
  fn=objective,
  space=search_space,
  algo=algo,
  max_evals=16)

100%|██████████| 16/16 [00:01<00:00, 13.54trial/s, best loss: -0.9866666666666667]


In [7]:
# Print the best value found for C
print("Best value found: ", argmin)
print("argmin.get('C'): ", argmin.get('C'))

Best value found:  {'C': 5.394497676445904}
argmin.get('C'):  5.394497676445904


To view the MLflow experiment associated with the notebook, click the **Experiment** icon in the notebook context bar on the upper right.  There, you can view all runs. To view runs in the MLflow UI, click the icon at the far right next to **Experiment Runs**. 

To examine the effect of tuning `C`:

1. Select the resulting runs and click **Compare**.
1. In the Scatter Plot, select **C** for X-axis and **loss** for Y-axis.

In [8]:
def train_and_returnModel(C):
    # Create a support vector classifier model
    clf = SVC(C=C)
    clf.fit(X,y)
    
    return clf


In [9]:
import mlflow

model_2_mlflow = trainmodel_2_mlflow = train_and_returnModel(argmin.get('C'))
mlflow.sklearn.log_model(model_2_mlflow, "Sklearn-SVC-model",registered_model_name="Sklearn-SVC-model-reg")
model_uri = "models:/Sklearn-SVC-model-reg/1"


# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(model_uri)

# Predict on a Pandas DataFrame.
import pandas as pd
loaded_model.predict(pd.DataFrame(X))

Registered model 'Sklearn-SVC-model-reg' already exists. Creating a new version of this model...
2022/07/13 18:57:48 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: Sklearn-SVC-model-reg, version 2
Created version '2' of model 'Sklearn-SVC-model-reg'.


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])