# Hyperopt

Hyperopt is a Python library for "serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions".

In the machine learning workflow, hyperopt can be used to distribute/parallelize the hyperparameter optimization process with more advanced optimization strategies than are available in other libraries.

There are two ways to scale hyperopt with Apache Spark:
* Use single-machine hyperopt with a distributed training algorithm (e.g. MLlib)
* Use distributed hyperopt with single-machine training algorithms (e.g. scikit-learn) with the SparkTrials class. 

In this lesson, we will use single-machine hyperopt with MLlib, but in the lab, you will see how to use hyperopt to distribute the hyperparameter tuning of single node models. 

Unfortunately you can’t use hyperopt to distribute the hyperparameter optimization for distributed training algorithms at this time. However, you do still get the benefit of using more advanced hyperparameter search algorthims (random search, TPE, etc.) with Spark ML.


Resources:
0. [Documentation](http://hyperopt.github.io/hyperopt/scaleout/spark/)
0. [Hyperopt on Databricks](https://docs.databricks.com/applications/machine-learning/automl/hyperopt/index.html)
0. [Hyperparameter Tuning with MLflow, Apache Spark MLlib and Hyperopt](https://databricks.com/blog/2019/06/07/hyperparameter-tuning-with-mlflow-apache-spark-mllib-and-hyperopt.html)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Use hyperopt to find the optimal parameters for an MLlib model using TPE

Let's start by loading in our SF Airbnb Dataset.

In [0]:
import os

In [0]:
## Put your name here
username = "renato"

dbutils.widgets.text("username", username)
spark.sql(f"CREATE DATABASE IF NOT EXISTS dsacademy_embedded_wave3_{username}")
spark.sql(f"USE dsacademy_embedded_wave3_{username}")
spark.conf.set("spark.sql.shuffle.partitions", 40)

spark.sql("SET spark.databricks.delta.formatCheck.enabled = false")
spark.sql("SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true")

Out[4]: DataFrame[key: string, value: string]

In [0]:
deltaPath = os.path.join("/", "tmp", username)    #If we were writing to the root folder and not to the DBFS
if not os.path.exists(deltaPath):
    os.mkdir(deltaPath)
    
print(deltaPath)

airbnbDF = spark.read.format("delta").load(deltaPath)

/tmp/renato


In [0]:
(trainDF, testDF) = airbnbDF.randomSplit([.8, .2], seed=42)

We will then create our random forest pipeline and regression evaluator.

In [0]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

categoricalCols = [field for (field, dataType) in trainDF.dtypes if dataType == "string"]
indexOutputCols = [x + "Index" for x in categoricalCols]

stringIndexer = StringIndexer(inputCols=categoricalCols, outputCols=indexOutputCols, handleInvalid="skip")

numericCols = [field for (field, dataType) in trainDF.dtypes if ((dataType == "double") & (field != "price"))]
assemblerInputs = indexOutputCols + numericCols
vecAssembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")

rf = RandomForestRegressor(labelCol="price", maxBins=40, seed=42)

pipeline = Pipeline(stages=[stringIndexer, vecAssembler, rf])

regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price")

Next, we get to the hyperopt-specific part of the workflow.

First, we define our **objective function**. The objective function has two primary requirements:

1. An **input** `params` including hyperparameter values to use when training the model
2. An **output** containing a loss metric on which to optimize

In this case, we are specifying values of `max_depth` and `num_trees` and returning the RMSE as our loss metric.

We are reconstructing our pipeline for the `RandomForestRegressor` to use the specified hyperparameter values.

In [0]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import mlflow

def objective_function(params):    
  # set the hyperparameters that we want to tune
  max_depth = params["max_depth"]
  num_trees = params["num_trees"]

  # create a grid with our hyperparameters
  grid = (ParamGridBuilder()
    .addGrid(rf.maxDepth, [max_depth])
    .addGrid(rf.numTrees, [num_trees])
    .build())

  # cross validate the set of hyperparameters
  cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=regressionEvaluator, numFolds=3)
  cvModel = cv.fit(trainDF)

  # get our average RMSE across all three folds
  rmse = cvModel.avgMetrics[0]

  return {"loss": rmse, "status": STATUS_OK}

Next, we define our search space. 

This is similar to the parameter grid in a grid search process. However, we are only specifying the range of values rather than the individual, specific values to be tested. It's up to hyperopt's optimization algorithm to choose the actual values.

See the [documentation](https://github.com/hyperopt/hyperopt/wiki/FMin) for helpful tips on defining your search space.

In [0]:
from hyperopt import hp

search_space = {
  "max_depth": hp.randint("max_depth", 2, 5),
  "num_trees": hp.randint("num_trees", 10, 100)
}

`fmin()` generates new hyperparameter configurations to use for your `objective_function`. It will evaluate 4 models in total, using the information from the previous models to make a more informative decision for the the next hyperparameter to try. 

Hyperopt allows for parallel hyperparameter tuning using either random search or Tree of Parzen Estimators (TPE). Note that in the cell below, we are importing `tpe`. According to the [documentation](http://hyperopt.github.io/hyperopt/scaleout/spark/), TPE is an adaptive algorithm that 

> iteratively explores the hyperparameter space. Each new hyperparameter setting tested will be chosen based on previous results. 

Hence, `tpe.suggest` is a Bayesian method.

MLflow also integrates with Hyperopt, so you can track the results of all the models you’ve trained and their results as part of your hyperparameter tuning. Notice you can track the MLflow experiment in this notebook, but you can also specify an external experiment.

In [0]:
from hyperopt import fmin, tpe, STATUS_OK, Trials
import numpy as np

# Creating a parent run
with mlflow.start_run():
  num_evals = 4
  trials = Trials()
  best_hyperparam = fmin(fn=objective_function, 
                         space=search_space,
                         algo=tpe.suggest, 
                         max_evals=num_evals,
                         trials=trials,
                         rstate=np.random.RandomState(42)
                        )
  
  # get optimal hyperparameter values
  best_max_depth = best_hyperparam["max_depth"]
  best_num_trees = best_hyperparam["num_trees"]
  
  # change RF to use optimal hyperparameter values (this is a stateful method)
  rf.setMaxDepth(best_max_depth)
  rf.setNumTrees(best_num_trees)
  
  # train pipeline on entire training data - this will use the updated RF values
  pipelineModel = pipeline.fit(trainDF)
  
  # evaluate final model on test data
  predDF = pipelineModel.transform(testDF)
  rmse = regressionEvaluator.evaluate(predDF)
  
  # Log param and metric for the final model
  mlflow.log_param("max_depth", best_max_depth)
  mlflow.log_param("numTrees", best_num_trees)
  mlflow.log_metric("rmse", rmse)


  0%|          | 0/4 [00:00<?, ?trial/s, best loss=?]  0%|          | 0/4 [00:00<?, ?trial/s, best loss=?]


[0;31m---------------------------------------------------------------------------[0m
[0;31mAttributeError[0m                            Traceback (most recent call last)
[0;32m<command-74902913649367>[0m in [0;36m<cell line: 5>[0;34m()[0m
[1;32m      6[0m   [0mnum_evals[0m [0;34m=[0m [0;36m4[0m[0;34m[0m[0;34m[0m[0m
[1;32m      7[0m   [0mtrials[0m [0;34m=[0m [0mTrials[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 8[0;31m   best_hyperparam = fmin(fn=objective_function, 
[0m[1;32m      9[0m                          [0mspace[0m[0;34m=[0m[0msearch_space[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m     10[0m                          [0malgo[0m[0;34m=[0m[0mtpe[0m[0;34m.[0m[0msuggest[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m

[0;32m/databricks/.python_edge_libs/hyperopt/fmin.py[0m in [0;36mfmin[0;34m(fn, space, algo, max_evals, timeout, loss_threshold, trials, rstate, allow_trials_fmin, pass_expr_memo_ctrl, catch_eva

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>