# Distributed Machine Learning on Spark with Hyper Parameter Tuning

Spark + MLflow + Hyperopt + scikit-learn

[Hyperopt](https://github.com/hyperopt/hyperopt) is a Python library for hyperparameter tuning, including automated MLflow tracking and the `SparkTrials` class for distributed tuning.  

This notebook illustrates how to scale up hyperparameter tuning for a single-machine Python ML algorithm and track the results using MLflow. You learn to use the `SparkTrials` class to distribute the workflow calculations across the Spark cluster.

## Initialize the parameters

In [None]:
trials = input('Please set the number of trials of parameter tuning: ').strip()

## CloudTik: scale workers and wait for workers ready

In [None]:
from cloudtik.runtime.spark.api import ThisSparkCluster
from cloudtik.runtime.ml.api import ThisMLCluster

cluster = ThisSparkCluster()

# Scale the cluster as need
# cluster.scale(workers=1)

# Wait for all cluster workers to be ready
cluster.wait_for_ready(min_workers=1)

ml_cluster = ThisMLCluster()
mlflow_url = ml_cluster.get_services()['mlflow']['url']

## Initialize SparkSession

In [None]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().setAppName('spark-scikit').set('spark.sql.shuffle.partitions', '16')
spark = SparkSession.builder.config(conf=conf).getOrCreate()

## Load the iris dataset from scikit-learn

In [None]:
from sklearn.datasets import load_iris

iris = iris = load_iris()
X = iris.data
y = iris.target

## Define a train function

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC


# Function to train a model
def train(C):
    # Create a support vector classifier model
    model = SVC(C=C)
    model.fit(X, y)
    return model

## Objective function to minimize

In [None]:
def hyper_objective(C):
    # Create a support vector classifier model
    model = train(C)

    # Use the cross-validation accuracy to compare the models' performance
    accuracy = cross_val_score(model, X, y).mean()
    with mlflow.start_run():
        mlflow.log_metric("C", C)
        mlflow.log_metric("loss", -accuracy)

    # Hyperopt tries to minimize the objective function.
    # A higher accuracy value means a better model, so you must return the negative accuracy.
    return {'loss': -accuracy, 'C': C, 'status': STATUS_OK}

## Do a super parameter tuning with hyperopt

Here are the steps in a Hyperopt workflow:  
1. Define a function to minimize.  
2. Define a search space over hyperparameters.  
3. Select a search algorithm.  
4. Run the tuning algorithm with Hyperopt `fmin()`.

Define the search space over hyperparameters:
See the [Hyperopt docs](https://github.com/hyperopt/hyperopt/wiki/FMin#21-parameter-expressions) for details on defining a search space and parameter expressions.

Search algorithm, the two main choices are:
* `hyperopt.tpe.suggest`: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on past results
* `hyperopt.rand.suggest`: Random search, a non-adaptive approach that samples over the search space

Run the tuning algorithm with Hyperopt `fmin()`

Set `max_evals` to the maximum number of points in hyperparameter space to test, that is, the maximum number of models to fit and evaluate.

In [None]:
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials
import mlflow

trials = int(trials) if trials else 2
print('Hyper parameter tuning trials: {}'.format(trials))

# Define the search space and select a search algorithm
search_space = hp.lognormal('C', 0, 1.0)
algo = tpe.suggest
spark_trials = SparkTrials(spark_session=spark)

mlflow.set_tracking_uri(mlflow_url)
mlflow.set_experiment("MLflow + HyperOpt + Scikit-Learn")
argmin = fmin(
  fn=hyper_objective,
  space=search_space,
  algo=algo,
  max_evals=trials,
  trials=spark_trials)

# Print the best value found for C
print("Best parameter found: ", argmin)
print("argmin.get('C'): ", argmin.get('C'))

## Train final model with the best parameters

In [None]:
best_model = train(argmin.get('C'))
model_name = 'scikit-learn-svc-model'
mlflow.sklearn.log_model(best_model, model_name, registered_model_name=model_name)

## Load model as a PyFuncModel and predict on a Pandas DataFrame.

In [None]:
import pandas as pd

model_uri = 'models:/{}/latest'.format(model_name)
print('Inference with model: {}'.format(model_uri))
saved_model = mlflow.pyfunc.load_model(model_uri)
saved_model.predict(pd.DataFrame(X))

## Clean up

In [None]:
spark.stop()