## MLflow Quick Start Notebook (Azure Databricks)
This is a Quick Start notebook based on [MLflow's tutorial](https://mlflow.org/docs/latest/tutorial.html).  In this tutorial, we’ll:
* Install the MLflow library on a Databricks cluster
* Connect our notebook to an MLflow Tracking Server running on Azure VM
* Log metrics, parameters, models and a .png plot to show how you can record arbitrary outputs from your MLflow job
* View our results on the MLflow tracking UI.

This notebook uses the `diabetes` dataset in scikit-learn and predicts the progression metric (a quantitative measure of disease progression after one year after) based on BMI, blood pressure, etc. It uses the scikit-learn ElasticNet linear regression model, where we vary the `alpha` and `l1_ratio` parameters for tuning. For more information on ElasticNet, refer to:
  * [Elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization)
  * [Regularization and Variable Selection via the Elastic Net](https://web.stanford.edu/~hastie/TALKS/enet_talk.pdf)

A good reference for MLflow in general is [Matei's Spark Summit 2018 Keynote](https://databricks.com/sparkaisummit/north-america/spark-summit-2018-keynotes).

To get started, you will first need to 

1. Install MLflow, Azure Storage, and Python ML and math libraries on your Databricks cluster 
2. Set up a Remote MLflow tracking server
3. Configure Azure Blob Storage

### Install Libraries on Your Databricks Cluster

1. Ensure you are using or [create a cluster](https://docs.databricks.com/user-guide/clusters/create.html) specifying 
  * **Databricks Runtime Version:** Databricks Runtime 4.1 
  * **Python Version:** Python 3
1. Install `mlflow` as a [PyPi library](https://docs.databricks.com/user-guide/libraries.html#upload-a-python-pypi-package-or-python-egg).
  1. Choose **PyPi** and enter `mlflow`.
1. Install `azure-storage` as a PyPi library.
  1. Choose **PyPi** and enter `azure-storage`.
1. For our ElasticNet Descent Path visualizations, install the latest version `scikit-learn` and `matplotlib` as PyPI libraries.
  1. Choose **PyPi** and enter `scikit-learn==0.19.1`
  1. Choose **PyPi** and enter `matplotlib==2.2.2`

### Set up a Remote MLflow Tracking Server

To run a long-lived, shared MLflow tracking server, launch a Linux VM instance to run the [MLflow Tracking server](https://mlflow.org/docs/latest/tracking.html). To do this:

1. Create a Linux VM instance.
  1. Open port 5000 for MLflow server; an example of how to do this via [How to open ports to a virtual machine with the Azure portal](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nsg-quickstart-portal). Opening up port 5000 to the Internet will allow anyone to access your server, so it is recommended to only open up the port within an [Azure VPC](https://azure.microsoft.com/en-us/services/virtual-network/) that your Databricks clusters have access to.
  1. Install conda onto your Linux instance via [Conda > Installing on Linux](https://conda.io/docs/user-guide/install/linux.html).
  1. Install `mlflow`: `pip install mlflow`
  1. Install `azure-storage`: `pip install azure-storage`
1. Run the tracking server.
  1. Start the tracking server: `mlflow server --default-artifact-root wasbs://<container>@<account>.blob.core.windows.net/ --host 0.0.0.0`. For more information, refer to [MLflow > Running a Tracking Server](https://mlflow.org/docs/latest/tracking.html?highlight=server#running-a-tracking-server).
1. Test connectivity of your tracking server:
  1. Get the hostname of your Azure VM instance.
  1. Go to `http://<mlflow-server-dns>:5000`; it should look similar to
    <img src="https://docs.azuredatabricks.net/_static/images/mlflow/mlflow-web-ui.png" width=1000/>

### Configure Azure Blob Storage for MLflow Artifact Store
From your Databricks cluster, configure Databricks cluster environment variables with the connection string required to access Azure Blob Storage; a good reference is [Spark configuration properties](https://docs.azuredatabricks.net/user-guide/clusters/spark-config.html#spark-configuration-properties).

1. In a cell, run the following (fill out the template with your Azure Storage Blob account information):
    ```
    dbutils.fs.put("dbfs:/databricks/init/init.bash" ,"""
    #!/bin/bash
    sudo echo export AZURE_STORAGE_CONNECTION_STRING="\\"DefaultEndpointsProtocol=https;AccountName=$myAccountName$;AccountKey=$myAccountKey$\\"" >> /databricks/spark/conf/spark-env.sh
    """, True)
    ```
1. Validate that this was correctly written by running `%fs head dbfs:/databricks/init/init.bash`.
1. Restart your cluster so the environment variables will be set.
1. Detach and reattach your notebook.
1. Validate the environment variables were set correctly via `%sh cat /databricks/spark/conf/spark-env.sh` and/or running
    ```
    %scala
    sys.env.get("AZURE_STORAGE_CONNECTION_STRING")
    ```

### Start Using MLflow in a Notebook

The first step is to import call `mlflow.set_tracking_uri` to point to your server:

In [0]:
# Set this variable to your MLflow server's DNS name
mlflow_server = '<mlflow-server-dns>'

# Tracking URI
mlflow_tracking_URI = 'http://' + mlflow_server + ':5000'
print ("MLflow Tracking URI: %s" % (mlflow_tracking_URI))

# Import MLflow and set the Tracking UI
import mlflow
mlflow.set_tracking_uri(mlflow_tracking_URI)

#### Write Your ML Code Based on the`train_diabetes.py` Code
This tutorial is based on the MLflow's [train_diabetes.py](https://github.com/mlflow/mlflow/blob/master/examples/sklearn_elasticnet_diabetes/train_diabetes.py), which uses the `sklearn.diabetes` built-in dataset to predict disease progression based on various factors.

In [0]:
# Import various libraries including matplotlib, sklearn, mlflow
import os
import warnings
import sys

import pandas as pd
import numpy as np
from itertools import cycle
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets

# Import mlflow
import mlflow
import mlflow.sklearn

# Load Diabetes datasets
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

# Create pandas DataFrame for sklearn ElasticNet linear_model
Y = np.array([y]).transpose()
d = np.concatenate((X, Y), axis=1)
cols = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'progression']
data = pd.DataFrame(d, columns=cols)

#### Plot the ElasticNet Descent Path
As an example of recording arbitrary output files in MLflow, we'll plot the [ElasticNet Descent Path](http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html) for the ElasticNet model by *alpha* for the specified *l1_ratio*.

The `plot_enet_descent_path` function below:
* Returns an image that can be displayed in our Databricks notebook via `display`
* As well as saves the figure `ElasticNet-paths.png` to the Databricks cluster's driver node
* This file is then uploaded to MLflow using the `log_artifact` within `train_diabetes`

In [0]:
def plot_enet_descent_path(X, y, l1_ratio):
    # Compute paths
    eps = 5e-3  # the smaller it is the longer is the path

    # Reference the global image variable
    global image
    
    print("Computing regularization path using the elastic net.")
    alphas_enet, coefs_enet, _ = enet_path(X, y, eps=eps, l1_ratio=l1_ratio, fit_intercept=False)

    # Display results
    fig = plt.figure(1)
    ax = plt.gca()

    colors = cycle(['b', 'r', 'g', 'c', 'k'])
    neg_log_alphas_enet = -np.log10(alphas_enet)
    for coef_e, c in zip(coefs_enet, colors):
        l1 = plt.plot(neg_log_alphas_enet, coef_e, linestyle='--', c=c)

    plt.xlabel('-Log(alpha)')
    plt.ylabel('coefficients')
    title = 'ElasticNet Path by alpha for l1_ratio = ' + str(l1_ratio)
    plt.title(title)
    plt.axis('tight')

    # Display images
    image = fig
    
    # Save figure
    fig.savefig("ElasticNet-paths.png")

    # Close plot
    plt.close(fig)

    # Return images
    return image    

#### Train the Diabetes Model
The next function trains Elastic-Net linear regression based on the input parameters of `alpha (in_alpha)` and `l1_ratio (in_l1_ratio)`.

In addition, this function uses MLflow Tracking to record its
* parameters,
* metrics,
* model,
* and arbitrary files, namely the above noted Lasso Descent Path Plot.

**Tip on how we use `with mlflow.start_run():` in the Python code to create a new MLflow run.** This is the recommended way to use MLflow in notebook cells. Whether your code completes or exits with an error, the `with` context will make sure that we close the MLflow run, so you don't have to call `mlflow.end_run` later in the code.

In [0]:
# train_diabetes
#   Uses the sklearn Diabetes dataset to predict diabetes progression using ElasticNet
#       The predicted "progression" column is a quantitative measure of disease progression one year after baseline
#       http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
def train_diabetes(data, in_alpha, in_l1_ratio):
  # Evaluate metrics
  def eval_metrics(actual, pred):
      rmse = np.sqrt(mean_squared_error(actual, pred))
      mae = mean_absolute_error(actual, pred)
      r2 = r2_score(actual, pred)
      return rmse, mae, r2

  warnings.filterwarnings("ignore")
  np.random.seed(40)

  # Split the data into training and test sets. (0.75, 0.25) split.
  train, test = train_test_split(data)

  # The predicted column is "progression" which is a quantitative measure of disease progression one year after baseline
  train_x = train.drop(["progression"], axis=1)
  test_x = test.drop(["progression"], axis=1)
  train_y = train[["progression"]]
  test_y = test[["progression"]]

  if float(in_alpha) is None:
    alpha = 0.05
  else:
    alpha = float(in_alpha)
    
  if float(in_l1_ratio) is None:
    l1_ratio = 0.05
  else:
    l1_ratio = float(in_l1_ratio)
  
  # Start an MLflow run; the "with" keyword ensures we'll close the run even if this cell crashes
  with mlflow.start_run():
    lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    lr.fit(train_x, train_y)

    predicted_qualities = lr.predict(test_x)

    (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

    # Print out ElasticNet model metrics
    print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
    print("  RMSE: %s" % rmse)
    print("  MAE: %s" % mae)
    print("  R2: %s" % r2)

    # Set tracking_URI first and then reset it back to not specifying port
    # Note, we had specified this in an earlier cell
    #mlflow.set_tracking_uri(mlflow_tracking_URI)

    # Log mlflow attributes for mlflow UI
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)
    mlflow.sklearn.log_model(lr, "model")
    
    # Call plot_enet_descent_path
    image = plot_enet_descent_path(X, y, l1_ratio)
    
    # Log artifacts (output files)
    mlflow.log_artifact("ElasticNet-paths.png")

![](https://docs.databricks.com/_static/images/mlflow/elasticnet-paths-by-alpha-per-l1-ratio.png)

#### Experiment with Different Parameters

Now that we have a `train_diabetes` function that records MLflow runs, we can simply call it with different parameters to explore them. Later, we'll be able to visualize all these runs on our MLflow tracking server.

In [0]:
# Start with alpha and l1_ratio values of 0.01, 0.01
train_diabetes(data, 0.01, 0.01)

In [0]:
display(image)

In [0]:
# Start with alpha and l1_ratio values of 0.01, 0.75
train_diabetes(data, 0.01, 0.75)

In [0]:
display(image)

In [0]:
# Start with alpha and l1_ratio values of 0.01, 1
train_diabetes(data, 0.01, 1)

In [0]:
display(image)

## Review the MLflow UI
Open the URL of your tracking server in a web browser. In case you forgot it, you can get it from `mlflow.get_tracking_uri()`:

In [0]:
# Identify the location of the runs
mlflow.tracking.get_tracking_uri()

The MLflow UI should look something similar to the animated GIF below. Inside the UI, you can:
* View your experiments and runs
* Review the parameters and metrics on each run
* Click each run for a detailed view to see the the model, images, and other artifacts produced.

<img src="https://docs.azuredatabricks.net/_static/images/mlflow/mlflow-ui-azure.gif" width=1000/>

#### Organize MLflow Runs into Experiments

As you start using your MLflow server for more tasks, you may want to separate them out. MLflow allows you to create [experiments](https://mlflow.org/docs/latest/tracking.html#organizing-runs-in-experiments) to organize your runs. To report your run to a specific experiment, just pass an `experiment_id` parameter to the `mlflow.start_run`, as in `mlflow.start_run(experiment_id=1)`.