d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Packaging ML Projects

Machine learning projects need to produce both reusable code and reproducible results.  This lesson examines creating, organizing, and packaging machine learning projects with a focus on reproducability and collaborating with a team.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Introduce organizing code into projects
 - Package a basic project with parameters and an environment
 - Run a basic project locally and remotely
 - Design a multi-step workflow with many different components

-sandbox
### The Case for Packaging

There are a number of different reasons why teams need to package their machine learning projects:<br><br>

1. Projects have various library dependencies 
  - shipping a machine learning solution involves the environment in which it was built
  - MLflow allows for this environment to be a conda environment or docker container
  - This means that teams can easily share and publish their code for others to use
2. Machine learning projects become increasingly complex as time goes on
  - This includes ETL and featurization steps, machine learning models used for pre-processing, and finally the model training itself
3. Each component of a machine learning pipeline needs to allow for tracing its lineage
  - If there's a failure at some point, tracing the full end-to-end lineage of a model allows for easier debugging.

-sandbox
**ML Projects is a specification for how to organize code in a project.**<br><br>

- The heart of this is an **MLproject file,** a YAML specification for the components of the ML project
- This allows for more complex workflows since a project can execute another project
   - This allows for encapsulation of each stage of a more complex machine learning architecture
- This means that teams can collaborate more easily using this architecture

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlflow-project.png" style="height: 400px; margin: 20px"/></div>

In [5]:
%run "./Includes/Classroom-Setup"

### Packaging a Simple Project

First we're going to create a simple MLflow project consisting of the following elements:<br><br>

1. MLProject file
2. Conda environment
3. Basic code

We're going to want to be able to pass parameters into this code so that we can try different hyperparameter options.

Create a new experiment for this exercise.  Navigate to the UI in another tab.

In [8]:
import mlflow
from mlflow.exceptions import MlflowException
from  mlflow.tracking import MlflowClient

experimentPath = "/Users/" + username + "/experiment-L3"

try:
  experimentID = mlflow.create_experiment(experimentPath)
except MlflowException:
  experimentID = MlflowClient().get_experiment_by_name(experimentPath).experiment_id
  mlflow.set_experiment(experimentPath)

print("The experiment can be found at the path `{}` and has an experiment_id of `{}`".format(experimentPath, experimentID))

-sandbox
First, examine the code we're going to run.  This looks similar to what we ran in the last lesson with the addition of decorators from the `click` library.  This allows us to parameterize our code.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We'll uncomment out the `__main__` block when we save this code as a Python file.<br>
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Check out the <a href="https://click.palletsprojects.com/en/7.x/" target="_blank">`click` docs here.</a>

In [10]:
import click
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

@click.command()
@click.option("--data_path", default="/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv", type=str)
@click.option("--n_estimators", default=10, type=int)
@click.option("--max_depth", default=20, type=int)
@click.option("--max_features", default="auto", type=str)
def mlflow_rf(data_path, n_estimators, max_depth, max_features):

  with mlflow.start_run() as run:
    # Import the data
    df = pd.read_csv(data_path)
    X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1), df[["price"]].values.ravel(), random_state=42)
    
    # Create model, train it, and create predictions
    rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features)
    rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)

    # Log model
    mlflow.sklearn.log_model(rf, "random-forest-model")
    
    # Log params
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("max_features", max_features)

    # Log metrics
    mlflow.log_metric("mse", mean_squared_error(y_test, predictions))
    mlflow.log_metric("mae", mean_absolute_error(y_test, predictions))  
    mlflow.log_metric("r2", r2_score(y_test, predictions))  

# if __name__ == "__main__":
#   mlflow_rf() # Note that this does not need arguments thanks to click

Test that it works using the `click` `CliRunner`, which will execute the code in the same way we expect to.

In [12]:
from click.testing import CliRunner

runner = CliRunner()
result = runner.invoke(mlflow_rf, ['--n_estimators', 10, '--max_depth', 20], catch_exceptions=True)

assert result.exit_code == 0, "Code failed" # Check to see that it worked

print("Success!")

Now create a directory to hold our project files.  This will be a unique directory for your username.

In [14]:
train_path = userhome + "/ml-production/mlflow-model-training/"

dbutils.fs.rm(train_path, True) # Clears the directory if it already exists
dbutils.fs.mkdirs(train_path)

print("Created directory `{}` to house the project files.".format(train_path))

Create the `MLproject` file.  This is the heart of an MLflow project.  It includes pointers to the conda environment and a `main` entry point, which is backed by the file `train.py`.

In [16]:
dbutils.fs.put(train_path + "/MLproject", 
'''
name: Lesson-3-Model-Training

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      data_path: {type: str, default: "/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv"}
      n_estimators: {type: int, default: 10}
      max_depth: {type: int, default: 20}
      max_features: {type: str, default: "auto"}
    command: "python train.py --data_path {data_path} --n_estimators {n_estimators} --max_depth {max_depth} --max_features {max_features}"
'''.strip())

-sandbox
Create the conda environment.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> You can also dynamically view and use a package version by calling `.__version__` on the package.

In [18]:
dbutils.fs.put(train_path + "/conda.yaml", 
'''
name: Lesson-03
channels:
  - defaults
dependencies:
  - cloudpickle=0.5.3
  - numpy=1.14.3
  - pandas=0.23.0
  - scikit-learn=0.19.1
  - pip:
    - mlflow==0.9.0
'''.strip())

Now create the code itself.  This is the same as above except for with the `__main__` is included.  Note how there are no arguments passed into `mlflow_rf()` on the final line.  `click` is handling the arguments for us.

In [20]:
dbutils.fs.put(train_path + "/train.py", 
'''
import click
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

@click.command()
@click.option("--data_path", default="/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv", type=str)
@click.option("--n_estimators", default=10, type=int)
@click.option("--max_depth", default=20, type=int)
@click.option("--max_features", default="auto", type=str)
def mlflow_rf(data_path, n_estimators, max_depth, max_features):

  with mlflow.start_run() as run:
    # Import the data
    df = pd.read_csv(data_path)
    X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1), df[["price"]].values.ravel(), random_state=42)
    
    # Create model, train it, and create predictions
    rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features)
    rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)

    # Log model
    mlflow.sklearn.log_model(rf, "random-forest-model")
    
    # Log params
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("max_features", max_features)

    # Log metrics
    mlflow.log_metric("mse", mean_squared_error(y_test, predictions))
    mlflow.log_metric("mae", mean_absolute_error(y_test, predictions))  
    mlflow.log_metric("r2", r2_score(y_test, predictions))  

if __name__ == "__main__":
  mlflow_rf() # Note that this does not need arguments thanks to click
'''.strip())

To summarize, you now have three files: `MLmodel`, `conda.yaml`, and `train.py`

In [22]:
dbutils.fs.ls(train_path)

-sandbox
### Running Projects

Now you have the three files we need to run the project, we can trigger the run.  We'll do this in a few different ways:<br><br>

1. On the driver node of our Spark cluster
2. On a new Spark cluster submitted as a job
3. Using files backed by GitHub

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This currently relies on environment variables.  See the setup script for details.

-sandbox
Now run the experiment.  This command will execute against the driver node of a Spark cluster, though it could be running locally or on a different remote VM.

First set the experiment using the `experimentPath` defined earlier.  Prepend `/dbfs` to the file path, which allows the cluster's file system to access DBFS.  Then, pass your parameters.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This will take a few minutes to build the environment for the first time.  Subsequent runs are faster since `mlflow` can reuse the same environment after it has been built.

In [25]:
import mlflow

mlflow.projects.run(train_path.replace("dbfs:","/dbfs"),
  parameters={
    "data_path": "/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv",
    "n_estimators": 10,
    "max_depth": 20,
    "max_features": "auto"
})

Check the run in the UI.  Notice that you can see the run command.  **This is very helpful in debugging.**

Now that it's working, experiment with other parameters.  Note how much faster it runs the second time.

In [27]:
mlflow.projects.run(train_path.replace("dbfs:","/dbfs"),
  parameters={
    "data_path": "/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv",
    "n_estimators": 1000,
    "max_depth": 10,
    "max_features": "log2"
})

How did the new model do?

-sandbox
Now try executing this code against a new Databricks cluster.  This needs to define cluster specifications.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/>  <a href="https://docs.databricks.com/api/latest/clusters.html" target="_blank">See the clusters API docs</a> to see how to define cluster specifications.

In [30]:
# clusterspecs = {
#     "num_workers": 2,
#     "spark_version": "5.3.x-cpu-ml-scala2.11",
#     "node_type_id": "Standard_DS3_v2",
#     "driver_node_type_id": "Standard_DS3_v2",
# }
# 
# mlflow.projects.run(
#   uri=train_path.replace("dbfs:","/dbfs"),
#   parameters={
#     "data_path": "/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv",
#     "n_estimators": 1500,
#     "max_depth": 5,
#     "max_features": "sqrt"
# },
#   mode="databricks",
#   cluster_spec=clusterspecs
# )

Finally, run this example, which is <a href="https://github.com/mlflow/mlflow-example" target="_blank">a project backed by GitHub.</a>

In [32]:
mlflow.run(
  uri="https://github.com/mlflow/mlflow-example",
  parameters={'alpha':0.4}
)

-sandbox
### Multi-Step Workflows

Now that we can package projects and run them in their environment, let's look at how how we can make workflows consisting of multiple steps.  The underlying idea that makes this possible is that **runs can recursively call other runs.**  This means that steps in a machine learning pipeline can be isolated.  There are three general architectures to consider:<br><br>

1. One driver project calls other entry points in that same project
2. One driver project calls other projects 
3. One project calls another project as its final step

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlproject-architecture1.png" style="height: 250px; margin: 20px"/></div>
<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlproject-architecture2.png" style="height: 250px; margin: 20px"/></div>
<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlproject-architecture3.png" style="height: 250px; margin: 20px"/></div>

Create a pre-processing project that reads and saves the data to a new location.  This is just a minimum viable product for a working pipeline.  In practice, this would likely include an ETL stage.

Do this by first creating a new directory for it to live.

In [35]:
load_path = userhome + "/ml-production/mlflow-data-loading/"

dbutils.fs.rm(load_path, True) # Clears the directory if it already exists
dbutils.fs.mkdirs(load_path)

print("Created directory `{}` to house the project files.".format(load_path))

Now create another `MLproject` file.

In [37]:
dbutils.fs.put(load_path + "/MLproject", 
'''
name: Lesson-3-Data-Loading

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      data_input_path: {type: str, default: "/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv"}
    command: "python load.py --data_input_path {data_input_path}"
'''.strip())

Create the environment as well.

In [39]:
dbutils.fs.put(load_path + "/conda.yaml", 
'''
name: Lesson-03
channels:
  - defaults
dependencies:
  - cloudpickle=0.5.3
  - numpy=1.14.3
  - pandas=0.23.0
  - scikit-learn=0.19.1
  - pip:
    - mlflow==0.9.0
'''.strip())

You can test the code below.  It simply takes an input path and logs the related data as an MLflow artifact.

In [41]:
import click
import mlflow

# @click.command()
# @click.option("--data_input_path", default="/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv", type=str)
def data_load(data_input_path):

  with mlflow.start_run() as run:
    # Log the data
    mlflow.log_artifact(data_input_path, "data-csv-dir")

if __name__ == "__main__":
  data_load("/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv")
  
dbutils.fs.put(load_path + "/load.py", 
'''
import click
import mlflow

@click.command()
@click.option("--data_input_path", default="/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv", type=str)
def data_load(data_input_path):

  with mlflow.start_run() as run:
    # Log the data
    mlflow.log_artifact(data_input_path, "data-csv-dir")

if __name__ == "__main__":
  data_load()
'''.strip())

dbutils.fs.ls(load_path)

Now run the data loading code.

In [43]:
submitted_run = mlflow.projects.run(load_path.replace("dbfs:","/dbfs"),
  parameters={
    "data_input_path": "/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv"
  })

Get the artifact URI from the MLflow client.

In [45]:
artifact_uri = mlflow.tracking.MlflowClient().get_run(submitted_run.run_id).info.artifact_uri

dbutils.fs.ls(artifact_uri+"/data-csv-dir")

Run the training code using the URI.

In [47]:
mlflow.projects.run(train_path.replace("dbfs:","/dbfs"),
  parameters={
    "data_path": artifact_uri.replace("dbfs:", "/dbfs")+"/data-csv-dir/airbnb-cleaned-mlflow.csv"
})

## Review

**Question:** Why is packaging important?  
**Answer:** Packaging not only manages your code but the environment in which it was run.  This environment can be a Conda or Docker environment.  This ensures that you have reproducible code and models that can be used in a number of downstream environments.

**Question:** What are the core components of MLflow projects?  
**Answer:** An MLmodel specifies the project components using YAML.  The environment file contains specifics about the environment.  The code itself contains the steps to create a model or process data.

**Question:** What code can I run and where can I run it?  
**Answer:** Arbitrary code can be run in any number of different languages.  It can be run locally or remotely, whether on a remote VM, Spark cluster, or submitted as a Databricks job.

**Question:** How can I manage a pipeline using MLflow?  
**Answer:** Multi-step workflows chain together multiple MLflow jobs, allowing for better encapsulation of steps such as fetching data, ETL, machine learning as a pre-processing step, and the training of the final model.

## Next Steps

Start the next lesson, [Model Management]($./04-Model-Management ).

## Additional Topics & Resources

**Q:** Where can I find out more information on MLflow Projects?  
**A:** Check out the <a href="https://www.mlflow.org/docs/latest/projects.html" target="_blank">MLflow docs</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>