d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Lab: Running a Project within a Project

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you:<br>
- Check directory for appropriate files
- Create and run a driver-less workflow

In [3]:
%run "./../Includes/Classroom-Setup"

-sandbox
## Check for Appropriate Files

This lab will reuse your work from lesson 3.  Run the following 2 cells to check that your `/user/ < username > /ml-production` folder still contains the files created and saved in the 03 notebook.

Under `train_path`, you should have the following 3 files: 
* `MLproject`
* `conda.yaml`
* `train.py`

Under `load_path`, you should have the following 3 files: 
* `MLproject`
* `conda.yaml`
* `load.py`

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If these files are not present, [re-run Lesson 3.]($../03-Packaging-ML-Projects )

In [5]:
train_path = userhome + "/ml-production/mlflow-model-training/"

dbutils.fs.ls(train_path)

In [6]:
load_path = userhome + "/ml-production/mlflow-data-loading/"

dbutils.fs.ls(load_path)

-sandbox
## Create a Driver-less Workflow

At the end of notebook 3, we retrieved the data path artifact saved from the data loading run and used that as the `data_path` parameter of our training code. We did this by making two separate calls to `mlflow.projects.run`.

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlproject-architecture3.png" style="height: 250px; margin: 20px"/></div>

Now we want log the same projects but through **only 1 explicit `mlflow.projects.run` call.**  In other words, the first project should call the second project:<br><br>

1. Edit the following `data_load` function to take in `train_path` as an additional parameter 
2. Call the MLproject saved at `train_path` as its last step

In [8]:
# ANSWER
import mlflow

def data_load(data_input_path, train_path):

  with mlflow.start_run() as run:
    # Log the data
    mlflow.log_artifact(data_input_path, "data-csv-dir")
  mlflow.projects.run(
    uri=train_path,
    parameters={ "data_path": data_input_path}
    )

if __name__ == "__main__":
  data_load( "/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv", train_path.replace("dbfs:", "/dbfs"))

Double check that the UI correctly logged your 2 project runs (the data loading and then training) from above by comparing it to the last 2 runs from the end of the 03 notebook.

Then fill in the below code to overwrite the original `load.py` file to have the new `data_load(data_input_path, train_path)` function. Be sure to add an appropriate `@click.option` for the new `train_path` parameter.

In [10]:
# ANSWER
dbutils.fs.put(load_path + "/load.py", 
'''
import click
import mlflow

@click.command()
@click.option("--data_input_path", default="/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv", type=str)
@click.option("--train_path", default="", type=str)
def data_load(data_input_path, train_path):

  with mlflow.start_run() as run:
    # Log the data
    mlflow.log_artifact(data_input_path, "data-csv-dir")
  mlflow.projects.run(
    uri= train_path,
    parameters={ "data_path": data_input_path}
    )

if __name__ == "__main__":
  data_load()

'''.strip(), overwrite = True)

dbutils.fs.ls(load_path)

Lastly, fill in the following single `mlflow.projects.run` call to directly run the loading data project which should then indirectly invoke the training code, all on the driver node of our Spark cluster.

In [12]:
# ANSWER

mlflow.projects.run(load_path.replace("dbfs:", "/dbfs"),
  parameters={
    "data_input_path": "/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv",
    "train_path": train_path.replace("dbfs:", "/dbfs")
})

Check that these logged runs also show up properly on the MLflow UI.

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>