In [None]:
## FILL IN YOUR NAME
NAME = "example"

# Running MLprojects

Any local directory or Git repository can be treated as an MLflow project. Let's run [an example project](https://github.com/mlflow/mlflow-example) from the official MLflow github repository.

There are two ways to run the project:

- Using the CLI: `mlflow run `
- Using the Python API: `mlflow.projects.run()`

For this example, we will use the Python API.

In [None]:
import mlflow

%load_ext dotenv
%dotenv

project_uri = "https://github.com/mlflow/mlflow-example"
params = {"alpha": 0.5, "l1_ratio": 0.01}

# Run MLflow project and create a reproducible conda environment
submitted_run = mlflow.run(project_uri, parameters=params, use_conda=False)

In [None]:
submitted_run

In [None]:
submitted_run.run_id

# Retrieving Run Details

Using the `submitted_run` object we can retrieve the details from the run that we just submitted. In order to do so, we will use the [Python API](https://www.mlflow.org/docs/latest/python_api/mlflow.tracking.html). In particular, we are interested in retrieving the path to the artifacts because this will be useful for us later. 

In [None]:
from mlflow.tracking import MlflowClient

# retrive the run by using the MLflow client
client = MlflowClient()
run = client.get_run(submitted_run.run_id)

In [None]:
# inspect the info about the run
run.info

In [None]:
# retrieve the run's artifacts path
run.info.artifact_uri

## What we just learned?

* It is possible to run projects easily by using the Python API.
* Projects can be stored as local folders or Git repositories.
* After running a project we can use the `mlflow.tracking` module to retrieve all the information about the run.

This will useful for the next exercise.

# Defining ML pipelines with MLflow


MLflow allows us to chain together multiple different runs. Each run, encapsulates a transformation or training step. For this exercise, we will run the following ML pipeline using the MLproject module:

![multistep-workflow](https://github.com/mlflow/mlflow/raw/master/docs/source/_static/images/tutorial-multistep-workflow.png?raw=true)

There are four entry points that make up the pipeline:

* **load_raw_data.py**: Downloads the MovieLens dataset (a set of triples of user id, movie id, and rating) as a CSV and puts it into the artifact store.
* **etl_data.py**: Converts the MovieLens CSV from the previous step into Parquet, dropping unnecessary columns along the way. This reduces the input size from 500 MB to 49 MB, and allows columnar access of the data.
* **als.py**: Runs Alternating Least Squares for collaborative filtering on the Parquet version of MovieLens to estimate the movieFactors and userFactors. This produces a relatively accurate estimator.
* **train_keras.py**: Trains a neural network on the original data, supplemented by the ALS movie/userFactors -- we hope this can improve upon the ALS estimations.

### Example: multi-step workflow

While we can run each of these steps manually, here we have a **driver run**, defined as the method `mlflow_pipeline` below. This method will run the steps in order, passing the results of one step to the next. 

We will provide you with an auxiliary method that given an entry point and some parameters launch a run using the MLflow's Python API.

In [None]:
def _run(entrypoint: str, parameters: dict, project_dir: str = '../mlflow-project/'):
    """Launches an entry point by providing the given parameters."""
    
    print("Launching new run for entrypoint=%s and parameters=%s" % (entrypoint, parameters))
    submitted_run = mlflow.run(project_dir, 
                               entrypoint, 
                               parameters=parameters,
                               use_conda=False,
                               storage_dir="../../data/")
    
    client = mlflow.tracking.MlflowClient()
    return client.get_run(submitted_run.run_id)

Here are some tips in case you want to implement the code on your own:

* You can use the provided method `_run` to execute each step of the pipeline
* Make sure your are passing the correct values for `entrypoint` and `parameters`.
* The entrypoint names and input parameters are defined in MLproject file located in the folder `mlflow-project` 

In [None]:
import os
import mlflow

%load_ext dotenv
%dotenv

# set experiment
experiment_name = f"pipeline-{NAME}"
mlflow.set_experiment(experiment_name)

def mlflow_pipeline(als_max_iter, keras_hidden_units, max_row_limit):
    
    with mlflow.start_run() as active_run:
        os.environ["SPARK_CONF_DIR"] = os.path.abspath(".")
    
        load_raw_data_run = _run("load_raw_data", {})
        ratings_csv_uri = os.path.join(load_raw_data_run.info.artifact_uri, "ratings-csv-dir")

        etl_data_run = _run("etl_data", {"ratings_csv": ratings_csv_uri, "max_row_limit": max_row_limit})
        ratings_parquet_uri = os.path.join(etl_data_run.info.artifact_uri, "ratings-parquet-dir")

        als_run = _run("als", {"ratings_data": ratings_parquet_uri, "max_iter": str(als_max_iter)})
        als_model_uri = os.path.join(als_run.info.artifact_uri, "als-model")

        keras_params = {
            "ratings_data": ratings_parquet_uri, 
            "als_model_uri": als_model_uri, 
            "hidden_units": keras_hidden_units
        }
        train_keras_run = _run("train_keras", keras_params)


After completing the code, run the next cell and go to the MLflow UI to check the results :)

In [None]:
# once you finished with the method `mlflow_pipeline` run this line!
mlflow_pipeline(als_max_iter=10, keras_hidden_units=20, max_row_limit=100000)