dennyglee and smurching Updated multistep workflow example to include diagram (#581)
Include tutorial image for multistep workflow
Latest commit c715f20 Oct 8, 2018

README.rst

Multistep Workflow Example

This MLproject aims to be a fully self-contained example of how to chain together multiple different MLflow runs which each encapsulate a transformation or training step, allowing a clear definition of the interface between the steps, as well as allowing for caching and reuse of the intermediate results.

At a high level, our goal is to predict users' ratings of movie given a history of their ratings for other movies. This example is based on this webinar by @brookewenig and @smurching.

../../docs/source/_static/images/tutorial-multistep-workflow.png?raw=true

There are four steps to this workflow:

  • load_raw_data.py: Downloads the MovieLens dataset (a set of triples of user id, movie id, and rating) as a CSV and puts it into the artifact store.
  • etl_data.py: Converts the MovieLens CSV from the previous step into Parquet, dropping unnecessary columns along the way. This reduces the input size from 500 MB to 49 MB, and allows columnar access of the data.
  • als.py: Runs Alternating Least Squares for collaborative filtering on the Parquet version of MovieLens to estimate the movieFactors and userFactors. This produces a relatively accurate estimator.
  • train_keras.py: Trains a neural network on the original data, supplemented by the ALS movie/userFactors -- we hope this can improve upon the ALS estimations.

While we can run each of these steps manually, here we have a driver run, defined as main (main.py). This run will run the steps in order, passing the results of one to the next. Additionally, this run will attempt to determine if a sub-run has already been executed successfully with the same parameters and, if so, reuse the cached results.

Running this Example

In order for the multistep workflow to find the other steps, you must execute mlflow run from this directory. So, in order to find out if the Keras model does in fact improve upon the ALS model, you can simply run:

cd example/multistep
mlflow run .

This will download and transform the MovieLens dataset, train an ALS model, and then train a Keras model -- you can compare the results by using mlflow ui!

You can also try changing the number of ALS iterations or Keras hidden units:

mlflow run . -P als_max_iter=20 -P keras_hidden_units=50