# Lesson 1

## Before this lesson
* Clone this repository
    * `git clone https://github.com/outerbounds/tutorials.git`
* [Install conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)
* Install requirements
    * From the root of this repository run `conda env create -f env.yml`
* Open this notebook
    * `jupyter notebook nbs/intro-to-mf/01-ML-DAG-Workflows.ipynb`
    
    
    
[HBA: lets discuss whether we need learners to do the above or if there are other ways (if we do, let's have a super minimal env and use mamba also?). the current tutorials are cool because there's minimal friction in running them]


[HBA: I think we're homing in on a good structure for lessons and episodes but next let's be really explicit; e.g. each sub-episode has i) showcasing, ii) brief context/description, iii) the flow, iv) execution, then v) interpretation or closing. something like that. whatever we decide works best but let's align and be explicit]

## Episode 1: Metaflow Fundamentals

[HBA: current prose is a first pass so don't overindex feedback on it; we'll also link out to how tos and core concepts]

### Showcasing
- DAGs and `metaflow.FlowSpec`
- Decorators and `metaflow.step`
- Metaflow Artifacts
- Metaflow Parameters

Metaflow is a tool for data scientists. It helps you efficiently access the environments, data, and infrastructure you need to get data science jobs done. Metaflow helps you realize these benefits by providing a consistent structure to your data science workflows. This structure is a directed acyclic graph (DAG). In addition to providing mechnaisms to encode your workflow as a DAG, Metaflow offers many design patterns relevant to machine learning workflows. The next sections of this page will introduce you to the Metaflow DAG, show you how to structure a Metaflow flow, and then introduce several common Metaflow constructs that leverage this structure. 

### 1a. FlowSpec and step

The DAG is a graph with no cycles and edges that only point in one direction. In Metaflow, you can create a DAG by sub-classing the `metaflow.FlowSpec` object. The nodes of the DAG correspond to functions of the `FlowSpec` class that are annotated with the `@step` decorator. Every flow you create must contain a `start` and `end` function. Here you can see an example of the minimal Metaflow flow:

In [57]:
%%writefile minimum_flow.py
from metaflow import FlowSpec, step

class MinimumFlow(FlowSpec):
    
    @step
    def start(self):
        self.next(self.end)
    
    @step
    def end(self):
        print("Flow is done!")

if __name__ == "__main__":
    MinimumFlow()

Overwriting minimum_flow.py


[HBA: Let's add a note about writing flows in scripts using your fave text editor and then executing from CLI, as we do below]

Notice that the functions take the `MinimumFlow` object itself as an argument, and use the `self` entity to create the structure of the DAG. This is doing by using `self.next(self.next_step)` at the end of a `step`.

The flow can be run from the command line: 

In [58]:
! python minimum_flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mMinimumFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-08-02 00:37:14.040 [0m[1mWorkflow starting (run-id 1659418634035942):[0m
[35m2022-08-02 00:37:14.097 [0m[32m[1659418634035942/start/1 (pid 4271)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:14.461 [0m[32m[1659418634035942/start/1 (pid 4271)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:37:14.469 [0m[32m[1659418634035942/end/2 (pid 4275)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:14.793 [0m[32m[1659418634035942/end/2 (pid 4275)] [0m[22mFlow is done![0m
[35m2022-08-02 00:37:14.840 [0m[32m[1659418634035942/end/2 (pid 4275)] [0m[1mTask finished successfully.[0m


[HBA: can we include some notes on how to read the above output, what it means, and why it is useful?]

For now, just remember that the general command is `python <FLOW SCRIPT> run`. Later you will see more ways to interact with your flows from the command line. Stay tuned! 

Bonus points: Try to add another step to the previous flow in addition to `start` and `end`, don't forget `@step`.

### 1b. Decorators

Using Metaflow requires the use of decorators. In Python, a decorator is a function that takes another function and extends its behavior without the need to modify it directly. In episode 1 part a. you saw Metaflow's `@step` decorators in action. This is just the beginning. There are many decorators built-into Metaflow and built as plugins by community members. You don't have to understand all of these now, but keep in mind that there are a wide-variety of decorators you can use.

For example, at the step-level there are decorators for use cases including:
* `@conda` for dependency management of a single step's environment
* `@batch` or `@kubernetes` to run a step on AWS Batch or Kubernetes

There are also flow-level decorators such as:
* `@conda_base` for dependency management of a flow's environment
* `@schedule` to run jobs automatically on a production orchestrator

You can view a list of all step decorators [here](https://docs.metaflow.org/api/step-decorators) and all flow decorators [here](https://docs.metaflow.org/api/flow-decorators).

[HBA: this sub-episode doesn't have any executable code or flows written so let's consider adding some or merging this into 1a]

### 1c. Flow Artifacts

[HBA: let's make sure to explicitly define what an artifact is]
[HBA: part of me wonders whether we want to spice up this sub-episode by having real data or an actual ML model. Just an idea]

Machine learning is all about the data. In this section, we are referring to how the state of flow artifacts change. This is done using the `self` keyword that refers to your flow object. When you use this keyword to store data artifact values, Metaflow automatically serializes this data and makes it usable across the rest of the downstream steps in your flow. This is especially useful when you run different steps of the flow on different computers. 

Here is a flow that shows using `self` after creating and updating an artifact:

In [59]:
%%writefile artifact_flow.py
from metaflow import FlowSpec, step

class ArtifactFlow(FlowSpec):
    
    @step
    def start(self):
        self.data_artifact = 1 # create `data_artifact`
        self.next(self.middle)
        
    @step
    def middle(self):
        self.data_artifact = 3 # update `data_artifact`
        self.next(self.end)
    
    @step
    def end(self):
        self.data_artifact += 1 # update `data_artifact`
        print("Artifact is {}".format(self.data_artifact))

if __name__ == "__main__":
    ArtifactFlow()

Overwriting artifact_flow.py


In [60]:
! python artifact_flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mArtifactFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-08-02 00:37:17.622 [0m[1mWorkflow starting (run-id 1659418637598266):[0m
[35m2022-08-02 00:37:17.629 [0m[32m[1659418637598266/start/1 (pid 4281)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:17.992 [0m[32m[1659418637598266/start/1 (pid 4281)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:37:18.000 [0m[32m[1659418637598266/middle/2 (pid 4284)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:18.366 [0m[32m[1659418637598266/middle/2 (pid 4284)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:37:18.374 [0m[32m[1659418637598266/end/3 (pid 4287)] [0m[1mTask is star

Note that although this artifact pattern can deal with many cases, this pattern is not for storing big data. In the case where you have a large dataset (e.g., a training dataset of many images) you may want to consider using [Metaflow's S3 utilities](https://docs.metaflow.org/api/S3). 

### 1d. Flow Parameters

You can also pass data into flows as a `Parameter`. You can use these to parameterize any aspect of your flow. The parameters can be set to a default and give you the option to override this from the command line when you (or a scheduler) runs the flow. Here is the `MinimumFlow` example adding a `Parameter` into the mix.

[HBA: it's not quite clear why this would be useful yet. We could say a few words about when you would want to use and/or have an actual ML example (see above also)]

In [61]:
%%writefile parameter_flow.py
from metaflow import FlowSpec, step, Parameter

class ParameterizedFlow(FlowSpec):
    
    my_param = Parameter('my-param', default=999)
    
    @step
    def start(self):
        self.next(self.end)
    
    @step
    def end(self):
        print("Parameter value is {}".format(self.my_param))

if __name__ == "__main__":
    ParameterizedFlow()

Overwriting parameter_flow.py


The flow can be run like before and the default will be called:

In [62]:
! python parameter_flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mParameterizedFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-08-02 00:37:22.613 [0m[1mWorkflow starting (run-id 1659418642608733):[0m
[35m2022-08-02 00:37:22.620 [0m[32m[1659418642608733/start/1 (pid 4293)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:22.990 [0m[32m[1659418642608733/start/1 (pid 4293)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:37:22.998 [0m[32m[1659418642608733/end/2 (pid 4297)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:23.323 [0m[32m[1659418642608733/end/2 (pid 4297)] [0m[22mParameter value is 999[0m
[35m2022-08-02 00:37:23.369 [0m[32m[1659418642608733/end/2 (pid 4297)] [0m[1mTask finished suc

Or you can override the default value at run time:

In [63]:
! python parameter_flow.py run --my-param 123

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mParameterizedFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-08-02 00:37:25.853 [0m[1mWorkflow starting (run-id 1659418645848463):[0m
[35m2022-08-02 00:37:25.861 [0m[32m[1659418645848463/start/1 (pid 4304)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:26.228 [0m[32m[1659418645848463/start/1 (pid 4304)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:37:26.235 [0m[32m[1659418645848463/end/2 (pid 4307)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:26.565 [0m[32m[1659418645848463/end/2 (pid 4307)] [0m[22mParameter value is 123[0m
[35m2022-08-02 00:37:26.611 [0m[32m[1659418645848463/end/2 (pid 4307)] [0m[1mTask finished suc

## Episode 2. Running Flows

### Showcasing
- Machine learning workflows
- Branching

This episode demonstrates a more realistic machine learning job. You will see a flow that trains two models on the same classification task, one Random Forest and one Gradient Boosted Trees model. 

[HBA: we could split this into 2-3 sub-episodes? e.g. ML model then branching? the previous episode contains 4 sections and this has one]
[HBA: perhaps have "Machine Learning" in title of episode?]

### Tree Models Flow

[HBA: this h3 could have a few more words]

The flow has the following structure:
* Parameter values are defined in beginning of the class.
    * Defaults can be overridden using command line arguments as shown in episode 1d.
* The `start` step loads and splits a dataset to be used in downstream tasks.
    * The dataset for this task is small, so we can store it in `self` without introducing much copying and storage overhead.
    * Notice that this step calls two downstream steps in `self.next(self.train_rf, self.train_xgb)`. This is called branching in Metaflow. These "branching" steps are run in parallel. 
* The `train_rf` step fits a `sklearn.ensemble.RandomForestClassifier` for the classification task. 
* The `train_xgb` step fits a `xgboost.XGBClassifier` for the classification task. 
* The `score` step evaluates each classifier on a held out dataset for testing.
* The `end` step prints the accuracy scores for each classifier.


[HBA: the following flow is long; one aspect it will be important for us to discuss is whether we want ALL the code on the tutorial page or only some of it and then to point people to the repo]

In [64]:
%%writefile tree_models_flow.py
from metaflow import FlowSpec, step, Parameter

class TreeModelsFlow(FlowSpec):

    test_size = Parameter("tst-sz", default=0.2)
    random_state = Parameter("seed", default=42)
    n_estimators = Parameter("n-est", default=10)
    min_samples_split = Parameter("min-samples", default=2)
    eval_metric = Parameter("eval-metric", default='mlogloss')

    @step
    def start(self):
        from sklearn import datasets
        from sklearn.model_selection import train_test_split
        iris = datasets.load_iris()
        self.X = iris['data']
        self.y = iris['target']
        data = train_test_split(self.X, self.y, 
                                test_size=self.test_size, 
                                random_state=self.random_state)
        self.X_train = data[0]
        self.X_test = data[1]
        self.y_train = data[2]
        self.y_test = data[3]
        self.next(self.train_rf, self.train_xgb)

    @step
    def train_rf(self):
        from sklearn.ensemble import RandomForestClassifier
        self.clf = RandomForestClassifier(n_estimators=self.n_estimators,
                                          min_samples_split=self.min_samples_split, 
                                          random_state=self.random_state)
        self.clf.fit(self.X_train, self.y_train)
        self.next(self.score)

    @step
    def train_xgb(self):
        from xgboost import XGBClassifier
        self.clf = XGBClassifier(n_estimators=self.n_estimators,
                                 random_state=self.random_state,
                                 eval_metric=self.eval_metric,
                                 use_label_encoder=False)
        self.clf.fit(self.X_train, self.y_train)
        self.next(self.score)

    @step
    def score(self, inputs):
        self.merge_artifacts(inputs, include=["X_test", "y_test"])
        self.accuracies = [
            train_step.clf.score(self.X_test, self.y_test)
            for train_step in inputs
        ]
        self.next(self.end)

    @step
    def end(self):
        self.model_names = ["Random Forest", "XGBoost"]
        for name, acc in zip(self.model_names, self.accuracies):
            print("{} Model Accuracy: {}%".format(
                name, round(100*acc, 3)))

if __name__ == "__main__":
    TreeModelsFlow()

Overwriting tree_models_flow.py


In [65]:
! python tree_models_flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mTreeModelsFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-08-02 00:37:31.315 [0m[1mWorkflow starting (run-id 1659418651310098):[0m
[35m2022-08-02 00:37:31.326 [0m[32m[1659418651310098/start/1 (pid 4313)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:32.356 [0m[32m[1659418651310098/start/1 (pid 4313)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:37:32.366 [0m[32m[1659418651310098/train_rf/2 (pid 4316)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:32.377 [0m[32m[1659418651310098/train_xgb/3 (pid 4317)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:37:33.312 [0m[32m[1659418651310098/train_rf/2 (pid 4316)] [0m[1mTask finis

## 3. Visualizing Flows with Cards

### Showcasing
- Neural net workflows
- Metaflow cards

Data visualization is a crucial aspect of communicating machine learning results successfully. Metaflow offers several utilities to help you in this regard. The quickest way to start viewing your flow artifacts and flow structures is to use cards. These are step-level entities that let you visualize plots, data tables, HTML, and more in your browser. You can read more about cards [here](https://docs.metaflow.org/metaflow/visualizing-results#what-are-cards). In this episode, you will see a new model type trained on same classification task from episode 2.

### Neural Net Flow

The flow has the following structure:
* Parameter values are defined in beginning of the class
    * Defaults can be overridden using ommand line arguments as shown in episode 1d.
* The `start` step loads and splits a dataset to be used in downstream tasks.
* The `scale_features` step normalizes the feature data. 
* The `visualize_feature_distributions` makes a `matplotlib.Figure` that is appended to the step's card.
    * Later, we will use this step name in the command line to visualize the card for this step.
    * The figure produced compares the training and testing features before and after scaling.
* The `train` step fits the neural net.
* The `score` step evaluated the model accuracy.
* The `end` step prints the accuracy of the model. 

[HBA: same question as above about including all code in tutorial or not]

[HBA: perhaps include utils.py for boilerplate?]

In [69]:
%%writefile neural_net_flow.py
from metaflow import FlowSpec, step, Parameter, card, current
from metaflow.cards import Image
from tensorflow import keras

def build_model(hidden_layer_dim, meta):
    # meta is a scikeras argument that will be
    # handed a dict containing input metadata
    n_features_in_ = meta["n_features_in_"]
    X_shape_ = meta["X_shape_"]
    n_classes_ = meta["n_classes_"]
    model = keras.models.Sequential()
    model.add(keras.layers.Dense(n_features_in_, 
                                 input_shape=X_shape_[1:]))
    model.add(keras.layers.Activation("relu"))
    model.add(keras.layers.Dense(hidden_layer_dim))
    model.add(keras.layers.Activation("relu"))
    model.add(keras.layers.Dense(n_classes_))
    model.add(keras.layers.Activation("softmax"))
    return model

class NeuralNetFlow(FlowSpec):

    test_size = Parameter("tst-sz", default=0.2)
    random_state = Parameter("seed", default=42)
    hidden_layer_dim = Parameter("hidden-dim", default=100)
    epochs = Parameter("epochs", default=200)
    loss_fn = Parameter("loss", default='categorical_crossentropy')

    @step
    def start(self):
        from sklearn import datasets
        from sklearn.model_selection import train_test_split
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        data = train_test_split(self.X, self.y, 
                                test_size=self.test_size, 
                                random_state=self.random_state)
        self.X_train = data[0]
        self.X_test = data[1]
        self.y_train = data[2]
        self.y_test = data[3]
        self.next(self.scale_features)

    @step
    def scale_features(self):
        from sklearn.preprocessing import StandardScaler
        scaler = StandardScaler()
        self.X_train_scaled = scaler.fit_transform(self.X_train)
        self.X_test_scaled = scaler.transform(self.X_test)
        self.next(self.visualize_feature_distributions)

    @card
    @step
    def visualize_feature_distributions(self):
        import matplotlib.pyplot as plt
        n_features = self.X_train.shape[1]
        assert n_features == self.X_test.shape[1], "Train and test feature dimensions are not the same!"
        feature_datasets = [self.X_train, self.X_train_scaled, self.X_test, self.X_test_scaled]
        n_bins = 10
        fig, axs = plt.subplots(len(feature_datasets), n_features, figsize=(16,16))
        for i,data in enumerate(feature_datasets):
            for j in range(n_features):
                axs[i,j].hist(data[:, i], bins=n_bins)
                axs[i,j].set_title("X train - {}".format(self.iris['feature_names'][i]))
        current.card.append(Image.from_matplotlib(fig))
        self.next(self.train)

    @step
    def train(self):
        from scikeras.wrappers import KerasClassifier
        self.clf = KerasClassifier(build_model, 
                                   loss=self.loss_fn,
                                   hidden_layer_dim=self.hidden_layer_dim,
                                   epochs=self.epochs,
                                   verbose=0)
        self.clf.fit(self.X_train, self.y_train)
        self.next(self.score)

    @step
    def score(self):
        self.accuracy = self.clf.score(self.X_test, self.y_test)
        self.next(self.end)

    @step
    def end(self):
        print("Neural Net Model Accuracy: {}%".format(round(100*self.accuracy, 3)))

if __name__ == "__main__":
    NeuralNetFlow()

Overwriting neural_net_flow.py


The flow can be run in the same way as usual when using cards:

In [70]:
! python neural_net_flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mNeuralNetFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-08-02 00:38:01.522 [0m[1mWorkflow starting (run-id 1659418681517256):[0m
[35m2022-08-02 00:38:01.540 [0m[32m[1659418681517256/start/1 (pid 4334)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:38:04.476 [0m[32m[1659418681517256/start/1 (pid 4334)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:38:04.489 [0m[32m[1659418681517256/scale_features/2 (pid 4337)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:38:06.813 [0m[32m[1659418681517256/scale_features/2 (pid 4337)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:38:06.826 [0m[32m[1659418681517256/visualize_feature_dis

And now the cards can be visualized for each step. In this case, lets look at the card associated with the `visualize_feature_distributions` step:

In [71]:
! python neural_net_flow.py card view visualize_feature_distributions

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mNeuralNetFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[32m[22mResolving card: NeuralNetFlow/1659418681517256/visualize_feature_distributions/3[K[0m[32m[22m[0m


## Episode 4. Analyze Flow Results Using Client API

### Showcasing
- The Metaflow Client API
- Tagging, filtering, and accessing data from flows

Cards are handy for quick visualizations and genering report elements from flows. For more involved analysis of a flow's run history, you can use the Metaflow Client API. The Client offers ways to tag, filter, and access data from flows. For example, you can access the results of the latest run of `TreeModelsFlow` from episode 2 like:


[HBA: should we talk about whether to do this in notebooks or from the terminal or?]

In [75]:
from metaflow import Flow
tree_flow_run = Flow('TreeModelsFlow').latest_run
assert tree_flow_run.successful

Once you have fetched the run, you can do things like:
* add, drop, or edit tags to the run
* view DAG structure
* view artifact state throughout steps
* view metadata about the run

For example, you can access any artifact stored using `self` with `<RUN NAME>.data.<ARTIFACT NAME>`:

In [73]:
tree_flow_run.data.model_names, tree_flow_run.data.accuracies

(['Random Forest', 'XGBoost'], [1.0, 1.0])

Let's compare the accuracy scores from each of the two tree models and the neural network in episode 3. 

In [74]:
from metaflow import Flow
tree_flow_run = Flow('TreeModelsFlow').latest_run
neural_net_run = Flow('NeuralNetFlow').latest_run

for model_name, acc in zip(
    [*tree_flow_run.data.model_names, "Neural Net"],
    [*tree_flow_run.data.accuracies, neural_net_run.data.accuracy]
):
    print("{} Accuracy: {}".format(model_name, acc))

Random Forest Accuracy: 1.0
XGBoost Accuracy: 1.0
Neural Net Accuracy: 1.0


## Episode 5. Debugging Flows

### Showcasing
- Metaflow `resume`
- Debugging flows

The team behind Metaflow wants you to have a great experience working with Metaflow. To this end, debugging is a first-class workflow in the Metaflow developer experience. In this episode, we focus on using `resume` in the command line when debugging your flows. The general structure of using this command is like: 
* Write `my_sweet_flow.py`
* Run `python my_sweet_flow.py run`
    * Oh no, something broke! Analyzing stack trace...
    * Found the bug! 
    * Save `my_sweet_flow.py` with the fix. 
* `python my_sweet_flow.py resume`
    * Pick up the state of the last flow execution *from the step that failed*.
    * Note: You can also specify a specific step to resume from like `python my_sweet_flow.py resume <DIFFERENT STEP NAME>`
    
Let's look at an example. In this flow:
* The `time_consuming_step` mimics some process you'd rather not re-run because of a downstream error.
* The `error_prone_step` creates an `Exception` that halts your flow.

In [76]:
%%writefile debuggable_flow.py
from metaflow import FlowSpec, step

class DebuggableFlow(FlowSpec):
    
    @step
    def start(self):
        self.next(self.time_consuming_step)
        
    @step
    def time_consuming_step(self):
        import time
        time.sleep(12)
        self.next(self.error_prone_step)
        
    @step
    def error_prone_step(self):
        raise Exception()
        self.next(self.end)
    
    @step
    def end(self):
        print("Flow is done!")

if __name__ == "__main__":
    DebuggableFlow()

Overwriting debuggable_flow.py


If you run this flow using the following command, the `error_prone_step` will produce an error.

In [77]:
! python debuggable_flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mDebuggableFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-08-02 00:39:11.657 [0m[1mWorkflow starting (run-id 1659418751653854):[0m
[35m2022-08-02 00:39:11.665 [0m[32m[1659418751653854/start/1 (pid 4366)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:39:12.059 [0m[32m[1659418751653854/start/1 (pid 4366)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:39:12.069 [0m[32m[1659418751653854/time_consuming_step/2 (pid 4369)] [0m[1mTask is starting.[0m
[35m2022-08-02 00:39:24.537 [0m[32m[1659418751653854/time_consuming_step/2 (pid 4369)] [0m[1mTask finished successfully.[0m
[35m2022-08-02 00:39:24.545 [0m[32m[1659418751653854/error_pron

You can resume from the step that failed by:
1. Finding and fixing the bug. 
2. Saving the flow script.
3. Running `python <FLOW SCRIPT> resume`. 

In [80]:
%%writefile debuggable_flow.py
from metaflow import FlowSpec, step

class DebuggableFlow(FlowSpec):
    
    @step
    def start(self):
        self.next(self.time_consuming_step)
        
    @step
    def time_consuming_step(self):
        import time
        time.sleep(12)
        self.next(self.error_prone_step)
        
    @step
    def error_prone_step(self):
        print("Squashed bug")
        self.next(self.end)
    
    @step
    def end(self):
        print("Flow is done!")

if __name__ == "__main__":
    DebuggableFlow()

Overwriting debuggable_flow.py


When you run the following command, notice the console print outs will contain notes about previous run tasks being cloned and the flow doesn't wait on `time_consuming_step` since we resumed downstream of this step.

In [81]:
! python debuggable_flow.py resume

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mDebuggableFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-08-02 00:39:36.594 [0m[22mGathering required information to resume run (this may take a bit of time)...[0m
[35m2022-08-02 00:39:36.599 [0m[1mWorkflow starting (run-id 1659418776593706):[0m
[35m2022-08-02 00:39:36.600 [0m[32m[1659418776593706/start/1] [0m[1mCloning results of a previously run task 1659418751653854/start/1[0m
[35m2022-08-02 00:39:36.972 [0m[32m[1659418776593706/time_consuming_step/2] [0m[1mCloning results of a previously run task 1659418751653854/time_consuming_step/2[0m
[35m2022-08-02 00:39:37.365 [0m[32m[1659418776593706/error_prone_step/3 (pid 4399)] [0m[1mTask is s

Note that this same funcionality works even when the steps are run on different computers. In fact, you can even resume a Metaflow run on your local machine for a flow that was run on a production scheduler like AWS Step Functions or Argo.

### Lesson Close
Way to stick with this lesson through all five episodes. You are going to be a great Metaflower. If you are ready for more, check out the next lesson in this tutorial where we move to the cloud. Hope to see you there!