# Lesson 2

## Showcasing

* Metaflow Fundamentals
    * DAGs and `metaflow.FlowSpec`
    * Decorators and `metaflow.step`
* Running Flows
    * RandomForestFlow
    * GradientBoostedTreesFlow
    * NeuralNetFlow
* Analyzing Flows
    * `metaflow.cards`
    * Using the Client API

## Orchestrating ML Workflows with DAGs

Short paragraph motivating DAGs and linking out to Hugo's CC post.

## Building DAGs with Metaflow

Metaflow is built to build ML DAGs blah blah. 

### Metaflow Fundamentals

One to two sentence value prop of each section under this H3. Progressively build up to full template flow by asking the user to add one element at a time.

#### Dags and FlowSpec

Objects common to all flows. Here is minimal flow:

#### Decorators and steps

How you talk to metaflow...

#### Running Flows

Show commands

#### Visualizing Results with Cards

can track state of data in steps

#### Flow Analysis with Metaflow Client API

### Random Forest Flow

In [41]:
%%writefile random_forest_flow.py
from metaflow import FlowSpec, step, Parameter

class RandomForestFlow(FlowSpec):
    
    test_size = Parameter("test_size", default=0.2)
    random_state = Parameter("random_state", default=42)
    n_estimators = Parameter("n_estimators", default=10)
    min_samples_split = Parameter("min_samples_split", default=2)
    
    @step
    def start(self):
        from sklearn import datasets
        from sklearn.model_selection import train_test_split
        iris = datasets.load_iris()
        self.X = iris['data']
        self.y = iris['target']
        data = train_test_split(self.X, self.y, 
                                test_size=self.test_size, 
                                random_state=self.random_state)
        self.X_train = data[0]
        self.X_test = data[1]
        self.y_train = data[2]
        self.y_test = data[3]
        self.next(self.train)
        
    @step
    def train(self):
        from sklearn.ensemble import RandomForestClassifier
        self.clf = RandomForestClassifier(n_estimators=self.n_estimators,
                                          min_samples_split=self.min_samples_split, 
                                          random_state=self.random_state)
        self.clf.fit(self.X_train, self.y_train)
        self.next(self.score)

    @step
    def score(self):
        self.accuracy = self.clf.score(self.X_test, self.y_test)
        self.next(self.end)
    
    @step
    def end(self):
        print("Random Forest Model Accuracy: {}%".format(round(100*self.accuracy, 3)))
        
if __name__ == "__main__":
    RandomForestFlow()

Overwriting random_forest_flow.py


In [42]:
! python random_forest_flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mRandomForestFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-07-21 08:35:27.537 [0m[1mWorkflow starting (run-id 1658410527532180):[0m
[35m2022-07-21 08:35:27.546 [0m[32m[1658410527532180/start/1 (pid 50580)] [0m[1mTask is starting.[0m
[35m2022-07-21 08:35:28.398 [0m[32m[1658410527532180/start/1 (pid 50580)] [0m[1mTask finished successfully.[0m
[35m2022-07-21 08:35:28.406 [0m[32m[1658410527532180/train/2 (pid 50583)] [0m[1mTask is starting.[0m
[35m2022-07-21 08:35:29.168 [0m[32m[1658410527532180/train/2 (pid 50583)] [0m[1mTask finished successfully.[0m
[35m2022-07-21 08:35:29.176 [0m[32m[1658410527532180/score/3 (pid 50586)] [0m[1mTas

### Gradient Boosted Trees Flow

In [49]:
%%writefile gradient_boosted_trees_flow.py

from metaflow import FlowSpec, step, Parameter

class GradientBoostedTreesFlow(FlowSpec):
    
    test_size = Parameter("test_size", default=0.2)
    random_state = Parameter("random_state", default=42)
    n_estimators = Parameter("n_estimators", default=10)
    eval_metric = Parameter("eval_metric", default='mlogloss')
    
    @step
    def start(self):
        from sklearn import datasets
        from sklearn.model_selection import train_test_split
        iris = datasets.load_iris()
        self.X = iris['data']
        self.y = iris['target']
        data = train_test_split(self.X, self.y, 
                                test_size=self.test_size, 
                                random_state=self.random_state)
        self.X_train = data[0]
        self.X_test = data[1]
        self.y_train = data[2]
        self.y_test = data[3]
        self.next(self.train)
        
    @step
    def train(self):
        from xgboost import XGBClassifier
        self.clf = XGBClassifier(n_estimators=self.n_estimators,
                                 random_state=self.random_state,
                                 eval_metric=self.eval_metric,
                                 use_label_encoder=False)
        self.clf.fit(self.X_train, self.y_train)
        self.next(self.score)

    @step
    def score(self):
        self.accuracy = self.clf.score(self.X_test, self.y_test)
        self.next(self.end)
    
    @step
    def end(self):
        print("Gradient Boosted Trees Model Accuracy: {}%".format(round(100*self.accuracy, 3)))
        
if __name__ == "__main__":
    GradientBoostedTreesFlow()

Overwriting gradient_boosted_trees_flow.py


In [50]:
! python gradient_boosted_trees_flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mGradientBoostedTreesFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-07-21 08:37:18.112 [0m[1mWorkflow starting (run-id 1658410638107504):[0m
[35m2022-07-21 08:37:18.122 [0m[32m[1658410638107504/start/1 (pid 50625)] [0m[1mTask is starting.[0m
[35m2022-07-21 08:37:18.959 [0m[32m[1658410638107504/start/1 (pid 50625)] [0m[1mTask finished successfully.[0m
[35m2022-07-21 08:37:18.968 [0m[32m[1658410638107504/train/2 (pid 50628)] [0m[1mTask is starting.[0m
[35m2022-07-21 08:37:20.010 [0m[32m[1658410638107504/train/2 (pid 50628)] [0m[1mTask finished successfully.[0m
[35m2022-07-21 08:37:20.019 [0m[32m[1658410638107504/score/3 (pid 50631)] [0

### Neural Net Flow

Use card to show the feature distributions before and after scaling for `X_train` and `X_test`.

In [61]:
%%writefile neural_net_flow.py
from metaflow import FlowSpec, step, Parameter, card, current
from metaflow.cards import Image
from tensorflow import keras

def build_model(hidden_layer_dim, meta):
    # meta is a scikeras argument that will be
    # handed a dict containing input metadata
    n_features_in_ = meta["n_features_in_"]
    X_shape_ = meta["X_shape_"]
    n_classes_ = meta["n_classes_"]

    # build neural net model 
    model = keras.models.Sequential()
    model.add(keras.layers.Dense(n_features_in_, 
                                 input_shape=X_shape_[1:]))
    model.add(keras.layers.Activation("relu"))
    model.add(keras.layers.Dense(hidden_layer_dim))
    model.add(keras.layers.Activation("relu"))
    model.add(keras.layers.Dense(n_classes_))
    model.add(keras.layers.Activation("softmax"))
    return model

class NeuralNetFlow(FlowSpec):
    
    test_size = Parameter("test_size", default=0.2)
    random_state = Parameter("random_state", default=42)
    hidden_layer_dim = Parameter("hidden_layer_dim", default=100)
    epochs = Parameter("epochs", default=200)
    loss_fn = Parameter("loss_fn", default='categorical_crossentropy')
    
    @step
    def start(self):
        from sklearn import datasets
        from sklearn.model_selection import train_test_split
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        data = train_test_split(self.X, self.y, 
                                test_size=self.test_size, 
                                random_state=self.random_state)
        self.X_train = data[0]
        self.X_test = data[1]
        self.y_train = data[2]
        self.y_test = data[3]
        self.next(self.scale_features)
    
    @card
    @step
    def scale_features(self):
        from sklearn.preprocessing import StandardScaler
        scaler = StandardScaler()
        self.X_train_scaled = scaler.fit_transform(self.X_train)
        self.X_test_scaled = scaler.transform(self.X_test)
        self.next(self.visualize_feature_distributions)
        
    @card()
    @step
    def visualize_feature_distributions(self):
        import matplotlib.pyplot as plt
        n_features = self.X_train.shape[1]
        assert n_features == self.X_test.shape[1], "Train and test feature dimensions are not the same!"
        feature_datasets = [self.X_train, self.X_train_scaled, self.X_test, self.X_test_scaled]
        n_bins = 10
        fig, axs = plt.subplots(len(feature_datasets), n_features, figsize=(16,16))
        for i,data in enumerate(feature_datasets):
            for j in range(n_features):
                axs[i,j].hist(data[:, i], bins=n_bins)
                axs[i,j].set_title("X train - {}".format(self.iris['feature_names'][i]))
        current.card.append(Image.from_matplotlib(fig))
        self.next(self.train)
        
        
    @step
    def train(self):
        from scikeras.wrappers import KerasClassifier
        self.clf = KerasClassifier(build_model, 
                                   loss=self.loss_fn,
                                   hidden_layer_dim=self.hidden_layer_dim,
                                   epochs=self.epochs,
                                   verbose=0)
        self.clf.fit(self.X_train, self.y_train)
        self.next(self.score)

    @step
    def score(self):
        self.accuracy = self.clf.score(self.X_test, self.y_test)
        self.next(self.end)
    
    @step
    def end(self):
        print("Neural Net Model Accuracy: {}%".format(round(100*self.accuracy, 3)))
        
if __name__ == "__main__":
    NeuralNetFlow()

Overwriting neural_net_flow.py


In [62]:
! python neural_net_flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mNeuralNetFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-07-21 09:13:36.804 [0m[1mWorkflow starting (run-id 1658412816797829):[0m
[35m2022-07-21 09:13:36.819 [0m[32m[1658412816797829/start/1 (pid 51723)] [0m[1mTask is starting.[0m
[35m2022-07-21 09:13:39.289 [0m[32m[1658412816797829/start/1 (pid 51723)] [0m[1mTask finished successfully.[0m
[35m2022-07-21 09:13:39.301 [0m[32m[1658412816797829/scale_features/2 (pid 51726)] [0m[1mTask is starting.[0m
[35m2022-07-21 09:13:44.138 [0m[32m[1658412816797829/scale_features/2 (pid 51726)] [0m[1mTask finished successfully.[0m
[35m2022-07-21 09:13:44.149 [0m[32m[1658412816797829/visualize_feature

#### Visualize card created in `visualize_feature_distributions`

In [65]:
! python neural_net_flow.py card view visualize_feature_distributions

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mNeuralNetFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[32m[22mResolving card: NeuralNetFlow/1658412816797829/visualize_feature_distributions/3[K[0m[32m[22m[0m


### Analyze Flow Results Using Client API

In [71]:
from metaflow import Flow

random_forest_data = Flow('RandomForestFlow').latest_successful_run.data
gradient_boosted_trees_data = Flow('GradientBoostedTreesFlow').latest_successful_run.data
neural_net_data = Flow('NeuralNetFlow').latest_successful_run.data

for model_name, run_data in zip(["Random Forest", "Gradient Boosted Trees", "Neural Net"], 
                                [random_forest_data, gradient_boosted_trees_data, neural_net_data]):
    print("{} Accuracy: {}".format(model_name, run_data.accuracy))

Random Forest Accuracy: 0.9666666666666667
Gradient Boosted Trees Accuracy: 0.9666666666666667
Neural Net Accuracy: 1.0


#### Learn m