# Creating flows from your laptop machine learning code

To introduce data scientists to Metaflow, it will be key to show them how they can take their pre-existing ML code and turn it into flows. Arguably, the 3 most practical types of models are

* Random forests
* Boosted trees, and
* Neural nets.

To this end, in what follows, we show how you would take code for each of these types of models and turn it into a Metaflow.

## Setup instructions

We'll be using `conda` to install the necessary packages but you can also use `pip` or `virtualenv`. To use `conda`, install the Anaconda distribution from [here](https://www.anaconda.com/products/individual).
Using the command line, execute

```bash
conda env create -f env.yml
```
to create your environment. You can then activate it by executing

```bash
conda activate full-stack-metaflow
```

## Random forests

This is typical random forest code:

In [1]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
X.shape, y.shape

((150, 4), (150,))

In [2]:
# https://scikit-learn.org/stable/modules/ensemble.html#forest
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
    random_state=0)
scores = cross_val_score(clf, X, y, cv=5)
print(scores)


clf = RandomForestClassifier(n_estimators=10, max_depth=None,
    min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=5)
print(scores)


clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
    min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=5)
print(scores)

[0.96666667 0.96666667 0.9        0.96666667 1.        ]
[0.96666667 0.96666667 0.9        0.93333333 1.        ]
[0.96666667 0.96666667 0.93333333 0.9        1.        ]


## Boosted trees



This is typical boosted tree code:

In [3]:
import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('data/agaricus.txt.train')
dtest = xgb.DMatrix('data/agaricus.txt.test')
# specify parameters via map|
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
preds = bst.predict(dtest)
print(preds)

[0.28583017 0.9239239  0.28583017 ... 0.9239239  0.05169873 0.9239239 ]


## Neural nets

This is (somewhat) typical deep learning code:

In [4]:
# https://keras.io/examples/vision/mnist_convnet/
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

In [5]:
# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")


# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)


x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


In [6]:
model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)

model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dropout (Dropout)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                1

2022-03-16 12:19:22.170252: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
batch_size = 128
epochs = 15

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)

2022-03-16 12:19:27.876131: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7fcbb283d100>

In [8]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])


Test loss: 0.027363663539290428
Test accuracy: 0.9904999732971191


## Writing Local Machine Learning Flows


### Random Forests

In [9]:
%%writefile flows/local/rf_flow.py

from metaflow import FlowSpec, step, Parameter, JSONType, IncludeFile, card
import json

class ClassificationFlow(FlowSpec):
    """
    train a random forest
    """
    @card 
    @step
    def start(self):
        """
        Load the data
        """
        #Import scikit-learn dataset library
        from sklearn import datasets

        #Load dataset
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        self.next(self.rf_model)
        

    @step
    def rf_model(self):
        """
        build random forest model
        """
        from sklearn.ensemble import RandomForestClassifier
        
        self.clf = RandomForestClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)
        self.next(self.train)

        
        
    @step
    def train(self):
        """
        Train the model
        """
        from sklearn.model_selection import cross_val_score
        self.scores = cross_val_score(self.clf, self.X, self.y, cv=5)
        self.next(self.end)
        
        
    @step
    def end(self):
        """
        End of flow, yo!
        """
        print("ClassificationFlow is all done.")


if __name__ == "__main__":
    ClassificationFlow()

Writing flows/local/rf_flow.py


Execute the above from the command line with

```bash
python flows/local/rf_flow.py run
```

In [10]:
! python flows/local/rf_flow.py run

[35m[1mMetaflow 2.5.0[0m[35m[22m executing [0m[31m[1mClassificationFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hba[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[22mCreating local datastore in current directory (/Users/hba/Documents/Projects/full-stack-ML-metaflow-tutorial-main/.metaflow)[K[0m[22m[0m
[35m2022-03-16 12:22:31.412 [0m[1mWorkflow starting (run-id 7226):[0m
[35m2022-03-16 12:22:37.340 [0m[32m[7226/start/135712 (pid 15908)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:23:18.430 [0m[32m[7226/start/135712 (pid 15908)] [0m[1mTask finished successfully.[0m
[35m2022-03-16 12:23:22.519 [0m[32m[7226/rf_model/135713 (pid 16056)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:23:39.167 [0m[32m[7226/rf_model/135713 (pid 16056)] [0m[1mTask fin

In [11]:
%%writefile flows/local/tree_branch_flow.py

from metaflow import FlowSpec, step, Parameter, JSONType, IncludeFile, card
import json





class ClassificationFlow(FlowSpec):
    """
    train multiple tree based methods
    """
    @card 
    @step
    def start(self):
        """
        Load the data
        """
        #Import scikit-learn dataset library
        from sklearn import datasets

        #Load dataset
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        self.next(self.rf_model, self.xt_model, self.dt_model)
    
                
    @step
    def rf_model(self):
        """
        build random forest model
        """
        from sklearn.ensemble import RandomForestClassifier
        from sklearn.model_selection import cross_val_score
        
        self.clf = RandomForestClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)
        self.scores = cross_val_score(self.clf, self.X, self.y, cv=5)
        self.next(self.choose_model)

    @step
    def xt_model(self):
        """
        build extra trees classifier
        """
        from sklearn.ensemble import ExtraTreesClassifier
        from sklearn.model_selection import cross_val_score
        

        self.clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)

        self.scores = cross_val_score(self.clf, self.X, self.y, cv=5)
        self.next(self.choose_model)

    @step
    def dt_model(self):
        """
        build decision tree classifier
        """
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.model_selection import cross_val_score
        
        self.clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
            random_state=0)

        self.scores = cross_val_score(self.clf, self.X, self.y, cv=5)

        self.next(self.choose_model)
                        
    @step
    def choose_model(self, inputs):
        """
        find 'best' model
        """
        import numpy as np

        def score(inp):
            return inp.clf,\
                   np.mean(inp.scores)

            
        self.results = sorted(map(score, inputs), key=lambda x: -x[1]) 
        self.model = self.results[0][0]
        self.next(self.end)
        
    @step
    def end(self):
        """
        End of flow, yo!
        """
        print('Scores:')
        print('\n'.join('%s %f' % res for res in self.results))


if __name__ == "__main__":
    ClassificationFlow()

Writing flows/local/tree_branch_flow.py


Execute the above from the command line with

```bash
python flows/local/tree_branch_flow.py run
```

In [12]:
! python flows/local/rf_flow.py run

[35m[1mMetaflow 2.5.0[0m[35m[22m executing [0m[31m[1mClassificationFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hba[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-03-16 12:24:59.536 [0m[1mWorkflow starting (run-id 7227):[0m
[35m2022-03-16 12:25:05.357 [0m[32m[7227/start/135717 (pid 16163)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:25:45.236 [0m[32m[7227/start/135717 (pid 16163)] [0m[1mTask finished successfully.[0m
[35m2022-03-16 12:25:49.090 [0m[32m[7227/rf_model/135718 (pid 16266)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:26:10.637 [0m[32m[7227/rf_model/135718 (pid 16266)] [0m[1mTask finished successfully.[0m
[35m2022-03-16 12:26:14.690 [0m[32m[7227/train/135719 (pid 16273)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:26:

### Boosted Trees

In [13]:
%%writefile flows/local/boosted_flow.py

from metaflow import FlowSpec, step, Parameter, JSONType, IncludeFile




class BSTFlow(FlowSpec):
    """
    train a boosted tree
    """

    @step
    def start(self):
        """
        Load the data & train model
        """
        import xgboost as xgb
        # from io import StringIO
        # read in data
        dtrain = xgb.DMatrix('data/agaricus.txt.train')
        #dtest = xgb.DMatrix('data/agaricus.txt.test')

                # specify parameters
        param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
        num_round = 2
        bst = xgb.train(param, dtrain, num_round)
        bst.save_model("model.json")
        self.next(self.predict)
        

        
        
    @step
    def predict(self):
        """
        make predictions
        """
        import xgboost as xgb

        dtest = xgb.DMatrix('data/agaricus.txt.test')
        # make prediction
        bst = xgb.Booster()
        bst.load_model("model.json")
        preds = bst.predict(dtest)
        self.next(self.end)
        
        
    @step
    def end(self):
        """
        End of flow, yo!
        """
        print("ClassificationFlow is all done.")


if __name__ == "__main__":
    BSTFlow()

Writing flows/local/boosted_flow.py


Execute the above from the command line with

```bash
python flows/local/boosted_flow.py run
```

In [14]:
! python flows/local/boosted_flow.py run

[35m[1mMetaflow 2.5.0[0m[35m[22m executing [0m[31m[1mBSTFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hba[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-03-16 12:31:30.621 [0m[1mWorkflow starting (run-id 7228):[0m
[35m2022-03-16 12:31:36.449 [0m[32m[7228/start/135722 (pid 16732)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:31:54.058 [0m[32m[7228/start/135722 (pid 16732)] [0m[1mTask finished successfully.[0m
[35m2022-03-16 12:31:58.364 [0m[32m[7228/predict/135723 (pid 16738)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:32:15.850 [0m[32m[7228/predict/135723 (pid 16738)] [0m[1mTask finished successfully.[0m
[35m2022-03-16 12:32:19.806 [0m[32m[7228/end/135724 (pid 16746)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:32:27.232 [0m[32

### Deep Learning

In [15]:
%%writefile flows/local/NN_flow.py

from metaflow import FlowSpec, step, Parameter, JSONType, IncludeFile
from taxi_modules import init, MODELS, MODEL_LIBRARIES
import json


class NNFlow(FlowSpec):
    """
    train a NN
    """

    @step
    def start(self):
        """
        Load the data
        """
        from tensorflow import keras

        # the data, split between train and test sets
        (self.x_train, self.y_train), (self.x_test, self.y_test) = keras.datasets.mnist.load_data()
        self.next(self.wrangle)
        
    @step
    def wrangle(self):
        """
        massage data
        """
        import numpy as np
        from tensorflow import keras
        # Model / data parameters
        self.num_classes = 10
        self.input_shape = (28, 28, 1)

        # Scale images to the [0, 1] range
        self.x_train = self.x_train.astype("float32") / 255
        self.x_test = self.x_test.astype("float32") / 255
        # Make sure images have shape (28, 28, 1)
        self.x_train = np.expand_dims(self.x_train, -1)
        self.x_test = np.expand_dims(self.x_test, -1)

        # convert class vectors to binary class matrices
        self.y_train = keras.utils.to_categorical(self.y_train, self.num_classes)
        self.y_test = keras.utils.to_categorical(self.y_test, self.num_classes)
        
        self.next(self.build_model)


    @step
    def build_model(self):
        """
        build NN model
        """
        import tempfile
        import numpy as np
        import tensorflow as tf
        from tensorflow import keras
        from tensorflow.keras import layers

        model = keras.Sequential(
            [
                keras.Input(shape=self.input_shape),
                layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
                layers.MaxPooling2D(pool_size=(2, 2)),
                layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
                layers.MaxPooling2D(pool_size=(2, 2)),
                layers.Flatten(),
                layers.Dropout(0.5),
                layers.Dense(self.num_classes, activation="softmax"),
            ]
        )
        model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
        with tempfile.NamedTemporaryFile() as f:
            tf.keras.models.save_model(model, f.name, save_format='h5')
            self.model = f.read()
        self.next(self.train)

        
        
    @step
    def train(self):
        """
        Train the model
        """
        import tempfile
        import tensorflow as tf
        self.batch_size = 128
        self.epochs = 15
        
        with tempfile.NamedTemporaryFile() as f:
            f.write(self.model)
            f.flush()
            model =  tf.keras.models.load_model(f.name)
        model.fit(self.x_train, self.y_train, batch_size=self.batch_size, epochs=self.epochs, validation_split=0.1)
        
        self.next(self.end)
        
        
    @step
    def end(self):
        """
        End of flow, yo!
        """
        print("ClassificationFlow is all done.")


if __name__ == "__main__":
    NNFlow()

Writing flows/local/NN_flow.py


Execute the above from the command line with

```bash
python flows/local/NN_flow.py run
```

In [16]:
! python flows/local/NN_flow.py run

[35m[1mMetaflow 2.5.0[0m[35m[22m executing [0m[31m[1mNNFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hba[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-03-16 12:33:56.785 [0m[1mWorkflow starting (run-id 7229):[0m
[35m2022-03-16 12:34:02.641 [0m[32m[7229/start/135726 (pid 16786)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:34:42.119 [0m[32m[7229/start/135726 (pid 16786)] [0m[1mTask finished successfully.[0m
[35m2022-03-16 12:34:45.873 [0m[32m[7229/wrangle/135727 (pid 16806)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:35:38.151 [0m[32m[7229/wrangle/135727 (pid 16806)] [0m[1mTask finished successfully.[0m
[35m2022-03-16 12:35:42.146 [0m[32m[7229/build_model/135728 (pid 16832)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:35:51.337 

## Flows for the Cloud

In [17]:
%%writefile flows/cloud/rf_flow_cloud.py

from metaflow import FlowSpec, step, Parameter, JSONType, IncludeFile, card, conda, conda_base
import json





class ClassificationFlow(FlowSpec):
    """
    train a random forest
    """
    @conda(libraries={'scikit-learn':'1.0.2'}) 
    @card
    @step
    def start(self):
        """
        Load the data
        """
        #Import scikit-learn dataset library
        from sklearn import datasets

        #Load dataset
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        self.next(self.rf_model)
        
    @conda(libraries={'scikit-learn':'1.0.2'})
    @step
    def rf_model(self):
        """
        build random forest model
        """
        from sklearn.ensemble import RandomForestClassifier
        
        self.clf = RandomForestClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)
        self.next(self.train)

        
    @conda(libraries={'scikit-learn':'1.0.2'})       
    @step
    def train(self):
        """
        Train the model
        """
        from sklearn.model_selection import cross_val_score
        self.scores = cross_val_score(self.clf, self.X, self.y, cv=5)
        self.next(self.end)
        
        
    @step
    def end(self):
        """
        End of flow, yo!
        """
        print("ClassificationFlow is all done.")


if __name__ == "__main__":
    ClassificationFlow()

Writing flows/cloud/rf_flow_cloud.py


Execute the above from the command line with

```bash
python flows/cloud/rf_flow_cloud.py --environment=conda run --with batch
```

In [18]:
! python flows/cloud/rf_flow_cloud.py --environment=conda run --with batch

[35m[1mMetaflow 2.5.0[0m[35m[22m executing [0m[31m[1mClassificationFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hba[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[22mBootstrapping conda environment...(this could take a few minutes)[K[0m[22m[0m
[35m2022-03-16 12:44:06.562 [0m[1mWorkflow starting (run-id 7230):[0m
[35m2022-03-16 12:44:17.829 [0m[32m[7230/start/135732 (pid 17269)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:44:20.408 [0m[32m[7230/start/135732 (pid 17269)] [0m[22m[d2dc4990-242e-44f3-bec0-98109e7d0c0b] Task is starting (status SUBMITTED)...[0m
[35m2022-03-16 12:44:23.603 [0m[32m[7230/start/135732 (pid 17269)] [0m[22m[d2dc4990-242e-44f3-bec0-98109e7d0c0b] Task is starting (status RUNNABLE)...[0m
[35m2022-03-16 12:44:53.604 [0m[32m[7

In [19]:
%%writefile flows/cloud/tree_branch_flow_cloud.py

from metaflow import FlowSpec, step, Parameter, JSONType, IncludeFile, card, conda, conda_base
import json





class ClassificationFlow(FlowSpec):
    """
    train multiple tree based methods
    """
    @conda(libraries={'scikit-learn':'1.0.2'}) 
    @card 
    @step
    def start(self):
        """
        Load the data
        """
        #Import scikit-learn dataset library
        from sklearn import datasets

        #Load dataset
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        self.next(self.rf_model, self.xt_model, self.dt_model)
    
    @conda(libraries={'scikit-learn':'1.0.2'})             
    @step
    def rf_model(self):
        """
        build random forest model
        """
        from sklearn.ensemble import RandomForestClassifier
        from sklearn.model_selection import cross_val_score
        
        self.clf = RandomForestClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)
        self.scores = cross_val_score(self.clf, self.X, self.y, cv=5)
        self.next(self.choose_model)
    
    @conda(libraries={'scikit-learn':'1.0.2'}) 
    @step
    def xt_model(self):
        """
        build extra trees classifier
        """
        from sklearn.ensemble import ExtraTreesClassifier
        from sklearn.model_selection import cross_val_score
        

        self.clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)

        self.scores = cross_val_score(self.clf, self.X, self.y, cv=5)
        self.next(self.choose_model)
    
    @conda(libraries={'scikit-learn':'1.0.2'}) 
    @step
    def dt_model(self):
        """
        build decision tree classifier
        """
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.model_selection import cross_val_score
        
        self.clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
            random_state=0)

        self.scores = cross_val_score(self.clf, self.X, self.y, cv=5)

        self.next(self.choose_model)

    @conda(libraries={'scikit-learn':'1.0.2'})                         
    @step
    def choose_model(self, inputs):
        """
        find 'best' model
        """
        import numpy as np

        def score(inp):
            return inp.clf,\
                   np.mean(inp.scores)

            
        self.results = sorted(map(score, inputs), key=lambda x: -x[1]) 
        self.model = self.results[0][0]
        self.next(self.end)

    @conda(libraries={'scikit-learn':'1.0.2'})         
    @step
    def end(self):
        """
        End of flow, yo!
        """
        print('Scores:')
        print('\n'.join('%s %f' % res for res in self.results))


if __name__ == "__main__":
    ClassificationFlow()

Writing flows/cloud/tree_branch_flow_cloud.py


Execute the above from the command line with

```bash
python flows/cloud/tree_branch_flow_cloud.py --environment=conda run --with batch
```

In [20]:
! python flows/cloud/tree_branch_flow_cloud.py --environment=conda run --with batch

[35m[1mMetaflow 2.5.0[0m[35m[22m executing [0m[31m[1mClassificationFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hba[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[22mBootstrapping conda environment...(this could take a few minutes)[K[0m[22m[0m
[35m2022-03-16 12:54:01.664 [0m[1mWorkflow starting (run-id 7231):[0m
[35m2022-03-16 12:54:12.959 [0m[32m[7231/start/135737 (pid 17939)] [0m[1mTask is starting.[0m
[35m2022-03-16 12:54:15.637 [0m[32m[7231/start/135737 (pid 17939)] [0m[22m[843ed1e4-6b6c-4ab7-909a-1c277736f837] Task is starting (status SUBMITTED)...[0m
[35m2022-03-16 12:54:16.705 [0m[32m[7231/start/135737 (pid 17939)] [0m[22m[843ed1e4-6b6c-4ab7-909a-1c277736f837] Task is starting (status RUNNABLE)...[0m
[35m2022-03-16 12:54:46.713 [0m[32m[7