# Lesson 4: Metaflow and the MLOps ecosystem

## Learning Objectives

* Incorporate other tools from the MLOps ecosystem into your ML workflows, including
    - Experiment tracking,
    - Data validation, and
    - Deploying your model to an endpoint.
    
Note that lessons 1-3 that we have just covered get you far! As your projects mature, the more advanced topics in Lesson 4 become relevant. In this lesson, we'll demo some more advanced things that are possible with modern tooling. Setting all this up requires some effort, so take this more as an inspirational tour rather than a step-by-step tutorial.

## Interoperability as a Foundational Part of Full-Stack ML

_Human-centricity_ is a foundational principle of Metaflow. As a result, MF strives to be interoperable and compatible with all the other ML tools that you already use (and ones you may want to use!). In this lesson, we'll show how to incorporate 3 _types of tools_, those for 
* experiment tracking,
* data validation, and
* deployment.

We'll be using [Weights & Biases](https://wandb.ai/site) for experiment tracking, [Great Expectations](https://greatexpectations.io/) for data validation, and [Amazon SageMaker](https://aws.amazon.com/pm/sagemaker/) for deployment, but keep in mind that Metaflow is agnostic with respect to the other tools you use.

![flow0](../img/recsys_flow.jpg)

This figure is from the wonderful repo [You Don't Need a Bigger Boat](https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat)!

 Let's jump in:

## Experiment Tracking

Experiment tracking is a way to keep track of all the model runs you try, along with those models in production. In the following, we use [Weights and Biases](https://wandb.ai/site) but there are other options, such as [Neptune.ai](https://neptune.ai/) and [Comet](https://www.comet.ml/site/). To reproduce the following, you will need to create a free Weights and Biases account. 

Note that  Metaflow tracks all experiments for you automatically, as we saw in previous lessons, so you don’t need a separate tool for that. However, a tool like W&B is convenient for many things, such as comparing results of a run with an easy-to-use UI out of the box and you can use it easily with Metaflow.

Let's do it!

In [None]:
%%writefile ../flows/ecosystem/rf_flow_monitor.py
from metaflow import FlowSpec, step, card
import json

class Tracking_Flow(FlowSpec):
    """
    train a random forest
    """
    @card 
    @step
    def start(self):
        """
        Load the data
        """
        #Import scikit-learn dataset library
        from sklearn import datasets
        from sklearn.model_selection import train_test_split

        #Load dataset
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        self.labels = self.iris['target_names']

        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=0.2)
        self.next(self.rf_model)
        

    @step
    def rf_model(self):
        """
        build random forest model
        """
        from sklearn.ensemble import RandomForestClassifier
        
        
        self.clf = RandomForestClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)
        self.next(self.train)

        
        
    @step
    def train(self):
        """
        Train the model
        """
        from sklearn.model_selection import cross_val_score
        self.clf.fit(self.X_train, self.y_train)
        self.y_pred = self.clf.predict(self.X_test)
        self.y_probs = self.clf.predict_proba(self.X_test)
        self.next(self.monitor)
        
    @step
    def monitor(self):
        """
        plot some things using an experiment tracker
        
        """
        import wandb
        # edit the following with your username, project name, etc ...
        wandb.init(project="mf-rf-wandb", entity="hugobowne", name="mf-tutorial-iris")

        wandb.sklearn.plot_class_proportions(self.y_train, self.y_test, self.labels)
        wandb.sklearn.plot_learning_curve(self.clf, self.X_train, self.y_train)
        wandb.sklearn.plot_roc(self.y_test, self.y_probs, self.labels)
        wandb.sklearn.plot_precision_recall(self.y_test, self.y_probs, self.labels)
        wandb.sklearn.plot_feature_importances(self.clf)

        wandb.sklearn.plot_classifier(self.clf, 
                              self.X_train, self.X_test, 
                              self.y_train, self.y_test, 
                              self.y_pred, self.y_probs, 
                              self.labels, 
                              is_binary=True, 
                              model_name='RandomForest')

        wandb.finish()
        self.next(self.end)
    
    @step
    def end(self):
        """
        End of flow!
        """
        print("Tracking_Flow is all done.")


if __name__ == "__main__":
    Tracking_Flow()


Execute the above from the command line with

```bash
python flows/ecosystem/rf_flow_monitor.py run
```

Check out the monitoring dashboards you built:

In [None]:
import wandb
%wandb hugobowne/mf-rf-wandb

## Data Validation

Data validation is an essential and underappreciated part of machine learning and data science, more generally! The basic idea is that, if you're expecting your data to have certain characteristics, you need to make sure it actually does and you need to automate this in production.

For example, you may expect 

* your data to have particular features or
* your features to be in certain ranges.

There are many ways to do this, including using [pytest](https://ericmjl.github.io/data-testing-tutorial/3-pytest-introduction/). Here we'll use the open source framework [Great Expectations](https://greatexpectations.io/). We've already defined what "expectations" we have of our data, which we'll go through, when we run our flow below. The core of our data validation is contained in this type of step:

```
@step
def data_validation(self):
    """
    Perform data validation with great_expectations
    """
    from data_validation import validate_data

    validate_data(current.run_id, current.flow_name, self.data_paths)

    self.next(...)
```

In [None]:
%%writefile ../flows/ecosystem/iris_validate.py

from metaflow import FlowSpec, step, card
import json

class Validation_Flow(FlowSpec):
    """
    train a random forest
    """
    @card 
    @step
    def start(self):
        """
        Load the data
        """
        #Import scikit-learn dataset library
        from sklearn import datasets
        from sklearn.model_selection import train_test_split

        #Load dataset
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        self.labels = self.iris['target_names']

        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=0.2)
        self.next(self.data_validation)
        


    @step
    def data_validation(self):
        """
        Perform data validation with great_expectations
        """
        import pandas as pd
        from ruamel import yaml
        import great_expectations as ge
        from great_expectations.core.batch import RuntimeBatchRequest

        context = ge.get_context()

        
        from sklearn import datasets
        iris = datasets.load_iris()
        df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
        df["target"] = iris['target']
        #df["petal length (cm)"][0] = -1

        # configuration for data validation checkpoint
        checkpoint_config = {
            "name": "flowers-test-flow-checkpoint",
            "config_version": 1,
            "class_name": "SimpleCheckpoint",
            "run_name_template": "%Y%m%d-%H%M%S-flower-power",
            "validations": [
                {
                    "batch_request": {
                        "datasource_name": "flowers",
                        "data_connector_name": "default_runtime_data_connector_name",
                        "data_asset_name": "iris",
                    },
                    "expectation_suite_name": "flowers-testing-suite",
                }
            ],
        }
        context.add_checkpoint(**checkpoint_config)

        # results of data validation
        # then build and view docs
        results = context.run_checkpoint(
            checkpoint_name="flowers-test-flow-checkpoint",
            batch_request={
                "runtime_parameters": {"batch_data": df},
                "batch_identifiers": {
                    "default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"
                },
            },
        )
        context.build_data_docs()
        context.open_data_docs()

        self.next(self.end)

    
    @step
    def end(self):
        """
        End of flow!
        """
        print("Validation_Flow is all done.")


if __name__ == "__main__":
    Validation_Flow()


Execute the following to run the flow with data validation:

```bash
python flows/ecosystem/iris_validate.py run
```

## Combination station: data validation + experiment tracking

Let's combine the above into a single flow. Notice that our flows are getting longer and less easily readable. For this reason, here we have refactored our code in order to decouple the business logic (or modeling-related logic) from the execution logic. We have done so by wrapping the data validation code in a function and putting that in `utils.py`:

In [None]:
%%writefile ../flows/ecosystem/rf_flow_monitor_validate.py


from metaflow import FlowSpec, step, card
import json

class Combination_Flow(FlowSpec):
    """
    train a random forest
    """
    @card 
    @step
    def start(self):
        """
        Load the data
        """
        #Import scikit-learn dataset library
        from sklearn import datasets
        from sklearn.model_selection import train_test_split

        #Load dataset
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        self.labels = self.iris['target_names']

        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=0.2)
        self.next(self.data_validation)
        

    @step
    def data_validation(self):
        """
        Perform data validation with great_expectations
        """
        import pandas as pd
        from utils import validate



        
        from sklearn import datasets
        iris = datasets.load_iris()
        df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
        df["target"] = iris['target']
        df["petal length (cm)"][0] = -1

        validate(df)


        self.next(self.rf_model)
        
        
    @step
    def rf_model(self):
        """
        build random forest model
        """
        from sklearn.ensemble import RandomForestClassifier
        
        
        self.clf = RandomForestClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)
        self.next(self.train)

        
        
    @step
    def train(self):
        """
        Train the model
        """
        from sklearn.model_selection import cross_val_score
        self.clf.fit(self.X_train, self.y_train)
        self.y_pred = self.clf.predict(self.X_test)
        self.y_probs = self.clf.predict_proba(self.X_test)
        self.next(self.monitor)
        

    
        
    @step
    def monitor(self):
        """
        plot some things using an experiment tracker
        
        """
        import wandb
        wandb.init(project="mf-rf-wandb", entity="hugobowne", name="mf-tutorial-iris")

        wandb.sklearn.plot_class_proportions(self.y_train, self.y_test, self.labels)
        wandb.sklearn.plot_learning_curve(self.clf, self.X_train, self.y_train)
        wandb.sklearn.plot_roc(self.y_test, self.y_probs, self.labels)
        wandb.sklearn.plot_precision_recall(self.y_test, self.y_probs, self.labels)
        wandb.sklearn.plot_feature_importances(self.clf)

        wandb.sklearn.plot_classifier(self.clf, 
                              self.X_train, self.X_test, 
                              self.y_train, self.y_test, 
                              self.y_pred, self.y_probs, 
                              self.labels, 
                              is_binary=True, 
                              model_name='RandomForest')

        wandb.finish()
        self.next(self.end)
    
    @step
    def end(self):
        """
        End of flow!
        """
        print("Combination_Flow is all done.")


if __name__ == "__main__":
    Combination_Flow()


Execute the above with the following

```bash
python flows/ecosystem/rf_flow_monitor_validate.py run
```

We can check out our experiment tracking once again with the following:

In [None]:
import wandb
%wandb hugobowne/mf-rf-wandb

## Deploying your model

Now we get to deploy our model that we can ping for predictions from anywhere around the globe: wow!

In [None]:
print('\U0001F92F')

To do this you'll need to have the correct permissions set up on Amazon Sagemaker. You can find out how to get set up with Sagemaker [here](https://docs.aws.amazon.com/sagemaker/index.html).

Now a few words on deplying to an endpoint:
- It is not the only way to deploy ML to production. For example, batch predictions are easier to structure as a workflow and you don’t need endpoints for that, just regular workflows.
- However, when integrating with other services, say, a product UI, you need a service that other services can call. This is where a system like Sagemaker hosting comes in handy.
- Sagemaker Hosting is just one option amongst others - you could also use an open-source project called Seldon - or even build your own simple service with Python’s Flask project, for example, but Sagemaker is conveniently hosted by AWS so we don’t have to worry about infrastructure, at least after you have managed to configure Sagemaker.

In [None]:
%%writefile ../flows/ecosystem/RF-deploy.py


from metaflow import FlowSpec, step, Parameter, JSONType, IncludeFile, card, S3, environment
import os
import json
from dotenv import load_dotenv
load_dotenv('my.env')



class Deployment_Flow(FlowSpec):
    """
    train a random forest
    """
    @card 
    @step
    def start(self):
        """
        Load the data
        """
        #Import scikit-learn dataset library
        from sklearn import datasets
        from sklearn.model_selection import train_test_split

        #Load dataset
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        self.labels = self.iris['target_names']

        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=0.2)
        self.next(self.rf_model)
        
    @step
    def rf_model(self):
        """
        build random forest model
        """
        from sklearn.ensemble import RandomForestClassifier
        
        
        self.clf = RandomForestClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)
        self.next(self.train)

        
        
    @step
    def train(self):
        """
        Train the model
        """
        from sklearn.model_selection import cross_val_score
        self.clf.fit(self.X_train, self.y_train)
        self.y_pred = self.clf.predict(self.X_test)
        self.y_probs = self.clf.predict_proba(self.X_test)
        self.next(self.deploy)

    @step
    def deploy(self):
        """
        Use SageMaker to deploy the model as a stand-alone, PaaS endpoint.
        """
        import os
        import time
        import joblib
        import shutil
        import tarfile
        from sagemaker.sklearn import SKLearnModel
        
        ROLE = os.getenv('ROLE')
        CODE_LOCATION = os.getenv('CODE_LOCATION')


        model_name = "model"
        local_tar_name = "model.tar.gz"


        os.makedirs(model_name, exist_ok=True)
        # save model to local folder
        joblib.dump(self.clf, "{}/{}.joblib".format(model_name, model_name))
        # save model as tar.gz
        with tarfile.open(local_tar_name, mode="w:gz") as _tar:
            _tar.add(model_name, recursive=True)
        # save model onto S3
        with S3(run=self) as s3:
            with open(local_tar_name, "rb") as in_file:
                data = in_file.read()
                self.model_s3_path = s3.put(local_tar_name, data)
                #print('Model saved at {}'.format(self.model_s3_path))
        # remove local model folder and tar
        shutil.rmtree(model_name)
        os.remove(local_tar_name)
        # initialize SageMaker SKLearn Model
        sklearn_model = SKLearnModel(model_data=self.model_s3_path,
                                     role=ROLE,
                                     entry_point=CODE_LOCATION,
                                     framework_version='0.23-1',
                                     code_location='s3://oleg2-s3-mztdpcvj/sagemaker/')
        endpoint_name = 'HBA-RF-endpoint-{}'.format(int(round(time.time() * 1000)))
        print("\n\n================\nEndpoint name is: {}\n\n".format(endpoint_name))
        # deploy model
        predictor = sklearn_model.deploy(instance_type='ml.c5.2xlarge',
                                         initial_instance_count=1,
                                         endpoint_name=endpoint_name)
        # prepare a test input and check response
        test_input = self.X
        result = predictor.predict(test_input)
        print(result)
        
        self.next(self.end)
    
    @step
    def end(self):
        """
        End of flow!
        """
        print("Deployment_Flow is all done.")


if __name__ == "__main__":
    Deployment_Flow()




Execute the above with
```bash
python flows/ecosystem/RF-deploy.py run
```

We can also test pinging the endpoint with the following:

In [None]:
import boto3
import pandas as pd
from sklearn import datasets


iris = datasets.load_iris()
X = iris['data']

# Create a low-level client representing Amazon SageMaker Runtime
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name='us-west-2')

# The name of the endpoint. The name must be unique within an AWS Region in your AWS account. 

endpoint_name='HBA-RF-endpoint-1657310388379'


# csv serialization
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=pd.DataFrame(X).to_csv(header=False, index=False),
    ContentType="text/csv",
)

print(response["Body"].read())

Note that, although we're pinging this endpoint using Python and the Sagemaker SDK, we can hit the endpoint from anywhere in a language agnostic way so external systems and any software you build can interface with your model.

**Exercise for the avid reader:** Combine all the above into a flow that includes
* data validation,
* experiment tracking, and
* deployment.

## Lesson Recap

In this lesson, you have seen the power of interoperability in Metaflow. We have

* Incorporated other tools from the MLOps ecosystem into your ML workflows, including
    - Experiment tracking,
    - Data validation, and
    - Deploying your model to an endpoint.
    
Check out these guides to further your knowledge of using [Weights and Biases](https://outerbounds.com/docs/track-wandb) and [Sagemaker](https://outerbounds.com/docs/deploy-with-sagemaker) with Metaflow.
    
And yet this is just the tip of the iceberg! To explore more and discuss this quickly evolving space, come chat with us on our [community slack](http://slack.outerbounds.co).

In [None]:
print('\U0001F91F')