# Setup

In order to run this tutorial, we recommend [creating a conda environment defined by a **environment.yml** file](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file):

---
```yaml
name: mf-nlp-example
channels:
  - conda-forge
  - defaults
dependencies:
  - jupyterlab=3.2.9
  - metaflow
  - nbformat=5.4.0
  - notebook=6.4.10
  - numpy=1.22.0
  - pandas=1.3.5
  - pip
  - pyarrow=8.0.0
  - python-dotenv=0.20.0
  - ruamel.yaml=0.17.17
  - scikit-learn=0.23.2
  - tensorflow=2.4.0
```

You can install this conda environment with the command `conda env create -f environment.yml`, and activate the environment with the command: `conda activate mf-nlp-example`

## Background

We are going to build a model that does classifies customer reviews as positive or negative sentiment, using the [Women's E-Commerce Clothing Reviews Dataset](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews). Here is what the data looks like:

In [1]:
import pandas as pd
df = pd.read_parquet('train.parquet')
print(f'num of rows: {df.shape[0]}')

num of rows: 20377


In [2]:
df.head()

Unnamed: 0,labels,review
0,0,Odd fit: I wanted to love this sweater but the...
1,1,Very comfy dress: The quality and material of ...
2,0,Fits nicely but fabric a bit thin: I ordered t...
3,1,"Great fit: Love these jeans, fit and style... ..."
4,0,"Stretches out, washes poorly. wish i could ret..."


## Define The Model

In this case we define our model in a seperate file, and define a custom class called `Nbow_Model`.  The model contains two subcomponents: the count vectorizer for preprocessing and the model.  This class facilitates combining these two components together so that we don't have to deal with them seperately.  Here an exaplanation of the various methods in this model:

1. `__init__`: this initializes the count vectorizer, a preprocessor that counts the tokens in the text and a nueral network to do the modeling.
2. `fit`:  when we call `fit`, we first fit the count vectorizer, followed by the model. 
3.  `predict`: similarly, when we call `predict`, we need to transform the data with the count vectorizer before making predictions.
4. `eval_acc`: calculates model accuracy given a dataset and labels
5. `eval_rocauc`: calculates the area under the roc curve given a dataset and labels
6. `model_dict`: This exposes of a dictionary that has two components that form this model, the count vectorizer and the nueral network.  We will use this to serialize the model's data into Metaflow. 
7.  `from_dict`: this allows you to instantiate a NbowModel from a `model_dict` which is useful for de-serializing data in Metaflow.

**NB:** Anytime you create your own model library or define models in custom classes, we recommend explicitly defining how you will serialize and load the model.  This will minimize the chances that things will break as your model code changes, by giving you the ability to make sure any new versions of your code are backwards compatible on how to load your model or allow you to deal with serailization/de-serialization accordingly in a way that is transparent to you. 

In [3]:
%pycat model.py

[0;32mimport[0m [0mtensorflow[0m [0;32mas[0m [0mtf[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mtensorflow[0m[0;34m.[0m[0mkeras[0m [0;32mimport[0m [0mlayers[0m[0;34m,[0m [0moptimizers[0m[0;34m,[0m [0mregularizers[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0msklearn[0m[0;34m.[0m[0mbase[0m [0;32mimport[0m [0mBaseEstimator[0m[0;34m,[0m [0mClassifierMixin[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0msklearn[0m[0;34m.[0m[0mmetrics[0m [0;32mimport[0m [0maccuracy_score[0m[0;34m,[0m [0mroc_auc_score[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0msklearn[0m[0;34m.[0m[0mfeature_extraction[0m[0;34m.[0m[0mtext[0m [0;32mimport[0m [0mCountVectorizer[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mclass[0m [0mNbowModel[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__init__[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mvocab_sz[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0mself[

## Constructing The Metaflow Flow

 We will walk you through how we would organize this task in Metaflow.  Concretely, we will demonstrate the following steps:

1. **Read data from a parquet file** in the `start` step.
    - We use pandas to read `train.parquet`
    - Notice how we are assigning the training data to `self.df` this stores the data as an artifact in Metaflow, which means it will be versioned and saved in the artifact store for later retrieval.  Furthermore, this allows you to pass data to another step.
    - We log the number of rows in the data.  It is always a good idea to log information about your dataset for debugging.
2. **Show a branching workflow to create a baseline and candidate model in parallel**  in the `baseline` and `train` steps.
    - When we call `self.next(self.baseline, self.train)`,  this creates a [branching flow](https://docs.metaflow.org/metaflow/basics#branch) that will allow the `baseline` and  `train` steps to run in parallel.
    - The `baseline` step records the performance metrics (accuracy and roc auc score) that result from classifying all examples with the majority class.  This will be our baseline against which we evaluate our model.
    - The `train` step uses a nueral bag of words model to train a text classifier.  We serialize this model in a special way by getting the `model_dict` property of our custom model.
3. **Evaluate The Model** in the `join` step:
    - Anytime you have branching in your flow, you [must have a join step](https://docs.metaflow.org/metaflow/basics#branch). join step allows you to access data from your vairous branches via the `inputs` parameter.  For more details, see the section about [data flow](https://docs.metaflow.org/metaflow/basics#data-flow-through-the-graph). Furthermore we evaluate your model on the holdout set and log the performance metrics using print statements.
4. **The model is tagged as a `deployment_candidate`** in the `end` step, depneding on if it meets our performance criteria. 
    - First we do a smoke test by testing the model on a few obvious examples where we expect the model to make good predictions.
    - If the model beats the baseline and passes the smoke test, we tag it as a deployment candidate.

In [4]:
%%writefile flow.py

from metaflow import FlowSpec, step, Flow, current

class NLPFlow(FlowSpec):
        
    @step
    def start(self):
        "Read the data"
        import pandas as pd
        self.df = pd.read_parquet('train.parquet')
        print(f'num of rows: {self.df.shape[0]}')
        self.next(self.baseline, self.train)

    @step
    def baseline(self):
        "Compute the baseline"
        from sklearn.metrics import accuracy_score, roc_auc_score
        baseline_predictions = [1] * self.df.shape[0]
        self.base_acc = accuracy_score(self.df.labels, baseline_predictions)
        self.base_rocauc = roc_auc_score(self.df.labels, baseline_predictions)
        self.next(self.join)

    @step
    def train(self):
        "Train the model"
        from model import Nbow_Model
        model = Nbow_Model(vocab_sz=750)
        model.fit(X=self.df['review'], y=self.df['labels'])
        self.model_dict = model.model_dict #save model
        self.next(self.join)
        
    @step
    def join(self, inputs):
        "Compare the model results with the baseline."
        import pandas as pd
        from model import NbowModel
        self.model_dict = inputs.train.model_dict
        self.train_df = inputs.train.df
        self.holdout_df = pd.read_parquet('holdout.parquet')
        model = NbowModel.from_dict(self.model_dict)
        
        self.model_acc = model.eval_acc(X=self.holdout_df['review'], labels=self.holdout_df['labels'])
        self.model_rocauc = model.eval_rocauc(X=self.holdout_df['review'], labels=self.holdout_df['labels'])
        
        print(f'Baseline Acccuracy: {inputs.baseline.base_acc:.2%}')
        print(f'Baseline AUC: {inputs.baseline.base_rocauc:.2}')
        print(f'Model Acccuracy: {self.model_acc:.2%}')
        print(f'Model AUC: {self.model_rocauc:.2}')
        self.next(self.end)
        
    @step
    def end(self):
        "Tags model as a deployment candidate if it beats the baseline and passes smoke tests."
        self.beats_baseline = self.model_rocauc > inputs.baseline.base_rocauc
        print(f'Model beats baseline (T/F): {self.beats_baseline}')
        #smoke test to make sure model is doing the right thing on obvious examples.
        _tst_reviews = ["poor fit its baggy in places where it isn't supposed to be.",
                        "love it, very high quality and great value"]
        _tst_preds = model.predict(_tst_reviews)
        self.passed_smoke_test = _tst_preds[0][0] < .5 and _tst_preds[1][0] > .5
        print(f'Model passed smoke test (T/F): {self.passed_smoke_test}')
        
        if self.beats_baseline and self.passed_smoke_test:
            run = Flow(current.flow_name)[current.run_id]
            run.add_tag('deployment_candidate')
        

if __name__ == '__main__':
    NLPFlow()

Overwriting flow.py


We can execute the flow like so:

In [5]:
#|eval: false
!python flow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mNLPFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hamel[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[22m    flow.py:54:50: E0602: Undefined variable 'inputs' (undefined-variable)[K[0m[22m[0m
[22m    flow.py:59:21: E0602: Undefined variable 'model' (undefined-variable)[K[0m[22m[0m
[1m    Pylint is not happy[0m[22m:[K[0m[22m[0m
[22m[K[0m
[0m

## Using The Model In Production

After you have trained a model in Metaflow, you may want to utilize this model to make predictions or for futher testing.  There are two common patterns for this: (1) Retrieve the model from Metaflow in an external system (2) Have another flow that does predictions.  We illustrate both examples here:

### 1. Retrieve Model From Metaflow To Use In External Systems

You can now retrieve the model tagged as a `deployment_candidate` outside Metaflow, so you can use this in whatever downstream application you want, or even just for ad-hoc testing:

In [6]:
from metaflow import Flow
import pandas as pd
from model import NbowModel

In [7]:
predict_df = pd.read_parquet('predict.parquet')

In [8]:
def get_latest_successful_run(flow_nm, tag):
    "Gets the latest successfull run for a flow with a specific tag."
    for r in Flow(flow_nm).runs(tag):
        if r.successful: return r

In [9]:
run = get_latest_successful_run('NLPFlow', 'deployment_candidate')
model = NbowModel.from_dict(run.data.model_dict)

2022-07-21 13:18:43.080782: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Now that we have retrieved the model using the tag we can use it to make predictions:

In [10]:
preds = model.predict(predict_df['review'])
preds

array([[0.99896604],
       [0.9890883 ],
       [0.99934345],
       ...,
       [0.99978065],
       [0.9995602 ],
       [0.5956297 ]], dtype=float32)

You can write these predictions to a parquet file like so:

In [11]:
import pyarrow as pa
pa_tbl = pa.table({"data": preds.squeeze()})
pa.parquet.write_table(pa_tbl, "sentiment_predictions.parquet")

### 2. Use a Flow To Make Predictions

You may want to do batch predictions in a Flow as well.  In this flow, we will perform the following steps:

1. Get the latest deployment candidate using the Metaflow API in the `start` step.
2. Make predictions with our deployment candidate on a new dataset and write that to a parquet file in the `end` step.

In [12]:
%%writefile predflow.py

from metaflow import FlowSpec, step, Flow, current

class NLPredictionFlow(FlowSpec):
    
    def get_latest_successful_run(self, flow_nm, tag):
        "Gets the latest successfull run for a flow with a specific tag."
        for r in Flow(flow_nm).runs(tag):
            if r.successful: return r
        
    @step
    def start(self):
        "Get the latest deployment candidate that is from a successfull run"
        self.deploy_run = self.get_latest_successful_run('NLPFlow', 'deployment_candidate')
        self.next(self.end)
    
    @step
    def end(self):
        "Make predictions"
        from model import NbowModel
        import pandas as pd
        import pyarrow as pa
        new_reviews = pd.read_parquet('predict.parquet')['review']
        
        # Make predictions
        model = NbowModel.from_dict(self.deploy_run.data.model_dict)
        predictions = model.predict(new_reviews)
        print(f'Writing predictions to parquet: {predictions.shape[0]:,} rows')
        pa_tbl = pa.table({"data": predictions.squeeze()})
        pa.parquet.write_table(pa_tbl, "sentiment_predictions.parquet")
        
if __name__ == '__main__':
    NLPredictionFlow()

Overwriting predflow.py


In [13]:
! python predflow.py --no-pylint --datastore local --metadata local run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mNLPredictionFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hamel[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m2022-07-21 13:18:44.210 [0m[1mWorkflow starting (run-id 1658434724205954):[0m
[35m2022-07-21 13:18:44.219 [0m[32m[1658434724205954/start/1 (pid 57802)] [0m[1mTask is starting.[0m
[35m2022-07-21 13:18:45.026 [0m[32m[1658434724205954/start/1 (pid 57802)] [0m[1mTask finished successfully.[0m
[35m2022-07-21 13:18:45.033 [0m[32m[1658434724205954/end/2 (pid 57806)] [0m[1mTask is starting.[0m
[35m2022-07-21 13:18:47.556 [0m[32m[1658434724205954/end/2 (pid 57806)] [0m[22m2022-07-21 13:18:47.556819: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-cri

### Further Discussion

This is a very simple example that will also run on your laptop.  However, for production use cases you may want to use [@conda](https://docs.metaflow.org/metaflow/dependencies#managing-dependencies-with-conda-decorator) for dependency management, [@batch](https://docs.metaflow.org/v/r/metaflow/scaling#using-aws-batch) or [@kubernetes](https://docs.metaflow.org/metaflow/scaling-out-and-up/effortless-scaling-with-kubernetes) for remote execution, and [@schedule](https://docs.metaflow.org/going-to-production-with-metaflow/scheduling-metaflow-flows/scheduling-with-aws-step-functions#scheduling-a-flow) to schedule jobs to run periodically.  