### Setup Instructions

If you haven't done so, please follow the [setup instructions](./nlp-tutorial-setup) to prepare your environment.  This tutorial will references the python script [baselineflow.py](https://github.com/outerbounds/tutorials/blob/main/nlp/baselineflow.py).

#### What You Will Learn
At the end of this lesson, you will:
    
* Learn how to operationalize the tasks of loading data and to compute a baseline.
* Learn how to run and view tasks with Metaflow.

In Lesson 2, you saw how we constructed a model in preparation for Metaflow.  In this lesson, we will construct a basic flow that reads our data and reports a baseline.

When creating flows, we recommend starting simple: create a flow that reads your data and reports a baseline metric.  This way, you can ensure you have the right foundation to incorporate your model.  Furthermore, starting simple helps with debugging.

For our baseline flow, we will just have three steps a `start` step where we read the data, a `baseline` step, and an `end` step that will be a placeholder for now.

Below is a detailed explanation of each step:

1. **Read data from a parquet file** in the `start` step.
    - We use pandas to read `train.parquet`
    - Notice how we are assigning the training data to `self.df` and the validation data to `self.valdf` this stores the data as an artifact in Metaflow, which means it will be versioned and saved in the artifact store for later retrieval.  Furthermore, this allows you to pass data to another step.  The prerequisite for being able to do this is that the data you are trying to store must be pickleable.
    - We log the number of rows in the data.  It is always a good idea to log information about your dataset for debugging. 
2. **Compute the baseline** in the `baseline` step.
     - The `baseline` step records the performance metrics (accuracy and roc auc score) that result from classifying all examples with the majority class.  This will be our baseline against which we evaluate our model.
3. **Print the baseline metrics** in the `end` step.  
    - This is just a placeholder for now, but also serves to illustrate how you can retrieve artifacts from any step.

In [None]:
%%writefile baselineflow.py

from metaflow import FlowSpec, step, Flow, current

class BaselineNLPFlow(FlowSpec):
        
    @step
    def start(self):
        "Read the data"
        import pandas as pd
        self.df = pd.read_parquet('train.parquet')
        self.valdf = pd.read_parquet('valid.parquet')
        print(f'num of rows: {self.df.shape[0]}')
        self.next(self.baseline)

    @step
    def baseline(self):
        "Compute the baseline"
        from sklearn.metrics import accuracy_score, roc_auc_score
        baseline_predictions = [1] * self.valdf.shape[0]
        self.base_acc = accuracy_score(self.valdf.labels, baseline_predictions)
        self.base_rocauc = roc_auc_score(self.valdf.labels, baseline_predictions)
        self.next(self.end)
        
    @step
    def end(self):
        print(f'Baseline Accuracy: {self.base_acc:.3f}\nBaseline AUC: {self.base_rocauc}')
        

if __name__ == '__main__':
    BaselineNLPFlow()

Overwriting baselineflow.py


In [None]:
#notest
! python baselineflow.py run

[35m[1mMetaflow 2.7.1[0m[35m[22m executing [0m[31m[1mBaselineNLPFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hamel[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-08-11 09:09:27.864 [0m[1mWorkflow starting (run-id 1660234167860082):[0m
[35m2022-08-11 09:09:27.876 [0m[32m[1660234167860082/start/1 (pid 28912)] [0m[1mTask is starting.[0m
[35m2022-08-11 09:09:28.856 [0m[32m[1660234167860082/start/1 (pid 28912)] [0m[22mnum of rows: 20377[0m
[35m2022-08-11 09:09:28.986 [0m[32m[1660234167860082/start/1 (pid 28912)] [0m[1mTask finished successfully.[0m
[35m2022-08-11 09:09:28.999 [0m[32m[1660234167860082/baseline/2 (pid 28916)] [0m[1mTask is starting.[0m
[35m2022-08-11 09:09:31.060 [0m[32m[1660234167860082/baseline/2 (pid 28916)] [0m[1mTask f

### Next Steps

In the next lesson, you will learn how to incorporate your model into the flow as well as deal with branching for parallel runs.