#### What You Will Learn
At the end of this lesson you will:
    
* Learn how to build a simple NLP model using Tensorflow and scikit-learn to classify fashion reviews
* Learn how to setup your model so that it can be easily used in Metaflow

#### Background

We are going to build a model that does classifies customer reviews as positive or negative sentiment, using the [Women's E-Commerce Clothing Reviews Dataset](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews). Here is what the data looks like:

In [None]:
import pandas as pd
df = pd.read_parquet('train.parquet')
print(f'num of rows: {df.shape[0]}')

num of rows: 20377


In [None]:
df.head()

Unnamed: 0,labels,review
0,0,Odd fit: I wanted to love this sweater but the...
1,1,Very comfy dress: The quality and material of ...
2,0,Fits nicely but fabric a bit thin: I ordered t...
3,1,"Great fit: Love these jeans, fit and style... ..."
4,0,"Stretches out, washes poorly. wish i could ret..."


Before we begin training a model, it is useful to set a baseline.  One such baseline is the majority-class classifier, which measures what happens when we label all of our examples with the majority class.  We can then calculate our performance metrics by using this baseline model, which in this case is [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) and the [area under the ROC curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html):

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score

valdf = pd.read_parquet('valid.parquet')
baseline_predictions = [1] * valdf.shape[0]
base_acc = accuracy_score(valdf.labels, baseline_predictions)
base_rocauc = roc_auc_score(valdf.labels, baseline_predictions)

print(f'Baseline Accuracy: {base_acc:.3f}\nBaseline AUC: {base_rocauc}')

Baseline Accuracy: 0.773
Baseline AUC: 0.5


Now its time to build our ML model.  We are going to define our model in a seperate file with a custom class called `Nbow_Model`.  The model contains two subcomponents: the count vectorizer for preprocessing and the model.  This class facilitates combining these two components together so that we don't have to deal with them seperately.

Next, let's import the `NbowModel` and train it on this dataset.  The purpose of doing this is to make sure the code works as we expect before using Metaflow.  For this example, we will set our `vocab_sz = 750`:

In [None]:
#notest
from model import NbowModel
model = NbowModel(vocab_sz=750)
model.fit(X=df['review'], y=df['labels'])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Next, we can evaluate our model on the validation set as well, using the built in evaluation methods we created:

In [None]:
model_acc = model.eval_acc(valdf['review'], valdf['labels'])
model_rocauc = model.eval_rocauc(valdf['review'], valdf['labels'])

print(f'Baseline Accuracy: {model_acc:.3f}\nBaseline AUC: {model_rocauc:.3f}')

Great! This is an improvement upon our baseline!  Now we have setup what we need to start using Metaflow.  In the next lesson, we are going to operationalize the steps we manually performed here into a Flow.