# Week 2 Project

Once again, we’ll be using the [Women's Ecommerce Clothing Reviews Dataset from Kaggle](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews), which Kaggle states is a 

> dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”. 

The machine learning task will be sentiment analysis, classifying each review as having positive or negative sentiment.

## Task 1: Training and Evaluating Sentiment Analysis Models Using Metaflow

In this task, you'll use Metaflow to build two machine learning models for sentiment analysis: a baseline *"majority class"* classifier and your own custom model. You'll then train both models in parallel and experiment with different hyperparameters to optimize their performance. Finally, you'll use this notebook and the Metaflow Client API to analyze the results of your different models and hyperparameters. Here's what you'll need to do:

### Step 1: Build the Workflows
The first step in this task is to build the workflow(s) for sentiment analysis using the Metaflow framework. Start by creating a new flow in Metaflow and implementing the baseline *"majority class"* classifier. Then, build your own custom classifier using techniques you learned in Week 1, or any [helpful resources](https://outerbounds.com/docs/nlp-tutorial-L2/) you'd like. For your custom model, be sure to include steps for data preprocessing, model training, and evaluation.

### Step 2: Train Both Models in Parallel
Once you've built your models, the next step is to train both models in parallel using the Metaflow framework. Use Metaflow to run both training jobs in parallel steps. If you get stuck, you may want to review the [FlowSpec branching documentation](https://docs.metaflow.org/metaflow/basics#branch).

### Step 3: Experiment with Hyperparameters
After you've trained both models in parallel, the next step is to experiment with different hyperparameters to optimize their performance. Try different values for hyperparameters such as learning rate, batch size, and number of epochs, and record the results for each combination of hyperparameters as Data Artifacts in Metaflow.

### Step 4: Analyze the Results
Finally, use this notebook and the Metaflow Client API to analyze the results of your different models and hyperparameters. Create visualizations to compare the performance of the two models and identify the best hyperparameters for each one.

By completing this task, you'll gain experience working with the Metaflow framework and learn how to build and optimize machine learning workflows for sentiment analysis.

In [1]:
from collections import Counter
import pandas as pd
import numpy as np 
from termcolor import colored
import matplotlib.pyplot as plt
import seaborn as sns
import string

# You can style your plots here, but it is not part of the project.
YELLOW = '#FFBC00'
GREEN = '#37795D'
PURPLE = '#5460C0'
BACKGROUND = '#F4EBE6'
colors = [GREEN, PURPLE]
custom_params = {
    'axes.spines.right': False, 'axes.spines.top': False,
    'axes.facecolor':BACKGROUND, 'figure.facecolor': BACKGROUND, 
    'figure.figsize':(8, 8)
}
sns_palette = sns.color_palette(colors, len(colors))
sns.set_theme(style='ticks', rc=custom_params)

In [2]:
# TODO: load the data. 
from sklearn.model_selection import train_test_split
df = pd.read_csv('../../full-stack-ml-metaflow-corise-week-1/data/Womens Clothing E-Commerce Reviews.csv', index_col=0)
labeling_function = lambda row: 1 if row['rating'] >= 4 else 0

# transformations
df.columns = ["_".join(name.lower().strip().split()) for name in df.columns]
df = df[~df.review_text.isna()]
df['review'] = df['review_text'].astype('str')
_has_review_df = df[df['review_text'] != 'nan']
reviews = _has_review_df['review_text']
labels = _has_review_df.apply(labeling_function, axis=1)
df = pd.DataFrame({'label': labels, **_has_review_df})

# split into training and validation.
_df = pd.DataFrame({'review': reviews, 'label': labels})
traindf, valdf = train_test_split(_df, test_size=0.2)

In [3]:
# TODO: build the majority class baseline model. 
# TODO: find the majority class in the labels. 🤔
# TODO: score the model on valdf with a 2D metric space: sklearn.metrics.accuracy_score, sklearn.metrics.roc_auc_score
    # Documentation on suggested model-scoring approach: https://scikit-learn.org/stable/modules/model_evaluation.html

from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(
        min_df=.005, max_df = .75, stop_words='english', 
        strip_accents='ascii', max_features=500)

X_train = cv.fit_transform(traindf['review'].copy()).toarray()
y_train = traindf['label']

model = DummyClassifier(strategy="most_frequent")
model.fit(X_train, y_train)

X_val = cv.transform(valdf['review'].copy()).toarray()
y_val = valdf['label']
pred = model.predict(X_val).tolist()

base_acc = accuracy_score(valdf['label'], pred)
base_rocauc = roc_auc_score(valdf['label'], pred)

msg = 'Baseline Accuracy: {}\nBaseline AUC: {}'
print(msg.format(round(base_acc, 3), round(base_rocauc, 3)))

Baseline Accuracy: 0.771
Baseline AUC: 0.5


In [4]:
%%writefile model.py
# TODO: modify this custom model to your liking. Check out this tutorial for more on this class: https://outerbounds.com/docs/nlp-tutorial-L2/
# TODO: train the model on traindf.
# TODO: score the model on valdf with _the same_ 2D metric space you used in previous cell.
# TODO: test your model works by importing the model module in notebook cells, and trying to fit traindf and score predictions on the valdf data!

import tensorflow as tf
from tensorflow.keras import layers, optimizers, regularizers
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer

class NbowModel():
    def __init__(self, vocab_sz):

        self.vocab_sz = vocab_sz

        # Instantiate the CountVectorizer
        self.cv = CountVectorizer(
            min_df=.005, max_df = .75, stop_words='english', 
            strip_accents='ascii', max_features=self.vocab_sz
        )

        # Define the keras model
        inputs = tf.keras.Input(shape=(self.vocab_sz,), 
                                name='input')
        x = layers.Dropout(0.10)(inputs)
        x = layers.Dense(
            15, activation="relu",
            kernel_regularizer=regularizers.L1L2(l1=1e-5, l2=1e-4)
        )(x)
        predictions = layers.Dense(1, activation="sigmoid",)(x)
        self.model = tf.keras.Model(inputs, predictions)
        opt = optimizers.Adam(learning_rate=0.002)
        self.model.compile(loss="binary_crossentropy", 
                           optimizer=opt, metrics=["accuracy"])

    def fit(self, X, y):
        print("fit: type of X:", type(X))
        print("fit: shape:", X.shape)
        #print("fit: 1st item:", X[0])
        res = self.cv.fit_transform(X).toarray()
        self.model.fit(x=res, y=y, batch_size=32, 
                       epochs=10, validation_split=.2)
    
    def predict(self, X):
        print("pred: type of X:", type(X))
        print("pred: shape:", X.shape)
        #print("pred: 1st item:", X[0])
        res = self.cv.transform(X).toarray()
        return self.model.predict(res)
    
    def eval_acc(self, X, labels, threshold=.5):
        return accuracy_score(labels, 
                              self.predict(X) > threshold)
    
    def eval_rocauc(self, X, labels):
        return roc_auc_score(labels,  self.predict(X))

    @property
    def model_dict(self): 
        return {'vectorizer':self.cv, 'model': self.model}

    @classmethod
    def from_dict(cls, model_dict):
        "Get Model from dictionary"
        nbow_model = cls(len(
            model_dict['vectorizer'].vocabulary_
        ))
        nbow_model.model = model_dict['model']
        nbow_model.cv = model_dict['vectorizer']
        return nbow_model

Overwriting model.py


In [5]:
%%writefile baseline_challenge.py
# TODO: In this cell, write your BaselineChallenge flow in the baseline_challenge.py file.

from metaflow import FlowSpec, step, Flow, current, Parameter, IncludeFile, card, current
from metaflow.cards import Table, Markdown, Artifact, Image
import numpy as np 
from dataclasses import dataclass

labeling_function = lambda row: 1 if row['rating'] >= 4 else 0 # TODO: Define your labeling function here.

@dataclass
class ModelResult:
    "A custom struct for storing model evaluation results."
    name: None
    params: None
    pathspec: None
    acc: None
    rocauc: None

class BaselineChallenge(FlowSpec):

    split_size = Parameter('split-sz', default=0.2)
    data = IncludeFile('data', default='../data/Womens Clothing E-Commerce Reviews.csv')
    kfold = Parameter('k', default=5)
    scoring = Parameter('scoring', default='accuracy')

    @step
    def start(self):

        import pandas as pd
        import io 
        from sklearn.model_selection import train_test_split
        
        # load dataset packaged with the flow.
        # this technique is convenient when working with small datasets that need to move to remove tasks.
        df = pd.read_csv(io.StringIO(self.data))
        # TODO: load the data. 
        # Look up a few lines to the IncludeFile('data', default='Womens Clothing E-Commerce Reviews.csv'). 
        # You can find documentation on IncludeFile here: https://docs.metaflow.org/scaling/data#data-in-local-files


        # filter down to reviews and labels 
        df.columns = ["_".join(name.lower().strip().split()) for name in df.columns]
        df = df[~df.review_text.isna()]
        df['review'] = df['review_text'].astype('str')
        _has_review_df = df[df['review_text'] != 'nan']
        reviews = _has_review_df['review_text']
        labels = _has_review_df.apply(labeling_function, axis=1)
        self.df = pd.DataFrame({'label': labels, **_has_review_df})

        # split the data 80/20, or by using the flow's split-sz CLI argument
        _df = pd.DataFrame({'review': reviews, 'label': labels})
        self.traindf, self.valdf = train_test_split(_df, test_size=self.split_size)
        print(f'num of rows in train set: {self.traindf.shape[0]}')
        print(f'num of rows in validation set: {self.valdf.shape[0]}')

        self.next(self.baseline, self.model)

    @step
    def baseline(self):
        "Compute the baseline"

        from sklearn.metrics import accuracy_score, roc_auc_score
        self._name = "baseline"
        params = "Always predict 1"
        pathspec = f"{current.flow_name}/{current.run_id}/{current.step_name}/{current.task_id}"

        predictions = np.ones(len(self.valdf)).astype('int') # TODO: predict the majority class
        acc = accuracy_score(self.valdf['label'], predictions) # TODO: return the accuracy_score of these predictions

        rocauc = roc_auc_score(self.valdf['label'], predictions) # TODO: return the roc_auc_score of these predictions
        print(f"model: {self._name}, acc: {acc:.3f}, rocauc: {rocauc:.3f}")
        self.result = ModelResult("Baseline", params, pathspec, acc, rocauc)
        self.next(self.aggregate)

    @step
    def model(self):

        # TODO: import your model if it is defined in another file.
        from model import NbowModel

        self._name = "model"
        # NOTE: If you followed the link above to find a custom model implementation, 
            # you will have noticed your model's vocab_sz hyperparameter.
            # Too big of vocab_sz causes an error. Can you explain why? 
        self.hyperparam_set = [{'vocab_sz': 100}, {'vocab_sz': 300}, {'vocab_sz': 500}]  
        pathspec = f"{current.flow_name}/{current.run_id}/{current.step_name}/{current.task_id}"

        self.results = []
        for params in self.hyperparam_set:
            model = NbowModel(params['vocab_sz'])   # TODO: instantiate your custom model here!
            model.fit(X=self.traindf['review'], y=self.traindf['label'])  # MO: changed from df to traindf
            acc = model.eval_acc(X=self.valdf['review'], labels=self.valdf['label'])  # TODO: evaluate your custom model in an equivalent way to accuracy_score.
            rocauc = model.eval_rocauc(X=self.valdf['review'], labels=self.valdf['label'])  # TODO: evaluate your custom model in an equivalent way to roc_auc_score.
            print(f"model: {self._name}, vocab_sz: {params['vocab_sz']}, acc: {acc:.3f}, rocauc: {rocauc:.3f}")
            self.results.append(ModelResult(f"NbowModel - vocab_sz: {params['vocab_sz']}", params, pathspec, acc, rocauc))

        self.next(self.aggregate)
        
    @step
    def aggregate(self, inputs):
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    BaselineChallenge()

Overwriting baseline_challenge.py


## Task 2: Anticipating Failure in Your Machine Learning Project

In this task, you'll practice anticipating potential failure modes in a sentiment analysis classifier and develop strategies to mitigate them. Here's what you'll need to do:
### Step 1: Identify Potential Failure Modes

The first step in anticipating failure in your machine learning project is to identify potential failure modes. Start by brainstorming ways in which your project could fail from an engineering point of view. For example, your model could overfit to the training data or suffer from data bias.

* overfit
* underfit
* data bias
* data scarcity
* data sparsity
* data shift
* target shift
* inappropriate loss function

### Step 2: Develop Strategies to Mitigate Failure Modes

Once you've identified potential failure modes, the next step is to develop strategies to mitigate them. Think about what measures you could take to fix the issue if it were to occur. For example, if your model is overfitting to the training data, you could try regularization techniques such as L1 or L2 regularization to reduce the complexity of the model.

* collect/buy more data
* regularization: L1, L2
* shift monitoring

### Step 3: Plan Ahead to Avoid Failure Modes

Finally, it's important to plan ahead to avoid potential failure modes in the first place. Think about what you could have done initially to avoid these failure modes. For example, you could have collected a diverse set of training data to reduce data bias or experimented with different model architectures to find the best solution for your problem.

The key to anticipating failure in your machine learning project is to be proactive rather than reactive. By identifying potential failure modes ahead of time and developing strategies to mitigate them, you'll be better equipped to build a successful machine learning project.

## Task 3: Visualizing ML Results with MF Cards
Now it is time to iterate. Extend the flow in your `baseline_challenge.py` file to include a step that aggregates all of the results from hyperparameter tuning jobs, and logs results and a data visualiation in a Metaflow card.

In [18]:
%%writefile baseline_challenge.py
# TODO: In this cell, write your BaselineChallenge flow in the baseline_challenge.py file.

from metaflow import FlowSpec, step, Flow, current, Parameter, IncludeFile, card, current
from metaflow.cards import Table, Markdown, Artifact, Image
import numpy as np 
from dataclasses import dataclass

labeling_function = lambda row: 1 if row['rating'] >= 4 else 0 # TODO: Define your labeling function here.

@dataclass
class ModelResult:
    "A custom struct for storing model evaluation results."
    name: None
    params: None
    pathspec: None
    acc: None
    rocauc: None

class BaselineChallenge(FlowSpec):

    split_size = Parameter('split-sz', default=0.2)
    data = IncludeFile('data', default='Womens Clothing E-Commerce Reviews.csv')
    kfold = Parameter('k', default=5)
    scoring = Parameter('scoring', default='accuracy')

    @step
    def start(self):

        import pandas as pd
        import io 
        from sklearn.model_selection import train_test_split
        
        # load dataset packaged with the flow.
        # this technique is convenient when working with small datasets that need to move to remove tasks.
        df = pd.read_csv(io.StringIO(self.data)) 
        # TODO: load the data. 
        # Look up a few lines to the IncludeFile('data', default='Womens Clothing E-Commerce Reviews.csv'). 
        # You can find documentation on IncludeFile here: https://docs.metaflow.org/scaling/data#data-in-local-files


        # filter down to reviews and labels 
        df.columns = ["_".join(name.lower().strip().split()) for name in df.columns]
        df = df[~df.review_text.isna()]
        df['review'] = df['review_text'].astype('str')
        _has_review_df = df[df['review_text'] != 'nan']
        reviews = _has_review_df['review_text']
        labels = _has_review_df.apply(labeling_function, axis=1)
        self.df = pd.DataFrame({'label': labels, **_has_review_df})

        # split the data 80/20, or by using the flow's split-sz CLI argument
        _df = pd.DataFrame({'review': reviews, 'label': labels})
        self.traindf, self.valdf = train_test_split(_df, test_size=self.split_size)
        print(f'num of rows in train set: {self.traindf.shape[0]}')
        print(f'num of rows in validation set: {self.valdf.shape[0]}')

        self.next(self.baseline, self.model)

    @step
    def baseline(self):
        "Compute the baseline"

        from sklearn.metrics import accuracy_score, roc_auc_score
        self._name = "baseline"
        params = "Always predict 1"
        pathspec = f"{current.flow_name}/{current.run_id}/{current.step_name}/{current.task_id}"

        predictions = np.ones(len(self.valdf)).astype('int') # TODO: predict the majority class
        acc = accuracy_score(self.valdf['label'], predictions) # TODO: return the accuracy_score of these predictions

        rocauc = roc_auc_score(self.valdf['label'], predictions) # TODO: return the roc_auc_score of these predictions
        self.result = ModelResult("Baseline", params, pathspec, acc, rocauc)
        self.next(self.aggregate)

    @step
    def model(self):

        # TODO: import your model if it is defined in another file.
        from model import NbowModel
        
        self._name = "model"
        # NOTE: If you followed the link above to find a custom model implementation, 
            # you will have noticed your model's vocab_sz hyperparameter.
            # Too big of vocab_sz causes an error. Can you explain why? 
        self.hyperparam_set = [{'vocab_sz': 100}, {'vocab_sz': 300}, {'vocab_sz': 500}]  
        pathspec = f"{current.flow_name}/{current.run_id}/{current.step_name}/{current.task_id}"

        self.results = []
        for params in self.hyperparam_set:
            model = NbowModel(params['vocab_sz']) # TODO: instantiate your custom model here!
            model.fit(X=self.traindf['review'], y=self.traindf['label'])  # MO: changed for fair training/validation
            acc = model.eval_acc(X=self.valdf['review'], labels=self.valdf['label'])  # TODO: evaluate your custom model in an equivalent way to accuracy_score.
            rocauc = model.eval_rocauc(X=self.valdf['review'], labels=self.valdf['label'])  # TODO: evaluate your custom model in an equivalent way to roc_auc_score.
            self.results.append(ModelResult(f"NbowModel - vocab_sz: {params['vocab_sz']}", params, pathspec, acc, rocauc))

        self.next(self.aggregate)

    def add_one(self, rows, result, df):
        "A helper function to load results."
        rows.append([
            Markdown(result.name),
            Artifact(result.params),
            Artifact(result.pathspec),
            Artifact(result.acc),
            Artifact(result.rocauc)
        ])
        df['name'].append(result.name)
        df['accuracy'].append(result.acc)
        df['rocauc'].append(result.rocauc)
        return rows, df

    @card(type="corise") # TODO: Set your card type to "corise". 
            # I wonder what other card types there are?
            # https://docs.metaflow.org/metaflow/visualizing-results
            # https://github.com/outerbounds/metaflow-card-altair/blob/main/altairflow.py

            # MO: saw 'html'

    @step
    def aggregate(self, inputs):

        import seaborn as sns
        import matplotlib.pyplot as plt
        from matplotlib import rcParams 
        rcParams.update({'figure.autolayout': True})

        rows = []
        violin_plot_df = {'name': [], 'accuracy': [], 'rocauc': []}
        for task in inputs:
            if task._name == "model": 
                for result in task.results:
                    print(result)
                    rows, violin_plot_df = self.add_one(rows, result, violin_plot_df)
            elif task._name == "baseline":
                print(task.result)
                rows, violin_plot_df = self.add_one(rows, task.result, violin_plot_df)
            else:
                raise ValueError("Unknown task._name type. Cannot parse results.")
            
        current.card.append(Markdown("# All models from this flow run"))

        # TODO: Add a Table of the results to your card! 
        current.card.append(
            Table(
                rows, # TODO: What goes here to populate the Table in the card? 
                headers=["Model name", "Params", "Task pathspec", "Accuracy", "ROCAUC"]
            )
        )
        
        fig, ax = plt.subplots(1,1)
        plt.xticks(rotation=40)
        #sns.violinplot(data=violin_plot_df, x="name", y="accuracy", ax=ax)  # MO: seemed inappropriate for acc comparison
        ax.set_title('Accuracy of baseline and trained models')
        ax.set_ylabel('Model')
        ax.set_xlabel('Accuracy')
        sns.barplot(data=violin_plot_df, x='accuracy', y='name', ax=ax)
        
        # TODO: Append the matplotlib fig to the card
        # Docs: https://docs.metaflow.org/metaflow/visualizing-results/easy-custom-reports-with-card-components#showing-plots
        current.card.append(Image.from_matplotlib(fig))

        # MO: added rocauc plot
        fig2, ax2 = plt.subplots(1, 1)
        plt.xticks(rotation=40)
        ax2.set_title('ROCAUC of baseline and trained models')
        ax2.set_ylabel('Model')
        ax2.set_xlabel('ROCAUC')
        sns.barplot(data=violin_plot_df, x='rocauc', y='name', ax=ax2)
        current.card.append(Image.from_matplotlib(fig2))

        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    BaselineChallenge()

Overwriting baseline_challenge.py


In [19]:
! python baseline_challenge.py run --data '../data/Womens Clothing E-Commerce Reviews.csv'

[35m[1mMetaflow 2.8.3.1+ob(v1)[0m[35m[22m executing [0m[31m[1mBaselineChallenge[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:sandbox[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[22mIncluding file ../data/Womens Clothing E-Commerce Reviews.csv of size 8MB [K[0m[22m[0m
[35m2023-05-07 05:11:14.811 [0m[1mWorkflow starting (run-id 53), see it in the UI at https://ui-pw-1437394648.outerbounds.dev/BaselineChallenge/53[0m
[35m2023-05-07 05:11:15.000 [0m[32m[53/start/260 (pid 14079)] [0m[1mTask is starting.[0m
[35m2023-05-07 05:11:18.134 [0m[32m[53/start/260 (pid 14079)] [0m[22mnum of rows in train set: 18112[0m
[35m2023-05-07 05:11:20.897 [0m[32m[53/start/260 (pid 14079)] [0m[22mnum of rows in validation set: 4529[0m
[35m2023-05-07 05:11:21.116 [0m[32m[53