# Lab: Data-Centric vs Model-Centric approaches

This lab gives an introduction to data-centric vs model-centric approaches to machine learning problems, showing how data-centric approaches can outperform purely model-centric approaches.

In this lab, we'll build a classifier for product reviews (restricted to the magazine category), like:

> Excellent! I look forward to every issue. I had no idea just how much I didn't know.  The letters from the subscribers are educational, too.

Label: ⭐️⭐️⭐️⭐️⭐️ (good)

> My son waited and waited, it took the 6 weeks to get delivered that they said it would but when it got here he was so dissapointed, it only took him a few minutes to read it.

Label: ⭐️ (bad)

We'll work with a dataset that has some issues, and we'll see how we can squeeze only so much performance out of the model by being clever about model choice, searching for better hyperparameters, etc. Then, we'll take a look at the data (as any good data scientist should), develop an understanding of the issues, and use simple approaches to improve the data. Finally, we'll see how improving the data can improve results.

## Installing software

For this lab, you'll need to install [scikit-learn](https://scikit-learn.org/) and [pandas](https://pandas.pydata.org/). If you don't have them installed already, you can install them by running the following cell:

In [1]:
!pip install scikit-learn pandas



# Loading the data

First, let's load the train/test sets and take a look at the data.

In [2]:
import pandas as pd

In [3]:
train = pd.read_csv('reviews_train.csv')
test = pd.read_csv('reviews_test.csv')

test.sample(5)

Unnamed: 0,review,label
35,my mother kept every Christmas issue for years...,good
337,I give this magazine to my niece for her birth...,good
485,Love it! Great Magazine!!,good
520,Order this and a subscription to Computer Shop...,bad
133,Best DIY magazine for beginners!!,good


# Training a baseline model

There are many approaches for training a sequence classification model for text data. In this lab, we're giving you code that mirrors what you find if you look up [how to train a text classifier](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html), where we'll train an SVM on [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) features (numeric representations of each text field based on word occurrences).

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

In [6]:
_ = sgd_clf.fit(train['review'], train['label'])

## Evaluating model accuracy

In [7]:
from sklearn import metrics

In [8]:
def evaluate(clf):
    pred = clf.predict(test['review'])
    acc = metrics.accuracy_score(test['label'], pred)
    print(f'Accuracy: {100*acc:.1f}%')

In [9]:
evaluate(sgd_clf)

Accuracy: 76.2%


## Trying another model

76% accuracy is not great for this binary classification problem. Can you do better with a different model, or by tuning hyperparameters for the SVM trained with SGD?

# Exercise 1

Can you train a more accurate model on the dataset (without changing the dataset)? You might find this [scikit-learn classifier comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) handy, as well as the [documentation for supervised learning in scikit-learn](https://scikit-learn.org/stable/supervised_learning.html).

One idea for a model you could try is a [naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).

You could also try experimenting with different values of the model hyperparameters, perhaps tuning them via a [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). 

Or you can even try training multiple different models and [ensembling their predictions](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier), a strategy often used to win prediction competitions like Kaggle.

**Advanced:** If you want to be more ambitious, you could try an even fancier model, like training a Transformer neural network. If you go with that, you'll want to fine-tune a pre-trained model. This [guide from HuggingFace](https://huggingface.co/docs/transformers/training) may be helpful.

In [12]:
# YOUR CODE HERE
# Naive Bayes (as suggested)
from sklearn.naive_bayes import MultinomialNB

nb_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

nb_clf.fit(train['review'], train['label'])
#evaluate your model and see if it does better
# than the ones we provided
evaluate(nb_clf)

Accuracy: 85.3%


In [23]:
# Grid Search using Support Vector Machine
from sklearn import svm
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', svm.SVC()),
])

# Define the parameter grid for search (note the double underscore)
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'clf__C': [1, 10, 100],
    'clf__kernel': ['linear', 'rbf']
}

# Grid search
grid_search = GridSearchCV(pipeline, parameters, cv = 10, n_jobs = -1, 
                          verbose = 1)
grid_search.fit(train['review'], train['label'])


# Evaluate accuracy
evaluate(grid_search)

Fitting 10 folds for each of 36 candidates, totalling 360 fits
Accuracy: 77.1%


In [85]:
#Ensemble Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

# For algorithms that can take non-numeric output
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', None),
])

# Define Classifiers
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier()
knn = KNeighborsClassifier()
adaboost = AdaBoostClassifier()
bagging = BaggingClassifier()

# Define Parameter Grids
dtc_param_grid = {
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [5, 10, 15, None],
    'clf__min_samples_split': [2, 5, 10],
    'clf__min_samples_leaf': [1, 2, 4],
    'clf__max_features': ['auto', 'sqrt', 'log2', None]
}

rfc_param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [5, 10, 15, None],
    'clf__min_samples_split': [2, 5, 10],
    'clf__min_samples_leaf': [1, 2, 4],
    'clf__max_features': ['auto', 'sqrt', 'log2', None]
}

knn_param_grid = {
    'clf__n_neighbors': [3, 5, 7],
    'clf__weights': ['uniform', 'distance'],
    'clf__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'clf__leaf_size': [10, 20, 30, 40, 50]
}

bagging_param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__max_samples': [0.5, 0.75, 1],
    'clf__max_features': [0.5, 0.75, 1],
    'clf__bootstrap': [True, False],
    'clf__bootstrap_features': [True, False],
    'clf__n_jobs':[-1]
}

# Group the classifiers and their parameters together for grid search
classifiers = [
    (dtc, dtc_param_grid),
    (rfc, rfc_param_grid),
    (knn, knn_param_grid),
    (bagging, bagging_param_grid)
]

In [86]:
# To identify the best model
best_clf = None
best_params = {}
best_score = 0.0

for clf, param_grid in classifiers:
    pipeline.set_params(clf = clf)
    grid_search = GridSearchCV(
        pipeline,
        param_grid = param_grid,
        cv = 10,
        n_jobs = -1,
        scoring = 'accuracy',
        verbose = 1
    )
        
    grid_search.fit(train['review'], train['label'])
    print(f"Best parameters for {clf.__class__.__name__}: {grid_search.best_params_}")
    print(f"Best cross-validation score for {clf.__class__.__name__}: {grid_search.best_score_}")
        
    if grid_search.best_score_ > best_score:
        best_clf = clf
        best_params = grid_search.best_params_
        best_score = grid_search.best_score_

# Print the best model after grid search
print(f"\nBest model: {best_clf.__class__.__name__} with optimized parameters: {best_params}")

Fitting 10 folds for each of 288 candidates, totalling 2880 fits
Best parameters for DecisionTreeClassifier: {'clf__criterion': 'gini', 'clf__max_depth': None, 'clf__max_features': None, 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2}
Best cross-validation score for DecisionTreeClassifier: 0.7472306639473055
Fitting 10 folds for each of 864 candidates, totalling 8640 fits
Best parameters for RandomForestClassifier: {'clf__criterion': 'gini', 'clf__max_depth': None, 'clf__max_features': None, 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2, 'clf__n_estimators': 200}
Best cross-validation score for RandomForestClassifier: 0.825681078379729
Fitting 10 folds for each of 120 candidates, totalling 1200 fits
Best parameters for KNeighborsClassifier: {'clf__algorithm': 'auto', 'clf__leaf_size': 10, 'clf__n_neighbors': 7, 'clf__weights': 'uniform'}
Best cross-validation score for KNeighborsClassifier: 0.6501600551075815
Fitting 10 folds for each of 108 candidates, totalling 108

In [88]:
# Set the best model and its optimized parameters
pipeline.set_params(clf = best_clf, **best_params)

# Fit the best model on the whole training data
pipeline.fit(train['review'], train['label'])

# Evaluate the results
evaluate(pipeline)

Accuracy: 90.1%


<font color = red>**Important Note:** </font> <br>
The `clf__` (double underscore) is crucial when defining the hyperparameter(s). In the process above, `scikit-learn`'s `Pipeline` object is used to chain the text preprocessing steps and the classifier model together. In the pipeline, each step is <font color = red>**identified by a name defined** </font>, which is passed as a parameter to the constructor of each step. For example, in the code:

```
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', None),
])
```

Three steps are defined in the pipeline, namely `vect`, `tfidf` and `clf`. When specifying the hyperparameters for the classifier model, it is important to refer to them using the <font color = red>**name of the step**</font> and the <font color = red>**name of the hyperparameter**</font>, separated by <font color = red>**two underscores (__)**</font>. This is called the <font color = red>**"double underscore notation"**</font> or <font color = red>**"dot notation"**</font>. 

So, for example, to specify the `max_depth` hyperparameter for the `DecisionTreeClassifier`, which is the classifier currently used in the pipeline, the hyperparameters are specified as '`clf__max_depth`', where '`clf`' is the name of the '`clf`' step in the pipeline and '`max_depth`' is the name of the hyperparameter you want to tune.

The reason of using this notation is because, when calling `GridSearchCV` with the pipeline object and the hyperparameter grid, `scikit-learn` needs to know which step in the pipeline each hyperparameter belongs to. By using the double underscore notation, you are telling `scikit-learn` that the hyperparameter belongs to the '`clf`' step in the pipeline. <font color = red>**If you don't use this notation, `scikit-learn` doesn't know which step the hyperparameter belongs to and raises an error.**</font>

In [78]:
# Using transformer neural network
import numpy as np
import transformers
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import datasets
from datasets import Dataset, DatasetDict, ClassLabel

# Reformat the data to be suitable with Hugging Face Dataset class
label_map = {"bad": 0, "good": 1}
dataset_train = Dataset.from_dict({"label": train["label"].map(label_map), "text": train["review"].values})
dataset_test = Dataset.from_dict({"label": test["label"].map(label_map), "text": test["review"].values})

# Model configuration
model_name = "distilbert-base-uncased"  # which pretrained neural network weights to load for fine-tuning on our data
max_training_steps = 10  # how many iterations our network will be trained for
model_folder = "test_trainer"  # file where model will be saved after training

# Train the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

train_tokenized_dataset = dataset_train.map(tokenize_function, batched=True)
train_tokenized_dataset = train_tokenized_dataset.cast_column("label", ClassLabel(names = ["0", "1"]))

test_tokenized_dataset = dataset_test.map(tokenize_function, batched=True)
test_tokenized_dataset = test_tokenized_dataset.cast_column("label", ClassLabel(names = ["0", "1"]))

training_args = TrainingArguments(max_steps=max_training_steps, output_dir=model_folder)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized_dataset,
)

trainer.train()



Map:   0%|          | 0/6666 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/6666 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=10, training_loss=0.6968923091888428, metrics={'train_runtime': 731.1075, 'train_samples_per_second': 0.109, 'train_steps_per_second': 0.014, 'total_flos': 10597391892480.0, 'train_loss': 0.6968923091888428, 'epoch': 0.01})

In [79]:
# Evaluate accuracy
pred_probs = trainer.predict(test_tokenized_dataset).predictions
pred_classes = np.argmax(pred_probs, axis=1)
print(f"Error rate of predictions: {np.mean(pred_classes != test_tokenized_dataset['label'])}")

***** Running Prediction *****
  Num examples = 1000
  Batch size = 8
The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.


Error rate of predictions: 0.5



## Taking a closer look at the training data

Let's actually take a look at some of the training data:

In [89]:
train.head()

Unnamed: 0,review,label
0,Based on all the negative comments about Taste...,good
1,I still have not received this. Obviously I c...,bad
2,</tr>The magazine is not worth the cost of sub...,good
3,This magazine is basically ads. Kindve worthle...,bad
4,"The only thing I've recieved, so far, is the b...",bad


Zooming in on one particular data point:

In [90]:
print(train.iloc[0].to_dict())

{'review': "Based on all the negative comments about Taste of Home, I will not subscribeto the magazine. In the past it was a great read.\nSorry it, too, has gone the 'way of the wind'.<br>o-p28pass4 </br>", 'label': 'good'}


This data point is labeled "good", but it's clearly a negative review. Also, it looks like there's some funny HTML stuff at the end.

# Exercise 2

Take a look at some more examples in the dataset. Do you notice any patterns with bad data points?

In [109]:
# YOUR CODE HERE
from bs4 import BeautifulSoup

df = train

# Create a new column to store whether each data point has html tags in the reviews column
df['has_html'] = df['review'].apply(lambda x: BeautifulSoup(x, 'html.parser').find() is not None)
df_filtered_html = df[df['has_html'] == False]

print(f"Number of rows with unparsed html tags: {len(df_filtered_html)}")
print(df[df['has_html'] == True])

Number of rows with unparsed html tags: 5361
                                                 review label  has_html  \
0     Based on all the negative comments about Taste...  good      True   
5     The magazines are great, but I never received ...  good      True   
10    </div>It's not the fault of the magazine, I ju...  good      True   
11    <li>dispatchEventBest magazine for current and...   bad      True   
12    <li>onEmptiedBoth my husband and I really enjo...   bad      True   
...                                                 ...   ...       ...   
6643  <!--   -->Such a great magazine for kids!<h4 c...   bad      True   
6646    <ul>Will not download. Keep getting error.</TR>  good      True   
6647  Fantastic.....makes me yean for the English ci...   bad      True   
6652  </div>Love this magazine as do grandkids ages ...   bad      True   
6654   Left a lot to be desired.<div class="api-level">  good      True   

      valid_label  
0            True  
5            T

In [104]:
# Check if there are any labels other than 'good' or 'bad'
unique_labels = df['label'].unique()
if set(unique_labels) != {'good', 'bad'}:
    print(f"The label column contains values other than 'good' or 'bad': {unique_labels}")
else:
    print("The label column contains only 'good' and 'bad' values.")

The label column contains only 'good' and 'bad' values.


## Issues in the data

It looks like there's some funny HTML tags in our dataset, and those datapoints have nonsense labels. Maybe this dataset was collected by scraping the internet, and the HTML wasn't quite parsed correctly in all cases.

# Exercise 3

To address this, a simple approach we might try is to throw out the bad data points, and train our model on only the "clean" data.

Come up with a simple heuristic to identify data points containing HTML, and filter out the bad data points to create a cleaned training set.

In [110]:
def is_bad_data(review: str) -> bool:
    # YOUR CODE HERE
    return '<' in review

## Creating the cleaned training set

In [112]:
train_clean = train[~train['review'].map(is_bad_data)]
print(len(train_clean))
train_clean.head()

3998


Unnamed: 0,review,label,has_html,valid_label
1,I still have not received this. Obviously I c...,bad,False,True
3,This magazine is basically ads. Kindve worthle...,bad,False,True
4,"The only thing I've recieved, so far, is the b...",bad,False,True
6,This is one magazine I really love. It has pri...,good,False,True
7,Did not. Open.,bad,False,True


## Evaluating a model trained on the clean training set

In [113]:
from sklearn import clone

In [114]:
sgd_clf_clean = clone(sgd_clf)

In [115]:
_ = sgd_clf_clean.fit(train_clean['review'], train_clean['label'])

This model should do significantly better:

In [116]:
evaluate(sgd_clf_clean)

Accuracy: 96.9%
