<a href="https://colab.research.google.com/github/jkchandalia/nlpower/blob/extra/notebooks/2.0_bert_sentiment_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Fine Tune a Bert-Based Model**

## Use DistilBERT to classify the sentiment of yelp reviews

Adapted from this [documentation](https://huggingface.co/docs/transformers/training)

In [1]:
!pip install evaluate transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting datasets>=2.0.0
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m14.3 MB/s

In [2]:
import evaluate
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          Trainer, TrainingArguments)

# Part I: Explore Hugging Face Datasets and Models

In [3]:
dataset_name = "yelp_review_full"
model_name = "distilbert-base-uncased"

#### 1. Can you find the above dataset on [Hugging Face Datasets](https://huggingface.co/datasets)?
#### 2. Can you find the above model on [Hugging Face Models](https://huggingface.co/models)?

# Part II: Load, inspect and down-sample our dataset

In [4]:
load_dataset?

In [5]:
dataset = load_dataset(dataset_name)

Downloading builder script:   0%|          | 0.00/4.41k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.55k [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
dataset.shape, dataset.column_names

({'train': (650000, 2), 'test': (50000, 2)},
 {'train': ['label', 'text'], 'test': ['label', 'text']})

In [7]:
# Let's look at a sample review
dataset['train'][100]["text"]

'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more t

In [8]:
# Let's explore the labels
pd.Series(dataset['train']['label']).value_counts()

# We have 5 different labels for these yelp sentiments corresponding to five different classes
# These represent the 1 to 5 star reviews for Yelp
# Each class is balanced meaning we have the same number of datapoints for each class

4    130000
1    130000
3    130000
0    130000
2    130000
dtype: int64

In [10]:
# We will need to know the number of output classes for our predictions
# when we instantiate our model
num_labels = len(pd.Series(dataset['train']['label']).unique())
num_labels

5

In [11]:
# Let's look at a few more reviews:
dataset['train']['text'][:10]

["dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
 "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patient

Let's create a small subset of data we can use to test our training pipeline before we train the full model on the full dataset.

In [40]:
# create a new dataset with 800 training samples and 200 test samples
# stratify by column: ensures that the train and test sets have the same proportion of each class as the full dataset
dataset_small = dataset["train"].train_test_split(train_size=800, test_size=200, seed=42, stratify_by_column="label")

In [41]:
dataset_small.shape

{'train': (2000, 2), 'test': (500, 2)}

# Part III: Tokenize our training and test datasets

In [42]:
tokenizer = AutoTokenizer.from_pretrained(model_name) # uncased means lowercase
# alternative tokenizers: distilbert-base-cased (cased means case sensitive), bert-base-uncased, bert-base-cased, roberta-base

# help(tokenizer)

In [43]:
def tokenize_function(examples):
    """this function takes in a batch of training (or test) examples and for each, will tokenize the text and truncate or pad it to the max length of 512 tokens. 

    Args:
        examples (list): List of 

    Returns:
        List (int): List of [input IDs] with the appropriate special tokens. The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model. (https://huggingface.co/transformers/v3.2.0/glossary.html#:~:text=The%20input%20ids%20are%20often,as%20input%20by%20the%20model.&text=The%20tokenizer%20takes%20care%20of,available%20in%20the%20tokenizer%20vocabulary.)
    """
    return tokenizer(examples["text"], padding="max_length", truncation=True)

Discussion Question: given the description of the tokenize_function() above, before we proceed, can you think of any potential problems with this approach? How would you go about evaluating if this approach is appropriate for our use-case? (side exercise: ex2_inspect_dataset.ipynb)

In [44]:
# dataset.map applies the tokenize function to all the examples in the dataset. (batched=True means that the function is applied to the examples in batches) Batched=True is faster than batched=False but it requires more memory. It is recommended to use batched=True if you have a GPU and batched=False if you don't have a GPU.
tokenized_datasets = dataset_small.map(tokenize_function, batched=True)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [45]:
# Let's look at our tokenized datasets
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 500
    })
})

In [46]:
# Let's explore the first 10 samples of our tokenized dataset
pd.DataFrame(tokenized_datasets["train"][0:10])

Unnamed: 0,label,text,input_ids,attention_mask
0,3,I went there today with some coworkers for lun...,"[101, 1045, 2253, 2045, 2651, 2007, 2070, 1119...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,2,"Great selection , clean store, friendly staff....","[101, 2307, 4989, 1010, 4550, 3573, 1010, 5379...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,1,Could have been better. Drink service was good...,"[101, 2071, 2031, 2042, 2488, 1012, 4392, 2326...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,4,I've gone to a number of groomers in town in t...,"[101, 1045, 1005, 2310, 2908, 2000, 1037, 2193...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,4,Holy Cow!!!! This has to be one of the best bu...,"[101, 4151, 11190, 999, 999, 999, 999, 2023, 2...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
5,4,Went in for brake pads and was helped by Hecto...,"[101, 2253, 1999, 2005, 13428, 19586, 1998, 20...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
6,2,"This isn't a review of one restaurant, but two...","[101, 2023, 3475, 1005, 1056, 1037, 3319, 1997...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
7,3,I liked it! Selfie heaven! haha\n\nThe ride is...,"[101, 1045, 4669, 2009, 999, 2969, 2666, 6014,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
8,4,Awesome weekday breakfast buffet! \n\nWe were ...,"[101, 12476, 16904, 6350, 28305, 999, 1032, 10...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
9,4,Amazing service! These Gentlemen are ON POINT...,"[101, 6429, 2326, 999, 2122, 11218, 2024, 2006...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


# Part IV: Load our Model

#### Yes, it's that easy!

In [47]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier

# Part V: Define Metrics

In [32]:
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [33]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


## Train

In [38]:
TrainingArguments?

[TrainingArguments source code](https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py#L147)

In [35]:
Trainer?

[Trainer source code](https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py#L230)

In [48]:
training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.094126,0.58


## Evaluate

In [None]:
# Try writing a review to test our model!
review = "solid haircut by james"

In [None]:
def evaluate_review(review):
    inputs = tokenizer(review, return_tensors="pt").to('cuda')

    with torch.no_grad():
        logits = model(**inputs).logits
        print(logits)

    predicted_class_id = logits.argmax().item()
    return predicted_class_id

predicted_class_id = evaluate(review)

print(f"This is predicted to be a {predicted_class_id} star review")

NameError: name 'tokenizer' is not defined