<a href="https://colab.research.google.com/github/jkchandalia/nlpower/blob/main/notebooks/2.0_bert_sentiment_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Fine Tune a Bert-Based Model**

## Use DistilBERT to classify the sentiment of yelp reviews

Adapted from this [documentation](https://huggingface.co/docs/transformers/training)

In [1]:
!pip install evaluate transformers | grep -v -e 'already satisfied' -e 'Downloading'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.4/81.4 kB 2.2 MB/s eta 0:00:00
Collecting transformers
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.0/7.0 MB 52.9 MB/s eta 0:00:00
Collecting multiprocess
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.9/132.9 kB 5.2 MB/s eta 0:00:00
Collecting datasets>=2.0.0
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 468.7/468.7 kB 21.0 MB/s eta 0:00:00
Collecting huggingface-hub>=0.7.0
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.1/200.1 kB 9.1 MB/s eta 0:00:00
Collecting dill
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.5/110.5 kB 1.3 MB/s eta 0:00:00
Collecting xxhash
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.2/212.2 kB 9.3 MB/s eta 0:00:00
Collecting responses<0.19
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 49.1 MB/s eta 0:00:00
Collecting aiohttp
     

In [2]:
import evaluate
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          Trainer, TrainingArguments)

# Part I: Explore Hugging Face Datasets and Models

In [3]:
dataset_name = "yelp_review_full"
model_name = "distilbert-base-uncased"

#### 1. Can you find the above dataset on [Hugging Face Datasets](https://huggingface.co/datasets)?
#### 2. Can you find the above model on [Hugging Face Models](https://huggingface.co/models)?
#### 3. Is defining these two strings enough to build a model? :) Almost!

# Part II: Load, inspect and down-sample our dataset

In [4]:
load_dataset?

In [5]:
dataset = load_dataset(dataset_name)

Downloading builder script:   0%|          | 0.00/4.41k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.55k [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
dataset.shape, dataset.column_names

({'train': (650000, 2), 'test': (50000, 2)},
 {'train': ['label', 'text'], 'test': ['label', 'text']})

In [7]:
# Let's look at a sample review
dataset['train'][100]["text"]

'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more t

In [8]:
# Let's explore the labels
pd.Series(dataset['train']['label']).value_counts()

# We have 5 different labels for these yelp sentiments corresponding to five different classes
# These represent the 1 to 5 star reviews for Yelp
# Each class is balanced meaning we have the same number of datapoints for each class

4    130000
1    130000
3    130000
0    130000
2    130000
dtype: int64

### Here's the other info we need for our model, the number of classes.

In [9]:
# We will need to know the number of output classes for our predictions
# when we instantiate our model
num_labels = len(pd.Series(dataset['train']['label']).unique())
num_labels

5

In [10]:
# Let's look at a few more reviews:
dataset['train']['text'][:10]

["dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
 "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patient

Let's create a small subset of data we can use to test our training pipeline before we train the full model on the full dataset.

In [11]:
# create a new dataset with 800 training samples and 200 test samples
# stratify by column: ensures that the train and test sets have the same proportion of each class as the full dataset
dataset_small = dataset["train"].train_test_split(train_size=800, test_size=200, seed=42, stratify_by_column="label")

In [12]:
dataset_small.shape

{'train': (800, 2), 'test': (200, 2)}

# Part III: Tokenize our training and test datasets

In [13]:
# Load the tokenizer the corresponds to the model_name we chose
tokenizer = AutoTokenizer.from_pretrained(model_name) 
tokenizer?

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [14]:
def tokenize_function(examples):
    """this function takes in a batch of training (or test) examples and for each, will tokenize the text and truncate or pad it to the max length of 512 tokens. 

    Args:
        examples (list): List of 

    Returns:
        List (int): List of [input IDs] with the appropriate special tokens. The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model. (https://huggingface.co/transformers/v3.2.0/glossary.html#:~:text=The%20input%20ids%20are%20often,as%20input%20by%20the%20model.&text=The%20tokenizer%20takes%20care%20of,available%20in%20the%20tokenizer%20vocabulary.)
    """
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [15]:
# dataset.map applies the tokenize function to all the examples in the dataset. (batched=True means that the function is applied to the examples in batches) Batched=True is faster than batched=False but it requires more memory. It is recommended to use batched=True if you have a GPU and batched=False if you don't have a GPU.
tokenized_datasets = dataset_small.map(tokenize_function, batched=True)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [16]:
# Let's look at our tokenized datasets
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 800
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 200
    })
})

In [17]:
# Let's explore the first 10 samples of our tokenized dataset
pd.DataFrame(tokenized_datasets["train"][0:10])

Unnamed: 0,label,text,input_ids,attention_mask
0,3,We got a phone fixed by these guys over the we...,"[101, 2057, 2288, 1037, 3042, 4964, 2011, 2122...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,1,Meh. Really not all that great.,"[101, 2033, 2232, 1012, 2428, 2025, 2035, 2008...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ..."
2,4,When I first moved here in 1995 this and Ri Ra...,"[101, 2043, 1045, 2034, 2333, 2182, 1999, 2786...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,3,I used to frequent this location about 5 years...,"[101, 1045, 2109, 2000, 6976, 2023, 3295, 2055...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,1,I picked up take out from Chef Chiang yesterda...,"[101, 1045, 3856, 2039, 2202, 2041, 2013, 1002...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
5,0,We went for brunch on a Saturday (@10a). It wa...,"[101, 2057, 2253, 2005, 7987, 4609, 2818, 2006...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
6,2,Stopped here for a hearty traditional Scottish...,"[101, 3030, 2182, 2005, 1037, 2540, 2100, 3151...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
7,3,This place was in a great location in downtown...,"[101, 2023, 2173, 2001, 1999, 1037, 2307, 3295...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
8,2,"The service here is awesome, drinks are reason...","[101, 1996, 2326, 2182, 2003, 12476, 1010, 897...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
9,0,"Because it's shit. Simply put, Why Not? is the...","[101, 2138, 2009, 1005, 1055, 4485, 1012, 3432...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [18]:
# Let's play around with the tokenizer
print(tokenizer('check this out'))

{'input_ids': [101, 4638, 2023, 2041, 102], 'attention_mask': [1, 1, 1, 1, 1]}


# Part IV: Load our Model

#### Yes, it's that easy!

In [19]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier

# Part V: Define Metrics

In [20]:
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [21]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


# Part VI: Train

In [22]:
TrainingArguments?

We can see many of the parameters like epochs and batch_size are set to defaults. 

[TrainingArguments source code](https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py#L147)

In [23]:
Trainer?

Many inputs (like optimizer choice) are set to defaults.

[Trainer source code](https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py#L230)

In [24]:
training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

In [25]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.112858,0.52
2,No log,0.971641,0.6
3,No log,0.974825,0.61


TrainOutput(global_step=300, training_loss=0.9964607747395834, metrics={'train_runtime': 119.7664, 'train_samples_per_second': 20.039, 'train_steps_per_second': 2.505, 'total_flos': 317938765824000.0, 'train_loss': 0.9964607747395834, 'epoch': 3.0})

# Part VII: Evaluate

In [26]:
# Try writing a review to test our model!
review = "pretty good haircut by james"

In [27]:
def evaluate_review(review):
    inputs = tokenizer(review, return_tensors="pt").to('cuda')

    with torch.no_grad():
        logits = model(**inputs).logits
        print(logits)

    predicted_class_id = logits.argmax().item()
    return predicted_class_id

predicted_class_id = evaluate_review(review)

print(f"This is predicted to be a {predicted_class_id+1} star review")

tensor([[-1.5721, -1.0417,  1.0406,  0.9427,  0.2858]], device='cuda:0')
This is predicted to be a 3 star review


A proper evaluation would involve much more than spot checking. But we can get a feel for the model's performance by testing out some reviews.

# Part VIII: Save your model

In [28]:
# Check to see that the model has been saved
model.save_pretrained('./saved_bert_model')

### Congratulations! You've trained a BERT model :)