# Part 2: Fine Tune a Bert Based Model for sentiment classification

In this section, we will fine-tune distilBert to classify the sentiment of yelp reviews

<a href="https://colab.research.google.com/github/jkchandalia/nlp/blob/main/Poetry_w_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# %%capture
# !pip install transformers
# !pip install datasets
# !pip install evaluate
# !pip install torch
# !pip install scikit-learn

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
import numpy as np
import evaluate
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments

  from .autonotebook import tqdm as notebook_tqdm


# Step 1: Load, inspect and down-sample our dataset

In [4]:
dataset = load_dataset("yelp_review_full")

Found cached dataset yelp_review_full (/Users/d/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)
100%|██████████| 2/2 [00:00<00:00, 212.52it/s]


In [5]:
dataset.shape, dataset.column_names

({'train': (650000, 2), 'test': (50000, 2)},
 {'train': ['label', 'text'], 'test': ['label', 'text']})

In [6]:
dataset['train'][100]["text"]

'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more t

In [7]:
dataset['train']['text'][:10]

["dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
 "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patient

First we'll train on a small subset of the data to iterate quickly and make sure our code works

In [8]:
# create a new dataset with 100 training samples and 100 test samples
# stratify by column: ensures that the train and test sets have the same proportion of each class as the full dataset
dataset_small = dataset["train"].train_test_split(train_size=100, test_size=100, seed=42, stratify_by_column="label")

Loading cached split indices for dataset at /Users/d/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-fad11cb6e689c3c5.arrow and /Users/d/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-8d5301c1f758ac0d.arrow


In [9]:
dataset_small.shape

{'train': (100, 2), 'test': (100, 2)}

Alternatively: we can define train and test size as a % of the full dataset

```
dataset_small = dataset["train"].train_test_split(train_size=.05, test_size=0.05, seed=42, stratify_by_column="label")
```
In this case, 5% of the original training dataset to our new train set, 5% of the training dataset to our new test set

# Step 2: Tokenize our training and test datasets

In [10]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # uncased means lowercase
# alternative tokenizers: distilbert-base-cased (cased means case sensitive), bert-base-uncased, bert-base-cased, roberta-base

# help(tokenizer)

In [11]:
def tokenize_function(examples):
    """this function takes in a batch of training (or test) examples and for each, will tokenize the text and truncate or pad it to the max length of 512 tokens. 

    Args:
        examples (list): List of 

    Returns:
        List (int): List of [input IDs] with the appropriate special tokens. The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model. (https://huggingface.co/transformers/v3.2.0/glossary.html#:~:text=The%20input%20ids%20are%20often,as%20input%20by%20the%20model.&text=The%20tokenizer%20takes%20care%20of,available%20in%20the%20tokenizer%20vocabulary.)
    """
    return tokenizer(examples["text"], padding="max_length", truncation=True)

Discussion Question: given the description of the tokenize_function() above, before we proceed, can you think of any potential problems with this approach? How would you go about evaluating if this approach is appropriate for our use-case? (side exercise: ex2_inspect_dataset.ipynb)

In [12]:
# dataset.map applies the tokenize function to all the examples in the dataset. (batched=True means that the function is applied to the examples in batches) Batched=True is faster than batched=False but it requires more memory. It is recommended to use batched=True if you have a GPU and batched=False if you don't have a GPU.
tokenized_datasets = dataset_small.map(tokenize_function, batched=True)

Loading cached processed dataset at /Users/d/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-09e408ab33c5a034.arrow
Loading cached processed dataset at /Users/d/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-6df42b55fc519edc.arrow


In [13]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 100
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 100
    })
})

In [14]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=5)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

In [15]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

In [16]:
metric = evaluate.load("accuracy")

In [53]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

In [57]:
trainer.train()



[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

                                          

[A[A                                
[A                                            


  0%|          | 0/243750 [07:14<?, ?it/s]     
[A

[A[A

[A[A

{'eval_loss': 1.6078723669052124, 'eval_accuracy': 0.2, 'eval_runtime': 7.9143, 'eval_samples_per_second': 12.635, 'eval_steps_per_second': 1.643, 'epoch': 1.0}




[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

                                          

[A[A                                
[A                                            


  0%|          | 0/243750 [07:52<?, ?it/s]     
[A

[A[A

[A[A

{'eval_loss': 1.597051978111267, 'eval_accuracy': 0.28, 'eval_runtime': 7.8127, 'eval_samples_per_second': 12.8, 'eval_steps_per_second': 1.664, 'epoch': 2.0}




[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

                                          

[A[A                                
[A                                            


  0%|          | 0/243750 [08:29<?, ?it/s]     
[A

[A[A

                                          

[A[A                                
  0%|          | 0/243750 [08:29<?, ?it/s]     
100%|██████████| 39/39 [01:52<00:00,  2.89s/it]

{'eval_loss': 1.5860011577606201, 'eval_accuracy': 0.33, 'eval_runtime': 7.7763, 'eval_samples_per_second': 12.86, 'eval_steps_per_second': 1.672, 'epoch': 3.0}
{'train_runtime': 112.8027, 'train_samples_per_second': 2.66, 'train_steps_per_second': 0.346, 'train_loss': 1.5954617231320112, 'epoch': 3.0}





TrainOutput(global_step=39, training_loss=1.5954617231320112, metrics={'train_runtime': 112.8027, 'train_samples_per_second': 2.66, 'train_steps_per_second': 0.346, 'train_loss': 1.5954617231320112, 'epoch': 3.0})

In [58]:
model.to('cpu')

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [27]:

inputs = tokenizer("solid haircut by James", return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    print(logits)

predicted_class_id = logits.argmax().item()


tensor([[ 0.0141,  0.0533,  0.0507,  0.0057, -0.1208]])


In [26]:
model(**inputs).logits

tensor([[ 0.0141,  0.0533,  0.0507,  0.0057, -0.1208]],
       grad_fn=<AddmmBackward0>)

In [28]:
logits.argmax() # returns the index of the maximum value in the tensor

tensor(1)

In [34]:

inputs = tokenizer("terrible worst haircut ever by James", return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    print(logits)

predicted_class_id = logits.argmax().item()
print(predicted_class_id)


tensor([[ 0.0429,  0.0587,  0.0284,  0.0069, -0.1171]])
1


In [39]:

inputs = tokenizer("I love it so much. I couldn't be happier, I am so grateful I had so much fun it was amazing", return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    print(logits)

predicted_class_id = logits.argmax().item()
print(predicted_class_id)

tensor([[ 0.0141,  0.0381,  0.0934,  0.0274, -0.0871]])
2


In [60]:
predicted_class_id

3

In [17]:
inputs_default = tokenizer("solid haircut by James")

In [20]:
type(inputs_default)

transformers.tokenization_utils_base.BatchEncoding

In [23]:
type(inputs)

transformers.tokenization_utils_base.BatchEncoding

In [24]:
inputs

{'input_ids': tensor([[  101,  5024,  2606, 12690,  2011,  2508,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [25]:
inputs_default

{'input_ids': [101, 5024, 2606, 12690, 2011, 2508, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}