## Finetuning DistilBERT model for Sentiment Analysis
The [BERT](https://arxiv.org/abs/1810.04805) model which stands for Bidirectional Encoder Representations from Transformers was proposed by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia. 

From the paper's abstract : It can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

While [DistilBERT](https://arxiv.org/abs/1910.01108) is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

In this notebook, I'll be finetuning the [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) pre-trained model on the IMDB Dataset for sentiment analysis.

In [None]:
# No need to run this cell if dependencies are already installed
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m67.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m108.3 MB/s[0m eta [36m0:00:00[0m
Collectin

In [None]:
import torch
torch.cuda.is_available()

True

In [None]:
# Import the IMDB dataset
from datasets import load_dataset
imdb = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

I'm creating a small training and testing set for faster training and testing. Later you'll see that even with this much samples, the model comes out to be good enough.

In [None]:
# Shuffle and select a subset of the training and test data
train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(4000))])
test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])

In [None]:
# Load the tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Define a function to tokenize the text
def tokenize_function(examples):
   return tokenizer(examples["text"], truncation = True)

# Tokenize the train and test data 
tokenized_train = train_dataset.map(tokenize_function, batched = True)
tokenized_test = test_dataset.map(tokenize_function, batched = True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
# Define a data collator to handle padding
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

In [None]:
# Load the model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels = 2)

In [None]:
# Import numpy and the evaluation metric functions
import numpy as np
from datasets import load_metric
 
# Define a function to compute the evaluation metrics 
def compute_metrics(eval_pred):
   # Load accuracy, precision, recall and F1 score metrics
   load_accuracy = load_metric("accuracy")
   load_precision = load_metric("precision")
   load_recall = load_metric("recall")
   load_f1 = load_metric("f1")

   # Unpack logits and labels from eval_pred
   logits, labels = eval_pred

   # Find predictions by taking argmax along the last axis
   predictions = np.argmax(logits, axis = -1)

   # Compute accuracy, precision, recall and F1 score using loaded metrics
   accuracy = load_accuracy.compute(predictions = predictions, references = labels)["accuracy"]
   precision = load_precision.compute(predictions = predictions, references = labels)["precision"]
   recall = load_recall.compute(predictions = predictions, references = labels)["recall"]
   f1 = load_f1.compute(predictions = predictions, references = labels)["f1"]

   # Return a dictionary containing the computed metrics
   return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

In [None]:
# Import the trainer and training arguments 
from transformers import TrainingArguments, Trainer

# Define the output directory and other training arguments 
output_dir_name = "finetuned-sentiment-model-4000-samples-imdb"
 
training_args = TrainingArguments(
   output_dir = output_dir_name,
   learning_rate = 2e-5,
   per_device_train_batch_size = 16,
   per_device_eval_batch_size = 16,
   num_train_epochs = 2,
   weight_decay = 0.01,
   save_strategy = "epoch",
   push_to_hub = False,
)

# Initialize the trainer
trainer = Trainer(
   model = model,
   args = training_args,
   train_dataset = tokenized_train,
   eval_dataset = tokenized_test,
   tokenizer = tokenizer,
   data_collator = data_collator,
   compute_metrics = compute_metrics,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Training arguments like learning_rate and batch_size and num_train_epochs were assigned according to Hugging Face recommendations.

In [None]:
# Train the model
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 4000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 500
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.2682


Saving model checkpoint to finetuned-sentiment-model-4000-samples-imdb/checkpoint-250
Configuration saved in finetuned-sentiment-model-4000-samples-imdb/checkpoint-250/config.json
Model weights saved in finetuned-sentiment-model-4000-samples-imdb/checkpoint-250/pytorch_model.bin
tokenizer config file saved in finetuned-sentiment-model-4000-samples-imdb/checkpoint-250/tokenizer_config.json
Special tokens file saved in finetuned-sentiment-model-4000-samples-imdb/checkpoint-250/special_tokens_map.json
Saving model checkpoint to finetuned-sentiment-model-4000-samples-imdb/checkpoint-500
Configuration saved in finetuned-sentiment-model-4000-samples-imdb/checkpoint-500/config.json
Model weights saved in finetuned-sentiment-model-4000-samples-imdb/checkpoint-500/pytorch_model.bin
tokenizer config file saved in finetuned-sentiment-model-4000-samples-imdb/checkpoint-500/tokenizer_config.json
Special tokens file saved in finetuned-sentiment-model-4000-samples-imdb/checkpoint-500/special_tokens_m

TrainOutput(global_step=500, training_loss=0.26823455810546876, metrics={'train_runtime': 370.9495, 'train_samples_per_second': 21.566, 'train_steps_per_second': 1.348, 'total_flos': 1048802349646464.0, 'train_loss': 0.26823455810546876, 'epoch': 2.0})

In [None]:
# Evaluate the model on the test dataset
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 300
  Batch size = 16


  load_accuracy = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.52k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

{'eval_loss': 0.3177955746650696,
 'eval_accuracy': 0.89,
 'eval_precision': 0.8461538461538461,
 'eval_recall': 0.9533333333333334,
 'eval_f1': 0.896551724137931,
 'eval_runtime': 6.7974,
 'eval_samples_per_second': 44.134,
 'eval_steps_per_second': 2.795,
 'epoch': 2.0}

The evaluation metrics we got are quite good for a model trained on just 4000 samples.

In [None]:
# Save the model
path = ""
torch.save(model, path)

In [None]:
from transformers import pipeline

In [None]:
# Creating a pipeline for testing the model
model.to('cpu')
class_labels = ['Negative', 'Positive']
model.config.id2label = class_labels
sentiment_model = pipeline(task = 'sentiment-analysis', model = model, tokenizer = tokenizer)

In [None]:
# Testing the model on a sample text
sentiment_model("I love this movie")

[{'label': 'Positive', 'score': 0.9679075479507446}]