In [None]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, TensorDataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AdamW
import os
from sklearn.metrics import precision_score, recall_score, f1_score
torch.manual_seed(42)
np.random.seed(42)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# you can create a NLP assignment3 folder in your Google drive upload the data folder there
# base_dir = "drive/MyDrive/ which went over training a diffusion model to generate images, "
base_dir = "/content/drive/MyDrive/CSE_354_HW3"

In [None]:
%cd $base_dir

In [None]:
!ls

## **Constants in the file**
---

The code block below contains a few constants.


1.   **BATCH_SIZE**: The batch size input to the models. This has been set to 16 and should not be changed. In case you encounter any CUDA - out of memory errors while training your models, this value may be reduced from 16. But please mention this in your submission report.
2.   **EPOCHS**: The number of epochs to train your model. This should not be changed.
3. **TEST_PATH**: This is the path to the test_data.csv file. For example, "/content/drive/MyDrive/CSE_354_HW3/test_data.csv".
4. **TRAIN_PATH**: This is the path to the train_data.csv file. For example, "/content/drive/MyDrive/CSE_354_HW3/train_data.csv".
5. **VAL_PATH**: This is the path to the val_data.csv file. For example, "/content/drive/MyDrive/CSE_354_HW3/val_data.csv".
6. **SAVE_PATH**: This is the path to directory your model will be saved. For example, "/content/drive/MyDrive/CSE_354_HW3/DistilBERT". Note: This path will be altered further down in the code by appending the name of the kind of DistilBERT model you train as per your experiments.



In [None]:
BATCH_SIZE = 16
EPOCHS = 3
# constants which can be changed
TEST_PATH = "data/test_data.csv"
TRAIN_PATH = "data/train_data.csv"
VAL_PATH = "data/val_data.csv"
# Models are stored in this path
SAVE_PATH = "data/DistilBERT"

In [None]:
def load_dataset(path):
  dataset = pd.read_csv(path)
  return dataset

In [None]:
train_data = load_dataset(TRAIN_PATH)
val_data = load_dataset(VAL_PATH)
test_data = load_dataset(TEST_PATH)

## **Problem 1 (Initialize the Model Class)**
---

In the code block below, you would need to load a pre-trained DistilBERT model and it's tokenizer using Hugging Face's library. The model you would need to load is called "distilbert-base-uncased". It would also need to have a classifier head on top which has the *num_classes* as the output shape of the model (in this case it is going to be 2). Please write your code between the given TODO block.

More about the model and how to load it can be read at - https://huggingface.co/distilbert-base-uncased.

In [None]:
class DistillBERT():

  def __init__(self, model_name='distilbert-base-uncased', num_classes=2):
    # TODO(students): start
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = num_classes)
    # TODO(students): end

  def get_tokenizer_and_model(self):
    return self.model, self.tokenizer

## **Problem 2 (Initialize the Dataloader Class)**
---

The code block below takes your data_frame and the tokenizer you loaded in the previous block and generates the DataLoader object for it. You would need to implement a part of the tokenize_data method. This method takes the given data and generates a list of token IDs for a given review along with it's label. You would need to use the tokenizer to generated the token encoded values for each review. **Please ensure that the max_length of an encoded review is 512 tokens.**

You would also need to convert the labels to a corresponding numerical class using the label_dict dictionary. Please write your code between the given TODO block.

In [None]:
class DatasetLoader(Dataset):

  def __init__(self, data, tokenizer):
    self.data = data
    self.tokenizer = tokenizer

  def tokenize_data(self):
    print("Processing data..")
    tokens = []
    labels = []
    label_dict = {'positive': 1, 'negative': 0}

    review_list = self.data['review'].to_list()
    label_list = self.data['sentiment'].to_list()

    for (review, label) in tqdm(zip(review_list, label_list), total=len(review_list)):
      # TODO(students): start
      encoded_review = self.tokenizer.encode(review, max_length=512, truncation=True)
      tokens.append(torch.tensor(encoded_review))
      labels.append(label_dict[label])
      # TODO(students): end

    tokens = pad_sequence(tokens, batch_first=True)
    labels = torch.tensor(labels)
    dataset = TensorDataset(tokens, labels)
    return dataset

  def get_data_loaders(self, batch_size=32, shuffle=True):
    processed_dataset = self.tokenize_data()

    data_loader = DataLoader(
        processed_dataset,
        shuffle=shuffle,
        batch_size=batch_size
    )

    return data_loader

## **Problem 3 (Training Function)**
---

The class below provides method to train a given model. It takes a dictionary with the following parameters:



1.   device: The device to run the model on.
2.   train_data: The train_data dataframe.
3.   val_data: The val_data dataframe.
4.   batch_size: The batch_size which is input to the model.
5.   epochs: The number of epochs to train the model.
6.   training_type: The type of training that your model will be undergoing. This can take four values - 'frozen_embeddings', 'top_2_training', 'top_4_training' and 'all_training'.

#### Problem 3(a)

Your first problem here would be to implement the set_training_parameters() method. In this method you will select the layers of your model to train based on the training_type. **Note: By default the Hugging Face DistilBERT has 6 layers.**

1. fully_frozen: This setting is supposed to train the DistilBERT model with all parameters except the classifier 'frozen' or not trainable.
2. top_4_training: This setting is supposed to train just the final four layers of DistilBERT (layer 5, layer 4, layer 3 and layer 2). All other layers before these would need to be frozen. Do not freeze the classifier head parameters.
3. bottom_4_training: This setting is supposed to train just the bottom four layers of DistilBERT (layer 3, layer 2, layer 1 and layer 0). All other layers after these would need to be frozen. Do not freeze the classifier head parameters.
4. all_training: All layers of DistilBERT would need to trained.

Please write your code between the given TODO block.

**Note: The classifier head on top of the final DistilBERT layer would always need to be trained, please do not freeze that layer.**

#### Problem 3(b)

Your second problem would be to implement a single training step in the given loop inside the train() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would also need to propagate the loss backwards to the model and update the given optimizer's parameters.

Please write your code between the given TODO block.

#### Problem 3(c)

Your second problem would be to implement a single validation step in the given loop inside the eval() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would need to ensure that the loss is not propagated backwards.

Please write your code between the given TODO block.

In [None]:
class Trainer():

  def __init__(self, options):
    self.device = options['device']
    self.train_data = options['train_data']
    self.val_data = options['val_data']
    self.batch_size = options['batch_size']
    self.epochs = options['epochs']
    self.save_path = options['save_path']
    self.training_type = options['training_type']
    transformer = DistillBERT()
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0)
    recall = recall_score(labels_flat, pred_flat, zero_division=0)
    f1 = f1_score(labels_flat, pred_flat, zero_division=0)
    return precision, recall, f1

  def set_training_parameters(self):
    # TODO(students): start

    # by default freeze all parameters (assume 'fully_frozen')
    for param in self.model.parameters():
        param.requires_grad = False

    # classifier params should always be trainable
    for param in self.model.classifier.parameters():
            param.requires_grad = True

    if self.training_type == 'top_4_training':
        # Unfreeze the final four transformer layers (layer 5 to layer 2)
        for layer in self.model.distilbert.transformer.layer[2:]:
            for param in layer.parameters():
                param.requires_grad = True
    elif self.training_type == 'bottom_4_training':
        # Unfreeze the first four transformer layers (layer 0 to layer 3)
        for layer in self.model.distilbert.transformer.layer[:4]:
            for param in layer.parameters():
                param.requires_grad = True
    elif self.training_type == 'all_training':
        # Unfreeze all transformer layers
        for layer in self.model.distilbert.transformer.layer:
            for param in layer.parameters():
                param.requires_grad = True
    # TODO(students): end

  def train(self, data_loader, optimizer):
    self.model.train()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    for batch_idx, (reviews, labels) in enumerate(tqdm(data_loader)):
      self.model.zero_grad()
      # TODO(students): start
      reviews, labels = reviews.to(self.device), labels.to(self.device)
      outputs = self.model(reviews, labels=labels)
      loss = outputs.loss
      total_loss += loss.item()
      loss.backward()
      optimizer.step()

      preds = outputs.logits.detach().cpu().numpy()
      labels_np = labels.cpu().numpy()
      precision, recall, f1 = self.get_performance_metrics(preds, labels_np)
      total_precision += precision
      total_recall += recall
      total_f1 += f1
      # TODO(students): end

    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def eval(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        reviews, labels = reviews.to(self.device), labels.to(self.device)
        outputs = self.model(reviews, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        preds = outputs.logits.detach().cpu().numpy()
        labels_np = labels.cpu().numpy()
        precision, recall, f1 = self.get_performance_metrics(preds, labels_np)
        total_precision += precision
        total_recall += recall
        total_f1 += f1
        # TODO(students): end

    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def save_transformer(self):
    self.model.save_pretrained(self.save_path)
    self.tokenizer.save_pretrained(self.save_path)

  def execute(self):
    last_best = 0
    train_dataset = DatasetLoader(self.train_data, self.tokenizer)
    train_data_loader = train_dataset.get_data_loaders(self.batch_size)
    val_dataset = DatasetLoader(self.val_data, self.tokenizer)
    val_data_loader = val_dataset.get_data_loaders(self.batch_size)
    optimizer = torch.optim.AdamW(self.model.parameters(), lr = 3e-5, eps = 1e-8)
    self.set_training_parameters()
    for epoch_i in range(0, self.epochs):
      train_precision, train_recall, train_f1, train_loss = self.train(train_data_loader, optimizer)
      print(f'Epoch {epoch_i + 1}: train_loss: {train_loss:.4f} train_precision: {train_precision:.4f} train_recall: {train_recall:.4f} train_f1: {train_f1:.4f}')
      val_precision, val_recall, val_f1, val_loss = self.eval(val_data_loader)
      print(f'Epoch {epoch_i + 1}: val_loss: {val_loss:.4f} val_precision: {val_precision:.4f} val_recall: {val_recall:.4f} val_f1: {val_f1:.4f}')

      if val_f1 > last_best:
        print("Saving model..")
        self.save_transformer()
        last_best = val_f1
        print("Model saved.")

## **Problem statement**
---
In this homework, you will be using pre-trained language models to predict the sentiment of a given movie review.


The dataset, which is given to you, is sampled from the [IMDB dataset of 50k movie reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The sentences are sampled to a smaller set to help with quicker computation on Colab. The data contains a review and an associated label for the sentiment of that review. The label can either be *positive* or *negative*. You have been given three files - train_data.csv, val_data.csv and test_data.csv. The training data will be used to fine-tune the language model, the val data will be used to select the best model while training and finally the test data will give the model's final performance on the data.

To perform this task you will be using a pre-trained DistilBERT model. DistilBERT is a BERT based language model. It is less than half the size of BERT, but its retains much of BERT's capabilities for many tasks and is 1.6X faster. You can read more about DistilBERT - https://arxiv.org/abs/1910.01108.


In [None]:
!pip install transformers==4.37.0

#### **Experiment 1**
---
Training your DistilBERT with frozen embeddings and layers.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_fully_frozen'
options['epochs'] = EPOCHS
options['training_type'] = 'fully_frozen'
trainer = Trainer(options)
trainer.execute()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Processing data..


100%|██████████| 5130/5130 [00:05<00:00, 880.58it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 938.41it/s]
  0%|          | 0/321 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|██████████| 321/321 [01:26<00:00,  3.69it/s]


Epoch 1: train_loss: 0.6962 train_precision: 0.4827 train_recall: 0.5570 train_f1: 0.4991


100%|██████████| 17/17 [00:04<00:00,  3.79it/s]


Epoch 1: val_loss: 0.6932 val_precision: 0.5248 val_recall: 0.9704 val_f1: 0.6667
Saving model..
Model saved.


100%|██████████| 321/321 [01:33<00:00,  3.45it/s]


Epoch 2: train_loss: 0.6933 train_precision: 0.5065 train_recall: 0.6774 train_f1: 0.5658


100%|██████████| 17/17 [00:04<00:00,  3.59it/s]


Epoch 2: val_loss: 0.6922 val_precision: 0.5352 val_recall: 0.9390 val_f1: 0.6687
Saving model..
Model saved.


100%|██████████| 321/321 [01:34<00:00,  3.41it/s]


Epoch 3: train_loss: 0.6933 train_precision: 0.5204 train_recall: 0.6516 train_f1: 0.5598


100%|██████████| 17/17 [00:04<00:00,  3.56it/s]


Epoch 3: val_loss: 0.6909 val_precision: 0.5273 val_recall: 0.9888 val_f1: 0.6764
Saving model..
Model saved.


#### **Experiment 2**
---
Training your DistilBERT with bottom 4 layers being trained. This should take around 5-6 minutes per epoch.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_bottom_4_training'
options['epochs'] = EPOCHS
options['training_type'] = 'bottom_4_training'
trainer = Trainer(options)
trainer.execute()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Processing data..


100%|██████████| 5130/5130 [00:10<00:00, 495.94it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 346.07it/s]
100%|██████████| 321/321 [03:51<00:00,  1.39it/s]


Epoch 1: train_loss: 0.4545 train_precision: 0.7932 train_recall: 0.7470 train_f1: 0.7429


100%|██████████| 17/17 [00:04<00:00,  3.57it/s]


Epoch 1: val_loss: 0.2687 val_precision: 0.9026 val_recall: 0.9073 val_f1: 0.9013
Saving model..
Model saved.


100%|██████████| 321/321 [03:53<00:00,  1.37it/s]


Epoch 2: train_loss: 0.2566 train_precision: 0.9151 train_recall: 0.9089 train_f1: 0.9031


100%|██████████| 17/17 [00:04<00:00,  3.57it/s]


Epoch 2: val_loss: 0.2380 val_precision: 0.9356 val_recall: 0.9014 val_f1: 0.9110
Saving model..
Model saved.


100%|██████████| 321/321 [03:53<00:00,  1.37it/s]


Epoch 3: train_loss: 0.1539 train_precision: 0.9551 train_recall: 0.9495 train_f1: 0.9490


100%|██████████| 17/17 [00:04<00:00,  3.57it/s]

Epoch 3: val_loss: 0.2446 val_precision: 0.9391 val_recall: 0.8610 val_f1: 0.8927





#### **Experiment 3**
---
Training your DistilBERT with only top 4 layers being trained. This should take around 6 minutes per epoch.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_top_4_training'
options['epochs'] = EPOCHS
options['training_type'] = 'top_4_training'
trainer = Trainer(options)
trainer.execute()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Processing data..


100%|██████████| 5130/5130 [00:05<00:00, 889.98it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 961.50it/s]
100%|██████████| 321/321 [03:18<00:00,  1.62it/s]


Epoch 1: train_loss: 0.4231 train_precision: 0.7917 train_recall: 0.8009 train_f1: 0.7720


100%|██████████| 17/17 [00:04<00:00,  3.57it/s]


Epoch 1: val_loss: 0.2383 val_precision: 0.8581 val_recall: 0.9494 val_f1: 0.8953
Saving model..
Model saved.


100%|██████████| 321/321 [03:18<00:00,  1.62it/s]


Epoch 2: train_loss: 0.2199 train_precision: 0.9168 train_recall: 0.9098 train_f1: 0.9072


100%|██████████| 17/17 [00:04<00:00,  3.58it/s]


Epoch 2: val_loss: 0.2235 val_precision: 0.8982 val_recall: 0.9350 val_f1: 0.9126
Saving model..
Model saved.


100%|██████████| 321/321 [03:18<00:00,  1.62it/s]


Epoch 3: train_loss: 0.1376 train_precision: 0.9516 train_recall: 0.9502 train_f1: 0.9472


100%|██████████| 17/17 [00:04<00:00,  3.57it/s]


Epoch 3: val_loss: 0.2562 val_precision: 0.8891 val_recall: 0.9674 val_f1: 0.9236
Saving model..
Model saved.


#### **Experiment 4**
---
Training your DistilBERT with all layers being trained. This should take around 8 minutes per epoch.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_all_training'
options['epochs'] = EPOCHS
options['training_type'] = 'all_training'
trainer = Trainer(options)
trainer.execute()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Processing data..


100%|██████████| 5130/5130 [00:06<00:00, 784.91it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 412.37it/s]
100%|██████████| 321/321 [04:13<00:00,  1.27it/s]


Epoch 1: train_loss: 0.4255 train_precision: 0.7906 train_recall: 0.8088 train_f1: 0.7708


100%|██████████| 17/17 [00:04<00:00,  3.57it/s]


Epoch 1: val_loss: 0.3241 val_precision: 0.8085 val_recall: 0.9737 val_f1: 0.8775
Saving model..
Model saved.


100%|██████████| 321/321 [04:13<00:00,  1.27it/s]


Epoch 2: train_loss: 0.2081 train_precision: 0.9270 train_recall: 0.9190 train_f1: 0.9161


100%|██████████| 17/17 [00:04<00:00,  3.56it/s]


Epoch 2: val_loss: 0.2458 val_precision: 0.8727 val_recall: 0.9935 val_f1: 0.9257
Saving model..
Model saved.


100%|██████████| 321/321 [04:14<00:00,  1.26it/s]


Epoch 3: train_loss: 0.1216 train_precision: 0.9558 train_recall: 0.9590 train_f1: 0.9540


100%|██████████| 17/17 [00:04<00:00,  3.57it/s]


Epoch 3: val_loss: 0.1944 val_precision: 0.9178 val_recall: 0.9646 val_f1: 0.9363
Saving model..
Model saved.


## **Problem 4 (Test Function)**
---

The class below provides method to test a given model. It takes a dictionary with the following parameters:



1.   device: The device to run the model on.
2.   test_data: The test_data dataframe.
3.   batch_size: The batch_size which is input to the model.
4.   save_path: The directory of your saved model.

You would need to implement a single test step in the given loop inside the test() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would need to ensure that the loss is not propagated backwards.

Please write your code between the given TODO block.

In [None]:
class Tester():

  def __init__(self, options):
    self.save_path = options['save_path']
    self.device = options['device']
    self.test_data = options['test_data']
    self.batch_size = options['batch_size']
    transformer = DistillBERT(self.save_path)
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0)
    recall = recall_score(labels_flat, pred_flat, zero_division=0)
    f1 = f1_score(labels_flat, pred_flat, zero_division=0)
    return precision, recall, f1

  def test(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        reviews, labels = reviews.to(self.device), labels.to(self.device)
        outputs = self.model(reviews, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        preds = outputs.logits.detach().cpu().numpy()
        labels_np = labels.cpu().numpy()
        precision, recall, f1 = self.get_performance_metrics(preds, labels_np)
        total_precision += precision
        total_recall += recall
        total_f1 += f1
        # TODO(students): end

    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def execute(self):
    test_dataset = DatasetLoader(self.test_data, self.tokenizer)
    test_data_loader = test_dataset.get_data_loaders(self.batch_size)

    test_precision, test_recall, test_f1, test_loss = self.test(test_data_loader)

    print()
    print(f'test_loss: {test_loss:.4f} test_precision: {test_precision:.4f} test_recall: {test_recall:.4f} test_f1: {test_f1:.4f}')

#### **Experiment 5**
---
Testing your DistilBERT trained with frozen embeddings.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_fully_frozen'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 1033.93it/s]
100%|██████████| 38/38 [00:10<00:00,  3.75it/s]


test_loss: 0.6919 test_precision: 0.4945 test_recall: 0.9137 test_f1: 0.6313





#### **Experiment 6**
---
Testing your DistilBERT trained with all layers frozen except the final two layers.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_bottom_4_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 941.00it/s]
100%|██████████| 38/38 [00:10<00:00,  3.77it/s]


test_loss: 0.2593 test_precision: 0.9187 test_recall: 0.8521 test_f1: 0.8762





#### **Experiment 7**
---
Testing your DistilBERT trained with all layers frozen except the final four layers.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_top_4_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 911.13it/s]
100%|██████████| 38/38 [00:10<00:00,  3.68it/s]


test_loss: 0.3100 test_precision: 0.8770 test_recall: 0.8854 test_f1: 0.8772





#### **Experiment 8**
---
Testing your DistilBERT trained with all layers trainable.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_all_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 1015.37it/s]
100%|██████████| 38/38 [00:10<00:00,  3.61it/s]


test_loss: 0.3065 test_precision: 0.8626 test_recall: 0.9133 test_f1: 0.8724





## **Results**
---

Answer the following questions based on the analyses you have performed above:

### 1. Briefly explain your code implementations for each TO-DO task.
#### Problem 1
The following code
```
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = num_classes)
```
 used the functions from the import statement:
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
```
The values model_name and num_labels passed in to these functions were given as parameters to the init__ function

#### Problem 2
```
    for (review, label) in tqdm(zip(review_list, label_list), total=len(review_list)):
      # TODO(students): start
      encoded_review = self.tokenizer.encode(review, max_length=512, truncation=True)
      tokens.append(torch.tensor(encoded_review))
      labels.append(label_dict[label])
      # TODO(students): end
```

Uses the tokenizer from problem 1, along with the params of review, and max length (which apparently required truncation = True)
The following lines appended to the initialized ```tokens``` and ```labels``` arrays from a few lines before

#### Problem 3
```
  def set_training_parameters(self):
    # TODO(students): start

    # by default freeze all parameters (assume 'fully_frozen')
    for param in self.model.parameters():
        param.requires_grad = False

    # classifier params should always be trainable
    for param in self.model.classifier.parameters():
            param.requires_grad = True

    if self.training_type == 'top_4_training':
        # Unfreeze the final four transformer layers (layer 5 to layer 2)
        for layer in self.model.distilbert.transformer.layer[2:]:
            for param in layer.parameters():
                param.requires_grad = True
    elif self.training_type == 'bottom_4_training':
        # Unfreeze the first four transformer layers (layer 0 to layer 3)
        for layer in self.model.distilbert.transformer.layer[:4]:
            for param in layer.parameters():
                param.requires_grad = True
    elif self.training_type == 'all_training':
        # Unfreeze all transformer layers
        for layer in self.model.distilbert.transformer.layer:
            for param in layer.parameters():
                param.requires_grad = True
    # TODO(students): end
  ```

  Above are if-statements for the different training types. I made fully_frozen the default, setting requires_grad of the model's params to false, thus preventing training from occuring on these params.

  I also made all classifier params trainable for all the four training options, as we will always need to train the classifier layer.

  In the if statmenets, we use array slicing to turn on training for the relevant layers of the BERT pretrained model

  #### Problem 3b
  ```    
  for batch_idx, (reviews, labels) in enumerate(tqdm(data_loader)):
      self.model.zero_grad()
      # TODO(students): start
      reviews, labels = reviews.to(self.device), labels.to(self.device)
      outputs = self.model(reviews, labels=labels)
      loss = outputs.loss
      total_loss += loss.item()
      loss.backward()
      optimizer.step()

      preds = outputs.logits.detach().cpu().numpy()
      labels_np = labels.cpu().numpy()
      precision, recall, f1 = self.get_performance_metrics(preds, labels_np)
      total_precision += precision
      total_recall += recall
      total_f1 += f1
      # TODO(students): end
  ```
  A single training step is done by first moving reviews and labels to our device (probably the GPU). We then pass our reviews and labels into the model to produce outputs, and extract losses. Loss.backward() computes the gradients and optimizer.step() applies the gradients

  The next section extracts the performance metrics. Excracting predictions from the outputs (outputs.logits), then moving the predictions and the labels to the cpu and converting to numpy objects, we can finally accumulate the metrics.

  #### Problem 3c
  ```
        for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        reviews, labels = reviews.to(self.device), labels.to(self.device)
        outputs = self.model(reviews, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        preds = outputs.logits.detach().cpu().numpy()
        labels_np = labels.cpu().numpy()
        precision, recall, f1 = self.get_performance_metrics(preds, labels_np)
        total_precision += precision
        total_recall += recall
        total_f1 += f1
        # TODO(students): end
```
Very similar to 3b, except it is running in the context of torch.no_grad(), which also means we don't calculate and apply gradients. We do keep track of total loss still for performance monitoring.

#### Problem 4
  The code for this is identical to problem 3c, but we are evaluating on the test data rather than the training data.

TODO [STUDENT]

### 2. A table containing the precision, recall and F1 scores of each DistilBERT model during validation and testing.

### Validation

| Category           | Loss   | Precision | Recall | F1 Score |
|--------------------|--------|-----------|--------|----------|
| Fully Frozen       | 0.6907 | 0.5753    | 0.5227 | 0.5245   |
| Bottom 4 Training  | 0.2446 | 0.9391    | 0.8610 | 0.8927   |
| Top 4 Training     | 0.2138 | 0.9132    | 0.9381 | 0.9229   |
| All Training       | 0.1944 | 0.9178    | 0.9646 | 0.9363   |



### Test

| Category           | Loss   | Precision | Recall | F1 Score |
|--------------------|--------|-----------|--------|----------|
| Fully Frozen       | 0.6919 | 0.4945    | 0.9137 | 0.6313   |
| Bottom 4 Training  | 0.2593 | 0.9187    | 0.8521 | 0.8762   |
| Top 4 Training     | 0.3100 | 0.8770    | 0.8854 | 0.8772   |
| All Training       | 0.3065 | 0.8626    | 0.9133 | 0.8724   |

TODO [STUDENT]

### 3. An analysis explaining your understanding of the impact freezing/training different layers has on the model's performance.

Firstly, fully frozen performs the worst in both validation and testing. It may be that changes to the text transformer (separate from the classifier) is needed. This may be because newer context related to the reviews we are classfying need to be learned; ie. training the classifier is not enough.

The other training all performed similarly, though it is interesting to note that Top 4 training performed better in both validation and test, maybe because the params in the bottom layer still held key "generalization" power for text.

One last thing worst noting is that training all layers performed the best in training but worse in test, compared to the Top 4 training metric. It may be because training all the transformers' parameters caused the model to lose ability to understand text in a general since, sort of akin to overfitting. Perhaps all training could perform better than top 4 training if there was a larger amount of training data that is also diverse.


TODO [STUDENT]

## **Submission guidelines**
---
You would need to submit the following files:


1.   `NLP_HW3.ipynb` - This jupyter notebook. It will also work as your report, so please add description to code wherever required. Also make sure to write your analyses outcomes in the RESULTS section above.
2.   `gdrive_link.txt` - Should contain a wgetable to a folder that contains your four DistilBERT models. Please make sure you provide the necessary permissions.

**Colab design credit**: Dhruv Verma, Yash Kumar Lal