# Assignment II: Pretraining & Fine-Tuning of Language Models

In this second assignment we will continue to work with PyTorch and Open AI's early Open Source Model GPT2 to develop a deeper understanding and intuition of how language models are trained. We will look at a specific simple task, Sentiment Classification, and see in two ways how we can use language models for this problem.

The structure of the Assignment is as follows:

1. **Continued GPT-2 Pretraining of GPT with a Movie Plots dataset**  

   Here we will explore how continued pretraining affects a Language Model. This gives us a good idea how pretraining morks, and more specifically, we will look how additional domain-specific pretraining affects the perplexity for text within this domain vs outside of the domain. We will use the *CMU Movie Summary Corpus* (https://www.cs.cmu.edu/~ark/personas/), released under the Creative Commons Attribution-ShareAlike License (http://creativecommons.org/licenses/by-sa/3.0/us/legalcode).
   We will learn that additional pretraining generally helps language models for in-domain tasks.

3. **Fine-tuning of GPT-2 for Sentiment Analysis of the IMDB dataset (using Pre/Post-Modifiers)**  
   We will then use both the original GPT-2 model and the model that was further pretrained on the CMU Movie Summary dataset for a Sentiment Analysis of the IMDB movie review dataset, which is part of Hugging Face datasets. We will (hopefully(!)... there are statistical fluctuations) see that fine-tuning the model that was further pre-trained on the movie plot dataset behaves somewhat better than the original gpt-2 model fine-tuned.

4. **IMDB Sentiment Classification with a Masked Language Model (BERT)**
   Finally, we will also look at a Masked Language Model, specifically BERT, as a tool for Sentiment Classification of the same dataset.



For reference, please consider the Lecture material for weeks 2 & 3 as well as the two Special Session notebooks:

* Intro to PyTorch I (Basics)
* Intro to PyTorch II (Huggingface & Language Models)
* All lesson material and notebooks to date



**INSTRUCTIONS:**

* This notebook needs to be run using a GPU. If you use Google Colab, a T4 chip is the recommendation.
  
* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the answers file as you did in a1. Please do not remove the output from your notebooks when you submit them as we'll look at the output as well as your code for grading purposes.

* \### YOUR CODE HERE indicates that you are supposed to write code. All the way up to \### END YOUR CODE     

* **Important!!:** When you are done please re-run your notebook from beginning to end to that all of the seeds apply! This is very important!

**AUTOGRADER:**

- In each code block, do NOT delete the ### comment at the top of a cell (it's needed for the auto-grading!)
  - Full autograder tests and results are on gradescope.
  - You will get the first 3 points from the autograder for this assignment.
  - You may upload and run the autograder as many times as needed in your time window to get full points
  - The assignment needs to be named Assignment_2.ipynb to be graded from the autograder!
  - The examples given are samples of how we will test/grade your code.
    - Please ensure your code outputs the exact same information / format!
    - In addition to the given example, the autograder will test other examples
    - Each autograder test tells you what input it is using
  - Once complete, the autograder will show each tests, if that test is passed or failed, and your total score
  - The autograder fails for a couple of reasons:
    - Your code crashes with that input (for example: `Test Failed: string index out of range`)
    - Your code output does not match the 'correct' output (for example: `Test Failed: '1 2 3 2 1' != '1 4 6 4 1'`)
- Please format your input and output strings to be user friendly
- Adding comments in your code is strongly suggested but won't be graded.
- Do not delete the output cells.  We want to see your code AND the results it produced when it ran.
- If you are stuck on a problem or do not understand a question - please come to office hours or ask questions (please don't post your code though). If it is a coding problem send a private message on Ed Discussion or send and email to your instructor.
- We also have TA tutors for extra help and 1 on 1 sessions!
- You may use any libraries from the Python Standard Library for this assignment: https://docs.python.org/3/library/





## 0. Setup

Let us first install a few required packages. (You may want to comment this out in case you use a local environment that already has the suitable packages installed.)

In [1]:
%%capture

#!pip install torch           # not required for colabs. Uncomment if needed otherwise
#!pip install transformers    # not required for colabs. Uncomment if needed otherwise
#!pip install numpy           # not required for colabs. Uncomment if needed otherwise
!pip install portalocker
!pip install -U datasets fsspec huggingface_hub # Hugging Face's dataset library
#!pip install pandas          # not required for colabs. Uncomment if needed otherwise

Next, we will import required libraries

In [2]:
import copy
import random

import torch
import numpy as np
import pandas as pd

from datasets import load_dataset

#from torchtext import data as torchtext_data
from torch import nn

from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, GPT2Model, GPT2ForSequenceClassification, GPT2LMHeadModel

Let's make sure we will later put data and models on the proper device.

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#device = torch.device("mps")               # in case you run on a local Mac with metal performance shaders (setup/support is up to you)
device

device(type='cuda')

Now we will define various functions that we will use in this notebook. You can jump over this part... but you don't have to.

We'll define:

1. perplexity(text, model) - a calculation giving us the perplexity for a given text and model. 'How certain is the model in picking the actual nect token?'
2. ClassificationData class - a class that created the Dataset for our IMDB classification with GPT-2. It has a number of options that we'll use to augment the text to make the classification easier for the model.
3. BERT ClassificationData class - same for a BERT model.
4. create_temp_set(base_data, split) - a function used to massage the IMDB dataset, just as we did in PyTorch Intro I.
5. random_huggingface_blog_text - A list of text of random Hugging Face blog snippets for some validation tests.
6. fake_data_loader -  a function that converts an array of text (fixed batch size for now) into a format that the perplexity calculation can use.


In [4]:
#@title Some Definitions

def perplexity(text_data, model):

    loss = []
    for step, batch_data in enumerate(text_data):

      batch_input = batch_data['input']
      batch_labels = batch_data['labels']

      if step % 100 == 0:
          print('Current batch: ', step)

      with torch.no_grad():
            try:
              batch_output = model(batch_input)
              batch_output_reshaped = batch_output.reshape((batch_size * (max_len - 1), -1))

              batch_labels_reshaped = batch_labels.reshape((batch_size * (max_len - 1),))
              batch_loss = loss_fn(batch_output_reshaped, batch_labels_reshaped)

              loss.append(batch_loss)
            except:
              continue

    avg_cost = np.mean([x.cpu().detach() for x in loss])
    avg_perplexity = np.exp(avg_cost)

    return avg_perplexity


class ClassificationData(Dataset):
    def __init__(self,
                 base_data,
                 tokenizer,
                 max_len,
                 use_prompt=False,
                 prompt_pre_text='',
                 prompt_post_text='',
                 classification_tokenset={1: 'good', 0: 'bad'},
                 num_examples=-1):

        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []

        # really  not ideal having to iterate through the whole set. But ok for this small data volume


        prompt_pre_text = prompt_pre_text.strip()
        prompt_post_text = prompt_post_text.strip()


        for num_example, example in enumerate(base_data):

            if num_examples != -1 and num_example >= num_examples:
              break

            if num_example == 0:
              print(example)

            stripped_example = example['text'].strip()

            token_encoder = self.tokenizer(' ' + stripped_example)['input_ids'] # simulating that the text will not be at the beginning

            if len(token_encoder) <= self.max_len:
                continue    # avoids complications with short sentences. No padding is needed then.

            truncated_encoding = token_encoder[:self.max_len]
            truncated_example = tokenizer.decode(truncated_encoding) # reconstruct shortened review


            # LLMs do next-word predictions. You may want to add a prompt that the model can work with!


            if use_prompt:

                additional_token_length = len(self.tokenizer(prompt_pre_text)['input_ids']) + len(self.tokenizer(' ' + prompt_post_text)['input_ids'])  # simulating that the prompt_post_text will not be at the beginning

                cutoff = self.max_len + additional_token_length

                prompted_text_line = prompt_pre_text + truncated_example + ' ' + prompt_post_text

            else:
                cutoff = self.max_len
                prompted_text_line = truncated_example

            if len(self.tokenizer(prompted_text_line)['input_ids']) != cutoff:
                    continue



            tokenized_example = self.tokenizer(prompted_text_line,
                                               return_tensors="pt",
                                               max_length=cutoff,
                                               truncation=True,
                                               padding='max_length').to(device)

            if example['label'] == 1:
              token = classification_tokenset[1]
            else:
              token = classification_tokenset[0]

            token_id = self.tokenizer.encode(' ' + token)[0]
            label = torch.tensor(token_id, dtype=torch.int64, device=device)

            self.data.append({'label': label,
                              'input_ids': torch.squeeze(tokenized_example['input_ids']).to(device)
                              })

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input_ids': self.data[index]['input_ids'],
            'label': self.data[index]['label']
        }

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input_ids': self.data[index]['input_ids'],
            'label': self.data[index]['label']
        }


class BERTClassificationData(Dataset):
    def __init__(self,
                 base_data,
                 tokenizer,
                 max_len,
                 num_examples=-1):

        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []

        # really  not ideal having to iterate through the whole set. But ok for this small data volume



        for num_example, example in enumerate(base_data):

            if num_examples != -1 and num_example >= num_examples:
              break


            token_encoder = self.tokenizer(example['text'])['input_ids']

            if len(token_encoder) <= self.max_len:
                continue    # avoids complications with short sentences. No padding is needed then.

            truncated_encoding = token_encoder[1:self.max_len + 1]
            truncated_example = tokenizer.decode(truncated_encoding) # reconstruct shortened review

            cutoff = self.max_len
            prompted_text_line = truncated_example

            tokenized_example = self.tokenizer(prompted_text_line,
                                               return_tensors="pt",
                                               max_length=cutoff,
                                               truncation=True,
                                               padding='max_length').to(device)

            label_val = example['label']

            label = torch.tensor(label_val, dtype=torch.int64, device=device)

            self.data.append({'label': label,
                              'input_ids': torch.squeeze(tokenized_example['input_ids']).to(device)
                              })

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input_ids': self.data[index]['input_ids'],
            'label': self.data[index]['label']
        }


def create_temp_set(base_data, num_examples=1000000000):
    num_positive = 0
    num_negative = 0
    num_other = 0

    temp_data = []
    out_data = []

    for example_num, example in enumerate(base_data):

      temp_data.append(example)

    random.shuffle(temp_data)

    for example_num, example in enumerate(temp_data):

      if num_examples != -1 and example_num > num_examples:
        break

      if example['label'] == 0:
        num_negative += 1
      elif example['label'] == 1:
        num_positive += 1
      else:
        num_other += 1

      out_data.append(example)


    print('positive: ', num_positive)
    print('negative: ', num_negative)
    print('other: ', num_other)

    return out_data


random_huggingface_blog_text = ["""1. Aspect candidate extraction

In this work we assume that aspects, which are usually features of products and services, are mostly nouns or noun compounds (strings of consecutive nouns). We use spaCy to tokenize and extract nouns/noun compounds from the sentences in the (few-shot) training set. Since not all extracted nouns/noun compounds are aspects, we refer to them as aspect candidates.

2. Aspect/Non-aspect classification

Now that we have aspect candidates, we need to train a model to be able to distinguish between nouns that are aspects and nouns that are non-aspects. For this purpose, we need training samples with aspect/no-aspect labels. This is done by considering aspects in the training set as True aspects, while other non-overlapping candidate aspects are considered non-aspects and therefore labeled as False:

Training sentence: "Waiters aren't friendly but the cream pasta is out of this world."
Tokenized: [Waiters, are, n't, friendly, but, the, cream, pasta, is, out, of, this, world, .]
Extracted aspect candidates: [Waiters, are, n't, friendly, but, the, cream, pasta, is, out, of, this, world, .]
Gold labels from training set, in BIO format: [B-ASP, O, O, O, O, O, B-ASP, I-ASP, O, O, O, O, O, .]
Generated aspect/non-aspect Labels: [Waiters, are, n't, friendly, but, the, cream, pasta, is, out, of, this, world, .]
Now that we have all the aspect candidates labeled, how do we use it to train the candidate aspect classification model? In other words, how do we use SetFit, a sentence classification framework, to classify individual tokens? Well, this is the trick: each aspect candidate is concatenated with the entire training sentence to create a training instance using the following template:""",
             """Normalization interrogations
During our first deeper dive in these surprising behavior, we observed that the normalization step was possibly not working as intended: in some cases, this normalization ignored the correct numerical answers when they were directly followed by a whitespace character other than a space (a line return, for example). Let's look at an example, with the generation being 10\n\nPassage: The 2011 census recorded a population of 1,001,360, and the gold answer being 10.

Normalization happens in several steps, both for generation and gold:

Split on separators |, -, or The beginning sequence of the generation 10\n\nPassage: contain no such separator, and is therefore considered a single entity after this step.
Punctuation removal The first token then becomes 10\n\nPassage (: is removed)
Homogenization of numbers Every string that can be cast to float is considered a number and cast to float, then re-converted to string. 10\n\nPassage stays the same, as it cannot be cast to float, whereas the gold 10 becomes 10.0.
Other steps A lot of other normalization steps ensue (removing articles, removing other whitespaces, etc.) and our original example becomes 10 passage 2011.0 census recorded population of 1001360.0.
However, the overall score is not computed on the string, but on the bag of words (BOW) extracted from the string, here {'recorded', 'population', 'passage', 'census', '2011.0', '1001360.0', '10'}, which is compared with the BOW of the gold, also normalized in the above manner, {10.0}. As you can see, they don’t intersect, even though the model predicted the correct output!

In summary, if a number is followed by any kind of whitespace other than a simple space, it will not pass through the number normalization, hence never match the gold if it is also a number! This first issue was likely to mess up the scores quite a bit, but clearly it was not the only factor causing DROP scores to be so low. We decided to investigate a bit more.

Diving into the results
Extending our investigations, our friends at Zeno joined us and undertook a much more thorough exploration of the results, looking at 5 models which were representative of the problems we noticed in DROP scores: falcon-180B and mistral-7B were underperforming compared to what we were expecting, Yi-34B and tigerbot-70B had a very good performance on DROP correlated with their average scores, and facebook/xglm-7.5B fell in the middle.

You can give analyzing the results a try in the Zeno project here if you want to!

The Zeno team found two even more concerning features:

Not a single model got a correct result on floating point answers
High quality models which generate long answers actually have a lower f1-score
At this point, we believed that both failure cases were actually caused by the same root factor: using . as a stopword token (to end the generations):

Floating point answers are systematically interrupted before their generation is complete
Higher quality models, which try to match the few-shot prompt format, will generate Answer\n\nPlausible prompt for the next question., and only stop during the plausible prompt continuation after the actual answer on the first ., therefore generating too many words and getting a bad f1 score.
We hypothesized that both these problems could be fixed by using \n instead of . as an end of generation stop word.""",
             """Text generation is a rich topic, and there exist several generation strategies for different purposes. We recommend this excellent overview on the subject. Many generation algorithms are supported by the text generation endpoints, and they can be configured using the following parameters:

do_sample: If set to False (the default), the generation method will be greedy search, which selects the most probable continuation sequence after the prompt you provide. Greedy search is deterministic, so the same results will always be returned from the same input. When do_sample is True, tokens will be sampled from a probability distribution and will therefore vary across invocations.
temperature: Controls the amount of variation we desire from the generation. A temperature of 0 is equivalent to greedy search. If we set a value for temperature, then do_sample will automatically be enabled. The same thing happens for top_k and top_p. When doing code-related tasks, we want less variability and hence recommend a low temperature. For other tasks, such as open-ended text generation, we recommend a higher one.""",
             """Recently, we released our Object Detection Leaderboard, ranking object detection models available in the Hub according to some metrics. In this blog, we will demonstrate how the models were evaluated and demystify the popular metrics used in Object Detection, from Intersection over Union (IoU) to Average Precision (AP) and Average Recall (AR). More importantly, we will spotlight the inherent divergences and pitfalls that can occur during evaluation, ensuring that you're equipped with the knowledge not just to understand but to assess model performance critically.

Every developer and researcher aims for a model that can accurately detect and delineate objects. Our Object Detection Leaderboard is the right place to find an open-source model that best fits their application needs. But what does "accurate" truly mean in this context? Which metrics should one trust? How are they computed? And, perhaps more crucially, why some models may present divergent results in different reports? All these questions will be answered in this blog.

So, let's embark on this exploration together and unlock the secrets of the Object Detection Leaderboard! If you prefer to skip the introduction and learn how object detection metrics are computed, go to the Metrics section. If you wish to find how to pick the best models based on the Object Detection Leaderboard, you may check the Object Detection Leaderboard section."""]

def fake_data_loader(text, tokenizer, max_len):
  return [{'input': tokenizer(text, return_tensors='pt', truncation=True, max_length=max_len)['input_ids'][:, :max_len-1].to(device),
            'labels': tokenizer(text, return_tensors='pt', truncation=True, max_length=max_len)['input_ids'][:, 1:max_len].to(device)}]


This should say 'cpu' if using a CPU, or 'cuda', if a GPU is used (or 'mps' per the comment about macs).

Now we are ready to move to Language Models.

## 1. Continued Pretraining of GPT2 with a Movie Plots dataset

We are now downloading GPT-2 from Hugging Face, i.e. the tokenizer and the model. We will i) make sure that it is on the proper device, and ii) copy the model to a second one that will see additional pre-training before being used for a classification task.

In [5]:
%%capture

torch.manual_seed(10)
random.seed(10)
np.random.seed(10)

gpt_2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt_2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

base_gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
additinal_pretrain_gpt2_model = copy.deepcopy(base_gpt2_model)


We will now continue to pretrain the model *additinal_pretrain_gpt2_model* on a dataset of a specific domain - movies. We use the *CMU Movie Summary Corpus* (https://www.cs.cmu.edu/~ark/personas/, license: http://creativecommons.org/licenses/by-sa/3.0/us/legalcode). It contains 40k+ unlabeled movie plot summaries. As such, they represent domain-specific text which is available at many companies and institutions using their internal documents.

Get the dataset by uncommenting the first line below (we have it commented here because you may need to rerun the notebook multiple times when you already have the dataset. When you do that... make sure you comment out this line again):

In [7]:
!wget https://www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz
# We reccomend downloading the file! You can then upload the file later when needed from your local computer. Go to the folder icon on the left.

--2025-06-15 04:10:56--  https://www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz
Resolving www.cs.cmu.edu (www.cs.cmu.edu)... 128.2.42.95
Connecting to www.cs.cmu.edu (www.cs.cmu.edu)|128.2.42.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 48002242 (46M) [application/x-gzip]
Saving to: ‘MovieSummaries.tar.gz’


2025-06-15 04:12:17 (583 KB/s) - ‘MovieSummaries.tar.gz’ saved [48002242/48002242]



In [8]:
# uncomment this tar command to run once. After you have the raw files you can don't need to untar again
!tar -xvf MovieSummaries.tar.gz

MovieSummaries/
MovieSummaries/tvtropes.clusters.txt
MovieSummaries/name.clusters.txt
MovieSummaries/plot_summaries.txt
MovieSummaries/README.txt
MovieSummaries/movie.metadata.tsv
MovieSummaries/character.metadata.tsv


Next, we will create a list of 15k plots and convince ourselves that the data looks roughly as expected:

In [9]:
plots =  pd.read_csv('MovieSummaries/plot_summaries.txt', delimiter='\t')
plots.columns = ['id', 'plot']
plot_summary_list = [x for x in plots['plot']][:15000]
plot_summary_list[0][:400]

'The nation of Panem consists of a wealthy Capitol and twelve poorer districts. As punishment for a past rebellion, each district must provide a boy and girl  between the ages of 12 and 18 selected by lottery  for the annual Hunger Games. The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth. In her first Reaping, 12-year-old Primrose Everdeen is chose'

Next, we will create a Dataset class that takes text data and returns input token ids and labels for the next word predictions (simply the input token ids shifted one to the left). For simplicity, we will throw out any examples that are shorter than our desired length and truncate all other examples to this length:

In [10]:
#@title Class for Creation of Continued Pretraining Dataset (Movie Plots)

class ContinuedPretrainData(Dataset):
    def __init__(self, base_data, tokenizer, max_len, device):
        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []

        tokenized_examples = tokenizer(base_data,
                                       max_length=max_len,
                                       truncation=True, padding='max_length',
                                       return_tensors="pt")

        tokens = tokenized_examples['input_ids'][tokenized_examples['attention_mask'][:, max_len - 1] > 0]

        self.data = tokens.to(device)

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, index):

        return {'input': self.data[index][:self.max_len - 1],
                'labels': self.data[index][1:]
        }

Now, please build a simple neural net that serves for continued pre-training:

In [11]:
%%capture

class ContinuedTrainingNetwork(torch.nn.Module):
    """
    Build a simple PyTorch network that takes a batch (from a Dataloader)
    as input and returns the logits of each next word prediction.
    When instantiated, you need to pass in a pretrained base model.
    You need to define both, the __init__ and the forward methods.
    """
    def __init__(self, pretrainModel):
        ### YOUR CODE HERE
        super().__init__()
        self.pretrainModel = pretrainModel

        ### END YOUR CODE

    def forward(self, x):               # x stands for the input that the network will use/act on later
        # get the logits for all tokens in all examples and call it logits.

        ### YOUR CODE HERE
        logits = self.pretrainModel(x).logits
        ### END YOUR CODE

        return logits

pretrain_network = ContinuedTrainingNetwork(pretrainModel=additinal_pretrain_gpt2_model)

pretrain_network.to(device)

Then we create the training sets:

In [12]:
max_len=100

train_data = ContinuedPretrainData(plot_summary_list[:10000], tokenizer=gpt_2_tokenizer, max_len=max_len, device=device)
test_data = ContinuedPretrainData(plot_summary_list[10000:], tokenizer=gpt_2_tokenizer, max_len=max_len, device=device)

In [13]:
train_data[0]

{'input': tensor([  464,  3277,   286,  5961,   368, 10874,   286,   257, 11574, 13241,
           290, 14104, 26647, 12815,    13,  1081,  9837,   329,   257,  1613,
         21540,    11,  1123,  4783,  1276,  2148,   257,  2933,   290,  2576,
           220,  1022,   262,  9337,   286,  1105,   290,  1248,  6163,   416,
         22098,   220,   329,   262,  5079, 32367,  5776,    13,   383,   256,
          7657,  1276,  1907,   284,   262,  1918,   287,   281, 13478,    26,
           262,  6195, 23446,   318, 20945,   351, 16117,   290,  5129,    13,
           554,   607,   717,   797,  9269,    11,  1105,    12,  1941,    12,
           727, 11460, 13698, 10776, 39060,   318,  7147,   422,  5665,  1105,
            13,  2332,  4697,  6621,  8595,    77,   747, 11661,   284],
        device='cuda:0'),
 'labels': tensor([ 3277,   286,  5961,   368, 10874,   286,   257, 11574, 13241,   290,
         14104, 26647, 12815,    13,  1081,  9837,   329,   257,  1613, 21540,
            1

Here is the data loader:

In [14]:
torch.manual_seed(10)
batch_size = 4
train_texts = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_texts = DataLoader(test_data, batch_size=batch_size, shuffle=True)

Next, we construct a network that takes the input from the loaders and returns the logits of each next word prediction:

Test the shape of the output. Is it correct? We first need to grab an example and then look at the shape of the model output:

In [15]:
example_data = next(iter(test_texts))

# Please call your model output pretrain_model_output

### YOUR CODE HERE
pretrain_model_output = pretrain_network(example_data['input'])
### END YOUR CODE
pretrain_model_output.shape


torch.Size([4, 99, 50257])

**QUESTION:**

1.a. What do the numbers above refer to?

In [None]:
### Q1-a Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE
4: batch size
99: sequence length
50257: vocabulary size
### END YOUR ANSWER

Next, we will calculate the initial perplexity for the test set of the movie plot summaries. We need the loss function for this. Please use the cross entropy to define the loss function *loss_fn*.

In [16]:
"""
Define the loss function loss_fn as the cross entropy and validate/report on the
calculation for the average loss for the two examples

example 1:
  label: 0
  logits: [-3.1, -2.4]
example 2:
  label: 1
  logits: [2.4, -3.1]

"""

### YOUR CODE HERE
loss_fn = torch.nn.CrossEntropyLoss()
test_input = torch.tensor([[[-3.1, -2.4], [2.4, -3.1]]], dtype=torch.float32)
test_target = torch.tensor([[0, 1]], dtype=torch.int64)
### END YOUR CODE

loss_fn(test_input,test_target)


tensor(3.3036)

**QUESTION:**

1.b. What is the average loss for these two examples?

In [None]:
### Q1-b Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE
3.3036
### END YOUR ANSWER

1.c. (Ideally, by just looking at the labels and logits), which example contributes the higher loss? Choose from 'first' or 'second'.

In [None]:
### Q1-c Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE
second
### END YOUR ANSWER

Also, consider https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html and investigate the dimensions of the input! We will need to reshape the output and the labels, because CrossEntropy expects a tensor of individual decisions, not a tensor of decision sequences. Similar for the labels. Recall how to reshape from the previous notebook.

For the example data we calculate the loss. Then you need to calculate the perplexity.

In [17]:
# Reshape pretrain_model_output and example_data['labels']. Name them reshaped_pretrain_model_output and reshaped_pretrain_model_labels

### YOUR CODE HERE
reshaped_pretrain_model_output = pretrain_model_output.view(-1, pretrain_model_output.shape[-1])
reshaped_pretrain_model_labels = example_data['labels'].view(-1)
### END YOUR CODE

print('Shape of reshaped outputs: ', reshaped_pretrain_model_output.shape)
print('Shape of reshaped labels: ', reshaped_pretrain_model_labels.shape)

Shape of reshaped outputs:  torch.Size([396, 50257])
Shape of reshaped labels:  torch.Size([396])


In [18]:
# Now use the loss function loss_fn to first calculate - for this batch - the loss and then the perplexity.

### YOUR CODE HERE
initial_batch_loss = loss_fn(reshaped_pretrain_model_output, reshaped_pretrain_model_labels)
initial_batch_perplexity = torch.exp(initial_batch_loss)
### END YOUR CODE

print('Initial batch loss: ', initial_batch_loss)
print('Initial batch perplexity: ', initial_batch_perplexity)

Initial batch loss:  tensor(3.8234, device='cuda:0', grad_fn=<NllLossBackward0>)
Initial batch perplexity:  tensor(45.7585, device='cuda:0', grad_fn=<ExpBackward0>)


**QUESTION:**

1.d. What is the perplexity of this batch before the training?

In [None]:
### Q1-d Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE
45.7585
### END YOUR ANSWER

Next, we will calculate the perplexity of the whole test set. For that we will use the perplexity function defined at the outset:

In [19]:
%%time

test_movie_plot_perplexity_before =  perplexity(test_texts, pretrain_network)
test_movie_plot_perplexity_before

Current batch:  0
Current batch:  100
Current batch:  200
Current batch:  300
Current batch:  400
Current batch:  500
Current batch:  600
Current batch:  700
Current batch:  800
Current batch:  900
CPU times: user 22.6 s, sys: 11.2 s, total: 33.7 s
Wall time: 33.9 s


np.float32(49.22481)

**QUESTION**:

1.e. What is the perplexity of the test set before further training?

In [None]:
### Q1-e Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE
49.22481
### END YOUR ANSWER

Ok. What about random text not from this domain? Let us look at the random snippets from Hugging Face blogs defined above in random_huggingface_blog_text:

In [20]:
test_hf_perplexity_before = perplexity(fake_data_loader(random_huggingface_blog_text, tokenizer=gpt_2_tokenizer, max_len=100), pretrain_network)
test_hf_perplexity_before

Current batch:  0


np.float32(48.412525)

Good, about the same. (As hoped/expected. The model should not have any particular better understanding for either type of text.)

Next, we need to create the optimizer and generate a training loop. Nothing to do here for you, but take a look if interested.

In [26]:
pretrain_optimizer = torch.optim.AdamW(pretrain_network.parameters(), lr=0.00001)

def continued_train_loop(dataloader,
               model,
               loss_fn,
               optimizer,
               reporting_interval=100,
               max_len=100,
               steps=None):

    """
    Write the training loop for continued pre-training. In particular, you need to:
    - initialize the epoch_loss to 0 and set the model into training mode
    - iterate over the batches:
      - break if you are at 'steps' number of batches
      - get the inputs X and labels y (which in this case will be the actual next token)
      - get the model outputs
      - reshape y and model outputs in proper format for cross entropy calculation
      - zero out the gradient
      - calculate loss
      - propagate loss (loss.backward) and apply optimizer step
      - add the loss to the epoch_loss
    Reporting:
      - report the current average loss and perplexity every 'reporting_interval' batches
      - report the average loss and perplexity at the end of the epoch (done for you)

    """


    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices

    ### YOUR CODE HERE
    model.train()
    epoch_loss = 0

    batch_perplexity = 0
    batch_loss = 0

    for batch, data in enumerate(dataloader):
        if (steps is not None) and (batch >= steps):
          break

        inputs = data['input']
        labels = data['labels']
        outputs = model(inputs)

        reshaped_outputs = outputs.view(-1, outputs.shape[-1])
        reshaped_labels = labels.view(-1)

        optimizer.zero_grad()

        loss = loss_fn(reshaped_outputs, reshaped_labels)

        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()

        if ((batch+1) % reporting_interval) == 0:
            avg_loss = epoch_loss/(batch+1)
            avg_perplexity = torch.exp(torch.tensor(avg_loss))

            print(f"Batch {batch + 1}: Avg Loss = {avg_loss:.8f}, Avg Perplexity = {avg_perplexity:.8f}")

    ### END YOUR CODE
    print(batch)
    print(f"Training Results: \n  Avg train loss: {epoch_loss/batch:>8f} \n Avg train perplexity: {np.exp(epoch_loss/batch):>8f} ")

Now we do the training:

In [27]:
epochs = 1
for t in range(epochs):
    # we just train for 1000 batches
    print(f"Epoch {t+1}\n-------------------------------")
    continued_train_loop(train_texts, pretrain_network, loss_fn, pretrain_optimizer, steps=1000)

print("Done!")

Epoch 1
-------------------------------
Batch 100: Avg Loss = 3.72844400, Avg Perplexity = 41.61431122
Batch 200: Avg Loss = 3.72947317, Avg Perplexity = 41.65715408
Batch 300: Avg Loss = 3.71865743, Avg Perplexity = 41.20903397
Batch 400: Avg Loss = 3.71755587, Avg Perplexity = 41.16365814
Batch 500: Avg Loss = 3.71375749, Avg Perplexity = 41.00760269
Batch 600: Avg Loss = 3.70597850, Avg Perplexity = 40.68983841
Batch 700: Avg Loss = 3.70399657, Avg Perplexity = 40.60928345
Batch 800: Avg Loss = 3.70448581, Avg Perplexity = 40.62915421
Batch 900: Avg Loss = 3.70154895, Avg Perplexity = 40.51000595
Batch 1000: Avg Loss = 3.69879588, Avg Perplexity = 40.39862823
1000
Training Results: 
  Avg train loss: 3.698796 
 Avg train perplexity: 40.398630 
Done!


How did the perplexity of the test set change after the additional pre-training?

In [28]:
%%time

test_movie_plot_perplexity_after = perplexity(test_texts, pretrain_network)
test_movie_plot_perplexity_after

Current batch:  0
Current batch:  100
Current batch:  200
Current batch:  300
Current batch:  400
Current batch:  500
Current batch:  600
Current batch:  700
Current batch:  800
Current batch:  900
CPU times: user 24.8 s, sys: 11.2 s, total: 36 s
Wall time: 37.3 s


np.float32(41.759407)

What about the Hugging Face blog snippets that were not in the movie domain:



In [29]:
test_hf_perplexity_after =  perplexity(fake_data_loader(random_huggingface_blog_text, tokenizer=gpt_2_tokenizer, max_len=100), pretrain_network)
test_hf_perplexity_after

Current batch:  0


np.float32(72.38276)


**QUESTION:**

1.f. What is your observation about the perplexity change for the test movie plot set texts after the additional pre-training? About the same ('within 2'), higher, or lower?

In [None]:
### Q1-f Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

1.g. What is your observation about the perplexity change for the Hugging Face texts after the additional pre-training? About the same ('within 2'), higher, or lower?

In [None]:
### Q1-g Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

1.h. (Free form) What would these observations imply in terms of where/how this model could be used?

In [None]:
### Q1-h Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

## 2. Sentiment Classification of the IMDB Movie dataset using GPT2 and Prompts

We will now get the IMDB dataset, just like we did in the PyTorch Intro II notebook. Refer to it for details.

In [None]:
imdb_dataset = load_dataset("IMDB")

In [None]:
imdb_dataset['train'][10]

In [None]:
imdb_train_set = create_temp_set(imdb_dataset['train'])
imdb_test_set = create_temp_set(imdb_dataset['test'])

Usually we would first get the Dataset and the Dataloader, however for reasons that hopefully become apparent, in this case we first want to build the network. You should think of the network as taking input_ids $x$ from a batch of suitably tokenized sentences, and the **output should be the logits of the last token for each example in the batch.**

Please fill in the missing line:

In [None]:
%%capture

class TokenPredictionNetworkClass(torch.nn.Module):
    """
    Define a simple PyTorch network that takes a batch (from a Dataloader)
    as input and returns the logits for the last next-token prediction.
    When instantiated, you need to pass in a pretrained base language model
    (the 'logit_model').
    You need to define both, the __init__ and the forward methods.
    """

    def __init__(self, logit_model):
        ### YOUR CODE HERE

        ### END YOUR CODE

    def forward(self, x):
        # get the logits for the last position of each each example. (Call them last_token_logits.). This will be just one line.
        # Use self.logit_model(x) to get the model output
        ### YOUR CODE HERE

        ### END YOUR CODE

        return last_token_logits


loss_fn = torch.nn.CrossEntropyLoss()

token_prediction_network_base_model = TokenPredictionNetworkClass(logit_model=base_gpt2_model)
token_prediction_network_addnl_pretrain_model = TokenPredictionNetworkClass(logit_model=additinal_pretrain_gpt2_model)

token_prediction_network_base_model.to(device)
token_prediction_network_addnl_pretrain_model.to(device)

We now construct our training and test sets for the Sentiment Classification. We want to follow a different approach than we have in the PyTorch intro II notebook. We want to leverage the language model and what it is good at - predicting the next tokens - to the maximum. So why not put a 'wrapper' around the review in a way that the proper sentiment would be naturally the next word?

As a simple example, rather than trying to use the last output vector and add a classification layer let's try to reframe the problem like this (as an illustrative example):

  
 "This is a review: <truncated review text>... The reviewer classifies reviews as good or bad. In this case they thought the movie was"

 or

 "This is a review: <truncated review text>... The reviewer has positive or negative sentiments about movies. In this case the sentiment was"

 ...

 One would think that the LM should already do a decent job getting the proper sentiment simply using the next word prediction task it is trained on!

 How could we test this? We could simply consider the cross entropy loss for the next token relative to the actual sentiment, i.e. the next word we would expect for a positive or a negative review.

 So we can experient with:

 * The pre-fix before the review
 * The text after the review
 * The words we would expect for pos/neg reviews

Try a few combinations and see which ones give you the lowest loss.

**NOTE:** Usually we also would do a good chunk of text pre-processing (take out html, etc.), but for simplicity we will ignore this.

In [None]:
class ClassificationData(Dataset):
    def __init__(self,
                 base_data,
                 tokenizer,
                 max_len,
                 use_prompt=False,
                 prompt_pre_text='',
                 prompt_post_text='',
                 classification_tokenset={1: 'good', 0: 'bad'},
                 num_examples=-1):

        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []

        # really  not ideal having to iterate through the whole set. But ok for this small data volume



        for num_example, example in enumerate(base_data):

            if num_examples != -1 and num_example >= num_examples:
              break

            if num_example == 0:
              print(example)


            token_encoder = self.tokenizer(example['text'])['input_ids']

            if len(token_encoder) <= self.max_len:
                continue    # avoids complications with short sentences. No padding is needed then.

            truncated_encoding = token_encoder[:self.max_len]
            truncated_example = tokenizer.decode(truncated_encoding) # reconstruct shortened review

            # LLMs do next-word predictions. You may want to add a prompt that the model can work with!


            if use_prompt:

                additional_token_length = len(self.tokenizer(prompt_pre_text)['input_ids']) + len(self.tokenizer(prompt_post_text)['input_ids'])
                cutoff = self.max_len + additional_token_length - 1

                prompted_text_line = prompt_pre_text + truncated_example + prompt_post_text

            else:
                cutoff = self.max_len
                prompted_text_line = truncated_example

            if len(self.tokenizer(prompted_text_line)['input_ids']) != cutoff:
                    continue



            tokenized_example = self.tokenizer(prompted_text_line,
                                               return_tensors="pt",
                                               max_length=cutoff,
                                               truncation=True,
                                               padding='max_length').to(device)

            #if num_example == 0:
            #  print(self.tokenizer.decode(tokenized_example['input_ids'][0]))

            if example['label'] == 1:
              token = classification_tokenset[1]
            else:
              token = classification_tokenset[0]

            token_id = self.tokenizer.encode(' ' + token)[0]
            label = torch.tensor(token_id, dtype=torch.int64, device=device)

            self.data.append({'label': label,
                              'input_ids': torch.squeeze(tokenized_example['input_ids']).to(device)
                              })

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input_ids': self.data[index]['input_ids'],
            'label': self.data[index]['label']
        }

In [None]:
#Suggested, but try a bunch!
# prompt_pre_text = 'Here is a movie review: '
#prompt_post_text = ' ...  The reviewer classifies reviews as good or bad. In this case they thought the movie was'
#classification_tokenset = {1: 'good', 0: 'bad'}

prompt_pre_text = 'Here is a movie review: '
prompt_post_text = ' ...  The reviewer classifies reviews as good or bad. In this case they thought the movie was'
classification_tokenset = {1: 'good', 0: 'bad'}

# make a modification to the prompt_pre_text, prompt_post_text, and classification_tokenset
# that gets the loss below 1.7

### YOUR CODE HERE

### END YOUR CODE

play_data = ClassificationData(imdb_train_set,
                                tokenizer=gpt_2_tokenizer,
                                max_len=100,
                                use_prompt=True,
                                prompt_pre_text = prompt_pre_text,
                                prompt_post_text = prompt_post_text,
                                classification_tokenset=classification_tokenset,
                                num_examples=20
                                )

batch_size = 4
toy_texts = DataLoader(play_data, batch_size=batch_size, shuffle=True)


loss = 0
predicted_tokens = []
labels = []

for batch_num, toy_text_batch in enumerate(toy_texts):
    sample_output = token_prediction_network_base_model(toy_text_batch['input_ids']).to(device)
    sample_labels = toy_text_batch['label']
    loss += loss_fn(sample_output, sample_labels).detach()

    predicted_tokens += gpt_2_tokenizer.decode(torch.argmax(sample_output, dim=-1)).split()
    labels += gpt_2_tokenizer.decode(sample_labels).split()

loss /= (batch_num + 1)



print('Average loss: ', loss)
print('Predicted tokens vs labels: ', [(x, y) for x,y in zip(predicted_tokens, labels)])


Ok, the accuracy using the old GPT2 is not exactly amazing (newer and larger models would be MUCH better out of the box). However, even for GPT2 at least a token of the right type is predicted. Fine-tuning should make this much better!


**QUESTION:**

2.a. Write down two different prompt_pre_text/prompt_post_text combinations and their respective average loss. Pick one that sounds reasonable but is quite a bit worse (say, average loss > 3), and another that gets the loss below 1.7. (Note, for the later you probably havde to counteract a bit the model's tendency to be positve. You may also want to be more clear where the review
starts and ends.)

In [None]:
### Q2-a Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Now let's do the fine-tuning that is supposed to help! Start by getting the full dataset and dataloaders:

In [None]:
imdb_train_data = ClassificationData(imdb_train_set,
                                tokenizer=gpt_2_tokenizer,
                                max_len=100,
                                use_prompt=True,
                                prompt_pre_text = prompt_pre_text,
                                prompt_post_text = prompt_post_text,
                                classification_tokenset=classification_tokenset,
                                num_examples=-1
                                )

imdb_test_data = ClassificationData(imdb_test_set,
                                tokenizer=gpt_2_tokenizer,
                                max_len=100,
                                use_prompt=True,
                                prompt_pre_text = prompt_pre_text,
                                prompt_post_text = prompt_post_text,
                                classification_tokenset=classification_tokenset,
                                num_examples=-1
                                )


batch_size = 4
imdb_train_loader = DataLoader(imdb_train_data, batch_size=batch_size, shuffle=True)
imdb_test_loader = DataLoader(imdb_train_data, batch_size=batch_size, shuffle=True)

Let's set up the optimizers as before:

In [None]:
adam_optimizer_base_model = torch.optim.AdamW(token_prediction_network_base_model.parameters(), lr=0.00001)
adam_optimizer_addtl_pretrain_model = torch.optim.AdamW(token_prediction_network_addnl_pretrain_model.parameters(), lr=0.00001)

Here is the new training loop. Please fill in the lines for optimizer zeroing, the prediction calculation, and the loss.

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer, reporting_interval=100, steps=None):

    """
    Write the training loop to fine-tune the model for sentiment
    classification using the final next-token-prediction task.
    In particular, you need to:
    - initialize the epoch_loss to 0 and set the model into training mode
    - iterate over the batches:
      - break if you are at 'steps' number of batches
      - get the inputs X and labels y (which in this case will be the actual next token)
      - get the model outputs
      - reshape y and model outputs in proper format for cross entropy calculation
      - zero out the gradient
      - calculate loss
      - propagate loss (loss.backward) and apply optimizer step
      - add the loss to the epoch_loss
    Reporting:
      - report the current average loss every 'reporting_interval' batches
      - report the average loss at the end of the epoch (done for you)

    """

    ### YOUR CODE HERE

    ### END YOUR CODE

    print(f"Training Results: \n  Avg train loss: {epoch_loss/batch:>8f} \n")


def test_loop(dataloader, model, loss_fn, reporting_interval=100, contrast_pair=None,steps=None):
    """
    Write the test loop to fine-tune the model for sentiment classification using the final next-token-prediction task.
    In particular, you need to:

    - set the model into eval mode and initialize the test_loss to 0. Also, set the number of correct &
      total test examples to 0, like:
      'test_loss, correct_token_predictions,  correct_label_class, total = 0, 0, 0, 0'
        (See the two approaches for accuracy below for correct_token_predictions
        and correct_label_class)

    - use torch.no_grad to iterate over the batches:
        - break if you are at 'steps' number of batches
        - from the batch, get the test inputs X and labels y (which in this case will be the actual next token). You may want to look at the format of batches by using 'next(iter(imdb_test_loader))' in a separate cell
        - get the model outputs
        - calculate loss and add to test_loss (reshaping should not be necessary)
        - For the accuracy, we can try two approaches (and in this case they should turn out to be
            probably the same in the end):
              i) Test Class Accuracy:
                  - Define the predicted class (I call it selected class) by comparing the logits for our two
                    'evaluating tokens' (like 'good', 'bad'). if the logit for (in this example) 'good' is higher,
                    then the predicted class is the positive one, etc.
              ii)  Token Prediction Accuracy:
                  - see how often the correct 'evaluation token' is predicted. I.e., here we do not compare
                    whether the model believes that 'good' is a more likely next token than 'bad', but was
                    'good' the actual next token prediction (and vise versa).
              - get these numbers for each batch and add to the totals

    - add the loss to the epoch_loss

    Reporting:

    - report on the average token accuracy, average class accuracy and average test loss every 'reporting_interval' batches
    - report on the same at the end (done for you)

    """

    # let's get the proper class ids for the 'evaluating next tokens' (like 'good', 'bad')

    if contrast_pair is not None:
      class_1, class_2 = contrast_pair
      class_1_id, class_2_id = gpt_2_tokenizer.encode(' ' + class_1 + ' ' + class_2)

    # now the loop starts:

    ### YOUR CODE HERE

    ### END YOUR CODE

    print(correct_label_class)
    print(f"Test Results: \n\t Test Token Accuracy: {(100*correct):>0.1f}% \n\t Test Class Accuracy: {(100*correct_label_class):>0.1f}%  \n\t Avg test loss: {test_loss:>8f} \n")

In [None]:
epochs = 1
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(imdb_train_loader, token_prediction_network_base_model, loss_fn, adam_optimizer_base_model, steps=2000)
    test_loop(imdb_test_loader, token_prediction_network_base_model, loss_fn, contrast_pair=('good', 'bad'),
              steps=500
              ) # no optimizer use here!
print("Done!")

**QUESTION:**

2.b. What was your test accuracy after fine-tuning, when starting with the base model?

In [None]:
### Q2-b Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Now we redo this for the model that saw the additional pre-training:

In [None]:
epochs = 1
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(imdb_train_loader, token_prediction_network_addnl_pretrain_model, loss_fn, adam_optimizer_addtl_pretrain_model, steps=2000)
    test_loop(imdb_test_loader, token_prediction_network_addnl_pretrain_model, loss_fn, contrast_pair=('good', 'bad'),
              steps=500
              ) # no optimizer use here!
print("Done!")

This looks good! So we had better movie review sentiment classification using the model had had seen the additional pretraining.


**QUESTION:**

2.c. What was your test accuracy after fine-tuning, when starting with the model that had additional pre-training?

In [None]:
### Q2-c Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.d. Based on this and what we saw in the previous section (and, as there are statistical fluctuations, based on what 'should' be the case), what would be your expectation for these two starting models when used for sentiment analysis tasks that deal with data **inside** the movie domain? ('base model slightly better', or 'additional pretrain model slightly better')

In [None]:
### Q2-d Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.e. Based on this and what we saw in the previous section (and, as there are statistical fluctuations, based on what 'should' be the case), what would be your expectation for these two starting models when used for sentiment analysis tasks that deal with data  **outside** the movie domain? ('base model slightly better', or 'additional pretrain model slightly better')

In [None]:
### Q2-e Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

## 3. Sentiment Classification with BERT

Now we will see how well the classification with BERT works in comparison. We will get the model tokenizer for that model, then - as discussed in class - use the output of the initial [CLS] token to classify the sentiment. Will it be better? Or worse?

See https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel for more details around the model.



In [None]:
%%capture

from transformers import AutoTokenizer, BertModel


bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
bert_model = BertModel.from_pretrained("bert-base-cased").to(device)

Let us look at a simple bert tokenization

In [None]:
bert_toy_inputs = bert_tokenizer("This is new", return_tensors="pt").to(device)
bert_toy_outputs = bert_model(**bert_toy_inputs)

last_hidden_states = bert_toy_outputs.last_hidden_state

last_hidden_states.shape

Play with decode method of the tokenizer to see why the shape is ... x 5 x ... . Then identify the first value of the output of the [CLS] token.

In [None]:
# Decode the tokenization. Call it bert_toy_tokens.
### YOUR CODE HERE

### END YOUR CODE
print('The tokens after tokenization: ', bert_toy_tokens)

**QUESTION:**

3.a Why is the shape .. x 5 x ... and not .. x 3 x ... ? Explain. (You may need to to look up what the purpose is of one of the extra tokens. Don't write more than 2-3 lines.)

In [None]:
### Q3-a Grading Tag: Please put your answer in this cell. Don't edit this line. (THIS IS NOT AN AUTOGRADER QUESTION)

### YOUR ANSWER HERE

### END YOUR ANSWER

In [None]:
# Get the output for the [CLS] token. Call it cls_first_out.
### YOUR CODE HERE

### END YOUR CODE
print('First output of [CLS] token: ', cls_first_out)

**QUESTION:**

3.b What is the first value of the output of the [CLS] token?

In [None]:
### Q3-b Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Now we construct the dataset and the dataloader. The BERT dataset class is defined at the beginning.

In [None]:
bert_train_data = BERTClassificationData(imdb_train_set,
                                tokenizer=bert_tokenizer,
                                max_len=100,
                                num_examples=-1
                                )

bert_test_data = BERTClassificationData(imdb_test_set,
                                tokenizer=bert_tokenizer,
                                max_len=100,
                                num_examples=-1
                                )

batch_size = 4
bert_imdb_train_loader = DataLoader(bert_train_data, batch_size=batch_size, shuffle=True)
bert_imdb_test_loader = DataLoader(bert_test_data, batch_size=batch_size, shuffle=True)



Now build the classification network that uses the output of the [CLS] token for the classification.

In [None]:
%%capture

class BERTClassificationNetworkClass(torch.nn.Module):
    """
    Write the class for the classification network using
    the Masked Language Model BERT.
    Specificaly, you will need to extract the output of the [CLS] token
    from the BERT model (i.e., the very first token), apply a suitable linear layer,
    and apply the sigmoid function.
    """

    def __init__(self):
        ### YOUR CODE HERE

        ### END YOUR CODE

    def forward(self, x):
        # Get the forward pass. Apply the BERT model, then the linear layer, and
        # then apply the sigmoid
        ### YOUR CODE HERE

        ### END YOUR CODE

        return torch.squeeze(sigmoid_output) # removing 'x 1 x ' dimensions


loss_fn = torch.nn.BCELoss()

bert_classification_model = BERTClassificationNetworkClass().to(device)


Let's test it. Is the structure correct?

In [None]:
test = next(iter(bert_imdb_train_loader))

out = bert_classification_model({'input_ids': test['input_ids']})

loss = loss_fn(out.float(), test['label'].float())

print('Output: ', out)
print('Loss: ', loss)


Good. Finally, we need train and test loops:

In [None]:
def bert_train_loop(dataloader, model, loss_fn, optimizer, reporting_interval=100, steps=None):
    """
    Following the same logic as above, write the training loop to use the
    Masked Language Model BERT for the sentiment classification task. You
    only need to report the average loss after the reporting interval
    and end of each epoch.
    """

    ### YOUR CODE HERE

    ### END YOUR CODE

    print(f"Training Results: \n  Avg train loss: {epoch_loss/batch:>8f} \n")


def bert_test_loop(dataloader, model, loss_fn, reporting_interval=100, contrast_pair=None,steps=None):
    """
    Following the same logic as above, write the test loop to use the
    Masked Language Model BERT for the sentiment classification task.
    Please report on the accuracy after the reporting interval and end of each epoch.
    """

    ### YOUR CODE HERE

    ### END YOUR CODE

    correct = float(correct.cpu().detach().numpy())

    test_loss /= batch
    correct /= total
    print(correct)

    print(f"Test Results: \n Test Accuracy: {(100*correct):>0.1f}%, Avg test loss: {test_loss:>8f} \n")

Now let's see:

In [None]:
adam_optimizer_bert = torch.optim.AdamW(bert_classification_model.parameters(), lr=0.00001)

epochs = 1
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    bert_train_loop(bert_imdb_train_loader, bert_classification_model, loss_fn, adam_optimizer_bert, steps=2000)
    bert_test_loop(bert_imdb_test_loader, bert_classification_model, loss_fn,
              steps=500
              ) # no optimizer use here!
print("Done!")


**QUESTION:**

3.c What was the test accuracy you got for the BERT model?


In [None]:
### Q3-c Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER


And that is it for assignment 2.  Congratulations!