### PyTorch Intro II: Applying the Basics to Sentiment Classification

This notebook supports the Slides "PyTorch Intro II: Sentiment Classification".  It assumes full understanding of the first part ("Basics") and will introduce a new concept: Text Embeddings.

1. Review: What Needs to be Done?

2. A Warm-Up: Sentence Embeddings with the Sentence Transformer

3. Sentiment Classification with GPT-2 - Option A: Straight Last Token Prediction Classification
    1. Downloading and working with GPT-2: Tokenizer & Model
    2. Data Preparation
    3. Network Setup
    4. Training


This notebook has been run on a T4 processor using Google Colab.

Let's again start with a few installs and imports.

The following installs are not required if you use Google Colab (at least not for 'Pro' version):


In [1]:
%%capture

!pip install torch
!pip install torchtext
!pip install transformers   # for our application example in the end
!pip install numpy

The following installs are still required when you use Google Colab:

In [2]:
%%capture
#!pip install torchdata
!pip install -U datasets fsspec huggingface_hub
#!pip install portalocker

Let's make sure we land on the proper device:

In [3]:
import torch
#torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available() #workaround
from torch.utils.data import Dataset, DataLoader



import numpy as np
import random

from datasets import load_dataset

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device



device(type='cuda')

### 1. Review: What Needs to be Done?

At the core, we will have a list/set/iterable of text inputs, and we will try to predict the sentiment, positiver or negative in this case. We will assign the value 0 to the negative class, and 1 to the positive class. So what do we need to do?

1. Data:     
    
    a. We need to get the data. We will use the IMDB dataset, included in tochtext.datasets.    

    b. We need to define the Dataset. This step would do all relevant pre-processing. In particular, we will
    1. For simplicity, only include examples that have at least our target length (max_tokens) and cut the input text off there.
    1. Tokenize the text, i.e., split the text into tokens, which are then represented as integers, corresponding to the number of the token in the vocabulary. (Example: vocabulary: ('I', 'you', 'animal', '_s', '_r'); 'you' -> 1, 'animals' -> 2, 3). We will use a model from HuggingFace to do this.
    1. We will convert the labels in the dataset (1: negative, 2: positive) to our classes 0 and 1 (we need that for the binary cross-entropy calculation)
    c. We need to define the Dataloader for batching and shuffling.

2. Network:
    a. We need a way to convert text (token ids) into vectors. We will use a pre-trained HuggingFace GPT-2 model for this, which we will further train.
    b. We will add a hidden layer
    c. We will create a classification layer

2. Training
    a. We will define our cost function as the binary cross-entropy
    b. We will use the Adam Optimizer
    c. We will write training and eval loops
      1.The training loop will update the parameters
      2. The eval loop will capture test accuracy and test loss.

And that's it!

But we will start with a warm-up: Sentence Embeddings with a pre-trained Sentence Transformer.

### 2. Warm-Up: Sentence Embeddings with the Sentence Transformer

The Sentence Transformer (see: https://huggingface.co/sentence-transformers) is a very handy (set of) pre-trained model(s) that converts sentence/document into a meaningful vector. This can serve as the base for Semantic Search, Text Clustering, etc.

In [4]:
%%capture
!pip install -U sentence-transformers
!pip install portalocker

In [5]:
from sentence_transformers import SentenceTransformer
import numpy as np

In [6]:
%%capture
sentence_embedding_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

Let's look at a few layers in this model. You'll see the individual layers of the underlying Transformer!

In [7]:
print(f"Model structure: {sentence_embedding_model}\n\n")

layer_count = 0


# Let's see the first 25 layers
num_layer_names_shown = 25


for layer_num, (name, param) in enumerate(sentence_embedding_model.named_parameters()):
    if layer_num < num_layer_names_shown:
      print(f"Layer: {name} | Size: {param.size()}")

print("\nNumber of layers in model: ", str(layer_num))



Model structure: SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)


Layer: 0.auto_model.embeddings.word_embeddings.weight | Size: torch.Size([30522, 384])
Layer: 0.auto_model.embeddings.position_embeddings.weight | Size: torch.Size([512, 384])
Layer: 0.auto_model.embeddings.token_type_embeddings.weight | Size: torch.Size([2, 384])
Layer: 0.auto_model.embeddings.LayerNorm.weight | Size: torch.Size([384])
Layer: 0.auto_model.embeddings.LayerNorm.bias | Size: torch.Size([384])
Layer: 0.auto_model.encoder.layer.0.attention.self.query.weight | Size: torch.Size([384, 384])
Layer: 0.auto_model.encoder.layer.0.attention.self.que

Lot's going on. Let's ignore that for now, and instead look at a couple of embeddings for three sample sentences. Note that sentences 0 and 1 have **similar meanings**, and that sentences 1 and 2 have **similar words**. So a word(token)-matching approach would probably suggest that sentences 1 and two are most similar... clearly not what we want. How do the sentence embeddings do?

In [8]:
sentences = ['What a nice start of the day', 'A great morning', 'A great problem']

encoded_sentences = sentence_embedding_model.encode(sentences, normalize_embeddings=True)
encoded_sentences.shape

(3, 384)

First off, this seems like what we need: three 384-dim vectors.

Let's verify that the sentence embedding vectors are normalized:

In [9]:
np.dot(encoded_sentences[0], encoded_sentences[0])

np.float32(1.0)

Good. Now, what are the mutual dot products (equivalent to cosine similarities as the vectors are normalized)?

In [10]:
np.matmul(encoded_sentences, np.transpose(encoded_sentences))

array([[0.99999946, 0.70665336, 0.08578878],
       [0.70665336, 0.9999995 , 0.2648253 ],
       [0.08578878, 0.2648253 , 1.0000002 ]], dtype=float32)

Great, the first two sentences (sentences 0 and 1) have a much larger similarity with each other than with the third sentence. That is what we want in embeddings!

We could actually use these embeddings in a network and add a classification head (and other layer in between if desired) to perform sentiment classification. But we will follow a different approach for the sentiment classification.

### 3.Sentiment Classification with GPT-2 - Option A: Straight Last Token Prediction Classification

We will now use the idea of Sentence Embeddings for our Sentiment Classification problem. We will however use the GPT-2 model from HuggingFace (provided by OpenAI). This gives us another opportunity to also look at a generative AI model explicitly.

The architectural picture can be found in the slides for this notebook. The idea is to use the **last output vector** to represent the full sentence, as its calculation is based on the entire input.

#### a. Downloading and working with GPT-2: Tokenizer & Model

We will first get the model, see here: https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Model


In [11]:
%%capture
from transformers import AutoTokenizer, GPT2Model, GPT2ForSequenceClassification

gpt_2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt_2_model = GPT2Model.from_pretrained("gpt2").to(device)

In [12]:
tokenized_input = gpt_2_tokenizer(['Great. This is fun'], return_tensors="pt").to(device)
tokenized_input

{'input_ids': tensor([[13681,    13,   770,   318,  1257]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], device='cuda:0')}

In [13]:
gpt_2_tokenizer.encode('Great. This is fun')

[13681, 13, 770, 318, 1257]

In [14]:
gpt_2_tokenizer.decode(13681)

'Great'

In [15]:
['Great. This is fun', 'I want to go and see a movie']

['Great. This is fun', 'I want to go and see a movie']

What do we see here?

* input_ids: the positions in the vocabulary for the words(tokens)
* attention_mask: Is this position relevant, or 'filled in'? (Like if sentences are 'padded' with padding tokens to make sure that all sentences have the same. But that is for a deep dive discussion.)

Here is a slightly more complex example:

In [16]:
gpt_2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

tokenized_input_2 = gpt_2_tokenizer(['This is great', 'Marvelously done by you!'],
                              return_tensors="pt",
                              max_length=7,
                              truncation=True,
                              padding='max_length').to(device)
tokenized_input_2

{'input_ids': tensor([[ 1212,   318,  1049, 50257, 50257, 50257, 50257],
        [38864,  3481,  1760,   416,   345,     0, 50257]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0]], device='cuda:0')}

In [17]:
device

device(type='cuda')

In [18]:
tokenized_input_2['attention_mask']

tensor([[1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0]], device='cuda:0')

Two examples in batch, length 7. Seems right.

For future comparison, let's extract some individual weights and biases. (We'll later see that they were changed during the upcoming Fine-Tuning process). We'll pick a random one: 'h.11.ln_2.bias' and look at the first 10 values:

In [19]:
pre_train_param_values = []

print(f"Model structure: {gpt_2_model}\n\n")

layer_count = 0

for param in gpt_2_model.parameters():
    param.requires_grad = True   # make sure that the gpt2 model layers can be retrained (try also without...)

num_layer_names_shown = 100000


for layer_num, (name, param) in enumerate(gpt_2_model.named_parameters()):
  pre_train_param_values.append(param.cpu().detach().numpy())

pre_train_param_values[15][:15]

Model structure: GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D(nf=2304, nx=768)
        (c_proj): Conv1D(nf=768, nx=768)
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D(nf=3072, nx=768)
        (c_proj): Conv1D(nf=768, nx=3072)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)




array([-0.00362932, -0.00622698, -0.00362774, -0.0011091 , -0.01039783,
       -0.00634792, -0.02703804,  0.04477038, -0.01744087, -0.00305082,
       -0.00317421, -0.00448913, -0.00828817, -0.01425266, -0.01139623],
      dtype=float32)

In [20]:
gpt_2_model

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D(nf=2304, nx=768)
        (c_proj): Conv1D(nf=768, nx=768)
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D(nf=3072, nx=768)
        (c_proj): Conv1D(nf=768, nx=3072)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

In [21]:
pre_train_param_values[15][:15]

array([-0.00362932, -0.00622698, -0.00362774, -0.0011091 , -0.01039783,
       -0.00634792, -0.02703804,  0.04477038, -0.01744087, -0.00305082,
       -0.00317421, -0.00448913, -0.00828817, -0.01425266, -0.01139623],
      dtype=float32)

Ok. So overall, what happened here?

* We asked for truncation and padding to length 7. This made sure that we do get a tensor; all length are identical.
* The 'attention_mask' now has 1s and 0s. The 0s correspond to filled-in 'padding tokens'
* Historically, GPT-2 did not have a padding token, and we needed to add it. (Moving forward, we can ignore the padding aspect as we will only consider examples that fit. But know that this is generally an important aspect.)
* The first sentence with 3 words has 3 non-adding tokens (1212, 318, 1049). Makes sense. But the second one with 4 words and one exclamation mark has 6 non-padding tokens. Why? 'Marvelously' was split into 2 tokens, and the '!' has its own token


What about the action of the model on our inputs? First off, a lot of things are returned (and more can be returned! See the HuggungFace specifications).

In [22]:
tokenized_input

{'input_ids': tensor([[13681,    13,   770,   318,  1257]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], device='cuda:0')}

In [23]:
gpt_2_model_output = gpt_2_model(**tokenized_input)
gpt_2_model_output.keys()

odict_keys(['last_hidden_state', 'past_key_values'])

What is in the last hidden state? Is that something we want?

In [24]:
gpt_2_model_output['last_hidden_state'].shape

torch.Size([1, 5, 768])

That makes sense. 2 sentences, 3 words each, and a 768-dim output for standard GPT-2. So this corresponds to the output vector for sentence _i_ and position _j_.

We will need the __last__ output vector for each sentence, as that will have seen the entire sentence (see class discussion). How do we get that?



In [25]:
last_position_logits = gpt_2_model(**tokenized_input)['last_hidden_state'][:, -1, :]
last_position_logits.shape

torch.Size([1, 768])

This will be our document embedding later.

#### b. Data Preparation

Now we need the data and create the Dataset with all pre-processing and the Dataloader. We will use the IMDB dataset sourced from Hugging Face using the dataset library (https://huggingface.co/datasets/stanfordnlp/imdb) provided by the Stanford NLP team.

In [26]:
imdb_dataset = load_dataset("IMDB")

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

You always want to look at the data right away to see the structure:

In [27]:
imdb_dataset['train'][6]

{'text': "Whoever wrote the screenplay for this movie obviously never consulted any books about Lucille Ball, especially her autobiography. I've never seen so many mistakes in a biopic, ranging from her early years in Celoron and Jamestown to her later years with Desi. I could write a whole list of factual errors, but it would go on for pages. In all, I believe that Lucille Ball is one of those inimitable people who simply cannot be portrayed by anyone other than themselves. If I were Lucie Arnaz and Desi, Jr., I would be irate at how many mistakes were made in this film. The filmmakers tried hard, but the movie seems awfully sloppy to me.",
 'label': 0}

So '0' seems to be 'negative'. (And we know separately that '1' is positive, but feel free to check yourself.) Let's count the numbers of positive and negative examples.

In [28]:
def create_temp_set(base_data, num_examples=1000000000):
    num_positive = 0
    num_negative = 0
    num_other = 0

    temp_data = []
    out_data = []

    for example_num, example in enumerate(base_data):

      temp_data.append(example)

    random.shuffle(temp_data)

    for example_num, example in enumerate(temp_data):

      if num_examples != -1 and example_num > num_examples:
        break

      if example['label'] == 0:
        num_negative += 1
      elif example['label'] == 1:
        num_positive += 1
      else:
        num_other += 1

      out_data.append(example)


    print('positive: ', num_positive)
    print('negative: ', num_negative)
    print('other: ', num_other)

    return out_data

# There appears to be an issue with the IMDB data. Let's which of the two sets has the full data and pick that one.
my_imdb_data_train = create_temp_set(imdb_dataset['train'], 10000)
my_imdb_data_test = create_temp_set(imdb_dataset['test'], 2000)

[x['label'] for x in my_imdb_data_train[:20]]

positive:  5037
negative:  4964
other:  0
positive:  950
negative:  1051
other:  0


[1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0]

This looks reasonable.

We will now create the actual Dataset and the Dataloader.

We will use the Dataset definition to properly format the data:

* we will ignore examples that do not have our desired minimum length
* we may (see Option B) 'add a prompt', i.e., augment the text so that _the natural next token prediction would actually be the sentiment (example: 'This did not go well.' -> 'Here is a statement: This did not go well. Is the sentiment positive or negative?' The classification would be done from the output of the last '?')
* we need to convert the input text into input ids etc.


A good reference is also [here (Sentiment Analysis using Roberta)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb#scrollTo=3vWRDemOGxJD), where the Encoder model RoBertA is used.


In [29]:
class ClassificationData(Dataset):
    def __init__(self, base_data, tokenizer, max_len, use_prompt=False):
        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []


        # really  not ideal having to iterate through the whole set. But ok for this small data volume
        for example_num, example in enumerate(base_data):


            try:
              token_encoder = tokenizer(example['text'])['input_ids']
            except:
              print(example_num)
              break

            try:
              if len(token_encoder) <= self.max_len:
                continue    # avoids complications with short sentences. No padding is needed then.
            except:
              print(example_num)
              print(token_encoder)
              print(len(token_encoder))
              break

            truncated_encoding = token_encoder[:self.max_len]
            truncated_example = tokenizer.decode(truncated_encoding) # reconstruct shortened review

            # LLMs do next-word predictions. You may want to add a prompt that the model can work with!

            if use_prompt:
                cutoff = self.max_len + 13
                prompted_text_line = 'Here is a movie review: ' + truncated_example + ' Is this review positive or negative?'

            else:
                cutoff = self.max_len
                prompted_text_line = truncated_example

            if len(gpt_2_tokenizer(prompted_text_line)['input_ids']) != cutoff:
                    continue

            tokenized_example = self.tokenizer(prompted_text_line,
                                               return_tensors="pt",
                                               max_length=cutoff,
                                               truncation=True,
                                               padding='max_length').to(device)

            self.data.append({'label': (float(example['label'])),
                              'tokenized_text':
                                  {'input_ids': torch.tensor(torch.squeeze(tokenized_example['input_ids']),
                                      device=device),
                                   'attention_mask': torch.tensor(torch.squeeze(tokenized_example['attention_mask']),
                                      device=device)
                                   }
                              })

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input': self.data[index]['tokenized_text'],
            'label': torch.tensor(self.data[index]['label'],
                                  dtype=torch.float,
                                  device=device)
        }

In [30]:
train_data = ClassificationData(my_imdb_data_train, tokenizer=gpt_2_tokenizer, max_len=100)
test_data = ClassificationData(my_imdb_data_test, tokenizer=gpt_2_tokenizer, max_len=100)

  {'input_ids': torch.tensor(torch.squeeze(tokenized_example['input_ids']),
  'attention_mask': torch.tensor(torch.squeeze(tokenized_example['attention_mask']),
Token indices sequence length is longer than the specified maximum sequence length for this model (1257 > 1024). Running this sequence through the model will result in indexing errors


Quick peek at what we have created:

Perfect. This looks like the input_ids and attention masks (not critical here) for the input plus the labels.  And all data is the GPU, if available.

And now to the Dataloaders to create batches and deal with shuffling (we don't need to do this here anymore):

In [31]:
batch_size = 16
train_texts = DataLoader(train_data, batch_size=batch_size, shuffle=True) # usually we shuffle, but we shuffled already above
test_texts = DataLoader(test_data, batch_size=batch_size, shuffle=True)

Another peek, now at the Dataloader. Do we see the right shapes?

In [32]:
next(iter(test_texts)).keys()

dict_keys(['input', 'label'])


Yup!

#### c. Network Setup


How would we use this type of model for our classification goal? We could use it as the first layer in our network, where we use the (tokenized) review as input, then use the last output vector as input to the classification layer.

In [33]:
class MyTextClassificationNetworkClass(torch.nn.Module):
    def __init__(self, embedding_model, embedding_model_dim):
        super().__init__()
        self.lm = embedding_model
        self.dropout = torch.nn.Dropout(0.2)
        self.linear_2 = torch.nn.Linear(embedding_model_dim, 1)
        self.sigmoid = torch.nn.Sigmoid()


    def forward(self, x):                             # x stands for the input that the network will use/act on later
        embeddings = self.lm(**x)['last_hidden_state'][:, -1]
        dropout_embeddings = self.dropout(embeddings)
        output = self.linear_2(dropout_embeddings)[:, 0]  # flatten the last dimension, there is only 1 neuron. And this is what the cost function will expect
        output = self.sigmoid(output)
        #output = self.sigmoid(self.linear_2(dropout_embeddings))[:, 0]  # flatten the last dimension, there is only 1 neuron
        return output

my_text_classification_network = MyTextClassificationNetworkClass(embedding_model=gpt_2_model,
                                                                  embedding_model_dim=768)

my_text_classification_network.to(device)

MyTextClassificationNetworkClass(
  (lm): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (linear_2): Linear(in_features=768, out_features=1, bias=True)
  (sigmo

NOTES:

  - we can obviously build the network layer-by-layer and test whether we get the expected output dimensions!   
  - The 'last_hidden_state' does **not** refer to the last token in the sequence, but rather the output of the last Transformer layer! So we need to add the [:, -1] slicing.

Does this work?

In [34]:
example = next(iter(test_texts))
example



{'input': {'input_ids': tensor([[ 5246, 26510,   373,  ...,    11,   290,  1312],
          [ 5779,   612,   338,  ...,   355,   257,  3957],
          [ 1212,  2646,  4952,  ...,   326,   673,   561],
          ...,
          [   40,   550,   262,  ...,  4687,    74,  2427],
          [ 5703,   355,   262,  ...,  6735,  7328,   780],
          [   40,  2497,   428,  ..., 35589,    13,   843]], device='cuda:0'),
  'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 1,  ..., 1, 1, 1],
          ...,
          [1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0')},
 'label': tensor([1., 0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 1.],
        device='cuda:0')}

In [35]:
print(my_text_classification_network(example['input']))

tensor([0.9962, 0.9796, 0.9993, 0.0323, 0.9978, 0.9902, 0.8796, 0.7315, 0.9720,
        0.8857, 0.9873, 0.9907, 0.8540, 0.8927, 0.7251, 0.4248],
       device='cuda:0', grad_fn=<SigmoidBackward0>)


Good. This looks right, dimension-wise. Obviously no valid predictions because the models hasn't been trained yet.


What are the layers? Do we see our added classification layer?

In [36]:
for name, param in my_text_classification_network.named_parameters():
  print(name)


lm.wte.weight
lm.wpe.weight
lm.h.0.ln_1.weight
lm.h.0.ln_1.bias
lm.h.0.attn.c_attn.weight
lm.h.0.attn.c_attn.bias
lm.h.0.attn.c_proj.weight
lm.h.0.attn.c_proj.bias
lm.h.0.ln_2.weight
lm.h.0.ln_2.bias
lm.h.0.mlp.c_fc.weight
lm.h.0.mlp.c_fc.bias
lm.h.0.mlp.c_proj.weight
lm.h.0.mlp.c_proj.bias
lm.h.1.ln_1.weight
lm.h.1.ln_1.bias
lm.h.1.attn.c_attn.weight
lm.h.1.attn.c_attn.bias
lm.h.1.attn.c_proj.weight
lm.h.1.attn.c_proj.bias
lm.h.1.ln_2.weight
lm.h.1.ln_2.bias
lm.h.1.mlp.c_fc.weight
lm.h.1.mlp.c_fc.bias
lm.h.1.mlp.c_proj.weight
lm.h.1.mlp.c_proj.bias
lm.h.2.ln_1.weight
lm.h.2.ln_1.bias
lm.h.2.attn.c_attn.weight
lm.h.2.attn.c_attn.bias
lm.h.2.attn.c_proj.weight
lm.h.2.attn.c_proj.bias
lm.h.2.ln_2.weight
lm.h.2.ln_2.bias
lm.h.2.mlp.c_fc.weight
lm.h.2.mlp.c_fc.bias
lm.h.2.mlp.c_proj.weight
lm.h.2.mlp.c_proj.bias
lm.h.3.ln_1.weight
lm.h.3.ln_1.bias
lm.h.3.attn.c_attn.weight
lm.h.3.attn.c_attn.bias
lm.h.3.attn.c_proj.weight
lm.h.3.attn.c_proj.bias
lm.h.3.ln_2.weight
lm.h.3.ln_2.bias
lm.h.3.m

Let's set up the loss function and the optimizer as before. We want to use the binary cross-entropy and the Adam optimizer (usually, you want to pick Adam or AdamW).

In [37]:
loss_fn = torch.nn.BCELoss()
adam_optimizer = torch.optim.Adam(my_text_classification_network.parameters(), lr=0.0001)

Let's test whether the loss function works for our labels and model outputs. Not that we may have to adjust the format of the labels.

In [38]:
sample_example = next(iter(train_texts))
sample_input = sample_example['input']
sample_labels = sample_example['label']

sample_labels

tensor([0., 1., 1., 1., 1., 0., 1., 0., 0., 1., 1., 0., 0., 0., 1., 0.],
       device='cuda:0')

In [39]:
sample_output = my_text_classification_network(sample_input)#.to(device)

loss_fn(sample_output, sample_labels)

tensor(1.6976, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)

In [40]:
sample_output

tensor([0.8674, 0.5647, 0.9826, 0.9856, 0.9842, 0.9855, 0.9668, 0.9878, 0.8838,
        0.0588, 0.9953, 0.9780, 0.9067, 0.9847, 0.9997, 0.3837],
       device='cuda:0', grad_fn=<SigmoidBackward0>)

Is this correct? Let's see. We need to get the labels and then calculate manually.

In [41]:
print('Labels: ', sample_labels)
print('Sample Output: ', sample_output)


Labels:  tensor([0., 1., 1., 1., 1., 0., 1., 0., 0., 1., 1., 0., 0., 0., 1., 0.],
       device='cuda:0')
Sample Output:  tensor([0.8674, 0.5647, 0.9826, 0.9856, 0.9842, 0.9855, 0.9668, 0.9878, 0.8838,
        0.0588, 0.9953, 0.9780, 0.9067, 0.9847, 0.9997, 0.3837],
       device='cuda:0', grad_fn=<SigmoidBackward0>)


In [42]:
sample_output_list = list(sample_output.cpu().detach().numpy())
sample_labels_list = list(sample_labels.cpu().detach().numpy())

loss = 0

for (label, positive_class_prob) in zip(sample_labels_list, sample_output_list):
  if label == 0:
    loss -= np.log(1 - positive_class_prob)
  else:
    loss -= np.log(positive_class_prob)

average_loss = loss/len(sample_labels_list)



print('Manual binary cross entropy discussion: ', average_loss)

Manual binary cross entropy discussion:  1.6976345


Great, the manual calculation agrees!

Off to write the loops:

In [43]:
def train_loop(dataloader, model, loss_fn, optimizer, reporting_interval=25, steps=None):
    # size = len(dataloader.dataset)
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    epoch_loss = 0
    model.train()

    for batch, example in enumerate(dataloader):

      X =  example['input']
      y = example['label']

      if steps is not None:
        if batch > steps:
          break

      optimizer.zero_grad()     # the gradients need to be zeroed out after the gradients are applied by the optimizer

      # Compute prediction and loss

      pred = model(X)

      loss = loss_fn(pred, y)

      # Backpropagation
      loss.backward()
      optimizer.step()

      epoch_loss += loss.item()

      if int(batch + 1) % reporting_interval == 0:
        print('\tFinished batches: ', str(batch + 1))
        print('\tCurrent average loss: ', epoch_loss/batch)

    print(f"Training Results: \n  Avg train loss: {epoch_loss/batch:>8f} \n")


def test_loop(dataloader, model, loss_fn, reporting_interval=100, steps=None):
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    test_loss, correct, total = 0, 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for batch, example in enumerate(dataloader):

          X =  example['input']
          y = example['label']

          if steps is not None:
            if int(batch) > steps:
              break

          pred = model(X)
          test_loss += loss_fn(pred, y).item()
          predictions = [int(x > 0.5) for x in list(pred)]
          labels = [int(label > 0.5) for label in list(y)]
          correct += np.sum([x == y for (x, y) in zip(predictions, labels)])
          total += np.sum([1 for _ in predictions])

          if int(batch + 1) % reporting_interval == 0:
            print('Accuracy after', str(batch + 1), 'batches:', str(correct/total))

    test_loss /= batch
    correct /= total
    print(f"Test Results: \n Test Accuracy: {(100*correct):>0.1f}%, Avg test loss: {test_loss:>8f} \n")

In [44]:
#my_text_classification_network = my_text_classification_network.to(device)

epochs = 1
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_texts, my_text_classification_network, loss_fn, adam_optimizer, steps=500)
    test_loop(test_texts, my_text_classification_network, loss_fn, steps=125) # no optimizer use here!
print("Done!")

Epoch 1
-------------------------------
	Finished batches:  25
	Current average loss:  0.9133863424261411
	Finished batches:  50
	Current average loss:  0.8098145139460661
	Finished batches:  75
	Current average loss:  0.7534962062900131
	Finished batches:  100
	Current average loss:  0.7151649293273387
	Finished batches:  125
	Current average loss:  0.6864801807509314
	Finished batches:  150
	Current average loss:  0.6516119676748379
	Finished batches:  175
	Current average loss:  0.6188156107205084
	Finished batches:  200
	Current average loss:  0.5965864139435878
	Finished batches:  225
	Current average loss:  0.5819241707213223
	Finished batches:  250
	Current average loss:  0.5690490765025816
	Finished batches:  275
	Current average loss:  0.5601985939118984
	Finished batches:  300
	Current average loss:  0.5446470992571135
	Finished batches:  325
	Current average loss:  0.5380626094791993
	Finished batches:  350
	Current average loss:  0.5301465981968153
	Finished batches:  375
	

This looks ok. Maybe using a prompt would be better in order to prime the model for the task? Let's revisit in Assignment II.

Lastly, did the underlying GPT2 model get updated too? Let us look again at the same values as before:


In [45]:
post_train_param_values = []

print(f"Model structure: {my_text_classification_network}\n\n")

layer_count = 0

num_layer_names_shown = 100000


for layer_num, (name, param) in enumerate(my_text_classification_network.named_parameters()):
  post_train_param_values.append(param.cpu().detach().numpy())


post_train_param_values[15][:15]

Model structure: MyTextClassificationNetworkClass(
  (lm): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (linear_2): Linear(in_features=768, out_features=1, bi

array([-0.00504095, -0.00651887, -0.00320126, -0.00186698, -0.01149514,
       -0.00528767, -0.02683423,  0.04433359, -0.01597745, -0.00354672,
       -0.0031457 , -0.00185023, -0.00850381, -0.01216451, -0.01116965],
      dtype=float32)

This compares to these values before Fine-Tuning:

In [46]:
pre_train_param_values[15][:15]

array([-0.00362932, -0.00622698, -0.00362774, -0.0011091 , -0.01039783,
       -0.00634792, -0.02703804,  0.04477038, -0.01744087, -0.00305082,
       -0.00317421, -0.00448913, -0.00828817, -0.01425266, -0.01139623],
      dtype=float32)

So clearly, the values have changes, so the underlying GPT2 model weights were further fine-tuned as well, not just the linear layer!