### Lesson Notebook Week 3 - From Basics to Sentiment Classification with GPT-2

This notebook is a subset of the material for this week's special material session. It requires a T4 GPU.



Let's again start with a few installs and imports:


####0. Prep Work (runs for 2 min... plan for it!)

These installs do not have do be done in a Google Colab (Pro):

In [None]:
%%capture

!pip install torch
!pip install torchtext
!pip install transformers   # for our application example in the end
!pip install numpy


These will need to be done:

In [None]:
%%capture

!pip install portalocker
!pip install torchdata
!pip install -U datasets fsspec huggingface_hub # Hugging Face's dataset library

In [None]:
import torch


import numpy as np
import random

from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset

from transformers import AutoTokenizer, GPT2Model, GPT2ForSequenceClassification, GPT2LMHeadModel, BertModel, AutoModelForCausalLM
from transformers import AutoTokenizer, BertModel

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [None]:
def create_temp_set(base_data, num_examples=1000000000):
    num_positive = 0
    num_negative = 0
    num_other = 0

    temp_data = []
    out_data = []

    for example_num, example in enumerate(base_data):

      temp_data.append(example)

    random.shuffle(temp_data)

    for example_num, example in enumerate(temp_data):

      if num_examples != -1 and example_num > num_examples:
        break

      if example['label'] == 0:
        num_negative += 1
      elif example['label'] == 1:
        num_positive += 1
      else:
        num_other += 1

      out_data.append(example)


    print('positive: ', num_positive)
    print('negative: ', num_negative)
    print('other: ', num_other)

    return out_data

In [None]:
def cos_sim(a, b):
  return np.dot(a, b)/(np.sqrt(np.dot(a, a) * np.dot(b, b)))

In [None]:
class ClassificationData(Dataset):
    def __init__(self, base_data, tokenizer, max_len, use_prompt=False):
        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []


        # really  not ideal having to iterate through the whole set. But ok for this small data volume
        for example_num, example in enumerate(base_data):


            try:
              token_encoder = tokenizer(example['text'])['input_ids']
            except:
              print(example_num)
              break

            try:
              if len(token_encoder) <= self.max_len:
                continue    # avoids complications with short sentences. No padding is needed then.
            except:
              print(example_num)
              print(token_encoder)
              print(len(token_encoder))
              break

            truncated_encoding = token_encoder[:self.max_len]
            truncated_example = tokenizer.decode(truncated_encoding) # reconstruct shortened review

            # LLMs do next-word predictions. You may want to add a prompt that the model can work with!

            if use_prompt:
                cutoff = self.max_len + 13
                prompted_text_line = 'Here is a movie review: ' + truncated_example + ' Is this review positive or negative?'

            else:
                cutoff = self.max_len
                prompted_text_line = truncated_example

            if len(self.tokenizer(prompted_text_line)['input_ids']) != cutoff:
                    continue

            tokenized_example = self.tokenizer(prompted_text_line,
                                               return_tensors="pt",
                                               max_length=cutoff,
                                               truncation=True,
                                               padding='max_length').to(device)

            self.data.append({'label': (float(example['label'])),
                              'input':
                                  {'input_ids': torch.tensor(torch.squeeze(tokenized_example['input_ids']),
                                      device=device),
                                   'attention_mask': torch.tensor(torch.squeeze(tokenized_example['attention_mask']),
                                      device=device)
                                   }
                              })

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input': self.data[index]['input'],
            'label': torch.tensor(self.data[index]['label'],
                                  dtype=torch.float,
                                  device=device)
        }

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer, reporting_interval=25, steps=None):
    # size = len(dataloader.dataset)
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    epoch_loss = 0
    model.train()

    for batch, example in enumerate(dataloader):

      X =  example['input']
      y = example['label']

      if steps is not None:
        if batch > steps:
          break


      # Compute prediction and loss

      pred = model(X)

      loss = loss_fn(pred, y)

      # Backpropagation
      loss.backward()
      optimizer.step()
      optimizer.zero_grad()     # the gradients need to be zeroed out after the gradients are applied by the optimizer

      epoch_loss += loss.item()

      if int(batch + 1) % reporting_interval == 0:
        print('\tFinished batches: ', str(batch + 1))
        print('\tCurrent average loss: ', epoch_loss/batch)

    print(f"Training Results: \n  Avg train loss: {epoch_loss/batch:>8f} \n")


def test_loop(dataloader, model, loss_fn, reporting_interval=100, steps=None):
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    test_loss, correct, total = 0, 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for batch, example in enumerate(dataloader):

          X =  example['input']
          y = example['label']

          if steps is not None:
            if int(batch) > steps:
              break

          pred = model(X)
          test_loss += loss_fn(pred, y).item()
          predictions = [int(x > 0.5) for x in list(pred)]
          labels = [int(label > 0.5) for label in list(y)]
          correct += np.sum([x == y for (x, y) in zip(predictions, labels)])
          total += np.sum([1 for _ in predictions])

          if int(batch + 1) % reporting_interval == 0:
            print('Accuracy after', str(batch + 1), 'batches:', str(correct/total))

    test_loss /= batch
    correct /= total
    print(f"Test Results: \n Test Accuracy: {(100*correct):>0.1f}%, Avg test loss: {test_loss:>8f} \n")

### 1. Pretraining, Loss, and Perplexity

We will briefly look at the problem of pre-training, where a model was trained on predicting the next actual word based on the previous ones. We use both, Phi-2 and GPT2 as examples and look at the losses for a given sentence. We will then calculate the perplexity for both models. Note that GPT2 is from 2018 and Phi-2 is from earlier in 2024 ([note, that we do not use the instruction-tuned version of Phi-2. We use the base version that has not undergone instruction-tuning.](https://huggingface.co/microsoft/phi-2))



In [None]:
%%capture

phi_2_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
phi_2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
phi_2_lm_model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True).to(device)

gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", trust_remote_code=True)
gpt2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
gpt2_lm_model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype="auto", trust_remote_code=True).to(device)

Let's construct a short input and tokenize for both models:

In [None]:
text = '''This was really a fun event! I would go there certainly again if I get the chance. Do you know when there will be the next show? I can't wait!'''

tokenized_input_gpt2 = gpt2_tokenizer(text, return_tensors="pt").to(device)
tokenized_input_phi_2 = phi_2_tokenizer(text, return_tensors="pt").to(device)

In [None]:
tokenized_input_gpt2

{'input_ids': tensor([[1212,  373, 1107,  257, 1257, 1785,    0,  314,  561,  467,  612, 3729,
          757,  611,  314,  651,  262, 2863,   13, 2141,  345,  760,  618,  612,
          481,  307,  262, 1306,  905,   30,  314,  460,  470, 4043,    0]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [None]:
tokenized_input_phi_2

{'input_ids': tensor([[1212,  373, 1107,  257, 1257, 1785,    0,  314,  561,  467,  612, 3729,
          757,  611,  314,  651,  262, 2863,   13, 2141,  345,  760,  618,  612,
          481,  307,  262, 1306,  905,   30,  314,  460,  470, 4043,    0]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

**Question:**
1. Do these values make sense?
2. What are input_ids and attention_mask?   
3. The tokenized values appear to be identical for both models. Did that have to be the case?


Now we apply both language models like last week, and look at the outputs. Note that here we use the 'LMHeadModel' versions for both models. These are the Transformer parts **plus the next-token-prediction classification head**.

In [None]:
output_gpt2 = gpt2_lm_model(**tokenized_input_gpt2)
output_phi_2 = phi_2_lm_model(**tokenized_input_phi_2)

In [None]:
output_gpt2.logits.shape

torch.Size([1, 35, 50257])

In [None]:
output_phi_2.logits.shape

torch.Size([1, 35, 51200])

**Question:**

4. Does this look right?  
5. The output shapes are not the same for both models. Do they have to be? Why/why not?  


We could now look at all of the logits and calculate the loss for how good the models were in predicting the  next token correctly at each position. Or... we simply use a model capabilitities by giving the correct labels and get the loss for us (which is the average loss over each positions).

We will do that for a short text and a long text:

In [None]:
text_long = '''This was really a fun event! I would go there certainly again if I get the chance. Do you know when there will be the next show? I can't wait!'''
text_short = '''This is'''


tokenized_input_gpt2_long = gpt2_tokenizer(text_long, return_tensors="pt").to(device)
tokenized_input_phi_2_long = phi_2_tokenizer(text_long, return_tensors="pt").to(device)

tokenized_input_gpt2_short = gpt2_tokenizer(text_short, return_tensors="pt").to(device)
tokenized_input_phi_2_short = phi_2_tokenizer(text_short, return_tensors="pt").to(device)


output_gpt2_long = gpt2_lm_model(**tokenized_input_gpt2_long,
                          labels=tokenized_input_gpt2_long['input_ids'])

output_phi_2_long = phi_2_lm_model(**tokenized_input_phi_2_long,
                          labels=tokenized_input_phi_2_long['input_ids'])

output_gpt2_short = gpt2_lm_model(**tokenized_input_gpt2_short,
                          labels=tokenized_input_gpt2_short['input_ids'])

output_phi_2_short = phi_2_lm_model(**tokenized_input_phi_2_short,
                          labels=tokenized_input_phi_2_short['input_ids'])

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Now we look at the average losses for each model and input:

In [None]:
print('GPT2 loss - short input: ', output_gpt2_short.loss)
print('Phi-2 loss - short input: ', output_phi_2_short.loss)

print()

print('GPT2  loss - long input: ', output_gpt2_long.loss)
print('Phi-2 loss - long input: ', output_phi_2_long.loss)


GPT2 loss - short input:  tensor(3.0817, device='cuda:0', grad_fn=<NllLossBackward0>)
Phi-2 loss - short input:  tensor(1.4561, device='cuda:0', grad_fn=<NllLossBackward0>)

GPT2  loss - long input:  tensor(3.2498, device='cuda:0', grad_fn=<NllLossBackward0>)
Phi-2 loss - long input:  tensor(2.8042, device='cuda:0', grad_fn=<NllLossBackward0>)


**Questions:**

6. Why are the losses different for the longer inputs?   
7. How come the loss for Phi-2 is actually different than the one for GPT2 for the short text? What does that say about in what way Phi-2 is better than GPT2?  

What about the corresponding perplexities?

In [None]:
print('GPT2 perplexity - short input: ', str(np.exp(output_gpt2_short.loss.cpu().detach().numpy())))
print('Phi-2 perplexity - short input: ', np.exp(output_phi_2_short.loss.cpu().detach().numpy()))

print()

print('GPT2 perplexity - long input: ', np.exp(output_gpt2_long.loss.cpu().detach().numpy()))
print('Phi-2 perplexity - long input: ', np.exp(output_phi_2_long.loss.cpu().detach().numpy()))

GPT2 perplexity - short input:  21.795923
Phi-2 perplexity - short input:  4.2890596

GPT2 perplexity - short input:  25.78493
Phi-2 perplexity - short input:  16.513083


**Questions:**.

8. What does this 'mean'? How certain is the model to pick the correct word?  


Next, let's see how good/not good these two models are in 'classifying sentiment out-of-the-box' by simply using pre- and post-modifiers:

In [None]:
text = '''Here is a sentence: "This was interesting event, but I did not like all of it." Sentences can have positive or negative sentiment only! Pleasde repond only with one of those two words! The sentiment of this sentence is'''




gpt2_tokenized_input = gpt2_tokenizer(text, return_tensors="pt").to(device)
phi_2_tokenized_input = phi_2_tokenizer(text, return_tensors="pt").to(device)

output_gpt2 = gpt2_lm_model(**gpt2_tokenized_input)
output_phi_2 = phi_2_lm_model(**phi_2_tokenized_input)

last_token_logits_gpt2 = output_gpt2.logits[0, -1, :].cpu().detach().numpy()
last_token_logits_phi_2 = output_phi_2.logits[0, -1, :].cpu().detach().numpy()

top_predictions_gpt2 = list(np.argsort(last_token_logits_gpt2)[-5:])
top_predictions_phi_2 = list(np.argsort(last_token_logits_phi_2)[-5:])


print('GPT2 sentiment: \n\t', ' '.join([gpt2_tokenizer.decode(x) for x in top_predictions_gpt2]))
print()
print('Phi-2 sentiment: \n\t', ' '.join([gpt2_tokenizer.decode(x) for x in top_predictions_phi_2]))

GPT2 sentiment: 
	  the  that  "  not :

Phi-2 sentiment: 
	  positive 
  "  negative  neutral


Clearly, Phi-2 is much better than GPT2, as one should expect! (Note that there appears to be a 'return' predicted for Phi-2).


### 2.Fine-Tuning - Sentiment Classification with GPT-2 - Option A: Straight Last Token Prediction Classification

We will now use the idea of Sentence Embeddings for our Sentiment Classification problem. We will however use the GPT-2 model from HuggingFace (provided by OpenAI). This gives us an opportunity to also look at a generative AI model explicitly. Note that here we use the base 'Model' versions for both models. These are the Transformer parts **WITHOUT the next-token-prediction classification head**.

#### a. Downloading and working with GPT-2: Tokenizer & Model

We will first get the model, see here: https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Model


In [None]:
%%capture
gpt_2_model = GPT2Model.from_pretrained("gpt2").to(device)

In [None]:
device

device(type='cuda')

In [None]:
tokenized_input = gpt2_tokenizer('This is great',
                              return_tensors="pt").to(device)
tokenized_input

{'input_ids': tensor([[1212,  318, 1049]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1]], device='cuda:0')}

In [None]:
output = gpt_2_model(**tokenized_input)
output.keys()

odict_keys(['last_hidden_state', 'past_key_values'])

Note that now we don't have 'logits' as the output anymore but 'last_hidden_state's. These are the output vectors of the last transformer layer. Let's look at the shapes:

In [None]:
output.last_hidden_state.shape

torch.Size([1, 3, 768])

**Questions:**  

2.a. What do these dimensions mean?   
2.b. Which 'last hidden state' "knows" about the entire context?   
2.c. How would we get the vector that has seen the full context?

In [None]:
output.last_hidden_state[:, -1].shape

torch.Size([1, 768])

#### b. Data Preparation

Now we need the data and create the Dataset with all pre-processing and the Dataloader. We will use the IMDB dataset.

The IMDB data was already created as **my_imdb_data_train**

Let's look at it first:

In [None]:
imdb_dataset = load_dataset("IMDB")

my_imdb_data_train = create_temp_set(imdb_dataset['train'], 10000)
my_imdb_data_test = create_temp_set(imdb_dataset['test'], 2000)

[x['label'] for x in my_imdb_data_train[:20]]

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

positive:  4931
negative:  5070
other:  0
positive:  1012
negative:  989
other:  0


[0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1]

We will now create the actual Dataset and the Dataloader.



In [None]:
train_data = ClassificationData(my_imdb_data_train, tokenizer=gpt2_tokenizer, max_len=100)
test_data = ClassificationData(my_imdb_data_test, tokenizer=gpt2_tokenizer, max_len=100)

  {'input_ids': torch.tensor(torch.squeeze(tokenized_example['input_ids']),
  'attention_mask': torch.tensor(torch.squeeze(tokenized_example['attention_mask']),
Token indices sequence length is longer than the specified maximum sequence length for this model (1125 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
batch_size = 8
train_texts = DataLoader(train_data, batch_size=batch_size, shuffle=True) # usually we shuffle, but we shuffled already above
test_texts = DataLoader(test_data, batch_size=batch_size, shuffle=True)

In [None]:
next(iter(train_texts))

{'input': {'input_ids': tensor([[40277, 25028, 36997,  5341,   262,  2597,   286,   257,  4346,   286,
             257, 11783, 19500,   326,  4078,  2387,   262,  1208,   286,   734,
            5510,  1938,   290,  8404,   284,  3677,   606,   329,   845,  1310,
            1637,   284,   262,  8976,  3430,   286,   262,  3932,    69,  3970,
             357, 18664,   286,   663,  2612,   828,  4851,  4347,    78,    11,
            4361,   777,  1938,   750,   407,   711,   880,    11,   290,   340,
            2227,   326,   262, 10029,  4347,   373,  2642,   276,   351,   428,
              13,   887,   644,  4325,   318,   326,   777,   734,  1938,   706,
             477,   389,   922,   290, 10029,  4347,    78,  3677,   606,   329,
             881,  1637,   284,   257,  3215,  3430,    11,  1642,   257,   922],
          [ 1212,  2646,   561,  3221, 36509,   355,   262,  5290,  3807,  3227,
            1683,    13, 10776,    13,   887,   287,   616,  4459,   340,   318,
     

Let's build the network. Of course, ideally we should have cleaned the text during dataset generation and below we should use regularization like Dropout, but let's just do the simplest approach to run our first classification using LMs.

In [None]:
class MyTextClassificationNetworkClass(torch.nn.Module):
    def __init__(self, embedding_model, embedding_model_dim):
        super().__init__()

        self.lm = embedding_model
        self.linear =  torch.nn.Linear(embedding_model_dim, 1)
        self.activation = torch.nn.Sigmoid()


    def forward(self, x):                             # x stands for the input that the network will use/act on later
        ### Code here!

        model_out = self.lm(**x)['last_hidden_state']
        last_vector = model_out[:, -1]

        linear_out = self.linear(last_vector)
        output = self.activation(linear_out)[:, 0]

        return output

my_text_classification_network = MyTextClassificationNetworkClass(embedding_model=gpt_2_model,
                                                                  embedding_model_dim=768)

my_text_classification_network.to(device)

MyTextClassificationNetworkClass(
  (lm): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (linear): Linear(in_features=768, out_features=1, bias=True)
  (activation): Sigmoid()
)

In [None]:
example = next(iter(train_texts))

In [None]:
example

{'input': {'input_ids': tensor([[ 2504,   373,  4753,   262,  1339,   351, 17654,   287,   262,  3806,
            3245,    13,   632,   373,   319,  3195,   938,  1755,   290,   314,
            1975,   314,  8020,   470,  1775,   262,  2646,  1201,   616, 24505,
             614,   287,  1029,  1524,   290,   314,  1101,   783,   287,   616,
             604,   400,   614,   286,  4152,    13,  4900,   262,  2646,   468,
             867, 17978,    11,   340,   318,   655,   523, 15241,   326,   345,
             460,   470,  1037,   475,  1650,   866,    11,  2342,   340,    11,
             290,  2883,  3511,    13,   632,   318,   635, 20105,    13, 15105,
           43487,   338,   374, 20482,   318,   655,   523,   625,   262,  1353,
             326,   345,   460,   470,  1037,   475,  6487,   503,  7812,   379],
          [  464,   691, 26831,   373,   262,  1402,   636,   416, 13633,  7920,
              13,   632,  3947,   326,   262,  3807,   373,  2111,  1165,  1327,
     

In [None]:
example['input']['input_ids'].shape

torch.Size([8, 100])

In [None]:
sample_out = my_text_classification_network(example['input'])

In [None]:
sample_out

tensor([0.0038, 0.0022, 0.0005, 0.0004, 0.0010, 0.0032, 0.0032, 0.0015],
       device='cuda:0', grad_fn=<SelectBackward0>)

NOTE: we can obviously build the network layer-by-layer and test whether we get the expected output dimensions!

Let's set up the loss function and the optimizer:

In [None]:
loss_fn = torch.nn.BCELoss()
adam_optimizer = torch.optim.Adam(my_text_classification_network.parameters(), lr=0.0001)

Let's train a little bit.

In [None]:
my_text_classification_network = my_text_classification_network.to(device)

epochs = 1
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_texts, my_text_classification_network, loss_fn, adam_optimizer, steps=200)
    test_loop(test_texts, my_text_classification_network, loss_fn, steps=50) # no optimizer use here!
print("Done!")

Epoch 1
-------------------------------
	Finished batches:  25
	Current average loss:  1.091543046136697
	Finished batches:  50
	Current average loss:  0.9147673200587837
	Finished batches:  75
	Current average loss:  0.821073550227526
	Finished batches:  100
	Current average loss:  0.7639714377095
	Finished batches:  125
	Current average loss:  0.7007329838891183
	Finished batches:  150
	Current average loss:  0.6800354116114994
	Finished batches:  175
	Current average loss:  0.6474665130178133
	Finished batches:  200
	Current average loss:  0.623677848571509
Training Results: 
  Avg train loss: 0.620181 

Test Results: 
 Test Accuracy: 69.1%, Avg test loss: 0.543997 

Done!


Clearly not trained enough, but we see that the model trained. (Obviously, this can be done muxh better (dropout, etc.)


####3. The 'Other Language Model': Masked Language Models and simple Sentence Embeddings

We will now look at Masked Language Models, specifically BERT as an old & famous one.

In [None]:
%%capture

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased").to(device)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [None]:
inputs = bert_tokenizer(["This was fun and useful", "I enjoyed the cool event", "Cars are slow and expensive" ],
                        return_tensors="pt").to(device)
outputs = bert_model(**inputs)

In [None]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

What are these? Let's first look at the dimensions:

In [None]:
outputs['last_hidden_state'].shape

torch.Size([3, 7, 768])

In [None]:
outputs['pooler_output'].shape

torch.Size([3, 768])

The BERT model has by default 2 outputs: the last hidden state and the Pooler output. The Pooler output is derived from the last hidden state of the CLS token with some task-specific fine-tuning. I.e, it should be a 'better' representation of the overall text. Let's look at that:

In [None]:
pooler_outputs = outputs['pooler_output'].to('cpu').detach().numpy()

cls_outputs = outputs['last_hidden_state'][:, 0, :].to('cpu').detach().numpy()


In [None]:
print('Pooler comp 0-1: ', cos_sim(pooler_outputs[0], pooler_outputs[1]))
print('CLS comp 0-1: ', cos_sim(cls_outputs[0], cls_outputs[1]))

Pooler comp 0-1:  0.9004953
CLS comp 0-1:  0.9113325


Ok. That seems reasonable. But what about comps with the much less similar sentence?

In [None]:
print('Pooler comp 0-2: ', cos_sim(pooler_outputs[0], pooler_outputs[2]))
print('CLS comp 0-2: ', cos_sim(cls_outputs[0], cls_outputs[2]))

Pooler comp 0-2:  0.74846876
CLS comp 0-2:  0.8500113


Clearly, the Pooler output has much better contrast. So without fine-tuning, the Pooler output appears to be more suitable for text representations than the CLS token output. (However, fine-tuning on various tasks would take care of that too for specific situations.)