## Exploring tokenization for classification with BERT and RoBERTa

In this section we explore how BERT and RoBERTa models tokenize texts or pairs of texts, and what special tokens are aggregated.

In [13]:
from transformers import AutoTokenizer

#### BERT

In [14]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Let's start with tokenizing a single text.

In [24]:
example_text = ["Here is some text to encode"]
encoded_input = tokenizer(example_text, return_tensors='pt')
for key, value in encoded_input.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1]]


As you can see, we get three different things for each text:
- `input_ids` - The indices for the corresponding tokens.
- `attention_mask` - This is for the Transformer, to mask out any padding token and prevent it from being involved in the calculations! Since we only have one text, there is no need to use padding to get all texts to be of the same length. Thus, all tokens have the mask set to `True`.
- `token_type_ids` - This was seen during class. As BERT was prepared to receive texts pairs of texts, they have one embedding for each text in the pair. This is indicated by the `token_type_id`. In this case, we have one text only, and thus it is set to `0` in all tokens.

In [25]:
input_ids = encoded_input['input_ids'][0]
token_type_ids = encoded_input['token_type_ids'][0]
attention_mask = encoded_input['attention_mask'][0]

for input_id, token_type_id, attention_mask in zip(input_ids, token_type_ids, attention_mask):
    print(f"Token with ID {input_id}, corresponding to '{tokenizer.decode(input_id)}' - attention_mask={attention_mask.item()}, token_type_id={token_type_id.item()}")

Token with ID 101, corresponding to '[CLS]' - attention_mask=1, token_type_id=0
Token with ID 2182, corresponding to 'here' - attention_mask=1, token_type_id=0
Token with ID 2003, corresponding to 'is' - attention_mask=1, token_type_id=0
Token with ID 2070, corresponding to 'some' - attention_mask=1, token_type_id=0
Token with ID 3793, corresponding to 'text' - attention_mask=1, token_type_id=0
Token with ID 2000, corresponding to 'to' - attention_mask=1, token_type_id=0
Token with ID 4372, corresponding to 'en' - attention_mask=1, token_type_id=0
Token with ID 16044, corresponding to '##code' - attention_mask=1, token_type_id=0
Token with ID 102, corresponding to '[SEP]' - attention_mask=1, token_type_id=0


After investigating each token further, we can see the the `[CLS]` and `[SEP]` tokens seen in class were added. Let's now check how pairs of texts are tokenized:

In [26]:
example_text = [("Here is some text to encode", "Here is some more text to encode")]
encoded_input = tokenizer(example_text, return_tensors='pt')
for key, value in encoded_input.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102, 2182, 2003, 2070, 2062, 3793, 2000, 4372, 16044, 102]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


In [27]:
input_ids = encoded_input['input_ids'][0]
token_type_ids = encoded_input['token_type_ids'][0]
attention_mask = encoded_input['attention_mask'][0]

for input_id, token_type_id, attention_mask in zip(input_ids, token_type_ids, attention_mask):
    print(f"Token with ID {input_id}, corresponding to '{tokenizer.decode(input_id)}' - attention_mask={attention_mask.item()}, token_type_id={token_type_id.item()}")

Token with ID 101, corresponding to '[CLS]' - attention_mask=1, token_type_id=0
Token with ID 2182, corresponding to 'here' - attention_mask=1, token_type_id=0
Token with ID 2003, corresponding to 'is' - attention_mask=1, token_type_id=0
Token with ID 2070, corresponding to 'some' - attention_mask=1, token_type_id=0
Token with ID 3793, corresponding to 'text' - attention_mask=1, token_type_id=0
Token with ID 2000, corresponding to 'to' - attention_mask=1, token_type_id=0
Token with ID 4372, corresponding to 'en' - attention_mask=1, token_type_id=0
Token with ID 16044, corresponding to '##code' - attention_mask=1, token_type_id=0
Token with ID 102, corresponding to '[SEP]' - attention_mask=1, token_type_id=0
Token with ID 2182, corresponding to 'here' - attention_mask=1, token_type_id=1
Token with ID 2003, corresponding to 'is' - attention_mask=1, token_type_id=1
Token with ID 2070, corresponding to 'some' - attention_mask=1, token_type_id=1
Token with ID 2062, corresponding to 'more' -

As you can see, after each text in the pair the `[SEP]` token was added. Further, each token of the first text has `token_type_id=0`, while all tokens of the second text have `token_type_id=1`.

#### RoBERTa
We do the same with RoBERTa:

In [28]:
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")

In [29]:
example_text = [("Here is some text to encode", "Here is some more text to encode")]
encoded_input = tokenizer(example_text, return_tensors='pt')
for key, value in encoded_input.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[0, 11773, 16, 103, 2788, 7, 46855, 2, 2, 11773, 16, 103, 55, 2788, 7, 46855, 2]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


In [32]:
input_ids = encoded_input['input_ids'][0]
attention_mask = encoded_input['attention_mask'][0]

for input_id, attention_mask in zip(input_ids, attention_mask):
    print(f"Token with ID {input_id}, corresponding to '{tokenizer.decode(input_id)}' - attention_mask={attention_mask.item()}")

Token with ID 0, corresponding to '<s>' - attention_mask=1
Token with ID 11773, corresponding to 'Here' - attention_mask=1
Token with ID 16, corresponding to ' is' - attention_mask=1
Token with ID 103, corresponding to ' some' - attention_mask=1
Token with ID 2788, corresponding to ' text' - attention_mask=1
Token with ID 7, corresponding to ' to' - attention_mask=1
Token with ID 46855, corresponding to ' encode' - attention_mask=1
Token with ID 2, corresponding to '</s>' - attention_mask=1
Token with ID 2, corresponding to '</s>' - attention_mask=1
Token with ID 11773, corresponding to 'Here' - attention_mask=1
Token with ID 16, corresponding to ' is' - attention_mask=1
Token with ID 103, corresponding to ' some' - attention_mask=1
Token with ID 55, corresponding to ' more' - attention_mask=1
Token with ID 2788, corresponding to ' text' - attention_mask=1
Token with ID 7, corresponding to ' to' - attention_mask=1
Token with ID 46855, corresponding to ' encode' - attention_mask=1
Token

We can see two things:
1. RoBERTa does not use `token_type_ids`, they do not add these new different embeddings for the text pairs.
2. Instead of `[CLS]` and `[SEP]`, they add `<s>` and `</s>`. We can use `<s>` as we would use `[CLS]`, since they both appear as the first token always.

## Freezing parameters of a pre-trained model

In [1]:
from transformers import AutoModel

  from .autonotebook import tqdm as notebook_tqdm


We now want to be able to freeze any layers of the pre-trained transformer during fine-tuning. To do this, we just need to set `param.requires_grad = False` for the correct parameters. Unfortunately, each model may be structured in a different way. Thus, we need to inspect the model we will be using to learn how to access the correct modules for freezing:

In [2]:
bert_model = AutoModel.from_pretrained("bert-base-uncased")
bert_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

We see the full module list of the loaded model:
1. `embeddings` - This module holds the token embeddings, the positional embeddings and the different embeddings for the first and second texts.
2. `encoder` - The transformer itself, with a list of 12 identical transformer layers.
3. `pooler` - An additional layer provided by HuggingFace that pools the sequence of embeddings to a single text embedding. **We will ignore this, as we want to have full control over how we pool the sequence and get the final logits or probabilities**. If you get a model that has already trained for the task you need, you will want to reuse their `pooler` and final layers, as they are already trained.

Now, we can access to each module separately to freeze a specific number of layers:

In [3]:
num_frozen_layers = 5
# Freeze the first `num_frozen_layers` layers of the model
for layer in bert_model.encoder.layer[:num_frozen_layers]:
    for param in layer.parameters():
        param.requires_grad = False
# Freeze initial embeddings
for param in bert_model.embeddings.parameters():
    param.requires_grad = False

Similarly, we can study the modules of the RoBERTa model, and see how we need to access them in order to freeze the desired parameters:

In [4]:
roberta_model = AutoModel.from_pretrained("FacebookAI/roberta-base")
roberta_model

Some weights of RobertaModel were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-11): 12 x RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dr

It has the same structure as BERT! The same code will freeze the parameters correctly. Let's create a method for freezing layers of BERT-based models:

In [5]:
def freeze_layers_for_bert_based_models(model, num_frozen_layers):
    # Freeze the first `num_frozen_layers` layers of the model
    for layer in model.encoder.layer[:num_frozen_layers]:
        for param in layer.parameters():
            param.requires_grad = False
    # Freeze initial embeddings
    for param in model.embeddings.parameters():
        param.requires_grad = False

## Training for a simple text classification task
Let's now try to fine-tune a model for a simple text classification task: classifying movie reviews as either positive or negative. We will use a pre-trained model with one unfrozen layer only.

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pytorch_lightning as pl


from transformers import AutoModel, AutoTokenizer
from torchmetrics import Accuracy
from datasets import load_dataset

We begin with some simple data loading code, which uses the tokenizer to transform the texts into batched tensors of `input_ids`, `attention_mask` and possibly `token_type_ids`.

In [7]:
class CollateFn():
    def __init__(self, tokenizer_name, max_length):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        self.max_length = max_length

    def __call__(self, batch):
        texts = [example['text'] for example in batch]
        labels = [example['label'] for example in batch]

        encoded_input = self.tokenizer(
            texts,
            max_length=self.max_length,
            padding=True,
            truncation=True,
            return_tensors='pt',
        )

        labels = torch.tensor(labels)

        return encoded_input, labels

In [8]:
class IMDBDataModule(pl.LightningDataModule):
    def __init__(self, tokenizer_name, max_length, batch_size):
        super().__init__()
        self.tokenizer_name = tokenizer_name
        self.max_length = max_length
        self.batch_size = batch_size

    def setup(self, stage=None):
        dataset = load_dataset("imdb")

        trainval_dataset = dataset['train'].train_test_split(test_size=0.1)
        test_dataset = dataset['test']

        self.train, self.val = trainval_dataset['train'], trainval_dataset['test']
        self.test = test_dataset

    def train_dataloader(self):
        return torch.utils.data.DataLoader(
            self.train,
            batch_size=self.batch_size,
            shuffle=True,
            collate_fn=CollateFn(self.tokenizer_name, self.max_length),
        )

    def val_dataloader(self):
        return torch.utils.data.DataLoader(
            self.val,
            batch_size=self.batch_size,
            collate_fn=CollateFn(self.tokenizer_name, self.max_length),
        )
    
    def test_dataloader(self):
        return torch.utils.data.DataLoader(
            self.test,
            batch_size=self.batch_size,
            collate_fn=CollateFn(self.tokenizer_name, self.max_length),
        )

We now get to PytorchLightning module, where we add together a pre-trained transformer and a classification head to produce the final prediction. We also pass as parameters:
- `pooling` - The mechanism used to reduce token embeddings to a single text embedding to make the prediction. We can pass `cls` to take the first token, or `mean` to take the average over all tokens. **But be careful!** If you take the mean, make sure to not pick the padding tokens, by using the `attention_mask`.
- `frozen_layers` - The amount of layers of the pre-trained model we want to freeze during training.

In [9]:
class TextBinaryClassifier(pl.LightningModule):
    def __init__(self, model_name, optimizer_params, pooling="mean", frozen_layers=0):
        """
        model_name: The name of the model to use
        optimizer_params: Parameters to pass to the optimizer
        pooling: The pooling strategy to use. Either 'cls' or 'mean'
        frozen_layers: The number of layers to freeze in the pre-trained model
        """
        super().__init__()
        self.model = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Linear(self.model.config.hidden_size, 1)  # Binary classification
        freeze_layers_for_bert_based_models(self.model, frozen_layers)
        
        assert pooling in ["cls", "mean"], "Pooling must be either 'cls' or 'mean'"
        self.pooling = pooling

        self.accuracy = Accuracy(task="binary")
        self.optimizer_params = optimizer_params

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        # input_ids: (batch_size, seq_length)
        # attention_mask: (batch_size, seq_length)

        outputs = self.model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        last_hidden_state = outputs.last_hidden_state  # (batch_size, seq_length, hidden_size)

        if self.pooling == "cls":
            # NOTE Option 1: Use the CLS token
            pool_output = last_hidden_state[:, 0, :]  # (batch_size, hidden_size)
        else:
            # NOTE Option 2: Use the mean of all tokens
            mean_coeffs = attention_mask.float() / attention_mask.float().sum(dim=1, keepdim=True)  # (batch_size, seq_length)
            pool_output = torch.einsum("bld,bl->bd", last_hidden_state, mean_coeffs)  # (batch_size, hidden_size)

        logits = self.classifier(pool_output)  # (batch_size, 1)
        return logits

    def training_step(self, batch, batch_idx):
        loss, accuracy = self._step(batch)
        self.log('train_loss', loss, prog_bar=True, on_step=True, on_epoch=True)
        self.log('train_accuracy', accuracy, prog_bar=True, on_step=True, on_epoch=True)
        return loss

    def validation_step(self, batch, batch_idx):
        loss, accuracy = self._step(batch)
        self.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
        self.log('val_accuracy', accuracy, prog_bar=True, on_step=False, on_epoch=True)
        return loss

    def test_step(self, batch, batch_idx):
        loss, accuracy = self._step(batch)
        self.log('test_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
        self.log('test_accuracy', accuracy, prog_bar=True, on_step=False, on_epoch=True)
        return loss

    def _step(self, batch):
        encoded_input, labels = batch
        labels = labels.float().view(-1, 1)

        logits = self(**encoded_input)  # (batch_size, 1)
        loss = F.binary_cross_entropy_with_logits(logits, labels)
        accuracy = self.accuracy(logits, labels)
        
        return loss, accuracy

    def configure_optimizers(self):
        optimizer = optim.AdamW(self.parameters(), **self.optimizer_params)
        return optimizer
    
    def configure_callbacks(self):
        return super().configure_callbacks() + [
            pl.callbacks.ModelCheckpoint(monitor='val_loss', mode='min'),
        ]

In [10]:
MAX_LENGTH = 512
BATCH_SIZE = 128

# MODEL = "bert-base-uncased"
MODEL = "FacebookAI/roberta-base"
POOLING = "mean"  # ["cls", "mean"]
NUM_FROZEN_LAYERS = 11  # Leave one layer unfrozen
OPTIMIZER_PARAMS = {
    'lr': 2e-5,
    'weight_decay': 0.01,
}

In [11]:
data_module = IMDBDataModule(MODEL, MAX_LENGTH, BATCH_SIZE)
model = TextBinaryClassifier(MODEL, OPTIMIZER_PARAMS, POOLING, NUM_FROZEN_LAYERS)

data_module.setup()
trainer = pl.Trainer(max_epochs=3, accelerator="gpu", devices=[0], precision="16-mixed")

Some weights of RobertaModel were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using 16bit Automatic Mixed Precision (AMP)
/home/pablo/.micromamba/envs/mdl-dl_nlp/lib/python3.11/site-packages/pytorch_lightning/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/pablo/.micromamba/envs/mdl-dl_nlp/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conf

In [12]:
trainer.fit(model, data_module)

The following callbacks returned in `LightningModule.configure_callbacks` will override existing callbacks passed to Trainer: ModelCheckpoint
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name       | Type           | Params
----------------------------------------------
0 | model      | RobertaModel   | 124 M 
1 | classifier | Linear         | 769   
2 | accuracy   | BinaryAccuracy | 0     
----------------------------------------------
7.7 M     Trainable params
116 M     Non-trainable params
124 M     Total params
498.586   Total estimated model params size (MB)


Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]

/home/pablo/.micromamba/envs/mdl-dl_nlp/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.


                                                                           

/home/pablo/.micromamba/envs/mdl-dl_nlp/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.


Epoch 2: 100%|██████████| 176/176 [01:18<00:00,  2.24it/s, v_num=17, train_loss_step=0.270, train_accuracy_step=0.910, val_loss=0.173, val_accuracy=0.936, train_loss_epoch=0.188, train_accuracy_epoch=0.929] 

`Trainer.fit` stopped: `max_epochs=3` reached.


Epoch 2: 100%|██████████| 176/176 [01:18<00:00,  2.24it/s, v_num=17, train_loss_step=0.270, train_accuracy_step=0.910, val_loss=0.173, val_accuracy=0.936, train_loss_epoch=0.188, train_accuracy_epoch=0.929]


In [13]:
trainer.test(ckpt_path="best", datamodule=data_module)

The following callbacks returned in `LightningModule.configure_callbacks` will override existing callbacks passed to Trainer: ModelCheckpoint
Restoring states from the checkpoint path at /home/pablo/classes/MP_DL-DL_NLP/Lecture 2 - Text Classification/lightning_logs/version_17/checkpoints/epoch=1-step=352.ckpt
/home/pablo/.micromamba/envs/mdl-dl_nlp/lib/python3.11/site-packages/lightning_fabric/utilities/cloud_io.py:55: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are expli

Testing DataLoader 0: 100%|██████████| 196/196 [01:04<00:00,  3.03it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_accuracy         0.9341999888420105
        test_loss           0.17330922186374664
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 0.17330922186374664, 'test_accuracy': 0.9341999888420105}]

Not bad! $93.42$% accuracy.

## Zero-shot text classification with LLMs

We now try zero-shot classification with chat-based LLMs.

In [23]:
import torch

from transformers import pipeline
from datasets import load_dataset
from tqdm import tqdm

Take a look at the following prompts, they are more or less self-explanatory:

In [None]:
SYSTEM_PROMPT = "You are a sentiment classifier. You classify movie reviews as 'positive' or 'negative'. You only respond with the label."
PROMPT_TEMPLATE = """
Classify the sentiment of the following text as 'positive' or 'negative', answering only with the label:
{text}
"""

```python
SYSTEM_PROMPT = "You are a sentiment classifier. You classify movie reviews as 'positive' or 'negative'. You only respond with the label."
```
- This is an **instruction** given to the AI model to define its behavior.  
- It tells the model that it is a **sentiment classifier** (not a chatbot or general text generator).  
- It specifies that the model should only respond with **"positive"** or **"negative"** (not explanations, extra text, or other responses).  

👉 **Purpose:** Ensures the model **stays focused** and gives a **clear, structured response**.

```python
PROMPT_TEMPLATE = """
Classify the sentiment of the following text as 'positive' or 'negative', answering only with the label:
{text}
"""
```
- This is a **template** that will be filled with actual text (a movie review) when making a prediction.
- The `{text}` part is a **placeholder** that will be replaced with the real review.
- The instruction explicitly tells the model to **only output "positive" or "negative"**.

👉 **Purpose:** This ensures that every time we classify a review, we follow a **consistent format** and keep responses structured.

**NOTE**: Even though we make it clear that we just want the final answer, the model is probabilistic and can output other things. We need to make sure to parse the answer robustly.

In [33]:
MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
    "text-generation",
    model=MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
)
pipe.tokenizer.pad_token_id = pipe.tokenizer.eos_token_id
pipe.model.generation_config.pad_token_id = pipe.tokenizer.pad_token_id
pipe.tokenizer.padding_side = "left"

Device set to use cuda:0


**Creating the Pipeline**
```python
pipe = pipeline(
    "text-generation",
    model=MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
)
```
- **`pipeline("text-generation", ...)`**  
  - This creates a **text generation pipeline** that makes it easy to use the model for generating text.  
  - `"text-generation"` tells Hugging Face that we want to **generate text** from a given input.  

- **`model=MODEL_ID`**  
  - This tells the pipeline which model to use (the one we defined earlier).  

- **`torch_dtype=torch.bfloat16`**  
  - This sets the data type for the model's computations.  
  - `bfloat16` is a **reduced-precision** floating point type that makes the model run **faster and use less memory**, especially on GPUs.  

- **`device_map="cuda:0"`**  
  - This tells the pipeline to run the model on a **GPU** (specifically, the first GPU, which is `"cuda:0"`).  

**Setting Padding and Tokenizer Behavior**
```python
pipe.tokenizer.pad_token_id = pipe.tokenizer.eos_token_id
```
- The **"pad token"** is used when input sequences need to be of the same length (for batch processing).  
- Some models, like Llama, **don’t have a predefined pad token**, so this line **sets the pad token to be the same as the EOS (end of sequence) token**.  

```python
pipe.model.generation_config.pad_token_id = pipe.tokenizer.pad_token_id
```
- This ensures that the **generation settings** also recognize the pad token correctly.  

```python
pipe.tokenizer.padding_side = "left"
```
- This tells the tokenizer to **add padding on the left side** of sequences instead of the right.  
- The pipeline produces a warning otherwise.

Let's load the dataset and manually iterate in batches running our model:

In [34]:
test_dataset = load_dataset("imdb", split="test")

In [35]:
# Batch processing parameters
batch_size = 32  # Adjust based on available VRAM
total, correct = 0, 0

# Process in batches
for i in tqdm(range(0, len(test_dataset), batch_size)):
    # Extract batch
    batch = test_dataset[i : i + batch_size]
    labels = batch["label"]
    user_prompts = [PROMPT_TEMPLATE.format(text=text) for text in batch["text"]]

    # Create messages in batch format
    messages_batch = [[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": prompt}
    ] for prompt in user_prompts]

    # Run batch inference
    outputs = pipe(messages_batch, max_new_tokens=256, do_sample=False)

    # Extract predictions
    responses = [out[0]["generated_text"][-1]["content"] for out in outputs]

    # Get predictions from responses. In this case, we assume a positive sentiment if the word "positive" is present in the response
    predicted_labels = ["positive" in response.lower() for response in responses]

    # Evaluate accuracy
    total += len(labels)
    correct += sum(int(pred == label) for pred, label in zip(predicted_labels, labels))

# Final accuracy
accuracy = correct / total
print(f"Accuracy: {accuracy:.4f}")


100%|██████████| 782/782 [18:43<00:00,  1.44s/it]

Accuracy: 0.7882





Also not bad I would say, specially without training. We should note two things, however:
- This large model was very expensive to run, even in inference mode.
- I am actually not sure if the model has seen this data during its extensive procedure, so its difficult to know if the measurement of performance is fair.

## Zero-shot text classification by sentence similarity with `sentence-transformers`

Finally, we get to zero-shot classification by sentence similarity. The basic ideas are:
- We have a model (we will use the `sentence-transformers` library) that is trained to embed full texts into a semantic vector space.
- In this space, texts are close together if they are semantically similar, that is, close in meaning.
- Thus, if a textual description of an arbitrary classification label is close to the text, it likely means that the text belongs to that category.
- We can compare the descriptions of all the labels, and pick the one that is closes in the space to the text we want to classify.

In [97]:
import torch
import torch.nn.functional as F

from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from tqdm import tqdm

The `SentenceTransformer` class handles most of the heavy lifting for us:
- Tokenization
- Pooling all token embeddings to a single text embedding.
We load a `SentenceTransformer` as follows.

In [98]:
model = SentenceTransformer(
    'sentence-transformers/all-MiniLM-L6-v2',
    device='cuda:0',
    model_kwargs={ "torch_dtype": torch.float16, },
)

We now take embeddings from each label. We consider three options:


In [99]:
# Option 1: simple labels
labels = ["negative", "positive"]
label_embeddings = model.encode(labels, convert_to_tensor=True)

In [100]:
# Option 2: more complex labels
labels = [
    "the movie is negative dull boring terrible bad",
    "the movie is positive good awesome amazing interesting entertaining",
]
label_embeddings = model.encode(labels, convert_to_tensor=True)

In [102]:
# Option 3: use the training dataset to compute accurate embeddings
train_dataset = load_dataset("imdb", split="train")

pos_embedding_list = []
neg_embedding_list = []

for i in tqdm(range(0, len(train_dataset), batch_size)):
    batch = train_dataset[i : i + batch_size]
    texts = batch["text"]
    labels = batch["label"]

    # Encode text
    text_embeddings = model.encode(texts, convert_to_tensor=True)

    # Separate positive and negative examples
    for label, embedding in zip(labels, text_embeddings):
        if label == 1:
            pos_embedding_list.append(embedding)
        else:
            neg_embedding_list.append(embedding)

# The embeddings are averaged to get a single embedding per class
# NOTE: after averaging, the embeddings are normalized to have unit norm,
#       which is important because the model always outputs normalized embeddings,
#       and we must keep the same scale for a fair comparison between the labels

mean_pos_embedding = torch.stack(pos_embedding_list).mean(dim=0)
pos_embedding = F.normalize(mean_pos_embedding, p=2, dim=-1)

mean_neg_embedding = torch.stack(neg_embedding_list).mean(dim=0)
neg_embedding = F.normalize(mean_neg_embedding, p=2, dim=-1)

# Stack the embeddings
label_embeddings = torch.stack([neg_embedding, pos_embedding])

  0%|          | 0/782 [00:00<?, ?it/s]

100%|██████████| 782/782 [00:25<00:00, 30.38it/s]


The following method classifies the given text embeddings based on the given label embeddings. So, how do we measure similarity? The model is trained to measure cosine similarity, which is the dot product between the normalized vectors. That's why we normalized earlier! Now we only need to compute the dot products:

In [103]:
def predict(text_embeddings, label_embeddings):
    # text_embeddings: (num_samples, embedding_dim)
    # label_embeddings: (num_labels, embedding_dim)
    scores = text_embeddings @ label_embeddings.T  # (num_samples, num_labels)
    return scores.argmax(dim=1)  # (num_samples,)

Let's now evaluate our model!

In [104]:
test_dataset = load_dataset("imdb", split="test")

In [105]:
# Batch processing parameters
batch_size = 32  # Adjust based on available VRAM
total, correct = 0, 0

# Process in batches
for i in tqdm(range(0, len(test_dataset), batch_size)):
    batch = test_dataset[i : i + batch_size]
    texts = batch["text"]
    labels = batch["label"]

    # Encode text
    text_embeddings = model.encode(texts, convert_to_tensor=True)

    # Predict labels
    predicted_labels = predict(text_embeddings, label_embeddings)

    # Evaluate accuracy
    total += len(labels)
    correct += sum(int(pred == label) for pred, label in zip(predicted_labels, labels))


# Final accuracy
accuracy = correct / total
print(f"Accuracy: {accuracy:.4f}")


100%|██████████| 782/782 [00:24<00:00, 31.42it/s]

Accuracy: 0.7587



