## Topics Covered

* Introduction to HuggingFace🤗 API
* Fine-tuning pre-trained Transformer models with LoRA
* Text Classification and POS Tagging with Transformers



## Import Libraries

In [1]:
!pip install -U transformers datasets peft evaluate seqeval

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda

In [2]:
import torch
import transformers
import evaluate
import datasets

## Assert whether ``PyTorch`` can use an available GPU card

In [3]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

cuda


## Zero-shot inference with pre-trained models using `pipeline`

In [None]:
mask_pipe = transformers.pipeline('fill-mask', model='xlm-roberta-base')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
mask_pipe("<mask> is a popular programming language.")

[{'score': 0.40943586826324463,
  'token': 17925,
  'token_str': 'JavaScript',
  'sequence': 'JavaScript is a popular programming language.'},
 {'score': 0.2303527295589447,
  'token': 145581,
  'token_str': 'Python',
  'sequence': 'Python is a popular programming language.'},
 {'score': 0.10782749950885773,
  'token': 47302,
  'token_str': 'PHP',
  'sequence': 'PHP is a popular programming language.'},
 {'score': 0.08062639832496643,
  'token': 41925,
  'token_str': 'Java',
  'sequence': 'Java is a popular programming language.'},
 {'score': 0.06879209727048874,
  'token': 74181,
  'token_str': 'Javascript',
  'sequence': 'Javascript is a popular programming language.'}]

In [None]:
mask_pipe(["Test this <mask> output", "<mask> is the capital of France."])

[[{'score': 0.045715004205703735,
   'token': 19097,
   'token_str': 'HTML',
   'sequence': 'Test this HTML output'},
  {'score': 0.03751788288354874,
   'token': 12,
   'token_str': ':',
   'sequence': 'Test this: output'},
  {'score': 0.03457654267549515,
   'token': 11435,
   'token_str': 'file',
   'sequence': 'Test this file output'},
  {'score': 0.027089789509773254,
   'token': 17925,
   'token_str': 'JavaScript',
   'sequence': 'Test this JavaScript output'},
  {'score': 0.027035590261220932,
   'token': 9191,
   'token_str': 'page',
   'sequence': 'Test this page output'}],
 [{'score': 0.6514368057250977,
   'token': 7270,
   'token_str': 'Paris',
   'sequence': 'Paris is the capital of France.'},
  {'score': 0.06940539181232452,
   'token': 172567,
   'token_str': 'Strasbourg',
   'sequence': 'Strasbourg is the capital of France.'},
  {'score': 0.05400371551513672,
   'token': 73398,
   'token_str': 'Nice',
   'sequence': 'Nice is the capital of France.'},
  {'score': 0.04890

In [None]:
mask_pipe.model

XLMRobertaForMaskedLM(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True

## Fine-tuning pre-trained models for text classification

We will explore the HuggingFace🤗 API to handle data loading and training of the models. We will use the IMDB dataset that is included in the `datasets` module.

In [4]:
dataset = datasets.load_dataset("stanfordnlp/imdb")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [5]:
dataset['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

We saw the form of the dataset, you can either create your own by following [this guide](https://huggingface.co/docs/datasets/create_dataset) and some useful [links](https://stackoverflow.com/questions/67852880/how-can-i-handle-this-datasets-to-create-a-datasetdict), or you can use custom `PyTorch` datasets we have seen in previous lectures.

In [6]:
# Set aside the test split of the dataset to perform a split on it
test_dataset = dataset['test']

# Perform stratified split using the label
stratified_split = test_dataset.train_test_split(
    test_size=0.5,
    stratify_by_column='label'
)

# Extract the new splits
validation_dataset = stratified_split['train']
test_dataset = stratified_split['test']

# Redefine the dataset with the stratified split
dataset = datasets.DatasetDict({
    'train': dataset['train'],
    'validation': validation_dataset,
    'test': test_dataset,
})
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 12500
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12500
    })
})

We will use three different ways of fine-tuning the model


1.   Train the entire model including the classification head for the specific task
2.   Freeze the base model while only training the classification head
3. Use LoRA to perform parameter efficient fine tuning (PEFT)

Furthermore, we will generally use the `Trainer` API to see how it works, but you can always use custom training loops from the native `PyTorch` API if you want to use more complex functionality like custom loss functions etc.

You can take a look [here](https://huggingface.co/docs/transformers/training#train-a-tensorflow-model-with-keras).

### 1. Train the entire model including the classification head for the specific task

We will use the AutoModelForSequenceClassification model. This model takes the [CLS] token and adds a linear layer on top. If we want more complex stuff like global max pooling over the last hidden states, we would need to define a custom model.

In [7]:
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative": 0, "Positive": 1}

tokenizer = transformers.AutoTokenizer.from_pretrained("distilroberta-base")
model = transformers.AutoModelForSequenceClassification.from_pretrained("distilroberta-base",
                                                                        num_labels=2,
                                                                        id2label=id2label,
                                                                        label2id=label2id)

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())

# Count trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Total parameters: 82119938
Trainable parameters: 82119938


In [None]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
           

In [8]:
from functools import partial
def prepare_dataset(examples, tokenizer):
    return tokenizer(examples['text'], truncation=True)

tokenized_imdb = dataset.map(partial(prepare_dataset, tokenizer=tokenizer),
                             batched=True)
tokenized_imdb['train'][0]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/12500 [00:00<?, ? examples/s]

Map:   0%|          | 0/12500 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [9]:
def compute_metrics(p):
  accuracy_metric = evaluate.load("accuracy")
  f1_metric = evaluate.load("f1")
  predictions, labels = p
  # Ensure predictions are converted to a PyTorch tensor
  if not isinstance(predictions, torch.Tensor):
      predictions = torch.tensor(predictions)
  # Convert logits to predicted class indices
  predictions = torch.argmax(predictions, dim=1)
  # Compute metrics
  accuracy = accuracy_metric.compute(predictions=predictions.numpy(),
                                     references=labels)
  f1 = f1_metric.compute(predictions=predictions.numpy(),
                         references=labels, average="macro")
  return {
      "accuracy": accuracy["accuracy"],
      "f1": f1["f1"]
    }

### Define training arguments and fine-tune model
[TrainingArguments documentation](https://huggingface.co/docs/transformers/v4.41.2/en/main_classes/trainer#transformers.TrainingArguments)

In [10]:
# Dynamically pads the sequences within each batch to avoid any shape misalignments
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

training_args = transformers.TrainingArguments(
    output_dir='./txt_cls_example/',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1, # Number of update steps (forward passes) to accumulate the gradients for, before performing a backward/update pass
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none"
)

trainer = transformers.Trainer(
    model,
    training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.214,0.164406,0.9384,0.938389
2,0.1545,0.207895,0.9424,0.942392


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

TrainOutput(global_step=3126, training_loss=0.2027366849297678, metrics={'train_runtime': 2653.7502, 'train_samples_per_second': 18.841, 'train_steps_per_second': 1.178, 'total_flos': 6535291531512384.0, 'train_loss': 0.2027366849297678, 'epoch': 2.0})

In [None]:
predictions = trainer.predict(test_dataset=tokenized_imdb["test"])

# Unpack the results
logits = predictions.predictions
labels = predictions.label_ids
metrics = predictions.metrics

metrics

{'test_loss': 0.173473060131073,
 'test_accuracy': 0.936,
 'test_f1': 0.935995738237828,
 'test_runtime': 172.8697,
 'test_samples_per_second': 72.309,
 'test_steps_per_second': 4.524}

In [None]:
sample = ['I am furious about this worthless movie']

tokenized_sample = tokenizer(sample,
                             truncation=True,
                             padding=True,
                             return_tensors='pt'
                             ).to("cuda:0")

model.eval()
with torch.no_grad():
  predictions = model(**tokenized_sample)
print("Logits: ", predictions.logits)
print("shape: ", predictions.logits.shape)

Logits:  tensor([[ 2.6562, -2.1929]], device='cuda:0')
shape:  torch.Size([1, 2])


In [11]:
# free GPU memory
import gc
del model
gc.collect()
torch.cuda.empty_cache()

### 2. Freeze the base model while only training the classification head

In [None]:
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative": 0, "Positive": 1}

tokenizer = transformers.AutoTokenizer.from_pretrained("distilroberta-base")
model = transformers.AutoModelForSequenceClassification.from_pretrained("distilroberta-base",
                                                                        num_labels=2,
                                                                        id2label=id2label,
                                                                        label2id=label2id)
model

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
           

In [None]:
# Freeze the base model (DistilRoBERTa)
for param in model.roberta.parameters():
    param.requires_grad = False

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())

# Count trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

Total parameters: 82119938
Trainable parameters: 592130


In [None]:
training_args = transformers.TrainingArguments(
    output_dir='./txt_cls_frozen_base/',
    learning_rate=5e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to='none'
)

trainer = transformers.Trainer(
    model,
    training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["validation"],
    data_collator=data_collator,
    # tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.1704,0.93408,0.934047
2,0.231000,0.160825,0.94048,0.940475
3,0.112600,0.189595,0.94392,0.943917


TrainOutput(global_step=1173, training_loss=0.15816835044086844, metrics={'train_runtime': 3998.0833, 'train_samples_per_second': 18.759, 'train_steps_per_second': 0.293, 'total_flos': 9934972107075840.0, 'train_loss': 0.15816835044086844, 'epoch': 3.0})

In [None]:
predictions = trainer.predict(test_dataset=tokenized_imdb["test"])

# Unpack the results
logits = predictions.predictions
labels = predictions.label_ids
metrics = predictions.metrics

metrics

{'test_loss': 0.1600857824087143,
 'test_accuracy': 0.94152,
 'test_f1': 0.9415194876171427,
 'test_runtime': 177.0592,
 'test_samples_per_second': 70.598,
 'test_steps_per_second': 1.107}

### 3. Use LoRA to perform parameter efficient fine tuning (PEFT)

## LoRA fine-tuning technique
<img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dfbd169-eb7e-41e1-a050-556ccd6fb679_1600x672.png"> </img>


**LoRA** (Low-Rank Adaptation”) keeps freezed all the pre-trained model weights and approximates ΔW with two low-rank matrices, A and B.

**H = Wx + ∆W x ,where ∆W =  W0x + BAx**

E.g Consider a weight matrix W 1000x1000.
With regular fine-tuning we have to update 1M parameter.
Using LoRA with a rank r=16 we have two updata matrices **A** and **B**

A = 16x1000

B = 1000x16

that's only **32K << 1M** paramaters

Source: https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch

In [12]:
model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "distilroberta-base", num_labels=2, id2label=id2label, label2id=label2id
)

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())

# Count trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Total parameters: 82119938
Trainable parameters: 82119938


In [13]:
from peft import LoraConfig, get_peft_model, TaskType

# Define a LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Task type
    r=8,                         # LoRA rank
    lora_alpha=32,               # Scaling factor
    lora_dropout=0.1,            # Dropout for LoRA
    target_modules=["query", "value"],  # Apply LoRA to specific transformer layers
)


In [14]:
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 739,586 || all params: 82,859,524 || trainable%: 0.8926


* 592,130 parameters come from the classification head.
* For each encoder block: Each approximation: d x r + r x d = 768 x 8 + 8 x 768 = 12,288 parameters.
* But, we approximate
both query and value matrixes, so we have 2 x 12,288 = 24,576
* We have 6 encoder blocks so 6 x 24576 = 147,456
* Hence, 592,130 + 147,456 = 739,586

In [15]:
training_args = transformers.TrainingArguments(
    output_dir='./txt_cls_lora/',
    learning_rate=1e-3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to='none'
)

trainer = transformers.Trainer(
    model,
    training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [16]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.170272,0.9344,0.934398
2,0.226100,0.160508,0.94048,0.940479


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

TrainOutput(global_step=782, training_loss=0.20555073281993036, metrics={'train_runtime': 972.0993, 'train_samples_per_second': 51.435, 'train_steps_per_second': 0.804, 'total_flos': 6736970342400000.0, 'train_loss': 0.20555073281993036, 'epoch': 2.0})

In [17]:
predictions = trainer.predict(test_dataset=tokenized_imdb["test"])

# Unpack the results
logits = predictions.predictions
labels = predictions.label_ids
metrics = predictions.metrics

metrics

{'test_loss': 0.16037975251674652,
 'test_accuracy': 0.93952,
 'test_f1': 0.9395181032877191,
 'test_runtime': 92.6112,
 'test_samples_per_second': 134.973,
 'test_steps_per_second': 2.116}

## Token classification (NER)

We will use the WNUT dataset.

In [18]:
wnut = datasets.load_dataset("wnut_17", trust_remote_code=True)
wnut

README.md:   0%|          | 0.00/9.05k [00:00<?, ?B/s]

wnut_17.py:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/115k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/192k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3394 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1009 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1287 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3394
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1009
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1287
    })
})

In [19]:
wnut["train"][0]

{'id': '0',
 'tokens': ['@paulwalk',
  'It',
  "'s",
  'the',
  'view',
  'from',
  'where',
  'I',
  "'m",
  'living',
  'for',
  'two',
  'weeks',
  '.',
  'Empire',
  'State',
  'Building',
  '=',
  'ESB',
  '.',
  'Pretty',
  'bad',
  'storm',
  'here',
  'last',
  'evening',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  8,
  8,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

## B-I-O NER schema
The ner_tag describes an entity, such as a corporation, location, or person. The letter that prefixes each ner_tag indicates the token position of the entity:

- B- indicates the beginning of an entity.
- I- indicates a token is contained inside the same entity (e.g., the State token is a part of an entity like Empire State Building).
- 0 indicates the token doesn’t correspond to any entity.

In [20]:
label_list = wnut["train"].features[f"ner_tags"].feature.names
# print tags set
label_list

['O',
 'B-corporation',
 'I-corporation',
 'B-creative-work',
 'I-creative-work',
 'B-group',
 'I-group',
 'B-location',
 'I-location',
 'B-person',
 'I-person',
 'B-product',
 'I-product']

In [22]:
id2label = {idx:label for idx,label in enumerate(label_list)}
label2id = {label:idx for idx,label in enumerate(label_list)}
print(id2label)
print(label2id)

{0: 'O', 1: 'B-corporation', 2: 'I-corporation', 3: 'B-creative-work', 4: 'I-creative-work', 5: 'B-group', 6: 'I-group', 7: 'B-location', 8: 'I-location', 9: 'B-person', 10: 'I-person', 11: 'B-product', 12: 'I-product'}
{'O': 0, 'B-corporation': 1, 'I-corporation': 2, 'B-creative-work': 3, 'I-creative-work': 4, 'B-group': 5, 'I-group': 6, 'B-location': 7, 'I-location': 8, 'B-person': 9, 'I-person': 10, 'B-product': 11, 'I-product': 12}


In [24]:
tokenizer = transformers.AutoTokenizer.from_pretrained('xlm-roberta-base')
model = transformers.AutoModelForTokenClassification.from_pretrained("xlm-roberta-base",
                                                                     id2label=id2label,
                                                                     label2id=label2id)

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
model

XLMRobertaForTokenClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768

## Handle subwords splitting and special tokens
Adding the special tokens [CLS] and [SEP] and subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may be split into two subwords. You will need to realign the tokens and labels by:

1. Mapping all tokens to their corresponding word with the word_ids method.
2. Assigning the label -100 to the special tokens [CLS] and [SEP] so the loss function ignores them.
3. Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.

Here is how you can create a function to realign the tokens and labels, and truncate sequences to be no longer than model's maximum input length:

In [26]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True,
                                 is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100. [CLS] etc
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_dataset= wnut.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=wnut["train"].column_names
)

tokenized_dataset['train'][0]

Map:   0%|          | 0/3394 [00:00<?, ? examples/s]

Map:   0%|          | 0/1009 [00:00<?, ? examples/s]

Map:   0%|          | 0/1287 [00:00<?, ? examples/s]

{'input_ids': [0,
  1374,
  763,
  202,
  94449,
  1650,
  242,
  7,
  70,
  21455,
  1295,
  7440,
  87,
  242,
  39,
  38043,
  100,
  6626,
  40859,
  6,
  5,
  145359,
  22836,
  104919,
  2203,
  131523,
  6,
  5,
  197570,
  6494,
  77076,
  3688,
  4568,
  105216,
  6,
  5,
  2],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [-100,
  0,
  -100,
  -100,
  -100,
  0,
  0,
  -100,
  0,
  0,
  0,
  0,
  0,
  0,
  -100,
  0,
  0,
  0,
  0,
  0,
  -100,
  7,
  8,
  8,
  0,
  7,
  0,
  -100,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  -100,
  -100]}

In [27]:
import numpy as np

# Load the seqeval metric
seqeval_metric = evaluate.load("seqeval")

def compute_metrics_ner(p):
    predictions, labels = p

    # Convert logits to predicted class indices
    predictions = np.argmax(predictions, axis=-1)

    # Align predictions and labels (skip special tokens with -100)
    true_predictions = [
        [id2label[pred] for (pred, label) in zip(prediction, label) if label != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[label] for label in label_seq if label != -100]
        for label_seq in labels
    ]

    # Compute metrics
    results = seqeval_metric.compute(predictions=true_predictions,
                                     references=true_labels) # Micro-average by default
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [30]:
data_collator = transformers.DataCollatorForTokenClassification(tokenizer=tokenizer)

training_args = transformers.TrainingArguments(
    output_dir='./tok_cls_example/',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to='none'
)

trainer = transformers.Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics_ner
)

In [31]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.253828,0.520134,0.370813,0.432961,0.941778
2,No log,0.242782,0.657801,0.44378,0.53,0.947372
3,0.209300,0.221127,0.587586,0.509569,0.545804,0.952902


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=639, training_loss=0.18325492996191942, metrics={'train_runtime': 162.5052, 'train_samples_per_second': 62.656, 'train_steps_per_second': 3.932, 'total_flos': 277104186547440.0, 'train_loss': 0.18325492996191942, 'epoch': 3.0})

In [32]:
predictions = trainer.predict(test_dataset=tokenized_dataset["test"])

# Unpack the results
logits = predictions.predictions
labels = predictions.label_ids
metrics = predictions.metrics

metrics

{'test_loss': 0.25295591354370117,
 'test_precision': 0.5382907880133185,
 'test_recall': 0.4494902687673772,
 'test_f1': 0.4898989898989899,
 'test_accuracy': 0.9486193040950671,
 'test_runtime': 2.8379,
 'test_samples_per_second': 453.502,
 'test_steps_per_second': 28.542}

In [93]:
def get_raw_tokens(subtokens: list[str]) -> list[str]:

  tokens = []
  for tok in subtokens:
      if tok in ['<s>', '</s>']:
        tokens.append(tok)
      elif tok.startswith("▁"):
          tokens.append(tok[1:])
      else:
        tokens[-1] += tok[:]
  return tokens

import pandas as pd
pd.set_option("display.max_rows", None, "display.max_columns", None)


def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, truncation=True, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0][0].softmax(1)
    # executing argmax function to get the candidate tags
    word_ids = inputs.word_ids()
    raw_tokens = get_raw_tokens([tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])][0])[1:-1]
    tags = ([id2label[x] for x in probs.cpu().detach().numpy().argmax(axis=1).tolist()])

    assert len(word_ids) == len(tags), "ERROR"
    f_tags = []
    for idx in range(len(raw_tokens)):
      f_tags.append(tags[word_ids.index(idx)])


    tokens_n_tags = [(tok,tag) for tok,tag in zip(raw_tokens, f_tags)]

    return pd.DataFrame(tokens_n_tags, columns=['token', 'tag'])

In [94]:
# Example #1
text1 = """
The Trump administration’s immigration raids in the California city prompted mostly peaceful protests,
which escalated when the president sent in the national guard."""

print(get_prediction(text1))

               token         tag
0                The           O
1              Trump    B-person
2   administration’s    I-person
3        immigration           O
4              raids           O
5                 in           O
6                the           O
7         California  B-location
8               city           O
9           prompted           O
10            mostly           O
11          peaceful           O
12         protests,           O
13             which           O
14         escalated           O
15              when           O
16               the           O
17         president           O
18              sent           O
19                in           O
20               the           O
21          national           O
22            guard.           O


In [95]:
# Example #2
text2 = """
Apple in October 2021 overhauled the high-end MacBook Pro, introducing
an entirely new design, new chips, new capabilities, and more."""

print(get_prediction(text2))

            token            tag
0           Apple  B-corporation
1              in              O
2         October              O
3            2021              O
4      overhauled              O
5             the              O
6        high-end              O
7         MacBook      I-product
8            Pro,      I-product
9     introducing              O
10             an              O
11       entirely              O
12            new              O
13        design,              O
14            new              O
15         chips,              O
16            new              O
17  capabilities,              O
18            and              O
19          more.              O


### Use already fine-tuned models for NER from Huggingface hub
https://huggingface.co/models

In [96]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer_ner = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model_ner = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model_ner, tokenizer=tokenizer_ner)

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [100]:
def get_raw_tokens_bert(subtokens: list[str]) -> list[str]:

  tokens = []
  for tok in subtokens:
      if tok.startswith("##"):
          tokens[-1] += tok[2:]
      else:
          tokens.append(tok)
  return tokens

import pandas as pd
pd.set_option("display.max_rows", None, "display.max_columns", None)

def _get_predictions(text, tokenizer, model):

  inputs = tokenizer(text)
  text_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"])
  text_tags = ["O"]*len(text_tokens)
  pred_tags = model(text)
  print(pred_tags)
  for pr_tag in pred_tags:
    text_tags[pr_tag["index"]] = pr_tag["entity"]

  word_ids = inputs.word_ids()
  raw_tokens = get_raw_tokens_bert(tokenizer.convert_ids_to_tokens(inputs['input_ids'])[1:-1])
  assert len(word_ids) == len(text_tags), "ERROR"
  f_tags = []
  for idx in range(len(raw_tokens)):
    f_tags.append(text_tags[word_ids.index(idx)])


  tokens_n_tags = [(tok,tag) for tok,tag in zip(raw_tokens, f_tags)]

  return pd.DataFrame(tokens_n_tags, columns=['token', 'tag'])

In [101]:
print(_get_predictions(text1, tokenizer_ner, nlp))

[{'entity': 'B-PER', 'score': np.float32(0.99856985), 'index': 2, 'word': 'Trump', 'start': 5, 'end': 10}, {'entity': 'B-LOC', 'score': np.float32(0.9995659), 'index': 10, 'word': 'California', 'start': 53, 'end': 63}]
             token    tag
0              The      O
1            Trump  B-PER
2   administration      O
3                ’      O
4                s      O
5      immigration      O
6            raids      O
7               in      O
8              the      O
9       California  B-LOC
10            city      O
11        prompted      O
12          mostly      O
13        peaceful      O
14        protests      O
15               ,      O
16           which      O
17       escalated      O
18            when      O
19             the      O
20       president      O
21            sent      O
22              in      O
23             the      O
24        national      O
25           guard      O
26               .      O


In [102]:
print(_get_predictions(text2, tokenizer_ner, nlp))

[{'entity': 'B-ORG', 'score': np.float32(0.9983199), 'index': 1, 'word': 'Apple', 'start': 1, 'end': 6}, {'entity': 'B-MISC', 'score': np.float32(0.99807215), 'index': 12, 'word': 'Mac', 'start': 47, 'end': 50}, {'entity': 'I-MISC', 'score': np.float32(0.9977671), 'index': 13, 'word': '##B', 'start': 50, 'end': 51}, {'entity': 'I-MISC', 'score': np.float32(0.99782324), 'index': 14, 'word': '##ook', 'start': 51, 'end': 54}, {'entity': 'I-MISC', 'score': np.float32(0.99844617), 'index': 15, 'word': 'Pro', 'start': 55, 'end': 58}]
           token     tag
0          Apple   B-ORG
1             in       O
2        October       O
3           2021       O
4     overhauled       O
5            the       O
6           high       O
7              -       O
8            end       O
9        MacBook  B-MISC
10           Pro  I-MISC
11             ,       O
12   introducing       O
13            an       O
14      entirely       O
15           new       O
16        design       O
17             ,

# Resources
* https://huggingface.co/docs/transformers/tasks/sequence_classification
* https://huggingface.co/docs/transformers/tasks/token_classification