# Name Entity Recognition (NER)
This notebook compares three fine-tuning techniques:

- LoRA (Low-Rank Adaptation)
- AdaLoRA (Adaptive Low-Rank Adaptation)
- Prefix tuning

Summary
- Dataset: CoNLL-2003 dataset
- Base Model: GPT-2
- Evaluation Metrics: Accuracy, F1 Score

## Import Libraries

In [1]:
!pip install -q evaluate peft seqeval

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━

In [2]:
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, AdaLoraConfig, PromptTuningConfig, PrefixTuningConfig, get_peft_model
from datasets import load_dataset
import evaluate
import numpy as np
from seqeval.metrics import classification_report
from typing import List, Dict

2025-06-01 10:55:57.855647: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748775358.039702      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748775358.095712      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Data Loading and Preprocessing

In [3]:
# === 1. Preprocessing ===
def load_and_preprocess_data(num_virtual_tokens=0):
    try:
        # Load CoNLL-2003 dataset
        dataset = load_dataset("conll2003")
        print("Dataset loaded successfully")
        
        # Load tokenizer with add_prefix_space=True for pre-tokenized inputs
        tokenizer = AutoTokenizer.from_pretrained(
            "gpt2",
            add_prefix_space=True
        )
        tokenizer.pad_token = tokenizer.eos_token
        
        # Get label list
        label_list = dataset["train"].features["ner_tags"].feature.names
        num_labels = len(label_list)
        label2id = {label: idx for idx, label in enumerate(label_list)}
        id2label = {idx: label for idx, label in enumerate(label_list)}
        print(f"Labels: {label_list}")
        
        def tokenize_and_align_labels(examples):
            tokenized_inputs = tokenizer(
                examples["tokens"],
                truncation=True,
                is_split_into_words=True,
                padding="max_length",
                max_length=128,
                return_tensors="pt"
            )
            
            labels = []
            for i, label in enumerate(examples["ner_tags"]):
                word_ids = tokenized_inputs.word_ids(batch_index=i)
                label_ids = []
                # Prepend -100 for virtual prompt tokens if using Prompt Tuning
                if num_virtual_tokens > 0:
                    label_ids.extend([-100] * num_virtual_tokens)
                previous_word_idx = None
                for word_idx in word_ids:
                    if word_idx is None:
                        label_ids.append(-100)
                    elif word_idx != previous_word_idx:
                        label_ids.append(label[word_idx])
                    else:
                        label_ids.append(-100)
                    previous_word_idx = word_idx
                # Ensure label length matches input length
                target_length = 128 + num_virtual_tokens
                if len(label_ids) < target_length:
                    label_ids.extend([-100] * (target_length - len(label_ids)))
                elif len(label_ids) > target_length:
                    label_ids = label_ids[:target_length]
                labels.append(label_ids)
            
            tokenized_inputs["labels"] = labels
            return tokenized_inputs
        
        # Tokenize dataset
        tokenized_dataset = dataset.map(
            tokenize_and_align_labels,
            batched=True,
            remove_columns=dataset["train"].column_names
        )
        print("Dataset tokenized successfully")
        
        return tokenized_dataset, tokenizer, label_list, label2id, id2label
    
    except Exception as e:
        print(f"Error in preprocessing: {str(e)}")
        raise

## Load Pretrained Model and PEFT Configugration

In [4]:
# === 2. Model Setup ===
def setup_model(num_labels, peft_type: str, label_list):
    try:
        model = AutoModelForTokenClassification.from_pretrained(
            "distilgpt2",
            num_labels=num_labels,
            id2label={i: label for i, label in enumerate(label_list)},
            label2id={label: i for i, label in enumerate(label_list)}
        )
        
        
        if peft_type == "lora":
            config = LoraConfig(
                r=16,
                lora_alpha=16,
                target_modules=["c_attn", "c_fc"],
                lora_dropout=0.1,
                bias="none",
                task_type="TOKEN_CLS"
            )
            model = get_peft_model(model, config)
            print("LoRA model configured")
        elif peft_type == "adalora":
            config = AdaLoraConfig(
                r=24,
                target_r=16,
                lora_alpha=16,
                lora_dropout=0.1,
                target_modules=["c_attn", "c_fc"],
                task_type="TOKEN_CLS",
                inference_mode=False,
                init_r=8,
                tinit=200,
                tfinal=1000,
                deltaT=10,
                beta1=0.85,
                beta2=0.85,
                modules_to_save=["classifier"] 
            )
            model = get_peft_model(model, config)
            print("AdaLoRA model configured")
        elif peft_type == "prefix":
            config = PrefixTuningConfig(
                task_type="TOKEN_CLS",
                num_virtual_tokens=20,
                encoder_hidden_size=768
            )
            model = get_peft_model(model, config)
            print("Prefix Tuning model configured")
        elif peft_type == "prompt_tuning":
            config = PromptTuningConfig(
                task_type="TOKEN_CLS",
                num_virtual_tokens=20,
                prompt_tuning_init="TEXT",
                prompt_tuning_init_text="Classify named entities in the following text:",
                tokenizer_name_or_path="distilgpt2"
            )
            model = get_peft_model(model, config)
            print("Prompt Tuning model configured")
        else:
            print("Using full fine-tuning")
        
        model.print_trainable_parameters()
        return model
    
    except Exception as e:
        print(f"Error in model setup: {str(e)}")
        raise

## Model Training

In [12]:
from transformers import EarlyStoppingCallback
import numpy as np
import evaluate

# === 3. Training ===
def train_model(model, tokenized_dataset, output_dir, peft_type: str):
    try:
        training_args = TrainingArguments(
            output_dir=output_dir,
            eval_strategy="epoch",
            save_strategy="epoch",
            learning_rate=1e-4 if peft_type == "adapter" else 5e-5 if peft_type == "lora" else 1e-3,
            per_device_train_batch_size=8,
            per_device_eval_batch_size=16,
            num_train_epochs=30,
            weight_decay=0.1,
            logging_dir="./logs",
            logging_steps=100,
            load_best_model_at_end=True,
            metric_for_best_model="f1",
            report_to="none",
            gradient_accumulation_steps=2,
            save_total_limit=1  # optional: saves space
        )

        metric = evaluate.load("seqeval")

        def compute_metrics(p):
            predictions, labels = p
            predictions = np.argmax(predictions, axis=2)

            true_labels = [
                [label_list[l] for l in label if l != -100]
                for label in labels
            ]
            pred_labels = [
                [label_list[p] for (p, l) in zip(pred, label) if l != -100]
                for pred, label in zip(predictions, labels)
            ]

            results = metric.compute(predictions=pred_labels, references=true_labels)
            return {
                "precision": results["overall_precision"],
                "recall": results["overall_recall"],
                "f1": results["overall_f1"],
                "accuracy": results["overall_accuracy"]
            }

        # Early stopping callback
        early_stopping = EarlyStoppingCallback(
            early_stopping_patience=3,  # stop after 3 epochs with no improvement
            early_stopping_threshold=0.0  # requires strictly better score
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=tokenized_dataset["train"],
            eval_dataset=tokenized_dataset["validation"],
            compute_metrics=compute_metrics,
            callbacks=[early_stopping]
        )

        print(f"Starting training for {output_dir}...")
        trainer.train()
        trainer.save_model(output_dir)
        print(f"Model saved to {output_dir}")
        return trainer

    except Exception as e:
        print(f"Error in training: {str(e)}")
        raise

# Evaluation

In [6]:
# === 4. Evaluation and Comparison ===
def evaluate_and_compare(trainers: Dict[str, Trainer], tokenized_dataset):
    try:
        print("Evaluating models...")
        results = {}
        
        for peft_type, trainer in trainers.items():
            dataset = tokenized_dataset
            eval_results = trainer.evaluate(dataset["test"])
            results[peft_type] = eval_results
            print(f"{peft_type.capitalize()} Model Results: {eval_results}")
        
        # Compare
        print("\n=== Model Comparison ===")
        for peft_type, res in results.items():
            # param_count = "~0.1-1%" if peft_type == "lora" else "~0.5-2%" if peft_type == "adapter" else "<0.01%" if peft_type == "prompt_tuning" else "All"
            print(f"{peft_type.capitalize()} - F1: {res['eval_f1']:.4f}, Accuracy: {res['eval_accuracy']:.4f}")
        
        return results
    
    except Exception as e:
        print(f"Error in evaluation: {str(e)}")
        raise

In [7]:
# === 5. Demo Prediction ===
def demo_prediction(model, tokenizer, label_list, sentence: str, num_virtual_tokens=0):
    try:
        model.eval()
        # Tokenize with BatchEncoding to get word_ids
        encoding = tokenizer(
            sentence.split(),
            is_split_into_words=True,
            return_tensors="pt",
            truncation=True,
            padding=True,
            max_length=128,
            return_special_tokens_mask=True
        )
        
        # Move inputs to the same device as the model, excluding special_tokens_mask
        inputs = {k: v.to(model.device) for k, v in encoding.items() if k != "special_tokens_mask"}
        
        with torch.no_grad():
            outputs = model(**inputs)
        
        predictions = torch.argmax(outputs.logits, dim=2)[0]
        tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
        
        # Get word_ids from encoding
        word_ids = encoding.word_ids(batch_index=0)
        pred_labels = []
        current_word_id = None
        token_idx = num_virtual_tokens  # Skip prompt tokens
        for word_id, pred in zip(word_ids, predictions):
            if word_id is None or token_idx < num_virtual_tokens:
                token_idx += 1
                continue
            if word_id != current_word_id:
                pred_labels.append(label_list[pred])
                current_word_id = word_id
            token_idx += 1
        
        tokens = [token for token, wid in zip(tokens, word_ids) if wid is not None]
        
        print("=== Demo Prediction ===")
        print("Sentence:", sentence)
        print("Token\t\tLabel")
        print("-----------------------")
        for token, label in zip(tokens, pred_labels):
            print(f"{token[1:]:<15}\t{label}")
    
    except Exception as e:
        print(f"Error in demo prediction: {str(e)}")
        raise

## Main Script

In [8]:
# Preprocessing for LoRA and AdaLoRA
tokenized_dataset, tokenizer, label_list, label2id, id2label = load_and_preprocess_data(
    num_virtual_tokens=0
)

README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset loaded successfully


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Labels: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset tokenized successfully


In [13]:
# Train models with different PEFT techniques
trainers = {}
peft_type = "lora"
print('Training with ' + peft_type)
model = setup_model(len(label_list), peft_type=peft_type, label_list=label_list)
trainer = train_model(model, tokenized_dataset, f"./{peft_type}_model", peft_type)
trainers[peft_type] = trainer

Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training with lora
LoRA model configured
trainable params: 670,473 || all params: 82,589,970 || trainable%: 0.8118


No label_names provided for model class `PeftModelForTokenClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting training for ./lora_model...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2361,0.237861,0.656518,0.580906,0.616402,0.938116
2,0.1971,0.190237,0.662584,0.664927,0.663753,0.949823
3,0.1658,0.16491,0.646712,0.702138,0.673287,0.953368
4,0.1514,0.156956,0.687218,0.743223,0.714124,0.958841
5,0.1432,0.147393,0.702362,0.760903,0.730461,0.96077
6,0.1435,0.142526,0.714709,0.77151,0.742024,0.962698
7,0.1378,0.137006,0.718083,0.782287,0.748811,0.963867
8,0.1227,0.136955,0.729465,0.789527,0.758308,0.965211
9,0.1247,0.130842,0.73624,0.790537,0.762423,0.965737
10,0.1201,0.130389,0.733898,0.788517,0.760227,0.965971


Model saved to ./lora_model


In [14]:
peft_type = "adalora"
print('Training with ' + peft_type)
model = setup_model(len(label_list), peft_type=peft_type, label_list=label_list)
trainer = train_model(model, tokenized_dataset, f"./{peft_type}_model", peft_type)
trainers[peft_type] = trainer

Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training with adalora
AdaLoRA model configured
trainable params: 338,793 || all params: 82,258,302 || trainable%: 0.4119


No label_names provided for model class `PeftModelForTokenClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting training for ./adalora_model...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.1697,0.166855,0.704653,0.749621,0.726442,0.95966
2,0.1522,0.152185,0.721264,0.760734,0.740474,0.961081
3,0.1355,0.135895,0.730372,0.786328,0.757318,0.964958
4,0.1279,0.134317,0.741966,0.793063,0.766664,0.967178
5,0.1211,0.130592,0.732135,0.795252,0.762389,0.966419
6,0.1219,0.121341,0.747659,0.806701,0.776059,0.968386
7,0.1165,0.122693,0.747505,0.794578,0.770323,0.968094
8,0.1064,0.124856,0.740561,0.795925,0.767246,0.967373
9,0.1073,0.114607,0.753543,0.805691,0.778745,0.969107
10,0.0998,0.119641,0.752454,0.813268,0.78168,0.969224


Model saved to ./adalora_model


In [15]:
peft_type = "prefix"
print('Training with ' + peft_type)
model = setup_model(len(label_list), peft_type=peft_type, label_list=label_list)
trainer = train_model(model, tokenized_dataset, f"./{peft_type}_model", peft_type)
trainers[peft_type] = trainer

Training with prefix


Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Prefix Tuning model configured
trainable params: 191,241 || all params: 82,110,738 || trainable%: 0.2329


No label_names provided for model class `PeftModelForTokenClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting training for ./prefix_model...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2348,0.226735,0.585542,0.669641,0.624774,0.942246
2,0.1993,0.184682,0.657964,0.717124,0.686271,0.953251
3,0.17,0.163755,0.691715,0.740865,0.715447,0.958413
4,0.1533,0.150088,0.703276,0.75905,0.7301,0.960906
5,0.1424,0.144494,0.708053,0.762418,0.734231,0.961919
6,0.1421,0.137892,0.715987,0.769153,0.741619,0.963088
7,0.1355,0.134029,0.724791,0.773363,0.748289,0.964101
8,0.1227,0.137211,0.716084,0.775888,0.744787,0.963497
9,0.1252,0.129607,0.723434,0.776057,0.748822,0.965094
10,0.1198,0.130118,0.727415,0.784981,0.755102,0.965308


Model saved to ./prefix_model


In [16]:
# Evaluate and compare
results = evaluate_and_compare(trainers, tokenized_dataset)

Evaluating models...


Lora Model Results: {'eval_loss': 0.17044632136821747, 'eval_precision': 0.699547949628673, 'eval_recall': 0.767445979454481, 'eval_f1': 0.7319256756756757, 'eval_accuracy': 0.9578675282714055, 'eval_runtime': 9.4463, 'eval_samples_per_second': 365.54, 'eval_steps_per_second': 22.866, 'epoch': 28.0}


Adalora Model Results: {'eval_loss': 0.18012863397598267, 'eval_precision': 0.7098344693281402, 'eval_recall': 0.7747077577045696, 'eval_f1': 0.7408536585365854, 'eval_accuracy': 0.9590522347872913, 'eval_runtime': 9.9584, 'eval_samples_per_second': 346.743, 'eval_steps_per_second': 21.69, 'epoch': 26.0}


Prefix Model Results: {'eval_loss': 0.16270868480205536, 'eval_precision': 0.6991643454038997, 'eval_recall': 0.7557562876372653, 'eval_f1': 0.7263596901863989, 'eval_accuracy': 0.9576305869682283, 'eval_runtime': 9.4691, 'eval_samples_per_second': 364.659, 'eval_steps_per_second': 22.811, 'epoch': 28.0}

=== Model Comparison ===
Lora - F1: 0.7319, Accuracy: 0.9579
Adalora - F1: 0.7409, Accuracy: 0.9591
Prefix - F1: 0.7264, Accuracy: 0.9576


In [17]:
# Demo prediction with LoRA model
test_sentence = "Apple is planning to open a new store in London next month."
print('\n[LORA]')
demo_prediction(
    trainers["lora"].model,
    tokenizer,
    label_list,
    test_sentence,
    num_virtual_tokens=0
)
# Demo prediction with AdaLoRA Tuning model
print('\n[ADALORA]')
demo_prediction(
    trainers["adalora"].model,
    tokenizer,
    label_list,
    test_sentence,
    num_virtual_tokens=20
)
# Demo prediction with Prefix Tuning model
print('\n[PREFIX]')
demo_prediction(
    trainers["prefix"].model,
    tokenizer,
    label_list,
    test_sentence,
    num_virtual_tokens=20
)


[LORA]
=== Demo Prediction ===
Sentence: Apple is planning to open a new store in London next month.
Token		Label
-----------------------
Apple          	B-ORG
is             	O
planning       	O
to             	O
open           	O
a              	O
new            	O
store          	O
in             	O
London         	B-LOC
next           	O
month          	O

[ADALORA]
=== Demo Prediction ===
Sentence: Apple is planning to open a new store in London next month.
Token		Label
-----------------------
Apple          	B-ORG
is             	O
planning       	O
to             	O
open           	O
a              	O
new            	O
store          	O
in             	O
London         	B-LOC
next           	O
month          	O

[PREFIX]
=== Demo Prediction ===
Sentence: Apple is planning to open a new store in London next month.
Token		Label
-----------------------
Apple          	B-ORG
is             	O
planning       	O
to             	O
open           	O
a              	O
new            	O


- **F1 Score**: **AdaLoRA** (0.7409) outperforms LoRA (0.7319) and Prefix tuning (0.7264), indicating better handling of entity recognition challenges, such as class imbalance or rare entities.

- **Accuracy**: All methods achieve high accuracy (>0.957), with **AdaLoRA** slightly leading (0.9591). The small differences suggest that all models are effective for general token classification, but F1 score differences highlight varying abilities to handle entity-specific challenges.

- **Efficiency**: **LoRA and AdaLoRA** are parameter-efficient fine-tuning methods, making them more computationally efficient than full fine-tuning. Prefix tuning, while also efficient, appears less effective for NER based on the F1 score.