# Personal Project

### Fine-tune a BERT model by adapting LoRA method for NER task using bc5cdr dataset

https://huggingface.co/datasets/tner/bc5cdr?viewer_api=true

- Entity Types: Chemical, Disease, Treatment

- "O": 0
- "B-Chemical": 1
- "B-Disease": 2
- "I-Chemical": 3
- "I-Disease": 4

###Installing Necessary Libraries

Downgrading "datasets" library was necessary to be able to load the data later

In [1]:
# Install required libraries
# --quiet suppresses installation messages
!pip install transformers peft accelerate torch evaluate seqeval --quiet
!pip install datasets==3.6.0 --quiet

print("Installation complete")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstallation complete


### Importing Libraries

- transformers: Huggingface library for BERT and other models
- datasets: Easy loading of NLP datasets
- peft: Parameter-Efficient Fine-Tuning (includes LoRA)
- accelerate: Speeds up training on GPU
- torch: PyTorch (deep learning framework)
- evaluate & seqeval: Calculate accuracy/F1 scores for NER

In [2]:
import torch
from transformers import (
    AutoModelForTokenClassification,  # Pre-trained BERT for NER
    AutoTokenizer,                     # Converts text to numbers
    TrainingArguments,                 # Settings for training
    Trainer,                          # Handles the training loop
    DataCollatorForTokenClassification # Prepares batches of data
)
from peft import LoraConfig, get_peft_model, TaskType  # LoRA components
from datasets import load_dataset  # Download datasets
import evaluate                     # Calculate metrics
import numpy as np                  # Math operations

print("Libraries imported successfully")

Libraries imported successfully


In [3]:
import transformers

### Check GPU Availabiltiy

- Checks if you have a GPU

In [4]:
# Check if GPU is available (Colab usually has one)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


### Load Dataset

In [5]:
data = load_dataset("tner/bc5cdr")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

bc5cdr.py: 0.00B [00:00, ?B/s]

0000.parquet:   0%|          | 0.00/367k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/364k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/386k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5228 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5330 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5865 [00:00<?, ? examples/s]

In [6]:
print(data)

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5228
    })
    validation: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5330
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5865
    })
})


In [8]:
train_data= data["train"]
test_data = data["test"]
validation_data = data["validation"]

print(f"   Training data: {len(train_data)}")
print(f"   Testing data: {len(test_data)}")
print(f"   Validation data: {len(validation_data)}")

   Training data: 5228
   Testing data: 5865
   Validation data: 5330


One Example of a sentence and its respective labels/NER tags

In [9]:
print(data["train"][0])
print(data["test"][0])

{'tokens': ['Naloxone', 'reverses', 'the', 'antihypertensive', 'effect', 'of', 'clonidine', '.'], 'tags': [1, 0, 0, 0, 0, 0, 1, 0]}
{'tokens': ['Famotidine', '-', 'associated', 'delirium', '.'], 'tags': [1, 0, 0, 2, 0]}


Labels/NER Tags

* Entity Types: Chemical, Disease, Treatment

- "O": 0
- "B-Chemical": 1
- "B-Disease": 2
- "I-Chemical": 3
- "I-Disease": 4


In [10]:
label_list = {
    "O": 0,
    "B-Chemical": 1,
    "B-Disease": 2,
    "I-Chemical": 3,
    "I-Disease": 4,
}

print("Entity types:", label_list)

Entity types: {'O': 0, 'B-Chemical': 1, 'B-Disease': 2, 'I-Chemical': 3, 'I-Disease': 4}


In [11]:
for i, label in enumerate(label_list):
    print(f"   {i}: {label}")

# Create mappings between numbers and labels
id_to_label = {i: label for i, label in enumerate(label_list)}
label_to_id = {label: i for i, label in enumerate(label_list)}

num_labels = len(label_list)
print(f"\n Total labels: {num_labels}")
print(f"\n{id_to_label}")
print(f"{label_to_id}")

   0: O
   1: B-Chemical
   2: B-Disease
   3: I-Chemical
   4: I-Disease

 Total labels: 5

{0: 'O', 1: 'B-Chemical', 2: 'B-Disease', 3: 'I-Chemical', 4: 'I-Disease'}
{'O': 0, 'B-Chemical': 1, 'B-Disease': 2, 'I-Chemical': 3, 'I-Disease': 4}


## Load BERT Model

- Downloads a pre-trained BERT model
- Adds a classification "head" on top for NER
- Moves model to GPU for faster processing

In [12]:
model_name = "google-bert/bert-base-uncased"

print(f"Loading model: {model_name}")

Loading model: google-bert/bert-base-uncased


In [13]:
# Load tokenizer (converts words to numbers)
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [14]:
# Load pre-trained model
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels ,
    id2label=id_to_label,
    label2id=label_to_id
)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERT Architecture (can see that BERT is only built from Encoder)

In [15]:
print(model)

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [16]:
# Move model to GPU
model = model.to(device)
print(f"Total parameters: {model.num_parameters():,}")

Total parameters: 108,895,493


## Apply LoRA

In [17]:
lora_config = LoraConfig(
    task_type=TaskType.TOKEN_CLS,
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["query", "value"], # Apply LoRA to query and value matrices
    bias="none",
)

In [18]:
# model.unload()


In [19]:
# Apply LoRA to the model
model = get_peft_model(model, lora_config)

LoRA Applied BERT Architecture

In [20]:
print(model)

PeftModelForTokenClassification(
  (base_model): LoraModel(
    (model): BertForTokenClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-11): 12 x BertLayer(
              (attention): BertAttention(
                (self): BertSdpaSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Li

In [21]:
# Show how many parameters will be trained
model.print_trainable_parameters()

trainable params: 298,757 || all params: 109,194,250 || trainable%: 0.2736


## Tokenize Data

In [22]:
def tokenize_and_align_labels(examples):

    # Tokenize the words
    tokenized_inputs = tokenizer(
        examples['tokens'],           # Input words
        truncation=True,              # Cut if too long
        is_split_into_words=True,     # Already split into words
        padding='max_length',         # Pad to same length
        max_length=128                # Max 128 tokens
    )

    labels = []
    for i, label in enumerate(examples['tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)

        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            # Special tokens (CLS, SEP, PAD) get -100 (ignore)
            if word_idx is None:
                label_ids.append(-100)
            # First subword gets the real label
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # Additional subwords get -100 (ignore)
            else:
                label_ids.append(-100)

            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [23]:
# Apply tokenization to all data
tokenized_datasets = data.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=data['train'].column_names
)
print("Tokenization complete")

Map:   0%|          | 0/5228 [00:00<?, ? examples/s]

Map:   0%|          | 0/5330 [00:00<?, ? examples/s]

Map:   0%|          | 0/5865 [00:00<?, ? examples/s]

Tokenization complete


In [24]:
print(tokenized_datasets)


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5228
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5330
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5865
    })
})


### Define Evaluation Metrics

In [25]:
# Load seqeval metric (standard for NER)
seqeval = evaluate.load("seqeval")

def compute_metrics(eval_pred):

    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=2)  # Get the highest probability class

    # Remove ignored tokens (-100)
    true_predictions = [
        [id_to_label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id_to_label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    # Calculate metrics
    results = seqeval.compute(predictions=true_predictions, references=true_labels)

    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

print("Metrics function ready!")

Downloading builder script: 0.00B [00:00, ?B/s]

Metrics function ready!


### Training Configuration

**What this does:**
- Sets all the training: "hyperparameters"
- learning_rate: How big steps to take when learning
- batch_size: How many examples to process together
- epochs: How many times to see the entire dataset
- fp16: Use 16-bit precision (2x faster, uses less memory)

In [26]:
# Set up training parameters
training_args = TrainingArguments(
    output_dir="./lora-ner-results",        # Where to save model
    learning_rate=3e-4,                      # How fast to learn (higher for LoRA)
    per_device_train_batch_size=16,         # Process 16 examples at once
    per_device_eval_batch_size=16,          # Evaluate 16 at once
    num_train_epochs=3,                     # Train for 3 full passes
    weight_decay=0.01,                      # Prevent overfitting
    eval_strategy="epoch",            # Evaluate after each epoch
    save_strategy="epoch",                  # Save after each epoch
    load_best_model_at_end=True,           # Keep the best model
    logging_steps=50,                       # Log every 50 steps
    fp16=True,                              # Use mixed precision (faster on GPU)
    report_to="none"                        # Don't send to tracking services
)

# Data collator (prepares batches)
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

### Create Trainer

In [27]:
# Trainer handles all the training logic
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("Trainer ready!")

Trainer ready!


  trainer = Trainer(


## Train

**What this does:**
- Runs the training loop
- Shows progress bar
- Updates model weights to minimize loss

**What you'll see:**
```
Epoch 1/3
[████████████████] 88/88 [02:15<00:00, 1.54s/it]
Loss: 0.234

Epoch 2/3
[████████████████] 88/88 [02:12<00:00, 1.50s/it]
Loss: 0.112

Epoch 3/3
[████████████████] 88/88 [02:10<00:00, 1.48s/it]
Loss: 0.089

In [28]:
# Train the model
trainer.train()

print("\n Training complete!")

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.1444,0.114421,0.771285,0.78502,0.778092,0.959167
2,0.1173,0.104192,0.77909,0.8318,0.804583,0.962163
3,0.1033,0.100203,0.791478,0.8341,0.81223,0.963841



 Training complete!


### Evaluate

In [29]:
# Evaluate on validation set
results = trainer.evaluate()

print("\n Results:")
print(f"   Precision: {results['eval_precision']:.2%}")
print(f"   Recall: {results['eval_recall']:.2%}")
print(f"   F1-Score: {results['eval_f1']:.2%}")
print(f"   Accuracy: {results['eval_accuracy']:.2%}")


 Results:
   Precision: 79.15%
   Recall: 83.41%
   F1-Score: 81.22%
   Accuracy: 96.38%


### Save Model

- Saves only the LoRA (weights) (not the full BERT!)

What this does:
- Saves only the LoRA weights (not the full BERT)
- Creates a tiny ~1MB file

In [30]:
# Save the LoRA adapter (tiny file!)
model.save_pretrained("./lora-bert-ner")
tokenizer.save_pretrained("./lora-bert-ner")

print("Model saved!")

Model saved!


In [31]:
from google.colab import files
import shutil
shutil.make_archive('model', 'zip', './lora-bert-ner')
files.download('model.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>