# Fine-tuning BERT Models 

In this notebook, we will fine-tune a pre-trained transformer (TinyBERT) on a **token classification task**.  
This is a common technique useful for various tasks in NLP like NER, Question-Answering etc

We will:
1. Load the required libraries and define the paths and hyperparameters.
2. Load and prepare the dataset (`train.txt`).
3. Load a pre-trained tiny BERT model
4. Preprocess the data and align it with the standard format.
5. Define the model and set the parameters.
6. Fine-tune the model



Select the **bert-env** kernel.

If you do not have the bert-env kernel, you can open up a terminal on Nova and paste the following:



conda create -n bert-env python=3.10 -y

conda activate bert-env

pip install -U transformers datasets accelerate evaluate scikit-learn

pip install ipykernel

pip install -U ipywidgets

Then once that is done:

python -m ipykernel install --user --name=bert-env --display-name "Python (bert-env)"




After this you can select/change the kernel to bert-env

We import all the required libraries for training. We make use of the famous Hugging-face **transformers** libray. You can also look up the documentation for transformers and pytorch for the code. The train and text files are from Professor Li`s research group.


In [1]:
import os, re, json, random, numpy as np, torch
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer, AutoModelForTokenClassification,
    DataCollatorForTokenClassification, TrainingArguments, Trainer
)


We now define the **paths** of the training file,the output directory where the model will be saved and the base model path. I have already uploaded tinybert to the class folder. An alternative approach is to download it directly from the hugging face site. 

In case you need to use another model in your work, you can upload or reference the path of your model in the script or use hugging face to download the model.

The label mapping is something specific to this training script. This might not be required if you are using another train/test file.

Tiny BERT is a lightweight model which is 7x smaller than BERT but 9x faster. It is ideal for demonstration purposes. This is often used as a student model. You can read **TinyBERT: Distilling BERT for Natural Language Understanding (EMNLP findings 2020)** for reference.

In [2]:
# Paths and settings
DATA_PATH = "./train.txt"           # training file( already uploaded to class folder)
OUTPUT_DIR = "./ner-model"          # fine-tuned model local save path 
MODEL_NAME = "./tinybert-local"     # local TinyBERT (Already uploaded to class folder)

# Label mapping
id2label = {0: "O", 1: "B-Trait"}
label2id = {"O": 0, "B-Trait": 1}


We define the **hyperparameters** to train the model. 

**The hyperparameters are like the variables that control how the model learns.**

We use fairly standard settings and keep the epochs low to reduce training time for the demonstration. You can change this settings around as per your requirements.

Here, **EPOCHS** is how many passes or how many times the model goes through your dataset(ie the train)

**TRAIN_BS** stands for training batch size. This is the number of examples the model processes before updating the weights of the model. Larger values means more faster training but more memory required to train.

**EVAL_BS** stands for evaluation batch size. The number of examples processed at once when evaluating on the validation set.

**MAX_LEN** stands for maximum number of tokens per sentence considered for training. Longer sentences are truncated, shorter ones are padded.

**LEARN_RATE** determines the step size when updating the model weight. It is the rate of learning or how quickly the model learns. Too high means model may not converge, too low might lead to lot of time for model to converge.

**VAL_SPLIT** stands for validation split. The percentage of data used to validate the model performance.The torch trainer does not use the validation directly but it is used to save the best checkpoint or the best performing model.

The other parameters are specific to the environment or server we are running.

Generally, the most important hyperparamerts are epochs, batch size, learning rate and max len. We also have other parameters like dropout rate etc.



In [3]:
EPOCHS = 2
TRAIN_BS = 8
EVAL_BS = 8
MAX_LEN = 64
LEARN_RATE = 3e-5
VAL_SPLIT = 0.1
THREADS = max(1, os.cpu_count() // 2)

This code is only to ensure we are doing CPU training as we have less resources. You can choose GPU training by commenting out the os.environ line.

In [4]:
# CPU only (remove/comment if you want GPU)
os.environ["CUDA_VISIBLE_DEVICES"] = ""
torch.set_num_threads(THREADS)

We read our data using the **CoNLL** format which is a widely used standard in NLP tasks. Each line represents a single word(token) with a series of tab-separated fields. Every sentence is separated from the next by an empty line. Each word is annotated using a lable like 0/1. 


In [5]:
def read_conll(path):
    tokens_all, labels_all = [], []
    sent_tokens, sent_labels = [], []
    with open(path, "r", encoding="utf-8") as f:
        for raw in f:
            line = raw.strip()
            if not line:
                if sent_tokens:
                    tokens_all.append(sent_tokens); labels_all.append(sent_labels)
                    sent_tokens, sent_labels = [], []
                continue
            parts = re.split(r"\s+", line)
            if len(parts) >= 2:
                tok, lab = parts[0], parts[-1]
                sent_tokens.append(tok)
                sent_labels.append(label2id.get(lab, 0))  # map labels to ints
    if sent_tokens:
        tokens_all.append(sent_tokens); labels_all.append(sent_labels)
    return tokens_all, labels_all

tokens_all, labels_all = read_conll(DATA_PATH)
print(f"Loaded {len(tokens_all)} sentences")


Loaded 1480 sentences


We preprocess the data for our purpose.

In [6]:
def build_datasets(tokens_all, labels_all, val_split=0.1, seed=42):
    data = [{"tokens": t, "labels": l} for t, l in zip(tokens_all, labels_all)]
    random.Random(seed).shuffle(data)
    n_val = max(1, int(len(data) * val_split))
    return DatasetDict({
        "train": Dataset.from_list(data[n_val:]),
        "validation": Dataset.from_list(data[:n_val])
    })

ds = build_datasets(tokens_all, labels_all, VAL_SPLIT)
print(ds)


DatasetDict({
    train: Dataset({
        features: ['tokens', 'labels'],
        num_rows: 1332
    })
    validation: Dataset({
        features: ['tokens', 'labels'],
        num_rows: 148
    })
})


We have to align and tokenize the dataset as per our training needs.This is generic boiler plate code.

In [7]:
def tokenize_and_align(ds, tokenizer, max_length):
    def _map(batch):
        enc = tokenizer(batch["tokens"], is_split_into_words=True,
                        truncation=True, max_length=max_length)
        aligned = []
        for i, labels in enumerate(batch["labels"]):
            word_ids = enc.word_ids(batch_index=i)
            prev = None
            out = []
            for wid in word_ids:
                if wid is None:
                    out.append(-100)
                elif wid != prev:
                    out.append(int(labels[wid]))
                else:
                    out.append(-100)
                prev = wid
            aligned.append(out)
        enc["labels"] = aligned
        return enc
    return ds.map(_map, batched=True, remove_columns=["tokens","labels"])

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
ds_tok = tokenize_and_align(ds, tokenizer, MAX_LEN)
print(ds_tok)


Map:   0%|          | 0/1332 [00:00<?, ? examples/s]

Map:   0%|          | 0/148 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1332
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 148
    })
})


We define the model parameters. We are using the model for **token classification** and also choose our mapping,

In [8]:
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id
)
model.gradient_checkpointing_enable()


Some boiler plate code for training. You can remove the compute metrics part in case you want it. However the args refers to the Training arguments. We set the learning rate which is a very important parameter and also load our previously defined hyperparameters.

In [9]:
collator = DataCollatorForTokenClassification(tokenizer=tokenizer)


from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=-1).flatten()
    labels = p.label_ids.flatten()

    # remove padding (-100)
    valid = labels != -100
    preds = preds[valid]
    labels = labels[valid]

    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    acc = accuracy_score(labels, preds)

    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }




We now apply the **arguments** to the args variable which sort of takes the parameters for training. Most of the arguments are the hyperparameters we have already defined. Some of the values are generic trivial or default values.

In [10]:
args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    learning_rate=LEARN_RATE,
    per_device_train_batch_size=TRAIN_BS,
    per_device_eval_batch_size=EVAL_BS,
    num_train_epochs=EPOCHS,
    logging_steps=50,
    save_total_limit=1,
)


We define the **trainer** and set the arguments for it as per what we have already defined above.

In [11]:
trainer = Trainer(
    model=model, args=args,
    data_collator=collator, tokenizer=tokenizer,
    train_dataset=ds_tok["train"], eval_dataset=ds_tok["validation"],
    compute_metrics=compute_metrics,
)


  trainer = Trainer(


Finally we are ready to **train** our model. I have kept a print statement to show what is the step and training loss. Notice how the training loss reduces with more steps. Ignore any warnings if displayed.

In [12]:
trainer.train()
eval_out = trainer.evaluate()
print("Eval:", eval_out)




Step,Training Loss
50,0.1035
100,0.0097
150,0.0078
200,0.0068
250,0.0061
300,0.0057




Eval: {'eval_loss': 0.004390090238302946, 'eval_accuracy': 1.0, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_runtime': 2.2419, 'eval_samples_per_second': 66.014, 'eval_steps_per_second': 8.475, 'epoch': 2.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


The final step is to **save** the fine-tuned model in our output directory. The model has now been trained for token classification on the training text file. We can later use the saved model for inference or prediction on the test set.

In [13]:
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

with open(os.path.join(OUTPUT_DIR,"run_report.json"),"w") as f:
    json.dump({"metrics_eval": eval_out}, f, indent=2)

print(f"✅ Model saved to {OUTPUT_DIR}")


✅ Model saved to ./ner-model


A few things to note. The model is a very tiny one and I chose it as such for demonstration purposes. Also, our training text is too small. 

You can scale up the model by using the actual BERT model or other variants like RoBERTa or DistilBERT. Also you can increase the training dataset size and also change the hyperparameters like the number of epochs, batch size etc to improve the performance.

<h1>THANK YOU</h1>