# NBAiLab - Finetuning and Evaluating a BERT model for NER and POS
<img src="https://raw.githubusercontent.com/NBAiLab/notram/master/images/nblogo_2.png">


In this notebook we will finetune the [NB-BERTbase Model](https://github.com/NBAiLab/notram) released by the National Library of Norway. This is a model trained on a large corpus (110GB) of Norwegian texts.

We will finetune this model on the [NorNE dataset](https://github.com/ltgoslo/norne). for Named Entity Recognition (NER) and Part of Speech (POS) tags using the [Transformers Library by Huggingface](https://huggingface.co/transformers/). After training the model should be able to accept any text string input (up to 512 tokens) and return POS or NER-tags for this text. This is useful for a number of NLP tasks, for instance for extracting/removing names/places from a document. After training, we will save the model, evaluate it and use it for predictions.

The Notebook is intended for experimentation with the pre-release NoTram models from the National Library of Norway, and is made for educational purposes. If you just want to use the model, you can instead initiate one of our finetuned models.

## Before proceeding
Create a copy of this notebook by going to "File - Save a Copy in Drive"


# Fill in Parameters

In [14]:
model_training_data_path = "annotations/BIO/BIO_synthetic_spacebased.txt"
confidential_dataset_path = "annotations/BIO/BIO_manual_confidential.txt" # CHANGE THIS

# k-fold cross-validation
k = 5 


# Fill in Labels

In [15]:
# Define your label mapping
label_list = [
    "O",
    "B-PER", "I-PER",
    "B-PID", "I-PID",
    "B-LOC", "I-LOC",
    "B-ORG", "I-ORG",
    "B-OID", "I-OID",    
    "B-VEH", "I-VEH",
    "B-LIC", "I-LIC", 
    "B-MAIL", "I-MAIL",
    "B-PHO", "I-PHO",
    "B-USR", "I-USR",
    "B-FIN", "I-FIN",
    "B-ITM", "I-ITM"
]


label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}

# Install Dependencies and Define Helper Functions
You need to run the code below to install some libraries and initiate some helper functions. Click "Show Code" if you later want to examine this part as well.

In [16]:
import logging
import os
import sys
import copy
from dataclasses import dataclass
from dataclasses import field
from typing import Optional

import numpy as np
import pandas as pd
from sklearn import model_selection
from datasets import Dataset, DatasetDict, ClassLabel, Sequence, concatenate_datasets
from sklearn.model_selection import KFold

import transformers
from datasets import load_dataset
from seqeval.metrics import accuracy_score
from seqeval.metrics import f1_score
from seqeval.metrics import precision_score
from seqeval.metrics import recall_score
from seqeval.metrics import classification_report
from transformers.training_args import TrainingArguments
from tqdm import tqdm
from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    PreTrainedTokenizerFast,
    Trainer,
    TrainingArguments,
    pipeline,
    set_seed
)

# from google.colab import output
from IPython.display import Markdown
from IPython.display import display

# Helper Funtions - Allows us to format output by Markdown
def printm(string):
    display(Markdown(string))

## Preprocessing the dataset
# Tokenize texts and align the labels with them.
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples[text_column_name],
        max_length=max_length,
        padding=padding,
        truncation=True,
        # We use this argument because the texts in our dataset are lists of words (with a label for each word).
        is_split_into_words=True,
    )
    labels = []
    for i, label in enumerate(examples[label_column_name]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(ner_label_to_id[label[word_idx]])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(ner_label_to_id[label[word_idx]] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

def get_true_pred_and_labels(pairs):
    predictions, labels = pairs
    predictions = np.argmax(predictions, axis=2)
    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return true_predictions, true_labels

# Metrics
def compute_metrics(pairs):
    

    # Remove ignored index (special tokens)
    true_predictions, true_labels = get_true_pred_and_labels(pairs)

    return {
        "accuracy_score": accuracy_score(true_labels, true_predictions),
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        "f1_micro": f1_score(true_labels, true_predictions, average='micro'),
        "f1_macro": f1_score(true_labels, true_predictions, average='macro'),
        "f1_weighted": f1_score(true_labels, true_predictions, average='weighted'),
        "report": classification_report(true_labels, true_predictions, digits=4)
    }

def write_clsf_report_to_file(results, writer, k_idx, trainer_predictions, classification_reports_per_fold):
    # printm("**Eval results**")
    for key, value in results.items():
        # printm(f"{key} = {value}")
        if key == "report" or key == "eval_report":
            writer.write(f"{key}:\n{value}\n")
            test_true_pred, test_true_labels = get_true_pred_and_labels((trainer_predictions.predictions, trainer_predictions.label_ids))
            print("Get dict version of classification report...")
            report = classification_report(test_true_labels, test_true_pred, digits=4, output_dict=True)
            print("Add dict version of classification report into main dictionary...")
            classification_df = pd.DataFrame(report).transpose()
            classification_reports_per_fold[k_idx] = classification_df
        else:
            writer.write(f"{key} = {value}\n")

    return classification_reports_per_fold

def get_class_level_weighted_average(clsf_report_dict, dfs_index):
    new_df_list = dict()
    for df_name, report in clsf_report_dict.items():
        weighted_averages = pd.DataFrame(0, index=dfs_index, columns=['precision', 'recall','f1-score'])
        total_support = pd.Series(0, index=dfs_index)
        for df in report:
            weighted_averages['f1-score'] += df['f1-score'] * df['support']
            weighted_averages['precision'] += df['precision'] * df['support']
            weighted_averages['recall'] += df['recall'] * df['support']
            total_support += df['support']

        weighted_averages = weighted_averages.divide(total_support, axis=0)
        
        # break
        new_df = pd.DataFrame({
            'Weighted_Average_Precision': weighted_averages['precision'],
            'Weighted_Average_Recall': weighted_averages['recall'],
            'Weighted_Average_f1': weighted_averages['f1-score'],
            'Support': total_support
        })

        new_df_list[df_name] = new_df

    return new_df_list

def process_sentence(sentence):
    tokens = []
    ner_tags = []
    for line in sentence.split('\n'):
        if line.strip():  # Skip empty lines
            try:
                token, tag = line.split()
                tokens.append(token)
                ner_tags.append(label2id[tag])
            except:
                print("======================== Skipping numbers ==============")
                print(line)

    return ' '.join(tokens), tokens, ner_tags

def create_dataset_from_df(df):
    # Create a Dataset
    dataset = Dataset.from_pandas(df, preserve_index=False)
    return dataset

def create_df_from_file(file_path):
    # Open the file and read its contents
    with open(file_path, 'r', encoding='utf-8') as file:
        data = file.read()

    # Process the data
    sentences = data.strip().split('\n\n')
    processed_data = [process_sentence(sentence) for sentence in sentences]

    # Create a DataFrame
    df = pd.DataFrame(processed_data, columns=['text', 'tokens', 'ner_tags'])

    return df

def fold_i_of_k(dataset, i, k):
    n = len(dataset)
    print("length of dataset: ", n)
    test = dataset.iloc[n*(i-1)//k:n*i//k]
    print("Test indices: ",n*(i-1)//k, n*i//k )
    train = dataset.loc[dataset.index.difference(test.index)]

    return train, test 

def check_directories_for_files(*directories):
    for directory in directories:
        if os.path.exists(directory):
            # List all items in the directory
            items = os.listdir(directory)
            
            # Check if there are any files in the directory
            has_files = any(os.path.isfile(os.path.join(directory, item)) for item in items)
            
            if has_files:
                raise ValueError(f"Error: The directory '{directory}' contains files. Please change directories for variables 'output_dir' and 'cache_dir'.")


# Settings
Try running this with the default settings first. The default setting should give you a pretty good result. If you want training to go even faster, reduce the number of epochs. The first variables you should consider changing are the one in the dropdown menus. Later you can also experiment with the other settings to get even better results.

In [26]:
#Model, Dataset, and Task
#@markdown Set the main model that the training should start from
model_name = "bert_data/nbailab-base-ner-scandi"

#@markdown ---
#@markdown Set the dataset for the task we are training on
task_name = "ner" #@param ["ner", "pos"]

#General
overwrite_cache = False  #@#param {type:"boolean"}
cache_dir = "./bert_output/scandi/v1/cache" #param {type:"string"}
output_dir = "./bert_output/scandi/v1" #param {type:"string"}
overwrite_output_dir = False #param {type:"boolean"}
seed = 42 #param {type:"number"}
set_seed(seed)

#Tokenizer
padding = False  #param ["False", "'max_length'"] {type: 'raw'}
max_length = 512 #param {type: "number"}
label_all_tokens = False #param {type:"boolean"}

# Training
#@markdown ---
#@markdown Set training parameters
per_device_train_batch_size = 8  #param {type: "integer"}
per_device_eval_batch_size = 8  #param {type: "integer"}
learning_rate = 3e-05  #@param {type: "number"}
weight_decay = 0.0  #param {type: "number"}
adam_beta1 = 0.9  #param {type: "number"}
adam_beta2 = 0.999  #param {type: "number"}
adam_epsilon = 1e-08  #param {type: "number"}
max_grad_norm = 1.0  #param {type: "number"}
num_train_epochs = 4.0  #@param {type: "number"}
num_warmup_steps = 750  #@param {type: "number"}
save_total_limit = 1  #param {type: "integer"}
load_best_model_at_end = True  #@param {type: "boolean"}

# Load the Dataset used for Finetuning
The default setting is to use the NorNE dataset. This is currently the largest (and best) dataset with annotated POS/NER tags that are available today. All sentences is tagged both for POS and NER. The dataset is available as a Huggingface dataset, so loading it is very easy.

In [None]:
dataset_df = create_df_from_file(model_training_data_path)
dataset_df = dataset_df.sample(frac=1).reset_index(drop=True)


# enumerate splits
k_datasets = dict()

for i in range(1, k+1):
    # print("i = ", i)
    fold_size = len(dataset_df) // k

    train_df, test_df = fold_i_of_k(dataset_df, i, k)

    
    train_dataset = create_dataset_from_df(train_df)
    test_dataset = create_dataset_from_df(test_df)

    # Create the final DatasetDict
    dataset_dict = DatasetDict({
        'train': train_dataset,     # This is our training set
        'test': test_dataset        # This is our test set
    })

    k_datasets[i] = dataset_dict
    

    

# ==============================================================================================
# train_path = 'annotations/BIO/train.txt'
# val_path = 'annotations/BIO/val.txt'
# test_path = 'annotations/BIO/test.txt'

# train_df = create_df_from_file(train_path)
# train_dataset = create_dataset_from_df(train_df)
# val_df = create_df_from_file(val_path)
# val_dataset = create_dataset_from_df(val_df)
# test_df = create_df_from_file(test_path)
# test_dataset = create_dataset_from_df(test_df)

# # Create the final DatasetDict
# dataset_dict = DatasetDict({
#     'train': train_dataset,     # This is our training set
#     'validation': val_dataset,  # This is our validation set
#     'test': test_dataset        # This is our test set
# })

# ==============================================================================================
# Define the features of the dataset
features = dataset_dict['train'].features.copy()
features['ner_tags'] = Sequence(ClassLabel(names=label_list))
 
# Print information about the dataset
print(dataset_dict)

# Print the first example
print("\nFirst example:")
print(dataset_dict['train'][1])


# print all 5 datasets
print("\n 5-Fold Dataset:")
print(k_datasets)

In [None]:
# #Load the dataset
dataset = dataset_dict.copy()

#Getting some variables from the dataset
column_names = dataset["train"].column_names
features = dataset["train"].features
text_column_name = "tokens" if "tokens" in column_names else column_names[0]
label_column_name = (
    f"{task_name}_tags" if f"{task_name}_tags" in column_names else column_names[1]
)

num_labels = len(label2id)
ner_label_to_id = {i: i for i in range(len(label_list))}



#Look at the dataset
printm(f"###Quick Look at the Dataset")
print(dataset["train"].data.to_pandas()[[text_column_name, label_column_name]])


In [None]:
print(label2id)
print(id2label)

# Import Pretrained Model

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# tokenizer = AutoTokenizer.from_pretrained("bert-data/nb-bert-base-ner") # nb-bert-base-ner pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert_data/nbailab-base-ner-scandi")

# model = AutoModelForTokenClassification.from_pretrained("bert-data/nb-bert-base-ner") # nb-bert-base-ner pretrained model
model = AutoModelForTokenClassification.from_pretrained("bert_data/nbailab-base-ner-scandi")
model.gradient_checkpointing_enable()

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="max")
example = "Jeg heter Mohammid og bor i Lækkegaten 2."

ner_results = nlp(example)
print(ner_results)

# Start Training
Training for the default 4 epochs should take around 10-15 minutes if you have access to GPU.

In [None]:
# Example usage
try:
    check_directories_for_files(output_dir, cache_dir)

except ValueError as e:
    print(e)
    output_dir = ''
    sys.exit(1)

In [None]:
train_classification_reports_per_fold = dict()
eval_classification_reports_per_fold = dict()
#%%time

temp_cache_dir = copy.deepcopy(cache_dir)
temp_output_dir = copy.deepcopy(output_dir)
for k_idx, k_set in k_datasets.items():
    curr_cache_dir = temp_cache_dir + f"/v{k_idx}"
    print(f"current cache dir: {curr_cache_dir}")
    curr_output_dir = temp_output_dir + f"/v{k_idx}"
    print(f"current output dir: {curr_output_dir}")

    if not os.path.exists(curr_cache_dir):
        os.makedirs(curr_cache_dir)
    if not os.path.exists(curr_output_dir):
        os.makedirs(curr_output_dir)
        
    config = AutoConfig.from_pretrained(
        model_name,
        num_labels=num_labels,
        finetuning_task=task_name,
        cache_dir=curr_cache_dir,
        id2label=id2label,
        label2id=label2id
    )

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        cache_dir=curr_cache_dir,
        use_fast=True,
    )

    model = AutoModelForTokenClassification.from_pretrained(
        model_name,
        from_tf=bool(".ckpt" in model_name),
        config=config,
        cache_dir=curr_cache_dir,
        ignore_mismatched_sizes=True
    )

    data_collator = DataCollatorForTokenClassification(tokenizer)



    training_args = TrainingArguments(
        output_dir=curr_output_dir,
        overwrite_output_dir=overwrite_output_dir,
        do_train=True,
        do_eval=True,
        do_predict=True,
        eval_strategy="epoch",  # Evaluate during training
        eval_steps=500,               # Evaluate every 500 steps
        save_strategy="epoch",        # Save during training
        save_steps=500,               # Save every 500 steps
        metric_for_best_model="f1",   # Use F1 score to determine the best model
        greater_is_better=True,       # The higher the F1, the better
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        adam_beta1=adam_beta1,
        adam_beta2=adam_beta2,
        adam_epsilon=adam_epsilon,
        max_grad_norm=max_grad_norm,
        num_train_epochs=num_train_epochs,
        warmup_steps=num_warmup_steps,
        load_best_model_at_end=load_best_model_at_end,
        seed=seed,
        save_total_limit=save_total_limit,
        use_cpu=True
    )


    print(k_set)
    tokenized_datasets = DatasetDict()
    for split, dataset_elem in k_set.items():
        tokenized_datasets[split] = dataset_elem.map(
            tokenize_and_align_labels,
            batched=True,
            load_from_cache_file=not overwrite_cache,
            num_proc=os.cpu_count(),
        )
    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["test"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )


    
    train_result = trainer.train()
    trainer.save_model()  # Saves the tokenizer too for easy upload

    # Need to save the state, since Trainer.save_model saves only the tokenizer with the model
    trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))

    ###########################

    # Make predictions on the training set
    train_predictions = trainer.predict(tokenized_datasets["train"])
    train_metrics = compute_metrics((train_predictions.predictions, train_predictions.label_ids))

    # Print Results
    output_train_file = os.path.join(training_args.output_dir, f"train_results_{k_idx}.txt")
    with open(output_train_file, "w") as writer:
        print("**Train results**")
        for key, value in train_result.metrics.items():
            # print(f"{key} = {value}")
            writer.write(f"{key} = {value}\n")
        
        # Write the computed metrics to the file
        writer.write("\n**Classification Report**\n")
        train_classification_reports_per_fold = write_clsf_report_to_file(train_metrics, writer, k_idx, train_predictions, train_classification_reports_per_fold)
        
    printm("**Evaluate**")
    results = trainer.evaluate()
    test_predictions = trainer.predict(tokenized_datasets["test"])

    output_eval_file = os.path.join(training_args.output_dir, f"eval_results_{k_idx}.txt")
    with open(output_eval_file, "w") as writer:
        eval_classification_reports_per_fold = write_clsf_report_to_file(results, writer, k_idx, test_predictions, eval_classification_reports_per_fold)



In [None]:
train_classification_reports_per_fold[1].head()

In [None]:
eval_classification_reports_per_fold[1].head()

## Calculate weighted average for precision/recall/f1 for TRAINING set and TEST set - Class-level

In [None]:
df_dict = {'train' : list(train_classification_reports_per_fold.values()), 
           'test' :list(eval_classification_reports_per_fold.values())}
dfs_index = train_classification_reports_per_fold[1].index

new_df_list = get_class_level_weighted_average(clsf_report_dict=df_dict, dfs_index=dfs_index)


In [None]:
print("Total number of entities in train: ", new_df_list['train']['Support'].sum())
print("Total number of entities in test: ",new_df_list['test']['Support'].sum())

## Classification Report on TRAINING DATA

In [None]:
new_df_list['train'].to_csv(f"{output_dir}/train_clsf_report_weighted_avg.csv")
pd.read_csv(f"{output_dir}/train_clsf_report_weighted_avg.csv", index_col=0)

## Classification Report on TEST DATA

In [None]:
new_df_list['test'].to_csv(f"{output_dir}/test_clsf_report_weighted_avg.csv")
pd.read_csv(f"{output_dir}/test_clsf_report_weighted_avg.csv", index_col=0)

# Run Predictions on the Confidential Dataset

Final run to see if it also performs on the test-set. Should only be run after finetuning the hyper parameters.





In [None]:
dataset_df = create_df_from_file(confidential_dataset_path)
dataset_df = dataset_df.sample(frac=1).reset_index(drop=True) # shuffle - optional

dataset_df_obj = create_dataset_from_df(dataset_df)

# Create the final DatasetDict
dataset_conf = DatasetDict({
    'confidential': dataset_df_obj        # This is our test set
})
dataset_conf


In [28]:
column_names = dataset_conf["confidential"].column_names
features = dataset_conf["confidential"].features
text_column_name = "tokens" if "tokens" in column_names else column_names[0]
label_column_name = (
    f"{task_name}_tags" if f"{task_name}_tags" in column_names else column_names[1]
)

num_labels = len(label2id)
ner_label_to_id = {i: i for i in range(len(label_list))}

In [None]:
temp_cache_dir = copy.deepcopy(cache_dir)
temp_output_dir = copy.deepcopy(output_dir)

conf_classification_reports_per_fold = dict()
for k_idx in range(1, k+1): # just need the k indices for paths
    curr_cache_dir = temp_cache_dir + f"/v{k_idx}"
    print(f"current cache dir: {curr_cache_dir}")
    curr_output_dir = temp_output_dir + f"/v{k_idx}"
    print(f"current output dir: {curr_output_dir}")

    if not os.path.exists(curr_cache_dir):
        raise FileNotFoundError
    if not os.path.exists(curr_output_dir):
        raise FileNotFoundError
    
    model = AutoModelForTokenClassification.from_pretrained(curr_output_dir)
    tokenizer = AutoTokenizer.from_pretrained(curr_output_dir)
    
    tokenized_datasets_conf = DatasetDict()
    for split, dataset_elem in dataset_conf.items():
        tokenized_datasets_conf[split] = dataset_elem.map(
            tokenize_and_align_labels, # this calls tokenizer() so needs to be updated with the new tokenizer in every iteration
            batched=True,
            load_from_cache_file=not overwrite_cache,
            num_proc=os.cpu_count(),
        )

    data_collator = DataCollatorForTokenClassification(tokenizer)

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        eval_dataset=tokenized_datasets_conf["confidential"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        args=TrainingArguments(use_cpu=True, output_dir=curr_output_dir)
        
    )
    printm("**Test**")
    results = trainer.evaluate()
    test_predictions = trainer.predict(tokenized_datasets_conf["confidential"])

    output_eval_file = os.path.join(curr_output_dir, f"confidential_set_test_results_{k_idx}.txt")
    with open(output_eval_file, "w") as writer:
        conf_classification_reports_per_fold = write_clsf_report_to_file(results, writer, k_idx, test_predictions, conf_classification_reports_per_fold)


    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(np.argmax(test_predictions.predictions, axis=2), test_predictions.label_ids)
    ]

    # Save predictions
    output_test_predictions_file = os.path.join(curr_output_dir, "confidential_set_test_predictions.txt")
    with open(output_test_predictions_file, "w") as writer:
        for prediction in true_predictions:
            writer.write(" ".join(prediction) + "\n")


## Calculate class-level weighted average for precision/recall/f1 for CONFIDENTIAL DATA

In [None]:
df_dict = {'conf' :list(conf_classification_reports_per_fold.values())}
dfs_index = conf_classification_reports_per_fold[1].index

conf_df_list = get_class_level_weighted_average(clsf_report_dict=df_dict, dfs_index=dfs_index)
conf_df_list


## Classification Report for CONFIDENTIAL DATA

In [None]:
conf_df_list['conf'].to_csv(f"{output_dir}/confidential_clsf_report_weighted_avg.csv")
pd.read_csv(f"{output_dir}/confidential_clsf_report_weighted_avg.csv", index_col=0)

##### Copyright 2020 &copy; National Library of Norway