<a href="https://colab.research.google.com/github/manoushpajouh/Python-LLMs/blob/main/Named_Entity_Recognition_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Program 6: Named Entity Recognition

Below is an implemented named entity recognition (NER) system. The named entity recognition task consists of finding entities in a text by determining whether each word is or isn't a named entity, which is in essence a sequence labeling task.

I'll be working with the [CoNLL 2003 dataset](https://paperswithcode.com/dataset/conll-2003). It contains sentences extracted from the Reuters corpus and annotated in the IOB format. The IOB format tags each token (word) as being [I]nside, [O]utside, or [B]eginning of a named entity. The O tag is used for words that are not entites, while the B is used for the first word of an entity and I for every other word in that same entity. This means that I can extract multiword entities as well with clearly separated boundaries.

For example, the sentence

> The European Commission said on Thursday it disagreed with German advice [...]

would be tagged as

> O, B-ORG, I-ORG, O, O, O, O, O, O, B-MISC, O, [...]

Note that the IOB format lets us know that "European Commission" is one entity instead of two separate ones.

### Pretraining and fine-tuning

Instead of training a model from scratch (which would require a large amount of resources), I'll take what we call the pretraining-and-finetuning approach. This means that I will use a base model trained on a general task (e.g. next word prediction) and fine-tune it. This approach enables me to save on resources while achieving state-of-the-art results.

I am using libraries developed by [HuggingFace](https://huggingface.co/). HuggingFace is a company that provides deep learning researchers and practitioners with a plethora of open-source resources from datasets to pretrained models to Python libraries.

# Objectives

The objectives of this model are the following:

1. Familiarize myself with the `transformers` library for training
and using Transformer models
2. Fine-tune a pretrained model for NER
3. Compare different pretrained models

# Implementation

In [1]:
# install required libraries
! pip install torch transformers[torch] datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting 

In [2]:
# load the dataset that we're going to use
from datasets import load_dataset
dataset = load_dataset("conll2003", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [3]:
# inspect the dataset
example = dataset["train"][0]
print(example)

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}


Below, I am importing the two most important classes in this program: the model class and the tokenizer class. In the `transformers` library, each model architecture has a specific class and a specific tokenizer, since the individual architectures may use different approaches to tokenization (e.g. the total number of tokens).

In the first part, I am going to use the [BERT](https://arxiv.org/abs/1810.04805) pretrained model, one of the most popular encoder-only architectures. Specifically, I want to use a BERT instance adapted for sequence labeling (also known as token classification).

In [4]:
from transformers import BertTokenizerFast, BertForTokenClassification # import the appropriate model class

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

Next, I am going to load pretrained weights into the tokenizer and model. This is easily done by calling the `.from_pretrained()` method that downloads the data from HuggingFace. The parameter to that method is the name of the specific model (not model architecture) you want to use.

Here, we want to use the `bert-base-uncased` model.

The `from_pretrained()` method on the model additionally takes a `num_labels` keyword argument. This is the number of possible labels (tags) for each word.

In [5]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels = 9) # call .from_pretrained on the model class you imported

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now, I am going to define a method that processes a dataset row (see the example above) into tensors.

Specifically, I will call the tokenizer with the correct parameters.
Additionally, return the NER tags of each example as the labels (targets).

In [6]:
# utility function to align labels to tokens after tokenization -- don't change this
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels


def tokenize_and_preprocess(example):
    tokens = tokenizer(example['tokens'],
                       truncation=True,
                       padding='max_length',
                       max_length=128,
                       return_tensors="pt",
                       is_split_into_words = True,) # fill in the parameters
    labels = example['ner_tags'] # NER tags (dataset already annotated just extract them)
    # align the labels
    new_labels = align_labels_with_tokens(labels, tokens.word_ids())
    # pad the labels to match the length of the input
    new_labels = new_labels + [0 for _ in range(128 - len(new_labels))]
    return {'input_ids': tokens['input_ids'].squeeze(), 'attention_mask': tokens['attention_mask'].squeeze(), 'labels': new_labels}


In [7]:
# apply the function on your data
train_dataset = dataset["train"].map(tokenize_and_preprocess)
eval_dataset = dataset["validation"].map(tokenize_and_preprocess)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Now, I am going to use the `Trainer` class from the `transformers` library to fine-tune the model. Note that this is different from PyTorch Lightning `Traniner`.

First, I will define training arguments.

In [8]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',           # output directory
    num_train_epochs = 1,
    learning_rate= 0.00003,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    report_to="none"
)

Below is the code needed for the evaluation of the model.

In [9]:
! pip install seqeval evaluate

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=ce540155058c5b2a55117eaceef82efa4845213f18f3a5b9bf2e2afaf1dc2913
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully buil

In [10]:
import evaluate
import numpy as np


metric = evaluate.load("seqeval")
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

Next, define the `Trainer`.

In [11]:
trainer = Trainer(
    model=model,                            # the instantiated Transformers model to be trained
    args=training_args,                     # training arguments, defined above
    train_dataset=train_dataset,            # the training dataset
    eval_dataset=eval_dataset,              # the evaluation dataset
    compute_metrics=compute_metrics,        # defines how to evaluate the model
)

Now, train the model by calling the `train()` method on the trainer.

In [12]:
trainer.train()

Step,Training Loss
500,0.4144
1000,0.225
1500,0.1804
2000,0.17
2500,0.1737
3000,0.141
3500,0.1347
4000,0.1947
4500,0.1406
5000,0.1222


TrainOutput(global_step=14041, training_loss=0.12853343254497152, metrics={'train_runtime': 1244.0308, 'train_samples_per_second': 11.287, 'train_steps_per_second': 11.287, 'total_flos': 917274987848448.0, 'train_loss': 0.12853343254497152, 'epoch': 1.0})

Now, evaluate the data by calling the `evaluate` method on the trainer.:

In [13]:
# evaluate the model
evaluation_results = trainer.evaluate()
print(evaluation_results)

{'eval_loss': 0.07906228303909302, 'eval_precision': 0.9288925895087428, 'eval_recall': 0.9392153561205591, 'eval_f1': 0.934025452109846, 'eval_accuracy': 0.9852352193261285, 'eval_runtime': 37.1757, 'eval_samples_per_second': 87.423, 'eval_steps_per_second': 87.423, 'epoch': 1.0}


## Different pretrained models

Now, repeat the training process with two different pretrained models.

In [14]:
! pip install torch transformers[torch] datasets
! pip install seqeval

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from transformers import Trainer, TrainingArguments


tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
# utility function to align labels to tokens after tokenization -- don't change this
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels


def tokenize_and_preprocess(example):
    tokens = tokenizer(example['tokens'],
                       truncation=True,
                       padding='max_length',
                       max_length=128,
                       return_tensors="pt",
                       is_split_into_words = True,) #  fill in the parameters
    labels = example['ner_tags'] # NER tags (dataset already annotated just extract them)
    # align the labels
    new_labels = align_labels_with_tokens(labels, tokens.word_ids())
    # pad the labels to match the length of the input
    new_labels = new_labels + [0 for _ in range(128 - len(new_labels))]
    return {'input_ids': tokens['input_ids'].squeeze(), 'attention_mask': tokens['attention_mask'].squeeze(), 'labels': new_labels}

# apply the function on your data
train_dataset = dataset["train"].map(tokenize_and_preprocess)
eval_dataset = dataset["validation"].map(tokenize_and_preprocess)


import evaluate
import numpy as np


metric = evaluate.load("seqeval")
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

training_args = TrainingArguments(
    output_dir='./results',           # output directory
    num_train_epochs = 1,
    learning_rate= 0.00003,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    report_to="none"
)

trainer2 = Trainer(
    model=model,                            # the instantiated Transformers model to be trained
    args=training_args,                     # training arguments, defined above
    train_dataset=train_dataset,            # the training dataset
    eval_dataset=eval_dataset,              # the evaluation dataset
    compute_metrics=compute_metrics,        # defines how to evaluate the model
)

trainer2.train()




tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Step,Training Loss
500,0.1688
1000,0.0588
1500,0.0362
2000,0.0736
2500,0.0527
3000,0.0597
3500,0.0803
4000,0.0713
4500,0.0741
5000,0.0617


TrainOutput(global_step=14041, training_loss=0.051142128824245285, metrics={'train_runtime': 1241.1359, 'train_samples_per_second': 11.313, 'train_steps_per_second': 11.313, 'total_flos': 917274987848448.0, 'train_loss': 0.051142128824245285, 'epoch': 1.0})

In [15]:
# evaluate the model
evaluation_results = trainer2.evaluate()
print(evaluation_results)

{'eval_loss': 0.09772291779518127, 'eval_precision': 0.9267441860465117, 'eval_recall': 0.9398584905660378, 'eval_f1': 0.9332552693208431, 'eval_accuracy': 0.983370647498969, 'eval_runtime': 55.5129, 'eval_samples_per_second': 58.545, 'eval_steps_per_second': 58.545, 'epoch': 1.0}




I used the models bert-base-uncased and bert-base-NER (I asked in class and I was given permission to only do one more model instead of two) and both have the size of 110M parameters. Bert-base-NER is case-sensitive where bert-base-uncased is not. This means it can distinguish between upper and lower-case letters. bert-base-uncased is a general-purpose model, while bert-base-NER is fine-tuned specifically for Named Entity Recognition. Since BERT-base-NER is fine-tuned specifically for Named Entity Recognition and tells the difference between capital letters, it performs much better on tasks that involve entity extraction/recognition of proper nouns.

My second model should have done better because it can distinguish the difference. That being said, it did a little bit worse. I think this may be due to the data that it originally trained on and if there is any bias or lack of diversity.  
