# Named Entity Recognition with BERT

In this article, we will demonstrate how to perform Named Entity Recognition (NER) using BERT. We will train a BERT model on the CONLL2003 dataset in 10 steps.

Named Entity Recognition (NER) is a common task in Natural Language Processing that extracts relevant information from data. It involves training a system to identify and categorize entities in a text by simply tagging them with pre-defined tags.  In this example our tags are like ‘Person’ or ‘Location’ since they are coming from CONLL2003 dataset.

We will use the pre-trained BERT model from the Hugging Face Hub and fine-tune it to make Named Entity Recognition (NER) on a spesific dataset.

Compared to traditional NLP pipelines, Transformer architectures offer a more comprehensive, end-to-end approach. Models can be trained from stratch or can be fine-tuned like in our example.

The Hugging Face Transformers Library,
*  Provides state-of-the-art machine learning models like BERT, GPT-2, and T5. It is used for tasks such as text classification, information extraction, summarization.

* Functions as an unified high-level API for AI models. It is an interface designed to be compatible with both PyTorch and TensorFlow, two of the most popular deep learning libraries.

HuggingFace Transformers access easily pre-trained models from the libraries can switch between with minimal effort. Models can be pushed to the Hugging Face Hub for sharing and collaboration. (not necessarly) Alternatively, they can also be downloaded and used locally.

* CONLL2003 is a dataset for fine-tuning the model.
* A Python Notebook In Google Colab (Enabled with GPU)

The training process involves the following steps:

1. Load the Libraries
2. Inspect the Dataset
3. Verify the Alignment
4. Tokenize the Dataset
5. Configure the Data Collator
6. Set up the Metric Calculation
7. Initialize the Model
8. Define the Training Arguments
9. Begin Training
10. Evaluate the Results

Their documentation was also main resource for this article. I find it somehow difficult to find a precise tutorial like this one.

I highly recommend to check both transformer architecture, BERT other models and a traditional methods of NLP.

* I'm planning to more blog posts on both medium and my blog.
* Source code is both accessible on GitHub and Colab.
* Model is accessible on Hugging Face Hub.

Lets move on!

## Step 1 : Load The Librares

Lines here installs several Python libraries using pip, Python’s package installer:

In [None]:
#  Python libraries using pip
!pip install transformers datasets tokenizers seqeval -q
!pip install tensorflow_probability -U
!pip install seqeval -U


* **transformers:** HuggingFace transformers library.
* **datasets:** This library provides a simple way to download, process, and load datasets. It is developed by Hugging Face, the same organization that develops the transformers library.
* **tokenizers:** This library is used for text tokenization. It is capable of training new tokenizers and comes with several pre-trained ones.
* **seqeval:** This library is used for sequence labeling evaluation. It is commonly used in Named Entity Recognition (NER) tasks.

* The -q flag is used to run the installation in quiet mode, which means it won’t print all the installation messages.

Last line is upgrading the seqeval library to its latest version. As mentioned before, seqeval is used for sequence labeling evaluation.

In [None]:
# Check GPU
import torch; print(torch.cuda.get_device_name(0))

Tesla T4


When you run this code, it will print out the name of your GPU. This is useful to confirm that PyTorch is properly configured to use your GPU, which can significantly speed up machine learning tasks. If no GPU is available, or PyTorch isn’t configured correctly, it will throw an error.

## Step 2 : Inspect the Dataset

In [None]:
import datasets

# Import the load_dataset function from the datasets library.
# Function is used to load a dataset from the Hugging Face datasets hub.
from datasets import load_dataset

# loads the “conll2003” dataset and assigns it to the variable conll2003.
conll2003 = datasets.load_dataset("conll2003")


* The CoNLL-2003 dataset is a popular benchmark for Named Entity Recognition (NER), a common task in Natural Language Processing (NLP) where the goal is to classify named entities in text into pre-defined categories such as person names, organizations, locations, etc.

In [None]:
# Check Dataset
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [None]:
#  check the type of the conll2003 object.
type(conll2003)

datasets.dataset_dict.DatasetDict

* Type is DatasetDict type is a dictionary-like object provided by the datasets library.

* It allows you to access subsets of the dataset (like ‘train’, ‘test’, ‘validation’) using keys, just like you would with a Python dictionary.

In [None]:
#  Check the shape attribute of the conll2003 dataset.
conll2003.shape

{'train': (14041, 5), 'validation': (3250, 5), 'test': (3453, 5)}

The shape attribute of a DatasetDict object in the datasets library returns a dictionary where each key is the name of a subset of the dataset.

Therese are ‘train’, ‘test’, ‘validation’ and each value is a tuple representing the shape of that subset.

In [None]:
# Inspect First Row Of Train Data
conll2003["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
# Inspect NER Tags
conll2003["train"].features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [None]:
# Check the description of datset
conll2003['train'].description

'The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on\nfour types of named entities: persons, locations, organizations and names of miscellaneous entities that do\nnot belong to the previous three groups.\n\nThe CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on\na separate line and there is an empty line after each sentence. The first item on each line is a word, the second\na part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags\nand the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only\nif two phrases of the same type immediately follow each other, the first word of the second phrase will have tag\nB-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2\ntagging scheme, whereas the original dataset uses 

## Step 3 : Verify the Alignment

In [None]:
# Import the AutoTokenizer class from the transformers library
# AutoTokenizer provides access to all the tokenizers available in the transformers library in a unified way
from transformers import AutoTokenizer

# Sets the model_checkpoint variable to the string
# "bert-base-cased", which is the name of a pre-trained BERT model.
# The “cased” part means that the model was trained on case-sensitive data
model_checkpoint = "bert-base-cased"
# Loads the tokenizer associated with the "bert-base-cased" model and assigns it to the variable tokenizer.
# The from_pretrained method downloads and caches the tokenizer, and then returns an instance of it.
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
# Test tokenizer
inputs = tokenizer(conll2003["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

In [None]:
# Check wordids
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

The inputs.word_ids()  is used to get the word IDs from the inputs object. This is typically used when working with tokenized text data in natural language processing (NLP).

word_ids() is a method of tokenizer instance that returns a list where each element corresponds to a token in the input text, and the value of each element is the ID of the word that the token is part of.

In [None]:
# check NER tags
print(conll2003["train"][0]["ner_tags"])

[3, 0, 7, 0, 0, 0, 7, 0, 0]


In [None]:
def align_labels_with_tokens(labels, word_ids):
    # Initialize a list to store the adjusted labels
    new_labels = []

    # Initialize a variable to keep track of the current word's ID
    current_word = None

    # Iterate through each word ID in the word_ids list
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word/entity
            current_word = word_id

            # Assign -100 to labels for special tokens, else use the word's label
            label = -100 if word_id is None else labels[word_id]

            # Append the adjusted label to the new_labels list
            new_labels.append(label)
        elif word_id is None:
            # Handle special tokens by assigning them a label of -100
            new_labels.append(-100)
        else:
            # Token belongs to the same word/entity as the previous token
            label = labels[word_id]

            # If the label is in the form B-XXX, change it to I-XXX
            if label % 2 == 1:
                label += 1

            # Append the adjusted label to the new_labels list
            new_labels.append(label)

    # Return the list of adjusted labels
    return new_labels

This function, align_labels_with_tokens(labels, word_ids), is used to align labels with tokens, which is a common task in Named Entity Recognition (NER).

In [None]:
labels = conll2003["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


## Step 4 : Tokenize The Dataset

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs


In [None]:
tokenized_datasets = conll2003.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=conll2003["train"].column_names,
)


Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

## Step 5 : Configure the Data Collator

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[-100, 1, 2, -100]


## Step 6 : Set up the Metric Calculation

In [None]:
import evaluate
metric = evaluate.load("seqeval")

In [None]:
ner_feature = conll2003["train"].features["ner_tags"]
label_names = ner_feature.feature.names
labels = conll2003["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

In [None]:
predictions = labels.copy()
predictions[2] = "O"
metric.compute(predictions=[predictions], references=[labels])

{'MISC': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.8,
 'overall_accuracy': 0.8888888888888888}

In [None]:
import numpy as np

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }


* The function, `compute_metrics(eval_preds)`, is used to compute the precision, recall, F1 score, and accuracy of the predictions made by a model.

* `logits, labels = eval_preds` line unpacks `eval_preds` into `logits` and `labels`. `logits` are the raw output values from the model, and `labels` are the true labels.

* `predictions = np.argmax(logits, axis=-1)` This line uses the `np.argmax` function to find the indices of the maximum values along the last axis of `logits`. These indices represent the model's predictions.

* `true_labels = [[label_names[l] for l in label if l != -100] for label in labels]` this line creates a new list of labels, `true_labels`, by iterating over `labels` and replacing each label `l` with its corresponding name from `label_names`, but only if `l` is not equal to `-100` (which is used to represent special tokens).


`true_predictions = [
    [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
`
This line creates a new list of predictions, `true_predictions`, by iterating over `predictions` and `labels` together, replacing each prediction `p` with its corresponding name from `label_names`, but only if the corresponding label `l` is not equal to `-100`.

```python
all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
```
This line computes the metrics by calling the `compute` method of the `metric` object with `true_predictions` and `true_labels` as arguments.

```python
return {
    "precision": all_metrics["overall_precision"],
    "recall": all_metrics["overall_recall"],
    "f1": all_metrics["overall_f1"],
    "accuracy": all_metrics["overall_accuracy"],
}
```
Finally, this line returns a dictionary containing the precision, recall, F1 score, and accuracy.

This function is typically used in the evaluation step of a machine learning pipeline to assess the performance of a model.


## Step 7 : Initialize the Model

In [None]:
# Import token classification model to be trained or fine-tuned on tasks such as Named Entity Recognition (NER), Part-of-Speech tagging (POS)
from transformers import AutoModelForTokenClassification

# Create two dictionaries: id2label and label2id.
# id2label maps each label’s ID to its name.
# label2id maps each label’s name to its ID.
# These dictionaries are used to convert between label names and ID

id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

# Load a pre-trained model for token classification from the checkpoint specified by model_checkpoint,
# Configures it to use the specific labels defined by id2label and label2id.
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

# Sets the device where the PyTorch tensors will be allocated on.
torch.device('cuda')

In [None]:
# Get the number of labels in the model’s configuration.
# Classification tasks where the model needs to predict one label out of several possible ones.
# num_labels attribute tells how many different labels the model can predict.
model.config.num_labels

NameError: name 'model' is not defined

In [None]:
# Huggingface Hub Settings (not an obligation)
# interact with the Hugging Face Model Hub:

# Import the notebook_login and create_repo functions from the huggingface_hub library.
from huggingface_hub import notebook_login, create_repo

# The notebook_login function is then called to log into the Hugging Face Model Hub from a Jupyter notebook.
notebook_login()
access_token = "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"


In [None]:
# Createa a new repository
create_repo("csariyildiz/bert-finetuned-ner4", private=False)

# Push the tokenizer to the repository on the Hugging Face Model Hub.
tokenizer.push_to_hub("csariyildiz/bert-finetuned-ner4")

## Step 8 : Define the Training Arguments

In [None]:
#  imports the accelerate library, which is a PyTorch utility for easy multi-GPU and TPU training.
import accelerate

# Import the TrainingArguments class from the transformers library.
# Class is used to set various parameters for training a model.
from transformers import TrainingArguments

args = TrainingArguments(

    # Output directory where the model predictions and checkpoints will be written.
    "bert-finetuned-ner4",

    # Model checkpoint will be saved at the end of each epoch.
    evaluation_strategy="epoch",

    # Model will be evaluated at the end of each epoch.
    save_strategy="epoch",

    # Learning rate for the optimizer.
    # Controls how much to change the model in response to the estimated error each time the model weights are updated.
    learning_rate=2e-5,

    # Total number of training epochs to perform.
    # An epoch is one complete pass through the entire training dataset.
    num_train_epochs=3,

    # Weight decay to apply (if not zero).
    # Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights, to the loss function to reduce overfitting.
    weight_decay=0.01,

    # This means the model, tokenizer, and model configuration will be pushed to the Hugging Face Model Hub at each save.
    push_to_hub=True,
)

We will see that some of that turned hyperparameters used in the training of a machine learning model.

* *learning_rate: 2e-05:* The learning rate is a hyperparameter that determines the step size at each iteration while moving toward a minimum of a loss function. In this case, the learning rate is set to 0.00002.

* train_batch_size: 8: This is the number of training examples utilized in one iteration. The batch size can significantly impact the model’s performance and the speed of training.

* eval_batch_size: 8: This is the number of evaluation examples utilized in one iteration. It’s similar to train_batch_size but is used during the evaluation phase.

* seed: 42: A seed is used in generating random numbers, which can be useful for reproducibility. Here, the seed is set to 42.

* optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08: This refers to the optimization algorithm used to minimize the loss function. Adam is a popular choice. Betas are coefficients used for computing running averages of gradient and its square, and epsilon is a very small number to prevent any division by zero in the implementation.

* lr_scheduler_type: linear: This refers to the learning rate scheduling. In this case, a linear scheduler is used, which decreases the learning rate linearly from the initial learning rate to 0 over the course of training.

* num_epochs: 3: An epoch is one complete pass through the entire training dataset. The number of epochs is a hyperparameter that defines the number of times the learning algorithm will work through the entire training dataset. In this case, the model will be trained for 3 epochs.

## Step 9 : Begin Training

In [None]:
# This line imports the Trainer class from the transformers library.
# This class provides a simple way to train and fine-tune the models.
from transformers import Trainer

# This line creates an instance of the Trainer class with the specified parameters
trainer = Trainer(
    # This is the model that will be trained.
    model=model,
    # These are the training arguments that define the training setup.
    args=args,
    # This is the training dataset.
    train_dataset=tokenized_datasets["train"],
    # This is the validation dataset.
    eval_dataset=tokenized_datasets["validation"],
    # This is the function that will be used to form a batch by collating several samples together.
    data_collator=data_collator,
    # Function that will be used to compute metrics for evaluation.
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

NameError: name 'model' is not defined

## Step 10 : Evaluate the Results

The results of a machine learning model training process.

- **Epoch**: This is one complete pass through the entire training dataset. You've completed 3 epochs.
- **Training Loss**: This is a measure of how well the model fits the training data. It decreases with each epoch, which is a good sign that the model is learning.
- **Validation Loss**: This is a measure of the model's performance on the validation dataset. It's used to prevent overfitting to the training data.
- **Precision, Recall, F1, Accuracy**: These are metrics used to evaluate the model's performance. Higher values are generally better.
- **TrainOutput**: This contains additional information about the training process:
    - `global_step`: The total number of steps (batches) during training.
    - `training_loss`: The final training loss.
    - `train_runtime`: The total runtime of the training in seconds.
    - `train_samples_per_second`: The number of training samples processed per second.
    - `train_steps_per_second`: The number of training steps taken per second.
    - `total_flos`: The total number of floating point operations.
    - `epoch`: The total number of epochs completed.

Model has been trained successfully and has good performance based on the metrics.


In [None]:
trainer.push_to_hub(commit_message="Training complete")

# Aditionals

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

In [None]:
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "bert-finetuned-ner-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

In [None]:
output_dir = "bert-finetuned-ner-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

In [None]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(**batch)

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)

        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)

    results = metric.compute()
    print(
        f"epoch {epoch}:",
        {
            key: results[f"overall_{key}"]
            for key in ["precision", "recall", "f1", "accuracy"]
        },
    )

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

In [None]:
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "csariyildiz/bert-finetuned-ner4"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")