**Dependency**

This code cell is responsible for installing Python libraries required for the Named Entity Recognition (NER) task. The datasets library is used for loading and processing datasets, tokenizers and transformers for text processing and pretrained model functionalities, seqeval for evaluating the performance of NER models, and accelerate aims to optimize the speed of model training.

In [1]:
!pip install datasets -q
!pip install tokenizers -q
!pip install transformers -q
!pip install seqeval -q
!pip install accelerate -U


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0


**Dataset**

This code cell loads the wikiann dataset from the datasets library, a common dataset used for NER tasks, containing annotated entities in English.

In [2]:
from datasets import load_dataset

# Load dataset
dataset = load_dataset('wikiann', 'en')


Downloading builder script:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/617k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/131k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/234M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

This code extracts the names of entity labels in the dataset. Understanding the entity categories in the dataset is crucial for model training and evaluation.

In [3]:
label_names = dataset['train'].features['ner_tags'].feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

This code displays the first two samples in the training set. Viewing data samples helps in better understanding the format and content of the data.

In [4]:
dataset['train'][:2]

{'tokens': [['R.H.',
   'Saunders',
   '(',
   'St.',
   'Lawrence',
   'River',
   ')',
   '(',
   '968',
   'MW',
   ')'],
  [';', "'", "''", 'Anders', 'Lindström', "''", "'"]],
 'ner_tags': [[3, 4, 0, 3, 4, 4, 0, 0, 0, 0, 0], [0, 0, 0, 1, 2, 0, 0]],
 'langs': [['en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en'],
  ['en', 'en', 'en', 'en', 'en', 'en', 'en']],
 'spans': [['ORG: R.H. Saunders', 'ORG: St. Lawrence River'],
  ['PER: Anders Lindström']]}

This code uses the AutoTokenizer from the transformers library to load a pre-trained tokenizer (distilbert-base-uncased). A tokenization function is defined to tokenize samples in the dataset, including padding and truncation, to ensure consistency in input length.

In [5]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["tokens"], padding="max_length", truncation=True, is_split_into_words=True)
tokenized_datasets_ = dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

This code cell defines a function to adjust NER labels and tokenize the data. This is to fit the format required for model input, ensuring each entity label is correctly aligned with its corresponding word.

In [6]:
#Get the values for input_ids, attention_mask, and adjusted labels

def tokenize_adjust_labels(samples):

  tokenized_samples = tokenizer.batch_encode_plus(samples["tokens"], is_split_into_words=True, truncation=True)
  total_adjusted_labels = []

  for k in range(0, len(tokenized_samples["input_ids"])):
    prev_wid = -1
    word_ids_list = tokenized_samples.word_ids(batch_index=k)
    existing_label_ids = samples["ner_tags"][k]
    i = -1
    adjusted_label_ids = []
    for word_idx in word_ids_list:
      # Special tokens have a word id that is None. We set the label to -100
      # so they are automatically ignored in the loss function.
      if(word_idx is None):
        adjusted_label_ids.append(-100)

      elif(word_idx!=prev_wid):
        i = i + 1
        adjusted_label_ids.append(existing_label_ids[i])
        prev_wid = word_idx

      else:
        label_name = label_names[existing_label_ids[i]]
        adjusted_label_ids.append(existing_label_ids[i])

    total_adjusted_labels.append(adjusted_label_ids)

  #add adjusted labels to the tokenized samples
  tokenized_samples["labels"] = total_adjusted_labels

  return tokenized_samples

The entire dataset is tokenized and labels are adjusted using the function defined in the previous step. The batched=True parameter ensures data is processed in batches for efficiency.

In [7]:
tokenized_dataset = dataset.map(tokenize_adjust_labels, batched=True, remove_columns=['tokens', 'ner_tags', 'langs', 'spans'])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [8]:
tokenized_dataset

DatasetDict({
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 20000
    })
})

In [9]:
tokenized_dataset['train'][:2]

{'input_ids': [[101,
   1054,
   1012,
   1044,
   1012,
   15247,
   1006,
   2358,
   1012,
   5623,
   2314,
   1007,
   1006,
   5986,
   2620,
   12464,
   1007,
   102],
  [101,
   1025,
   1005,
   1005,
   1005,
   15387,
   11409,
   5104,
   13887,
   1005,
   1005,
   1005,
   102]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'labels': [[-100, 3, 3, 3, 3, 4, 0, 3, 3, 4, 4, 0, 0, 0, 0, 0, 0, -100],
  [-100, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, -100]]}

This code cell creates a data collator for batch processing tokenized data into a format acceptable by the model.

In [10]:
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)

A pre-trained distilbert-base-uncased model is loaded for token classification tasks. The num_labels parameter is set to the number of labels in the dataset.

In [11]:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_names))

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


A function is defined to compute performance metrics for the model, such as precision, recall, F1 score, and accuracy. This is crucial for evaluating and comparing the performance of different models.

In [12]:
import numpy as np
from datasets import load_metric
metric = load_metric("seqeval")
def compute_metrics(p):
    predictions, labels = p
    #select predicted index with maximum logit for each token
    predictions = np.argmax(predictions, axis=2)
    # Remove ignored index (special tokens)
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

  metric = load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

This code sets basic parameters for model training, including batch size, logging steps, and number of training epochs.

In [13]:
from transformers import TrainingArguments, Trainer
batch_size = 16
logging_steps = len(tokenized_dataset['train']) // batch_size
epochs = 10

In [18]:
from transformers import EarlyStoppingCallback
training_args = TrainingArguments(
    output_dir="results",
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False,
)

In [19]:
import torch
torch.cuda.empty_cache()

The Trainer class is used to configure the training process. It sets up the model, training arguments, training and validation datasets, data collator, tokenizer, and performance evaluation function.

In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)
trainer.train()

Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
500,0.2016,0.397577,0.776614,0.776053,0.776334,0.900324
1000,0.3129,0.275962,0.797143,0.813182,0.805083,0.914002
1500,0.2427,0.310801,0.812031,0.820897,0.81644,0.916267
2000,0.1734,0.310976,0.805492,0.823153,0.814226,0.917091
2500,0.1826,0.297433,0.811405,0.822972,0.817147,0.917747


TrainOutput(global_step=2500, training_loss=0.22263404846191406, metrics={'train_runtime': 246.0263, 'train_samples_per_second': 812.921, 'train_steps_per_second': 50.808, 'total_flos': 308218110713568.0, 'train_loss': 0.22263404846191406, 'epoch': 2.0})

This code cell uses the trained model to make predictions on the test set and processes the predictions to calculate performance metrics for the model.

In [21]:
predictions, labels, _ = trainer.predict(tokenized_dataset["test"])

In [22]:
predictions = np.argmax(predictions, axis=2)

In [23]:
true_predictions = [[label_names[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]

true_labels = [[label_names[l] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]

results = metric.compute(predictions=true_predictions, references=true_labels)

results

{'LOC': {'precision': 0.8348653972422849,
  'recall': 0.872284472901898,
  'f1': 0.8531648400805188,
  'number': 8746},
 'ORG': {'precision': 0.7232183215308121,
  'recall': 0.6770098730606487,
  'f1': 0.6993516427478691,
  'number': 7090},
 'PER': {'precision': 0.8235900456486527,
  'recall': 0.8947368421052632,
  'f1': 0.8576905382610028,
  'number': 6251},
 'overall_precision': 0.7986351147744394,
 'overall_recall': 0.8159550867025852,
 'overall_f1': 0.8072022036593285,
 'overall_accuracy': 0.915621247113164}