# Token Classification

Token classification encompasses any NLP task that can be formulated as “attributing a label to each token in a sentence,” such as:

- **Part-of-speech tagging** (POS): Mark each word in a sentence as corresponding to a particular part of speech (such as noun, verb, adjective, etc.).

- **Named entity recognition** (NER): Find the entities (such as persons, locations, or organizations) in a sentence. This can be formulated as attributing a label to each token by having one class per entity and one class for “no entity.”

- **Chunking**: Find the tokens that belong to the same entity. This task (which can be combined with POS or NER) can be formulated as attributing one label (usually B-) to any tokens that are at the beginning of a chunk, another label (usually I-) to tokens that are inside a chunk, and a third label (usually O) to tokens that don’t belong to any chunk.



## Abstract

This notebook explores the token classification NLP task with the 🤗Hugging Face library. In particular, it fine-tunes a pretrained BERT model on NER tagging.

## Table of Contents

>[Token Classification](#scrollTo=JqCbOLl9WcdF)

>>[Abstract](#scrollTo=t8x4Vqk7WcXk)

>>[Table of Contents](#scrollTo=_l8Y22MGWgL2)

>>[Named Entities](#scrollTo=9bUCKqSkh3oz)

>>[Named-Entity Recognition](#scrollTo=4uK9DfnUg-x2)

>>[Setup and Imports](#scrollTo=EFegxAAMXlfg)

>>[Download the Dataset](#scrollTo=DVNPPGTQXtP-)

>>[Preprocessing the Data](#scrollTo=L9xJkLDNcg5c)

>>[Data Collation](#scrollTo=E3iIQttigGaq)

>>[Model Creation](#scrollTo=mSAnvb_khtmO)

>>[Model Fine-tuning](#scrollTo=cDMu2L_WitW-)

>>[Model Evaluation](#scrollTo=h3KlL1NjjU9C)

>>[Model Inference](#scrollTo=-cQrnVuRl1QF)



## Named Entities

A named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include Barack Obama, New York City, Volkswagen Golf, or anything else that can be named.

## Named-Entity Recognition

Named entity recognition (NER) is the task of
- Identifying and 
- Categorizing entities in text.

Identifying entities involves detecting a word or string of words that form an entity. Each word represents a token: “The Great Lakes” is a string of three tokens that represents one entity. **Inside-outside-beginning** tagging is a common way of indicating where entities begin and end. This approach is called  BIO notation, which differentiates the beginning (B) and the inside (I) of entities.

The second step consists of the classification of the identified entity into one of the following entity categories:

- Person
- Organization
- Time
- Location

These are general entities that can be extended with domain specific ones.



## Setup and Imports

In [None]:
!pip install transformers datasets evaluate seqeval -q

[?25l[K     |████▌                           | 10 kB 26.5 MB/s eta 0:00:01[K     |█████████                       | 20 kB 11.4 MB/s eta 0:00:01[K     |█████████████▌                  | 30 kB 15.5 MB/s eta 0:00:01[K     |██████████████████              | 40 kB 14.8 MB/s eta 0:00:01[K     |██████████████████████▌         | 51 kB 14.2 MB/s eta 0:00:01[K     |███████████████████████████     | 61 kB 16.4 MB/s eta 0:00:01[K     |███████████████████████████████▌| 71 kB 15.7 MB/s eta 0:00:01[K     |████████████████████████████████| 72 kB 1.3 MB/s 
[?25h

In [None]:
import numpy as np
import tensorflow as tf

import evaluate

from datasets import load_dataset
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification
from transformers import TFAutoModelForTokenClassification
from transformers import create_optimizer

## Download the Dataset

In [None]:
ds = load_dataset("conll2003")

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 959.94 KiB, generated: 9.78 MiB, post-processed: Unknown size, total: 10.72 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

As shown below, the dataset contains labels for the three previously mentioned tasks: NER, POS, and chunking. The input texts in this dataset are lists of words found in the last "tokens" column. However, these pre-tokenized inputs still need to go through the tokenizer for subword tokenization.

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [None]:
ds["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
ner_tags = ds["train"].features["ner_tags"].feature.names
ner_tags

## Preprocessing the Data

In [None]:
# Specify the model's checkpoint
model_checkpoint = "bert-base-cased"

# Instanciate the corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
tokenized_example = tokenizer(ds["train"][0]["tokens"], is_split_into_words=True)
tokenized_example

{'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
print(tokenized_example.tokens())
print(tokenized_example.word_ids())

['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]


Tokenization lead to a misalignment between the input (`token_ids`) and the labels (`ner_tags`) due to:

* The addition of the special `[CLS]` and `[SEP]` tokens.
* The tokenization of words into subwords.

Realigning the tokens and their `ner_tags` consists of:

1. Mapping all tokens to their corresponding word with the `word_ids` method.
2. Assigning the label `-100` to the special tokens `[CLS]` and `[SEP]` so the loss function ignores them.
3. Only labeling the first token of a given word. Assign `-100` to other subtokens from the same word.

In [None]:
def tokenize_and_align_labels(examples):
  tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

  labels = []
  for i, label in enumerate(examples[f"ner_tags"]):

    # 1. Map tokens to their respective word.
    word_ids = tokenized_inputs.word_ids(batch_index=i)
    previous_word_idx = None

    # 2. Set the special tokens to -100.
    label_ids = []
    for word_idx in word_ids:  
        if word_idx is None:
            label_ids.append(-100)

        # 3. Only label the first token of a given word.
        elif word_idx != previous_word_idx:
            label_ids.append(label[word_idx])
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    labels.append(label_ids)

  tokenized_inputs["labels"] = labels
  return tokenized_inputs

Then, the 🤗Datasets `map` function applies the preprocessing function over the entire dataset. The function is sped up by setting `batched=True` as it processes multiple elements of the dataset at once.

In [None]:
tokenized_ds = ds.map(
    tokenize_and_align_labels, 
    batched=True,
    remove_columns=ds["train"].column_names
    )

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [None]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

## Data Collation

`DataCollatorWithPadding` can't be used here because it only pads the inputs (input IDs, attention mask, and token type IDs). In this case, the labels should be padded the exact same way as the inputs so that they stay the same size, using -100 as a value so that the corresponding predictions are ignored in the loss computation. This is all done by a `DataCollatorForTokenClassification`. Like the `DataCollatorWithPadding`, it takes the tokenizer used to preprocess the inputs:

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")

In [None]:
tf_train_dataset = tokenized_ds["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels", "token_type_ids"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=16
)

tf_val_dataset = tokenized_ds["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels", "token_type_ids"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=16
)

tf_test_dataset = tokenized_ds["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels", "token_type_ids"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=16
)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  tensor = as_tensor(value)


## Model Creation

In [None]:
# NER tags-labels correspondences

id2label = {i: label for i, label in enumerate(ner_tags)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
model = TFAutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Downloading:   0%|          | 0.00/527M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForTokenClassification.

Some layers of TFBertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Model Fine-tuning

In [None]:
num_epochs = 3
num_train_steps = len(tf_train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [None]:
model.fit(
    tf_train_dataset,
    validation_data=tf_val_dataset,
    epochs=num_epochs
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f9bb4471c10>

## Model Evaluation

The traditional framework used to evaluate token classification prediction is [seqeval](https://github.com/chakki-works/seqeval). This metric does not behave like the standard accuracy: it takes the lists of labels as strings, not integers, so the predictions and labels must be decoded before passing them to the metric.  

In [None]:
metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

TensorFlow can't concatenate the predictions together, because they have variable sequence lengths. This means model.predict() can't be used directly on an entire dataset. However, it can be used to obtain the predictions of one batch at a time, which are then concatenated into a one list. Finally, the -100 tokens that indicate masking/padding are dropped, then the metric is computed on the list at the end:

In [None]:
all_predictions = []
all_labels = []

for batch in tf_val_dataset:
    
  # Get the model's predictions on the current batch
  logits = model.predict(batch)["logits"]
  predictions = np.argmax(logits, axis=-1)

  # Get the true labels for the current batch
  labels = batch["labels"]
  
  # Append the predictions and labels to their corresponding lists
  for prediction, label in zip(predictions, labels):
    for predicted_idx, label_idx in zip(prediction, label):

      # Discard padding/masking tokens
      if label_idx == -100:
          continue

      all_predictions.append(ner_tags[predicted_idx])
      all_labels.append(ner_tags[label_idx])

In [None]:
model_performance_dict = metric.compute(predictions=[all_predictions], references=[all_labels])

for k in model_performance_dict:
  print(f"{k}\t{model_performance_dict[k]}")

LOC	{'precision': 0.9656488549618321, 'recall': 0.9640718562874252, 'f1': 0.9648597112503404, 'number': 1837}
MISC	{'precision': 0.8835978835978836, 'recall': 0.9056399132321041, 'f1': 0.8944831280128549, 'number': 922}
ORG	{'precision': 0.9163582531458179, 'recall': 0.9231916480238628, 'f1': 0.9197622585438336, 'number': 1341}
PER	{'precision': 0.9724473257698542, 'recall': 0.9771986970684039, 'f1': 0.974817221770918, 'number': 1842}
overall_precision	0.9436549072061529
overall_recall	0.9498485358465163
overall_f1	0.946741591881238
overall_accuracy	0.9910829017561621


## Model Inference

In [None]:
token_classifier = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
token_classifier("My name is Mike. I live in Italy.")

[{'entity_group': 'PER',
  'score': 0.9972308,
  'word': 'Mike',
  'start': 11,
  'end': 15},
 {'entity_group': 'LOC',
  'score': 0.99887186,
  'word': 'Italy',
  'start': 27,
  'end': 32}]