The task of classifying each token in a token sequence is called token classification. This task says that a specific model must be able to classify each token into a class. POS and NER are 2 of the most well-known tasks in this criterion. However, QA is also another major NLP task that fits in this category.

# NER
One of the well-known takss in the category of token classification is NER- the recognition of each token as an entity or not and identifying the type of each detected entity. For example a text can contain multiple entities at the same time- person names, locations, organizations and other types of entitties.

It is a time to look at the example

George Washington is one the presidents of the United States of America.

George Washington is a person name while the United States of America is a location name. A sequence taggong model is expected to tag each word in the form of tags, each containing information about tag. BIO's tags are the ones that are universally used for standard NER tasks.

# POS tagging

POS tagging or grammar tagging is annotating a word in a given text according ot its respective part of speech. As a simple example, in a given text, identification of each word's role in the categories of noun, adjective, adverb and verb is considered to be POS. However, from a linguistic perspective, there are many roles other than these 4.

# Understand QA
A QA or reading comprehension task contains a set of reading comprehension texts with respective questions on them. An examplary dataset is SQUAD or Stanford Question Answering Dataset. This dataset contains Wikipedia texts and respective questions asked about them. The answers are in the form of segments of the original Wikipedia text.

In [1]:
!pip install transformers datasets tokenizers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 7.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 52.8 MB/s 
[?25hCollecting tokenizers
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 50.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 13.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 83.3 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-

In [2]:
import datasets 
conll2003 = datasets.load_dataset("conll2003")

Downloading builder script:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 959.94 KiB, generated: 9.78 MiB, post-processed: Unknown size, total: 10.72 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14042 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3251 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3454 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

A download progress bar did appear after finishing the downloading and caching, the dataset will be ready to use. We can easily double-check the dataset by accessing the train samples.

In [3]:
conll2003["train"][0]

{'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'id': '0',
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.']}

The respective tags for POS and NER are shown above. We will use only the NER tags for now. Next we will get the NER tags available in this dataset.

In [4]:
conll2003["train"].features["ner_tags"]

Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

As can be seen from the above result, there are 9 tags in total. The next step is load the BERT tokenizer

In [5]:
from transformers import BertTokenizerFast 
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

The tokenizer class can work with white-space tokenized sentences also. We will need to enable our tokenizer for working with white-space tokenized sentences, because the NER task has a token-based label for each token. Tokens in this task are usually the white-space tokenized words rather than BPE and any other tokenizer tokens.

Next we will have a look at how tokenizer can be used with a white-space tokenized sentence.

In [6]:
tokenizer(["Oh","this","sentence","is","tokenized","and", "splitted","by","spaces"], is_split_into_words=True)

{'input_ids': [101, 2821, 2023, 6251, 2003, 19204, 3550, 1998, 3975, 3064, 2011, 7258, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

As can be seen from the above result, by just setting is_split_into_words=True, the problem is solved.

Next we will perform preprocessing data before training it.

In [7]:
def tokenize_and_align_labels(examples, label_all_tokens=True): 
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) 
    labels = [] 
    for i, label in enumerate(examples["ner_tags"]): 
        word_ids = tokenized_inputs.word_ids(batch_index=i) 
        previous_word_idx = None 
        label_ids = [] 
        for word_idx in word_ids: 
            if word_idx is None: 
                label_ids.append(-100) 
            elif word_idx != previous_word_idx: 
                 label_ids.append(label[word_idx]) 
            else: 
                 label_ids.append(label[word_idx] if label_all_tokens else -100) 
            previous_word_idx = word_idx 
        labels.append(label_ids) 
    tokenized_inputs["labels"] = labels 
    return tokenized_inputs

This function will ensure that our tokens and labels are aligned properly. This alignment is requried because the tokens are tokenized in pieces but the words must be of one piece. Next we will test how this function works

In [8]:
q = tokenize_and_align_labels(conll2003['train'][4:5])
print(q)

{'input_ids': [[101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, -100]]}


As can be seen from the above result, the result is not readable. So we can implement the ocde to have a readable version.

In [9]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]): 
    print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
germany_________________________________ 5
'_______________________________________ 0
s_______________________________________ 0
representative__________________________ 0
to______________________________________ 0
the_____________________________________ 0
european________________________________ 3
union___________________________________ 4
'_______________________________________ 0
s_______________________________________ 0
veterinary______________________________ 0
committee_______________________________ 0
werner__________________________________ 1
z_______________________________________ 2
##wing__________________________________ 2
##mann__________________________________ 2
said____________________________________ 0
on______________________________________ 0
wednesday_______________________________ 0
consumers_______________________________ 0
should__________________________________ 0
buy_____________________________________ 0
sheep___

The mapping of this function to the dataset can be done using the map function of the datasets library.

In [10]:
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)



  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

The next step it is required to load the BERT model with the respective number of labels


In [11]:
from transformers import AutoModelForTokenClassification 

In [12]:
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

The model will be loaded and ready to be trained. In the next step, we will prepare the trainer and training parameters.

In [13]:
from transformers import TrainingArguments, Trainer 
args = TrainingArguments( 
"test-ner",
evaluation_strategy = "epoch", 
learning_rate=2e-5, 
per_device_train_batch_size=16, 
per_device_eval_batch_size=16, 
num_train_epochs=3, 
weight_decay=0.01, 
)

It is required to prepare the data collector. It will apply batch operations on training dataset to use less memory and perform faster.

In [14]:
from transformers import DataCollatorForTokenClassification 
data_collator = DataCollatorForTokenClassification(tokenizer)

To be able to evaluate model performance, there are many metrics available for many taks in HuggingFace's dataset library. In this project we will use the sequence evaluation metric for NER, seqeval is a good Python framework to evaluate sequence tagging algorithms and model. It is really important to install the seqeval library.

In [15]:
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.5 MB/s 
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16180 sha256=0ede1ad4cc455b4bca5f81d4aa931a8e13ed3d1aa3b10a20b089043fefd56416
  Stored in directory: /root/.cache/pip/wheels/05/96/ee/7cac4e74f3b19e3158dce26a20a1c86b3533c43ec72a549fd7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [16]:
#load the metric
metric=datasets.load_metric("seqeval")

Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

The code below will show that it is easily possible to see how the metric works.

In [17]:
example = conll2003['train'][0]
label_list = conll2003["train"].features["ner_tags"].feature.names 
labels = [label_list[i] for i in example["ner_tags"]] 
metric.compute(predictions=[labels], references=[labels])

{'MISC': {'f1': 1.0, 'number': 2, 'precision': 1.0, 'recall': 1.0},
 'ORG': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'overall_accuracy': 1.0,
 'overall_f1': 1.0,
 'overall_precision': 1.0,
 'overall_recall': 1.0}

Various metrics such as accuracy, F1-score, precision and recall are computed for the sample input.

In [18]:
# Compute the metrics
import numpy as np
def compute_metrics(p): 
    predictions, labels = p 
    predictions = np.argmax(predictions, axis=2) 
    true_predictions = [ 
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100] 
        for prediction, label in zip(predictions, labels) 
    ] 
    true_labels = [ 
      [label_list[l] for (p, l) in zip(prediction, label) if l != -100] 
       for prediction, label in zip(predictions, labels) 
   ] 
    results = metric.compute(predictions=true_predictions, references=true_labels) 
    return { 
   "precision": results["overall_precision"], 
   "recall": results["overall_recall"], 
   "f1": results["overall_f1"], 
  "accuracy": results["overall_accuracy"], 
  }

The next step is to make a trainer and train it accordingly

In [19]:
trainer = Trainer( 
    model, 
    args, 
   train_dataset=tokenized_datasets["train"], 
   eval_dataset=tokenized_datasets["validation"], 
   data_collator=data_collator, 
   tokenizer=tokenizer, 
   compute_metrics=compute_metrics 
)

In [20]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: id, ner_tags, tokens, chunk_tags, pos_tags. If id, ner_tags, tokens, chunk_tags, pos_tags are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 14042
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2634


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2221,0.062369,0.91977,0.928515,0.924122,0.982446
2,0.0475,0.061183,0.920581,0.936234,0.928342,0.983764
3,0.0277,0.056374,0.935155,0.945408,0.940254,0.986131


Saving model checkpoint to test-ner/checkpoint-500
Configuration saved in test-ner/checkpoint-500/config.json
Model weights saved in test-ner/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-ner/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-ner/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: id, ner_tags, tokens, chunk_tags, pos_tags. If id, ner_tags, tokens, chunk_tags, pos_tags are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3251
  Batch size = 16
Saving model checkpoint to test-ner/checkpoint-1000
Configuration saved in test-ner/checkpoint-1000/config.json
Model weights saved in test-ner/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-ner/checkpoint-1000/tokenizer_config.json
Special tokens f

TrainOutput(global_step=2634, training_loss=0.0796683730731521, metrics={'train_runtime': 331.8696, 'train_samples_per_second': 126.935, 'train_steps_per_second': 7.937, 'total_flos': 1019599557281136.0, 'train_loss': 0.0796683730731521, 'epoch': 3.0})

In [21]:
# Save our model
model.save_pretrained("ner_model")

Configuration saved in ner_model/config.json
Model weights saved in ner_model/pytorch_model.bin


In [22]:
tokenizer.save_pretrained("tokenizer")

tokenizer config file saved in tokenizer/tokenizer_config.json
Special tokens file saved in tokenizer/special_tokens_map.json


('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

The next step is we need to use the model with the pipeline, we must read the config file adn assign label2id and id2label correctly according to the labels we have used in the label_list object

In [23]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}
import json
config = json.load(open("ner_model/config.json"))
config["id2label"] = id2label
config["label2id"] = label2id
json.dump(config, open("ner_model/config.json","w"))

In [24]:
# It is very seasy to use the model

from  transformers import pipeline
mmodel = AutoModelForTokenClassification.from_pretrained("ner_model")
nlp = pipeline("ner", model=mmodel, tokenizer=tokenizer)
example = "I live in HoChiMinh"
ner_results = nlp(example)
print(ner_results)

loading configuration file ner_model/config.json
Model config BertConfig {
  "_name_or_path": "ner_model",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-PER",
    "2": "I-PER",
    "3": "B-ORG",
    "4": "I-ORG",
    "5": "B-LOC",
    "6": "I-LOC",
    "7": "B-MISC",
    "8": "I-MISC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-LOC": "5",
    "B-MISC": "7",
    "B-ORG": "3",
    "B-PER": "1",
    "I-LOC": "6",
    "I-MISC": "8",
    "I-ORG": "4",
    "I-PER": "2",
    "O": "0"
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "tr

[{'entity': 'B-LOC', 'score': 0.9790821, 'index': 4, 'word': 'hoc', 'start': 10, 'end': 13}, {'entity': 'B-LOC', 'score': 0.9710882, 'index': 5, 'word': '##him', 'start': 13, 'end': 16}, {'entity': 'B-LOC', 'score': 0.9727854, 'index': 6, 'word': '##in', 'start': 16, 'end': 18}, {'entity': 'B-LOC', 'score': 0.958712, 'index': 7, 'word': '##h', 'start': 18, 'end': 19}]


As can be seen from the above result, we did successfully apply POS using BERT. We also train our won POS tagging model using transformers and we also tested the model.