### NER/POS tagging and chunking 

O means the word doesn’t correspond to any entity.
B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.
B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.
B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity.
B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity.

In [36]:
import datasets 
import numpy as np 
from transformers import BertTokenizerFast 
from transformers import DataCollatorForTokenClassification 
from transformers import AutoModelForTokenClassification 

use the conll2003 dataset for fine tuning purposes

In [37]:
conll2003 = datasets.load_dataset("conll2003") 


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [38]:
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [39]:
conll2003['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [40]:
conll2003['train'].features['ner_tags']

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [41]:
conll2003['train'].description

'The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on\nfour types of named entities: persons, locations, organizations and names of miscellaneous entities that do\nnot belong to the previous three groups.\n\nThe CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on\na separate line and there is an empty line after each sentence. The first item on each line is a word, the second\na part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags\nand the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only\nif two phrases of the same type immediately follow each other, the first word of the second phrase will have tag\nB-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2\ntagging scheme, whereas the original dataset uses 

import the tokenizer for this dataset

In [42]:
tokenizer= BertTokenizerFast.from_pretrained('bert-base-uncased')

running an example

In [43]:
example_text=conll2003['train'][0]
example_text

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [44]:
tokenized_input=tokenizer(example_text['tokens'], is_split_into_words=True)
tokenized_input

{'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [45]:
tokens=tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'])
tokens

['[CLS]',
 'eu',
 'rejects',
 'german',
 'call',
 'to',
 'boycott',
 'british',
 'lamb',
 '.',
 '[SEP]']

In [46]:
word_ids=tokenized_input.word_ids()
word_ids

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, None]

In [47]:
len(example_text['ner_tags']), len(tokenized_input['input_ids'])

(9, 11)

the length of the tags/number of words is different than the length of the tokenized inputs. This is because transformers are pretrained on stems/lemma at a sub-word level, and can also add special tokens such as [CLS] and [SEP] tags to the tokenizer input. Writing a function below to align the tags and words. 

1. set -100 has the label for the special tokens
2. mask subword representations after the first subword

In [48]:
example_text['ner_tags']

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [49]:
def tokenize_and_align_labels(examples, label_all_tokens=True): 
    """
    Function to tokenize and align labels with respect to the tokens. This function is specifically designed for
    Named Entity Recognition (NER) tasks where alignment of the labels is necessary after tokenization.

    Parameters:
    examples (dict): A dictionary containing the tokens and the corresponding NER tags.
                     - "tokens": list of words in a sentence.
                     - "ner_tags": list of corresponding entity tags for each word.
                     
    label_all_tokens (bool): A flag to indicate whether all tokens should have labels. 
                             If False, only the first token of a word will have a label, 
                             the other tokens (subwords) corresponding to the same word will be assigned -100.

    Returns:
    tokenized_inputs (dict): A dictionary containing the tokenized inputs and the corresponding labels aligned with the tokens.
    """
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) 
    labels = [] 
    for i, label in enumerate(examples["ner_tags"]): 
        word_ids = tokenized_inputs.word_ids(batch_index=i) 
        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token. 
        previous_word_idx = None 
        label_ids = []
        # Special tokens like `<s>` and `<\s>` are originally mapped to None 
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids: 
            if word_idx is None: 
                # set –100 as the label for these special tokens
                label_ids.append(-100)
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token                 
                label_ids.append(label[word_idx]) 
            else: 
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100) 
                # mask the subword representations after the first subword
                 
            previous_word_idx = word_idx 
        labels.append(label_ids) 
    tokenized_inputs["labels"] = labels 
    return tokenized_inputs 

testing the function

In [50]:
q=tokenize_and_align_labels(conll2003['train'][4:5])
q

{'input_ids': [[101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, -100]]}

In [51]:
tokenized_dataset=conll2003.map(tokenize_and_align_labels, batched=True)

In [52]:
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [53]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments("test-ner",evaluation_strategy = "epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, ) 
data_collator=DataCollatorForTokenClassification(tokenizer)
metric=datasets.load_metric("seqeval")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [54]:
example=conll2003['train'][0]
label_list=conll2003['train'].features['ner_tags'].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [55]:
labels=[label_list[i] for i in example['ner_tags']]
labels

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

In [56]:
metric.compute(predictions=[labels], references=[labels])

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

## Computing metrics 

In [57]:
def compute_metrics(eval_preds):
    pred_logits, labels=eval_preds
    pred_logits=np.argmax(pred_logits, axis=2)
      # We remove all the values where the label is -100
    
    predictions = [ 
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100] 
        for prediction, label in zip(pred_logits, labels) 
    ] 
    
    true_labels = [ 
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100] 
       for prediction, label in zip(pred_logits, labels) 
   ] 
    results = metric.compute(predictions=predictions, references=true_labels) 
    return { 
   "precision": results["overall_precision"], 
   "recall": results["overall_recall"], 
   "f1": results["overall_f1"], 
  "accuracy": results["overall_accuracy"], 
  } 

In [58]:
trainer = Trainer( model, args, train_dataset=tokenized_dataset["train"], eval_dataset=tokenized_dataset["validation"], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics) 

In [59]:
trainer.train() 

  0%|          | 0/2634 [00:00<?, ?it/s]

{'loss': 0.2264, 'grad_norm': 1.5905286073684692, 'learning_rate': 1.6203492786636296e-05, 'epoch': 0.57}


  0%|          | 0/204 [00:00<?, ?it/s]

{'eval_loss': 0.06011531502008438, 'eval_precision': 0.9217227635075288, 'eval_recall': 0.931312227318492, 'eval_f1': 0.9264926826553891, 'eval_accuracy': 0.9831604365577391, 'eval_runtime': 208.5639, 'eval_samples_per_second': 15.583, 'eval_steps_per_second': 0.978, 'epoch': 1.0}
{'loss': 0.0702, 'grad_norm': 0.7943836450576782, 'learning_rate': 1.240698557327259e-05, 'epoch': 1.14}
{'loss': 0.0463, 'grad_norm': 2.2683959007263184, 'learning_rate': 8.610478359908885e-06, 'epoch': 1.71}


  0%|          | 0/204 [00:00<?, ?it/s]

{'eval_loss': 0.05536239966750145, 'eval_precision': 0.9345804737657737, 'eval_recall': 0.9445128090390424, 'eval_f1': 0.9395203916986591, 'eval_accuracy': 0.9858293484995313, 'eval_runtime': 239.8188, 'eval_samples_per_second': 13.552, 'eval_steps_per_second': 0.851, 'epoch': 2.0}
{'loss': 0.0347, 'grad_norm': 1.6674907207489014, 'learning_rate': 4.8139711465451785e-06, 'epoch': 2.28}
{'loss': 0.0259, 'grad_norm': 2.8925271034240723, 'learning_rate': 1.0174639331814731e-06, 'epoch': 2.85}


  0%|          | 0/204 [00:00<?, ?it/s]

{'eval_loss': 0.05624592304229736, 'eval_precision': 0.9388636615350537, 'eval_recall': 0.9483163664839468, 'eval_f1': 0.943566340160285, 'eval_accuracy': 0.9865283492461913, 'eval_runtime': 206.0349, 'eval_samples_per_second': 15.774, 'eval_steps_per_second': 0.99, 'epoch': 3.0}
{'train_runtime': 11611.2814, 'train_samples_per_second': 3.628, 'train_steps_per_second': 0.227, 'train_loss': 0.07772379602187496, 'epoch': 3.0}


TrainOutput(global_step=2634, training_loss=0.07772379602187496, metrics={'train_runtime': 11611.2814, 'train_samples_per_second': 3.628, 'train_steps_per_second': 0.227, 'total_flos': 1020143109346326.0, 'train_loss': 0.07772379602187496, 'epoch': 3.0})

In [60]:
model.save_pretrained("ner_model")

In [61]:
tokenizer.save_pretrained("tokenizer")

('tokenizer\\tokenizer_config.json',
 'tokenizer\\special_tokens_map.json',
 'tokenizer\\vocab.txt',
 'tokenizer\\added_tokens.json',
 'tokenizer\\tokenizer.json')

In [62]:
id2label = {str(i): label for i,label in enumerate(label_list)}
label2id = {label: str(i) for i,label in enumerate(label_list)}

In [63]:
import json
config = json.load(open("ner_model/config.json"))
config["id2label"] = id2label
config["label2id"] = label2id
json.dump(config, open("ner_model/config.json","w"))

Inferencing using fine tuned model

In [64]:
from transformers import pipeline
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("ner_model")
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)

In [65]:
example = "Bill Gates is the Founder of Microsoft"
ner_results = nlp(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.99691856, 'index': 1, 'word': 'bill', 'start': 0, 'end': 4}, {'entity': 'I-PER', 'score': 0.99597245, 'index': 2, 'word': 'gates', 'start': 5, 'end': 10}, {'entity': 'B-ORG', 'score': 0.94493335, 'index': 7, 'word': 'microsoft', 'start': 29, 'end': 38}]
