<a href="https://colab.research.google.com/github/laxmiharikumar/transformers/blob/main/finetuning_bert_for_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### There are multiple ways to get the pre-trained models, either Tensorflow hub or hugging-face’s transformers package. Here we are using Hugging face transformer

In [7]:
!pip install --upgrade transformers datasets tokenizer seqeval -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/79.1 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h

### Download the dataset conll2003

In [36]:
from datasets import load_dataset

In [9]:
raw_dataset = load_dataset("conll2003")
raw_dataset

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [10]:
raw_dataset["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [15]:
raw_dataset["train"].features["ner_tags"].feature.names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [37]:
from transformers import AutoTokenizer

In [42]:
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [43]:
## Tokenize one line to see how it looks
example_text = raw_dataset["train"][0]
print(example_text["tokens"])
example_input = tokenizer(example_text["tokens"], is_split_into_words=True)

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']


In [44]:
example_input

{'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [45]:
print(example_input["input_ids"])
tokens = tokenizer.convert_ids_to_tokens(example_input["input_ids"])
tokens

[101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102]


['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

In [46]:
word_ids = example_input.word_ids()
word_ids

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

The tokenizer added the special tokens used by the model ([CLS] at the beginning and [SEP] at the end) and left most of the words untouched. The word lamb, however, was tokenized into two subwords, la and ##mb. This introduces a mismatch between our inputs and the labels: the list of labels has only 9 elements, whereas our input now has 12 tokens. Accounting for the special tokens is easy (we know they are at the beginning and the end), but we also need to make sure we align all the labels with the proper words.

### With a tiny bit of work, we can then expand our label list to match the tokens. The first rule we’ll apply is that special tokens get a label of -100. This is because by default -100 is an index that is ignored in the loss function we will use (cross entropy). Then, each token gets the same label as the token that started the word it’s inside, since they are part of the same entity. For tokens inside a word but not at the beginning, we replace the B- with I- (since the token does not begin the entity)

In [55]:
# Basically this is the mismatch
print("Ner tags in input: ")
print(raw_dataset["train"][0]["ner_tags"])
labels = raw_dataset["train"][0]["ner_tags"]
print(len(labels))
print(word_ids)
print(len(word_ids))

## There are only 9 tags but 12 words

Ner tags in input: 
[3, 0, 7, 0, 0, 0, 7, 0, 0]
9
[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]
12


In [89]:
def align_labels_with_tokens(example):
  word_ids = example.word_ids();

  new_labels = []
  for word_id in word_ids:
    if word_id == None:
      label_new = -100  
    else:
      label_new = labels[word_id]
    # i = i + 1
    new_labels.append(label_new)
  
  return new_labels

In [94]:
modified_labels = align_labels_with_tokens(example_input)
print(f"Modified labels: {modified_labels}")
print(f"Ner tags: {raw_dataset['train'][0]['ner_tags']}")

Modified labels: [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
Ner tags: [3, 0, 7, 0, 0, 0, 7, 0, 0]


### As we can see, our function added the -100 for the two special tokens at the beginning and the end, and a new 0 for our word that was split into two tokens.