This is a shorter version of the token classification notebook prepared by Huggingface.

Some parts are intentionally removed, to make this notebook easier, for educational purposes.

Source of the original notebook:

https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=545PP3o8IrJV

# Token Classification

Token classification is about classifying the parts (words, subwords...) of a text.

Most known application is Named Entity Recognition:
- [ "My", "name", "is", "Ahmet", "." ]
- [ "O", "O", "O", "PERSON", "O" ]

Named entity recognition finds the special entities in a text, such as "person", "location", "date".

It is a type of token classification, classes being "O", "PERSON", "LOC", "DATE".

## Data Exploration for Named Entity Recognition

In [1]:
! pip install datasets transformers

Collecting huggingface-hub<0.1.0,>=0.0.19
  Using cached huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.4.0
    Uninstalling huggingface-hub-0.4.0:
      Successfully uninstalled huggingface-hub-0.4.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy-transformers 1.0.6 requires transformers<4.10.0,>=3.4.0, but you have transformers 4.11.3 which is incompatible.[0m
Successfully installed huggingface-hub-0.0.19


- [ ] model checkpoint can be changed!
- [ ] dataset can be changed
    - dataset already gone! FileNotFoundError: Couldn't find file at https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/train.txt
    - https://github.com/davidsbatista/NER-datasets/issues/8
    - use this "conllpp"
        - CoNLLpp is a corrected version of the CoNLL2003 NER dataset where labels of 5.38% of the sentences in the test set have been manually corrected

In [5]:
# token classification
# named entity recognition NER
# part of speech tagging POS

# distillation learning

task = "ner" # Should be one of "ner", "pos" or "chunk"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

from datasets import load_dataset
datasets = load_dataset("conllpp")

datasets

Downloading:   0%|          | 0.00/2.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading and preparing dataset conllpp/conllpp (download: 4.63 MiB, generated: 9.78 MiB, post-processed: Unknown size, total: 14.41 MiB) to /home/gsoykan20/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/650k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/163k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/141k [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset conllpp downloaded and prepared to /home/gsoykan20/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [6]:
datasets["train"][0]

{'id': '0',
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.']}

- [ ] label listtekilerin anlamları yazılabilir.

In [9]:
label_list = datasets["train"].features[f"{task}_tags"].feature.names
print(label_list)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


### Show random elements from dataset to understand it better

In [19]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    dataset_len = len(dataset)
    assert num_examples <= dataset_len, "Can't pick more elements than there are in the dataset."
    picks = random.sample(range(dataset_len), num_examples)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [20]:
show_random_elements(datasets["train"])

Unnamed: 0,id,tokens,pos_tags,chunk_tags,ner_tags
0,11034,"[The, present, noise, levels, have, applied, at, Heathrow, ,, one, of, the, world, 's, busiest, airports, ,, since, 1959, and, at, Gatwick, since, 1968, .]","[DT, JJ, NN, NNS, VBP, VBN, IN, NNP, ,, CD, IN, DT, NN, POS, JJS, NNS, ,, IN, CD, CC, IN, NNP, IN, CD, .]","[B-NP, I-NP, I-NP, I-NP, B-VP, I-VP, B-PP, B-NP, O, B-NP, B-PP, B-NP, I-NP, B-NP, I-NP, I-NP, O, B-PP, B-NP, O, B-PP, B-NP, B-PP, B-NP, O]","[O, O, O, O, O, O, O, B-LOC, O, O, O, O, O, O, O, O, O, O, O, O, O, B-LOC, O, O, O]"
1,3855,"[EASTERN, DIVISION]","[NNP, NNP]","[B-NP, I-NP]","[B-MISC, I-MISC]"
2,1905,"[Portsmouth, 1, Queens, Park, Rangers, 2]","[NNP, CD, NNP, NNP, NNPS, CD]","[B-NP, I-NP, I-NP, I-NP, I-NP, I-NP]","[B-ORG, O, B-ORG, I-ORG, I-ORG, O]"
3,6577,"[Although, Danny, Rose, 's, 50th-minute, equaliser, gave, the, Seychellois, renewed, hope, they, could, not, find, the, net, again, and, were, eliminated, .]","[IN, NNP, NNP, POS, JJ, NN, VBD, DT, NNP, VBD, NN, PRP, MD, RB, VB, DT, JJ, RB, CC, VBD, VBN, .]","[B-SBAR, B-NP, I-NP, B-NP, I-NP, I-NP, B-VP, B-NP, I-NP, B-VP, B-NP, B-NP, B-VP, I-VP, I-VP, B-NP, I-NP, B-ADVP, O, B-VP, I-VP, O]","[O, B-PER, I-PER, O, O, O, O, O, B-MISC, O, O, O, O, O, O, O, O, O, O, O, O, O]"
4,13075,"[USDA, gross, cutout, hide, and, offal, value, .]","[NNP, JJ, JJ, NN, CC, NN, NN, .]","[B-NP, I-NP, I-NP, I-NP, I-NP, I-NP, I-NP, O]","[B-ORG, O, O, O, O, O, O, O]"
5,10019,"[Martin, Gates, ,, Anders, Haglund, (, Sweden, )]","[NNP, NNP, ,, NNP, NNP, (, NNP, )]","[B-NP, I-NP, O, B-NP, I-NP, O, B-NP, O]","[B-PER, I-PER, O, B-PER, I-PER, O, B-LOC, O]"
6,1020,"[The, HCFA, Administrator, reversed, a, previously, favorable, decision, regarding, the, reimbursement, of, costs, related, to, the, company, 's, community, liaison, personnel, ,, it, added, .]","[DT, NNP, NNP, VBD, DT, RB, JJ, NN, VBG, DT, NN, IN, NNS, VBN, TO, DT, NN, POS, NN, NN, NNS, ,, PRP, VBD, .]","[B-NP, I-NP, I-NP, B-VP, B-NP, I-NP, I-NP, I-NP, B-VP, B-NP, I-NP, B-PP, B-NP, B-VP, B-PP, B-NP, I-NP, B-NP, I-NP, I-NP, I-NP, O, B-NP, B-VP, O]","[O, B-ORG, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
7,3600,"[Worcestershire, 205-9, (, K., Spiring, 52, ), .]","[NNP, CD, (, NNP, NNP, CD, ), .]","[B-NP, I-NP, O, B-NP, I-NP, I-NP, O, O]","[B-ORG, O, O, B-PER, I-PER, O, O, O]"
8,8380,"[SQUASH, -, HONG, KONG, OPEN, FIRST, ROUND, RESULTS, .]","[NNP, :, NNP, NNP, NNP, NNP, NNP, NNS, .]","[B-NP, O, B-NP, I-NP, O, B-NP, I-NP, I-NP, O]","[O, O, B-MISC, I-MISC, I-MISC, O, O, O, O]"
9,8195,"[Birmingham, 2, 1, 1, 0, 5, 4, 4]","[NNP, CD, CD, CD, CD, CD, CD, CD]","[B-NP, I-NP, I-NP, I-NP, I-NP, I-NP, I-NP, I-NP]","[B-ORG, O, O, O, O, O, O, O]"


## Preprocessing for Named Entity Recognition

### Tokenization

In [21]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [22]:
# Let's see what tokenizer does
tokenizer("Hello, this is one sentence!")

{'input_ids': [101, 7592, 1010, 2023, 2003, 2028, 6251, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

### What is subword tokenization?

Subtokens are used in modern approaches in lieu of stemming and lemmatization.

Since it is hard to represent every possible state of a word, like:
- head -> token id: 1
- hunt -> token id: 2
- hunter -> token id: 3
- headhunter -> token id: 4

We instead do this:
- head -> token id: 1
- hunt -> token id: 2
- -er -> token id: 3
- headhunter -> token ids: 1 2 3

This way, it is easier to represent compound words and words with additions. Especially in Turkish language, additions of a word is a huge issue.

Techniques like Byte-Pair-Encoding is also utilized when we want to be language agnostic, and learn our tokens from data, in an unsupervised way.

.

.

Note that transformers are often pretrained with subword tokenizers, meaning that even if your inputs have been split into words already, each of those words could be split again by the tokenizer. Let's look at an example of that:

In [23]:
example = datasets["train"][4]
print(example["tokens"])

['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.']


In [24]:
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

['[CLS]', 'germany', "'", 's', 'representative', 'to', 'the', 'european', 'union', "'", 's', 'veterinary', 'committee', 'werner', 'z', '##wing', '##mann', 'said', 'on', 'wednesday', 'consumers', 'should', 'buy', 'sheep', '##me', '##at', 'from', 'countries', 'other', 'than', 'britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.', '[SEP]']


Here the words "Zwingmann" and "sheepmeat" have been split in three subtokens.

However, we do not have three labels for each subword token of Zwingmann, like:

`Z -> [PER] ##wing -> [PER] ##mann -> [PER]`

("##" is used when we want to sign that the string is not an original word, instead, it is a non-first subword of an original word.)

Instead, we have only one label for the whole word:

`Zwingmann -> [PER]`

This means that we need to do some processing on our labels as the input ids returned by the tokenizer are longer than the lists of labels our dataset contain, first because some special tokens might be added (we can a `[CLS]` and a `[SEP]` above) and then because of those possible splits of words in multiple tokens.

Thankfully, the tokenizer returns outputs that have a `word_ids` method which can help us.

In [43]:
print(tokenized_input.word_ids())
print(len(tokenized_input["input_ids"]) == len(tokenized_input.word_ids()))

[None, 0, 1, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 11, 11, 12, 13, 14, 15, 16, 17, 18, 18, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, None]
True


As we can see, it returns a list with the same number of elements as our processed input ids, mapping special tokens to `None` and all other tokens to their respective word. This way, we can align the labels with the processed input ids.

### Aligning subwords with word labels

In [42]:
word_ids = tokenized_input.word_ids()
aligned_labels = [-100 if i is None else example[f"{task}_tags"][i] for i in word_ids]
print(len(aligned_labels), len(tokenized_input["input_ids"]))

39 39


Here we set the labels of all special tokens to -100 (the index that is ignored by PyTorch) and the labels of all other tokens to the label of the word they come from.

We're now ready to write the function that will preprocess our samples.
- We feed them to the `tokenizer` with the argument `truncation=True` (to truncate texts that are bigger than the maximum size allowed by the model)
- and `is_split_into_words=True` (as seen above). Then we align the labels with the token ids using the strategy we picked:

In [69]:
#If you wonder what `label_all_tokens` is, go to original notebook, cited at the top of this notebook.
# It is intentionally removed to reduce the complexity of the notebook.
label_all_tokens = True

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"],
                                 truncation=True,
                                 is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
             # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [81]:
tokenize_and_align_labels(datasets['train'][:5])

{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], [101, 2848, 13934, 102], [101, 9371, 2727, 1011, 5511, 1011, 2570, 102], [101, 1996, 2647, 3222, 2056, 2006, 9432, 2009, 18335, 2007, 2446, 6040, 2000, 10390, 2000, 18454, 2078, 2329, 12559, 2127, 6529, 5646, 3251, 5506, 11190, 4295, 2064, 2022, 11860, 2000, 8351, 1012, 102], [101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100], [-100, 1, 2, -100], [-100, 5, 0, 

In [79]:
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

## Fine-tuning the NER model

In [83]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers.trainer_utils import IntervalStrategy

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN t

In [84]:
args = TrainingArguments(
    f"test-{task}",
    evaluation_strategy = IntervalStrategy.EPOCH,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

In [85]:
from transformers import DataCollatorForTokenClassification
# Data collator that will dynamically pad the inputs received, as well as the labels.
data_collator = DataCollatorForTokenClassification(tokenizer)

In [86]:
# seqeval is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
!pip install seqeval

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 865 kB/s eta 0:00:01
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16181 sha256=c3671f43dcfd833b980f03cb54a25ffbb39976f42ef3f0f2dc47da4c04406f13
  Stored in directory: /home/gsoykan20/.cache/pip/wheels/ad/5c/ba/05fa33fa5855777b7d686e843ec07452f22a66a138e290e732
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [88]:
from datasets import load_metric
metric = load_metric("seqeval")

Downloading:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

This metric takes list of labels for the predictions and references:

In [92]:
labels = [label_list[i] for i in example[f"{task}_tags"]]
# testing the metric
metric.compute(predictions=[labels], references=[labels])

{'LOC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

So we will need to do a bit of post-processing on our predictions:
- select the predicted index (with the maximum logit) for each token
- convert it to its string label
- ignore everywhere we set a label of -100

The following function does all this post-processing on the result of `Trainer.evaluate` (which is a namedtuple containing predictions and labels) before applying the metric:

In [93]:
import numpy as np


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"]
    }

Note that we drop the precision/recall/f1 computed for each category and only focus on the overall precision/recall/f1/accuracy.

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [94]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


We can now finetune our model by just calling the `train` method:

In [96]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, ner_tags, pos_tags, chunk_tags, tokens.
***** Running training *****
  Num examples = 14041
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2634
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2022-01-26 15:10:51.074176: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[34m[1mwandb[0m: W&B syncing is set to `offline` in this directory.  Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.


Epoch,Training Loss,Validation Loss


Saving model checkpoint to test-ner/checkpoint-500
Configuration saved in test-ner/checkpoint-500/config.json
Model weights saved in test-ner/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-ner/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-ner/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, ner_tags, pos_tags, chunk_tags, tokens.
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 16
Saving model checkpoint to test-ner/checkpoint-1000
Configuration saved in test-ner/checkpoint-1000/config.json
Model weights saved in test-ner/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-ner/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-ner/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to test-ner/checkpoint-1500
Configuration saved 

TrainOutput(global_step=2634, training_loss=0.08509582548134238, metrics={'train_runtime': 265.5655, 'train_samples_per_second': 158.616, 'train_steps_per_second': 9.918, 'total_flos': 509926772226690.0, 'train_loss': 0.08509582548134238, 'epoch': 3.0})

## Evaluation of the NER model

The `evaluate` method allows you to evaluate again on the evaluation dataset or on another dataset:

In [97]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, ner_tags, pos_tags, chunk_tags, tokens.
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 16


{'eval_loss': 0.05975901335477829,
 'eval_precision': 0.9282709763117113,
 'eval_recall': 0.9381362568519969,
 'eval_f1': 0.9331775440939187,
 'eval_accuracy': 0.9841136193940935,
 'eval_runtime': 5.7369,
 'eval_samples_per_second': 566.509,
 'eval_steps_per_second': 35.559,
 'epoch': 3.0}

To get the precision/recall/f1 computed for each category now that we have finished training, we can apply the same function as before on the result of the `predict` method:

In [102]:
def compute_test_results():
    predictions, labels, _ = trainer.predict(tokenized_datasets["test"])
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return results

In [103]:
compute_test_results()

The following columns in the test set  don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, ner_tags, pos_tags, chunk_tags, tokens.
***** Running Prediction *****
  Num examples = 3453
  Batch size = 16


{'LOC': {'precision': 0.8968792401628223,
  'recall': 0.9375886524822695,
  'f1': 0.9167822468793343,
  'number': 2115},
 'MISC': {'precision': 0.7970863683662851,
  'recall': 0.7621890547263681,
  'f1': 0.7792472024415055,
  'number': 1005},
 'ORG': {'precision': 0.8780581039755352,
  'recall': 0.8664654847227461,
  'f1': 0.8722232770077842,
  'number': 2651},
 'PER': {'precision': 0.9721293199554069,
  'recall': 0.9582417582417583,
  'f1': 0.9651355838406197,
  'number': 2730},
 'overall_precision': 0.9036442976766128,
 'overall_recall': 0.9013057287377956,
 'overall_f1': 0.9024734982332157,
 'overall_accuracy': 0.9777294385746841}

Saving trainer state and the model for later use.

In [106]:
trainer.save_state()
trainer.save_model()

Testing with custom input

In [195]:
# e.g. Wings began recording sessions for its next album in London in November 1974x
custom_sentence = input()

In [196]:
inputs = tokenizer(custom_sentence, return_tensors="pt", add_special_tokens=True)
inputs["input_ids"] = inputs["input_ids"].to(device=model.device)
inputs["attention_mask"] = inputs["attention_mask"].to(device=model.device)

In [197]:
outputs = model(**inputs)

In [198]:
predicted_classes = outputs['logits'].argmax(axis=2).cpu().numpy()[0]

In [199]:
tokens = tokenizer.convert_ids_to_tokens(ids=inputs["input_ids"].cpu().numpy()[0], skip_special_tokens=False)

In [200]:
for i, p in enumerate(predicted_classes):
    if tokens[i] in [tokenizer.sep_token, tokenizer.cls_token]:
        continue
    print(f"{tokens[i]} ----> {label_list[p]}")

wings ----> B-ORG
began ----> O
recording ----> O
sessions ----> O
for ----> O
its ----> O
next ----> O
album ----> O
in ----> O
london ----> B-LOC
in ----> O
november ----> O
1974 ----> O
