# Demo of RobBERT for Dutch named entity recognition
We use a [RobBERT (Delobelle et al., 2020)](https://arxiv.org/abs/2001.06286) model for NER.

**Dependencies**
- tokenizers
- torch
- transformers

First we load our RobBERT model that was pretrained on OSCAR and finetuned on Dutch named entity recognition. We also load in RobBERT's tokenizer.

Because we only want to get results, we have to disable dropout etc. So we add `model.eval()`.

In [2]:
import torch
from transformers import RobertaTokenizer, RobertaForTokenClassification

tokenizer = RobertaTokenizer.from_pretrained('pdelobelle/robbert-v2-dutch-ner')
model = RobertaForTokenClassification.from_pretrained('pdelobelle/robbert-v2-dutch-ner', return_dict=True)
model.eval()
print("RobBERT model loaded")

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


RobBERT model loaded


In [3]:
inputs = tokenizer.batch_encode_plus(
    ["Jan ging naar de bakker in Leuven en kocht een brood.",
     "Bedrijven zoals Google en Microsoft doen ook heel veel onderzoek naar NLP.",
     "Men moet een gegeven paard niet in de bek kijken.",
     "Hallo, mijn naam is RobBERT."],
    return_tensors="pt", padding=True)
for key, value in inputs.items():
    print("{}:\n\t{}".format(key, value))
print("Tokens:\n\t{}".format(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) ))
print("\t{}".format(tokenizer.convert_ids_to_tokens(inputs['input_ids'][1]) ))

input_ids:
	tensor([[    0,  6079,   499,    38,     5, 13292,    11,  6422,     8,  7010,
             9,  2617,     4,     2,     1],
        [    0, 25907,   129,  1283,     8,  3971,   113,    28,   118,    71,
           435,    38, 27600,     4,     2],
        [    0,  9396,    89,     9,   797,  2877,    22,    11,     5,  4290,
           445,     4,     2,     1,     1],
        [    0,  7751,     6,    74,   458,    12,  3663, 14334,   342,     4,
             2,     1,     1,     1,     1]])
attention_mask:
	tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])
Tokens:
	['<s>', 'Jan', 'Ġging', 'Ġnaar', 'Ġde', 'Ġbakker', 'Ġin', 'ĠLeuven', 'Ġen', 'Ġkocht', 'Ġeen', 'Ġbrood', '.', '</s>', '<pad>']
	['<s>', 'Bedrijven', 'Ġzoals', 'ĠGoogle', 'Ġen', 'ĠMicrosoft', 'Ġdoen', 'Ġook', 'Ġheel', 'Ġveel', 'Ġonderzoek', 'Ġnaa

In our model config, we stored what labels we use. 
We can load these in and automatically convert our predictions to a human-readable format.
For reference, we have 4 types of named entities:

- PER
- LOC
- ORG
- MISC

And we mark the first token with `B-` and then we mark a continuation with `I-`. 

In [4]:
print(model.config.id2label)

{0: 'B-PER', 1: 'B-ORG', 2: 'B-LOC', 3: 'B-MISC', 4: 'I-PER', 5: 'I-ORG', 6: 'I-LOC', 7: 'I-MISC', 8: 'O'}


Ok, let's do some predictions! Since we have a batch of 4 sentences, we can do this in one batch—as long as it fits on your GPU.

_If the formatting of this fails, you can try to zoom out or make the window wider_

In [5]:
with torch.no_grad():
    results = model(**inputs)
    for i, input in enumerate(inputs['input_ids']):
        print(f"Sentence {i}")
        [print("{:12}".format(token), end="") for token in tokenizer.convert_ids_to_tokens(input) ]
        print('\n')
        [print("{:12}".format(model.config.id2label[item.item()]), end="") for item in results.logits[i].argmax(axis=1)]
        print('\n')

Sentence 0
<s>         Jan         Ġging       Ġnaar       Ġde         Ġbakker     Ġin         ĠLeuven     Ġen         Ġkocht      Ġeen        Ġbrood      .           </s>        <pad>       

O           B-PER       O           O           O           O           O           B-LOC       O           O           O           O           O           O           O           

Sentence 1
<s>         Bedrijven   Ġzoals      ĠGoogle     Ġen         ĠMicrosoft  Ġdoen       Ġook        Ġheel       Ġveel       Ġonderzoek  Ġnaar       ĠNLP        .           </s>        

O           O           O           B-ORG       O           B-ORG       O           O           O           O           O           O           B-MISC      O           O           

Sentence 2
<s>         Men         Ġmoet       Ġeen        Ġgegeven    Ġpaard      Ġniet       Ġin         Ġde         Ġbek        Ġkijken     .           </s>        <pad>       <pad>       

O           O           O           O           O        

Ok, this works nicely! We have 'Jan', 'Leuven' and companies like 'Google' that are all labeled correctly. 
In addition, RobBERT consists of multiple tokens (perhaps we should have added one with it's name) and that works with the `I-` token as well.

