# Demo of RobBERT for humour detection
We use a [RobBERT (Delobelle et al., 2020)](https://arxiv.org/abs/2001.06286) model with the original pretraining head for MLM.

**Dependencies**
- tokenizers
- torch
- transformers

First we load our RobBERT model that was pretrained. We also load in RobBERT's tokenizer.

Because we only want to get results, we have to disable dropout etc. So we add `model.eval()`.

*Note: we pretrained both RobBERT v1 and RobBERT v2 in [Fairseq](https://github.com/pytorch/fairseq) and converted these checkpoints to HuggingFace. The MLM task behaves a bit differently.*

In [1]:
import torch
from transformers import RobertaTokenizer, AutoModelForSequenceClassification, AutoConfig

from transformers import RobertaTokenizer, RobertaForMaskedLM
import torch
tokenizer = RobertaTokenizer.from_pretrained('pdelobelle/robbert-v2-dutch-base')
model = RobertaForMaskedLM.from_pretrained('pdelobelle/robbert-v2-dutch-base', return_dict=True)
model = model.to( 'cuda' if torch.cuda.is_available() else 'cpu' )
model.eval()
#model = RobertaForMaskedLM.from_pretrained('pdelobelle/robbert-v2-dutch-base', return_dict=True)
print("RobBERT model loaded")

Special tokens have been added in the vocabulary, make sure the associated word emebedding are fine-tuned or trained.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=469740689.0, style=ProgressStyle(descri…


RobBERT model loaded


In [2]:
sequence = f"Er staat een {tokenizer.mask_token} in mijn tuin."

In [3]:
input = tokenizer.encode(sequence, return_tensors="pt").to( 'cuda' if torch.cuda.is_available() else 'cpu' )
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

Now that we have our tokenized input and the position of the masked token, we pass the input through RobBERT. 

This will give us a predicting for all tokens, but we're only interested in the `<mask>` token.  

In [4]:
with torch.no_grad():
    token_logits = model(input).logits

In [5]:
logits = token_logits[0, mask_token_index, :].squeeze()
prob = logits.softmax(dim=0)
values, indeces = prob.topk(k=15, dim=0)

for index, token in enumerate(tokenizer.convert_ids_to_tokens(indeces)):
    print(f"{token:20} | id = {indeces[index]:4} | p = {values[index]}")

Ġboom                | id = 2600 | p = 0.1416003555059433
Ġvijver              | id = 8217 | p = 0.13144515454769135
Ġplant               | id = 2721 | p = 0.043418534100055695
Ġhuis                | id =  251 | p = 0.01847737282514572
Ġparkeerplaats       | id = 6889 | p = 0.018001794815063477
Ġbankje              | id = 21620 | p = 0.016940612345933914
Ġmuur                | id = 2035 | p = 0.014668751507997513
Ġmoestuin            | id = 17446 | p = 0.0144038125872612
Ġzonnebloem          | id = 30757 | p = 0.014375611208379269
Ġschutting           | id = 15000 | p = 0.013991709798574448
Ġpaal                | id = 8626 | p = 0.01358739286661148
Ġbloem               | id = 3001 | p = 0.01199684850871563
Ġstal                | id = 7416 | p = 0.011224730871617794
Ġfontein             | id = 23425 | p = 0.011203107424080372
Ġtuin                | id =  671 | p = 0.010676783509552479


## RobBERT with pipelines
We can also use the `fill-mask` pipeline from Huggingface, that does basically the same thing.

In [6]:
from transformers import pipeline
p = pipeline("fill-mask", model="pdelobelle/robbert-v2-dutch-base")

Special tokens have been added in the vocabulary, make sure the associated word emebedding are fine-tuned or trained.


In [7]:
p(sequence)

[{'sequence': '<s>Er staat een boomin mijn tuin.</s>',
  'score': 0.1416003555059433,
  'token': 2600,
  'token_str': 'Ġboom'},
 {'sequence': '<s>Er staat een vijverin mijn tuin.</s>',
  'score': 0.13144515454769135,
  'token': 8217,
  'token_str': 'Ġvijver'},
 {'sequence': '<s>Er staat een plantin mijn tuin.</s>',
  'score': 0.043418534100055695,
  'token': 2721,
  'token_str': 'Ġplant'},
 {'sequence': '<s>Er staat een huisin mijn tuin.</s>',
  'score': 0.01847737282514572,
  'token': 251,
  'token_str': 'Ġhuis'},
 {'sequence': '<s>Er staat een parkeerplaatsin mijn tuin.</s>',
  'score': 0.018001794815063477,
  'token': 6889,
  'token_str': 'Ġparkeerplaats'}]

That's it for this demo of the MLM head. If you use RobBERT in your academic work, you can cite it!


```
@misc{delobelle2020robbert,
    title={{R}ob{BERT}: a {D}utch {R}o{BERT}a-based Language Model},
    author={Pieter Delobelle and Thomas Winters and Bettina Berendt},
    year={2020},
    eprint={2001.06286},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```
