# mBERT for topic labeling

**Dependencies**
- tokenizers
- torch
- transformers

First we load our model that was pretrained on this task. We also load in the tokenizer.

Because we only want to get results, we have to disable dropout etc. So we add `model.eval()`.

In [3]:
!export CUDA_VISIBLE_DEVICES="0"

In [1]:
model_path="DTAI-KULeuven/mbert-corona-tweets-belgium-topics"
model_path

'DTAI-KULeuven/mbert-corona-tweets-belgium-topics'

In [2]:
from tqdm import tqdm

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path, return_dict=True)
model.eval().to("cuda" if torch.cuda.is_available() else "cpu")
print("model loaded")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1227.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=312.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=711499087.0, style=ProgressStyle(descri…


model loaded


## Sanity check
Let's first check if the predictions make sense, just to make sure we loaded the model correctly :-) 

In [6]:
inputs = tokenizer.batch_encode_plus(
    ["Context: Tip voor onze horecazaken: stap naar de rechter! 👇  ---------- Tweet: RT @SamvanRooy1: Tip voor onze horecazaken: stap naar de rechter! 👇",
     "Context: Hey @JanJambon en @de_NVA het is maar dat jullie het weten  maar ik duld geen vroegere avondklok in Vlaanderen. Er is al een samenscholingsverbod. ---------- Tweet: RT @KarenLiesens: Hey @JanJambon en @de_NVA het is maar dat jullie het weten  maar ik duld geen vroegere avondklok in Vlaanderen. Er is al…",
     "This better not be a joke! Als het geen avondklok was ging ik nu zoeken, aub kacper laat ons weten dat alles oké met je is"],
    return_tensors="pt", padding=True)
for key, value in inputs.items():
    inputs[key] = value.to("cuda" if torch.cuda.is_available() else "cpu")
    print("{}:\n\t{}".format(key, value))
print("Tokens:\n\t{}".format(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) ))
print("\t{}".format(tokenizer.convert_ids_to_tokens(inputs['input_ids'][1]) ))

input_ids:
	tensor([[  101, 49082, 28883,   131, 82386, 10423, 41208, 13173, 74755, 29797,
         11062,   131, 16527, 10410, 11422, 10104, 82991,   106,   100,   118,
           118,   118,   118,   118,   118,   118,   118,   118,   118, 61444,
         10123,   131, 56898,   137, 14268, 12955, 11273, 22659, 10157, 10759,
           131, 82386, 10423, 41208, 13173, 74755, 29797, 11062,   131, 16527,
         10410, 11422, 10104, 82991,   106,   100,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101, 49082, 28883,   131, 35936,   137, 11806, 15417, 11008, 22572,
         10110,   137, 10104,   168,   

In our model config, we stored what labels we use (`0 = joke` and `1 = proverb`). We can load these in and automatically convert our predictions to a human-readable format.

In [7]:
for i in model.config.id2label:
    model.config.id2label[i] = model.config.id2label[i].replace("\n","")
    
print(model.config.id2label)

{0: 'other-measure', 1: 'closing-horeca', 2: 'testing', 3: 'schools', 4: 'lockdown', 5: 'quarantine', 6: 'curfew', 7: 'masks', 8: 'not-applicable', 9: 'vaccine'}


Ok, let's do some predictions! Since we have a batch of 2 jokes and one proverb, we can do this in one batch—as long as it fits on your GPU.

In [8]:
with torch.no_grad():
    results = model(**inputs)
    print([model.config.id2label[item.item()] for item in results.logits.argmax(axis=1)])
    print(results.logits)

['closing-horeca', 'curfew', 'curfew']
tensor([[ 0.0199,  2.3763, -1.3540, -0.0728,  1.7234, -1.0660, -1.9340, -1.4213,
          0.1896,  0.9748],
        [-1.2840, -0.7962, -0.6047, -0.8784,  0.7997,  0.8366,  5.6206, -0.3485,
         -1.4491, -0.9492],
        [-1.2943, -0.7917, -0.6074, -0.8710,  0.7887,  0.8474,  5.6175, -0.3549,
         -1.4373, -0.9491]], device='cuda:0')
