In [1]:
### Lettuce POS Tagger Utility Scripts
### Language: Dutch
### Author: Pranaydeep Singh
### Last Update: 2024-05-06
### Description: Inference script for POS tagging using pre-trained Transformers for the Lettuce project
### Requirements: transformers, ipymarkup

In [2]:
# Importing required libraries here. Ipymarkup is non-essential. Only needed for the visualization of the output.

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from ipymarkup import show_span_box_markup

In [3]:
# We will load the two pretrained models for Dutch in a pipeline for POS tagging. The models are available on the Huggingface model hub.

classifier_mono = pipeline("token-classification", model="pranaydeeps/lettuce_pos_nl_mono")  #This is a finetuned monolingual Dutch model
classifier_xlm = pipeline("token-classification", model="pranaydeeps/lettuce_pos_nl_xlm")    #This is a finetuned multilingual XLM model

Downloading config.json: 100%|██████████| 11.4k/11.4k [00:00<00:00, 1.62MB/s]
Downloading pytorch_model.bin: 100%|██████████| 465M/465M [00:38<00:00, 12.0MB/s] 
Downloading tokenizer_config.json: 100%|██████████| 1.47k/1.47k [00:00<00:00, 188kB/s]
Downloading vocab.json: 100%|██████████| 653k/653k [00:00<00:00, 2.37MB/s]
Downloading merges.txt: 100%|██████████| 383k/383k [00:00<00:00, 13.6MB/s]
Downloading tokenizer.json: 100%|██████████| 1.71M/1.71M [00:00<00:00, 6.82MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 280/280 [00:00<00:00, 93.6kB/s]
Downloading config.json: 100%|██████████| 11.4k/11.4k [00:00<00:00, 3.51MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.11G/1.11G [01:46<00:00, 10.4MB/s]
Downloading tokenizer_config.json: 100%|██████████| 468/468 [00:00<00:00, 64.1kB/s]
Downloading (…)tencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:00<00:00, 17.7MB/s]
Downloading tokenizer.json: 100%|██████████| 17.1M/17.1M [00:01<00:00, 11.0MB/s]
Downloading (…)cia

In [4]:
# Sample text from Wikipedia. From the Dutch article about the city of Gent.

text = "Gent is de hoofdstad en grootste centrumstad van de Belgische provincie Oost-Vlaanderen en van het arrondissement Gent."

In [5]:
# We will now pass the text to the two models and get the output.

output_mono = classifier_mono(text)
output_xlm = classifier_xlm(text)

In [6]:
# We will now visualize the output using ipymarkup. First let's see the output of the monolingual model.

show_span_box_markup(text, [(token['start'], token['end'], token['entity']) for token in output_mono])

In [7]:
# Now let's see the output of the XLM model.

show_span_box_markup(text, [(token['start'], token['end'], token['entity']) for token in output_xlm])

In the end you can use the model that works better for you!

Note: The models tokenize the text differently so some words might be broken into 3-4 sub-words by the multilingual model but in 2 or 0 sub-words by the monolingual model. The monolingual model should usually be better at tokenisation.In the end you can use the model that works better for you!