In [1]:
### Lettuce POS Taggers
### Langauge: German
### Author: Pranaydeep Singh
### Last Update: 2024-05-06
### Description: Inference script for POS tagging using pre-trained Transformers for the Lettuce project
### Requirements: transformers, ipymarkup

In [2]:
# Importing required libraries here. Ipymarkup is non-essential. Only needed for the visualization of the output.

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from ipymarkup import show_span_box_markup

In [3]:
# We will load the two pretrained models for German in a pipeline for POS tagging. The models are available on the Huggingface model hub.

classifier_mono = pipeline("token-classification", model="pranaydeeps/lettuce_pos_de_mono")  #This is a finetuned monolingual German model
classifier_xlm = pipeline("token-classification", model="pranaydeeps/lettuce_pos_de_xlm")    #This is a finetuned multilingual XLM model

Downloading config.json: 100%|██████████| 2.58k/2.58k [00:00<00:00, 317kB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.11G/1.11G [01:49<00:00, 10.1MB/s]
Downloading tokenizer_config.json: 100%|██████████| 468/468 [00:00<00:00, 62.1kB/s]
Downloading (…)tencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:02<00:00, 2.17MB/s]
Downloading tokenizer.json: 100%|██████████| 17.1M/17.1M [00:02<00:00, 6.05MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 280/280 [00:00<00:00, 111kB/s]


In [8]:
# Sample text from Wikipedia. From the German article about the city of Gent.

text = "Gent ist nach Antwerpen die zweitgrößte Stadt Belgiens. Sie zählt 265.086 Einwohner und ist die Hauptstadt der Provinz Ostflandern, des Arrondissements Gent und des Wahlbezirks."

In [10]:
# We will now pass the text to the two models and get the output.

output_mono = classifier_mono(text)
output_xlm = classifier_xlm(text)

In [11]:
# We will now visualize the output using ipymarkup. First let's see the output of the monolingual model.

show_span_box_markup(text, [(token['start'], token['end'], token['entity']) for token in output_mono])

In [12]:
# Now let's see the output of the XLM model.

show_span_box_markup(text, [(token['start'], token['end'], token['entity']) for token in output_xlm])

In the end you can use the model that works better for you!

Note: The models tokenize the text differently so some words might be broken into 3-4 sub-words by the multilingual model but in 2 or 0 sub-words by the monolingual model. The monolingual model should usually be better at tokenisation.