In [1]:
### Lettuce POS Tagger Utility Scripts
### Language: French
### Author: Pranaydeep Singh
### Last Update: 2024-05-06
### Description: Inference script for POS tagging using pre-trained Transformers for the Lettuce project
### Requirements: transformers, ipymarkup

In [2]:
# Importing required libraries here. Ipymarkup is non-essential. Only needed for the visualization of the output.

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from ipymarkup import show_span_box_markup

In [3]:
# We will load the two pretrained models for German in a pipeline for POS tagging. The models are available on the Huggingface model hub.

classifier_mono = pipeline("token-classification", model="pranaydeeps/lettuce_pos_fr_mono")  #This is a finetuned monolingual French model
classifier_xlm = pipeline("token-classification", model="pranaydeeps/lettuce_pos_fr_xlm")    #This is a finetuned multilingual XLM model

Downloading config.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 264kB/s]
Downloading pytorch_model.bin: 100%|██████████| 440M/440M [08:48<00:00, 834kB/s]  
Downloading tokenizer_config.json: 100%|██████████| 548/548 [00:00<00:00, 74.4kB/s]
Downloading (…)tencepiece.bpe.model: 100%|██████████| 811k/811k [00:01<00:00, 674kB/s]
Downloading tokenizer.json: 100%|██████████| 2.42M/2.42M [00:04<00:00, 590kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 354/354 [00:00<00:00, 122kB/s]
Downloading config.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 1.28MB/s]
Downloading pytorch_model.bin:   9%|▊         | 94.4M/1.11G [01:38<17:42, 956kB/s] 
Downloading pytorch_model.bin: 100%|██████████| 1.11G/1.11G [03:04<00:00, 6.01MB/s]
Downloading tokenizer_config.json: 100%|██████████| 468/468 [00:00<00:00, 73.3kB/s]
Downloading (…)tencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:00<00:00, 7.56MB/s]
Downloading tokenizer.json: 100%|██████████| 17.1M/17.1M [00:02<00:00, 8.25MB/s]
Dow

In [4]:
# Sample text from Wikipedia. From the French article about the city of Gent.

text = "Gand, est une ville belge néerlandophone, située en Région flamande au confluent de la Lys et de l'Escaut."

In [8]:
# We will now pass the text to the two models and get the output.

output_mono = classifier_mono(text)
output_xlm = classifier_xlm(text)

In [9]:
# We will now visualize the output using ipymarkup. First let's see the output of the monolingual model.

show_span_box_markup(text, [(token['start'], token['end'], token['entity']) for token in output_mono])

In [10]:
# Now let's see the output of the XLM model.

show_span_box_markup(text, [(token['start'], token['end'], token['entity']) for token in output_xlm])

In the end you can use the model that works better for you!

Note: The models tokenize the text differently so some words might be broken into 3-4 sub-words by the multilingual model but in 2 or 0 sub-words by the monolingual model. The monolingual model should usually be better at tokenisation.In the end you can use the model that works better for you!