In [19]:
### Lettuce POS Tagger Utility Scripts
### Langauge: French
### Author: Pranaydeep Singh
### Last Update: 2024-05-06
### Description: Inference script for POS tagging using pre-trained Transformers for the Lettuce project
### Requirements: transformers, ipymarkup

In [None]:
# Importing required libraries here. Ipymarkup is non-essential. Only needed for the visualization of the output.

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from ipymarkup import show_span_box_markup

In [21]:
# We will load the two pretrained models for German in a pipeline for POS tagging. The models are available on the Huggingface model hub.

classifier_mono = pipeline("token-classification", model="pranaydeeps/lettuce_pos_fr_mono")  #This is a finetuned monolingual French model
classifier_xlm = pipeline("token-classification", model="pranaydeeps/lettuce_pos_fr_xlm")    #This is a finetuned multilingual XLM model

In [None]:
# Sample text from Wikipedia. From the French article about the city of Gent.

text = "Gand, est une ville belge néerlandophone, située en Région flamande au confluent de la Lys et de l'Escaut."

In [24]:
# We will now pass the text to the two models and get the output.

output_mono = classifier_mono(text)
# output_xlm = classifier_xlm(text)

In [28]:
# We will now visualize the output using ipymarkup. First let's see the output of the monolingual model.

show_span_box_markup(text, [(token['start'], token['end'], token['entity']) for token in output_mono])

In [None]:
# Now let's see the output of the XLM model.

show_span_box_markup(text, [(token['start'], token['end'], token['entity']) for token in output_xlm])

In the end you can use the model that works better for you!

Note: The models tokenize the text differently so some words might be broken into 3-4 sub-words by the multilingual model but in 2 or 0 sub-words by the monolingual model. The monolingual model should usually be better at tokenisation.In the end you can use the model that works better for you!