<a href="https://colab.research.google.com/github/flipz357/impresso-datalab-notebooks/blob/main/annotate/language-identification_ImpressoHF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Identification using Floret

This notebook demonstrates how to use a pre-trained Floret language identification model downloaded from Hugging Face.
We'll load the model, input some text, and predict the language of the text.

## What is this notebook about?
This notebook provides a hands-on demonstration of **language identification** (LID) using our Impresso LID model from Hugging Face. We will explore how to download and use this model to predict the language of Impresso-like text inputs. This notebook walks through the necessary steps to set up dependencies, load the model, and implement it for practical language identification tasks.

## What will you learn in this notebook?
By the end of this notebook, you will:
- Understand how to install and configure the required libraries (`floret` and `huggingface_hub`).
- Learn to load our trained Floret language identification model from Hugging Face.
- Run the model to predict the dominant language (or the mix of languages) of a given text input.
- Gain insight into the core functionality of language identification using machine learning models.

## 1. Install Dependencies

First, we need to install `floret` and `huggingface_hub` to work with the Floret language identification model and Hugging Face.


In [2]:
!pip install floret
!pip install huggingface_hub

Collecting floret
  Downloading floret-0.10.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.1 kB)
Downloading floret-0.10.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (320 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/320.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m174.1/320.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.4/320.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: floret
Successfully installed floret-0.10.5


## 2. Model Information

In this example, we are using a language identification model hosted on the Hugging Face Hub: `impresso-project/impresso-floret-langident`.
The model can predict the language of a given text of a reasonable length and supports the main impresso languages: German (de), French (fr), Luxemburgish (lb), Italian (it), English (en)


## 3. Defining the FloretLangIdentifier Class

This class downloads the Floret model from Hugging Face and loads it for prediction. We use `huggingface_hub` to download the model locally.


In [3]:
from huggingface_hub import hf_hub_download
import floret


class FloretLangIdentifier:
    def __init__(self, repo_id, model_filename):
        """
        Initialize the Floret language identification model by downloading it from Hugging Face.
        Args:
            repo_id (str): The Hugging Face repository ID (e.g., "username/repo_name").
            model_filename (str): The model file name in the repository (e.g., "model.bin").
        """
        model_path = self._download_model(repo_id, model_filename)
        self.model = floret.load_model(model_path)

    def _download_model(self, repo_id, model_filename):
        """
        Download the model file from Hugging Face using huggingface_hub.
        Args:
            repo_id (str): The repository ID from which to download the model.
            model_filename (str): The model filename in the Hugging Face repository.

        Returns:
            str: The local path to the downloaded model file.
        """
        local_model_path = hf_hub_download(repo_id=repo_id, filename=model_filename)
        return local_model_path

    def predict(self, text):
        """
        Predict the language of the input text.
        Args:
            text (str): The input text.

        Returns:
            List of predicted labels and their probabilities.
        """
        predictions = self.model.predict(text)
        return predictions

    def predict_language(self, text):
        """
        Predicts the language of the input text and returns the language code without the "__label__" prefix.

        Args:
            text (str): The input text.

        Returns:
            str: The predicted language code (e.g., "en" for English).
        """
        predictions = self.model.predict(text)
        if predictions:
            # Extract the language code from the top prediction
            language = predictions[0][0].replace("__label__", "")
            return language
        else:
            return None

    def predict_language_mix(self, text, max_results=5, threshold_others=0.1):
        """
        Predicts the languageS of the input text and returns the language codes without the "__label__" prefix.

        Args:
            text (str): The input text.
            max_results (int): How many languages to consider?
            threshold_others (float): Below this probability, we ignore a predicted language.

        Returns:
            list: The predicted language codes (e.g., ["en", "de"] for English and German mixed text).
        """
        predictions = self.model.predict(text, k=max_results)
        language_mix = []
        if predictions:
            for (i, pred) in enumerate(predictions[0]):
                # Extract the language code
                prob = predictions[1][i]
                if i > 0  and prob < threshold_others:
                    break
                language_mix.append(pred.replace("__label__", ""))
            return language_mix
        else:
            return None

## 4. Using the Model for Prediction

Now that the model is loaded, you can input your own text and predict the language.


### 4.1 Predict the main language of a document

In [4]:
# Define the repository and model file
repo_id = "impresso-project/impresso-floret-langident"
model_filename = "LID-40-3-2000000-1-4.bin"

# Initialize the FloretLangIdentifier with the repo and model file name
model = FloretLangIdentifier(repo_id, model_filename)

# Example text for prediction
text = "Das ist ein Testsatz."

# Predict the language
result = model.predict_language(text)
print("Language:", result)

LID-40-3-2000000-1-4.bin:   0%|          | 0.00/32.0M [00:00<?, ?B/s]

Language: de


### 4.2 Predict the language mix of a document


In [5]:
# Multi-output for predicting mixed-language documents
# Example text for prediction
text = "This is ein test Satz."

# Predict the language
result = model.predict_language_mix(text)
print("Language mix:", result)

Language mix: ['de', 'en']


### 4.3 Predict the language mix of an impresso document


In [6]:
# source: https://impresso-project.ch/app/issue/onsjongen-1945-03-03-a/view?p=1&articleId=i0001&text=1
text = " Lëtzeburger Zaldoten traine'èren an England Soldats luxembourgeois à l’entraînement en Angleterre"

# Predict the language
result = model.predict_language_mix(text)
print("Language mix:", result)

Language mix: ['lb', 'fr']


### 4.4 Interactive mode

In [7]:
# Interactive text input
text = input("Enter a sentence for language identification: ")
result = model.predict_language_mix(text)
print("Prediction Result:", result)

Enter a sentence for language identification: Check out this dataset!
Prediction Result: ['en']


## 5. Why is Language identification important? An example

Many NLP models are trained on data from certain languages. For applying any further NLP processing, we often need to know the language.

Let us visit a concrete example: Say that we want to count the nouns in a text. For this we load a NLP-processor from the popular spacy-library, that (i.a.) splits the text and tags our words with so-called part-of-speech-tags.


### 5.1 Build a simple Noun counter class

In [8]:
class NounCounter:

    def __init__(self, nlp):
        """
        Initialize the NounCounter with a spaCy NLP model.

        Args:
            nlp: A spaCy NLP model.
        """
        self.nlp = nlp

    def count_nouns(self, text):
        """
        Count the number of nouns in the given text.

        Args:
            text (str): The input text.

        Returns:
            int: The count of nouns in the text.
        """
        doc = self.nlp(text)
        noun_count = 0
        for token in doc:
            if token.pos_ == "NOUN":
                noun_count += 1
        return noun_count

### 5.2 Noun counter: A first naive test

In [9]:
# Example text for prediction
text = "Das ist ein Testdokument. Ein Mann geht mit einem Hund im Park spazieren."

# We load the spacy library
import spacy

# We load a default spacy model
nlp = spacy.load("en_core_web_sm")

# We intitalize our Noun-Counter
counter = NounCounter(nlp)

# And print the estimated amount of nouns
print("Text: \"{}\"\nNoun-count: {}".format(text, counter.count_nouns(text)))

Text: "Das ist ein Testdokument. Ein Mann geht mit einem Hund im Park spazieren."
Noun-count: 2


### 5.3 Noun counter: A second test

Now let us assume that we would know the language of the input document: German.

This would let us load a default German spacy model.

In [10]:
# Need to download the German model
spacy.cli.download("de_core_news_sm")

# Load the German model
nlp = spacy.load("de_core_news_sm")

# We intitialize our Noun-Counter
counter = NounCounter(nlp)

# And print the estimated amount of nouns
print("Text: \"{}\"\nNoun-count: {}".format(text, counter.count_nouns(text)))

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Text: "Das ist ein Testdokument. Ein Mann geht mit einem Hund im Park spazieren."
Noun-count: 4


### 5.4 Noun counter: Combining our knowledge


We use our insights to build a language-informed spacy loader that uses our language identifier!

In [11]:
class LanguageAwareSpacyLoader:

    def __init__(self, lang_identifier):
        """
        Initialize the LanguageAwareSpacyLoader with a language identifier.

        Args:
            lang_identifier: A language identifier.
        """
        self.lang_identifier = lang_identifier

    def load(self, text):
        """
        Load a spaCy model for a detected language.

        Returns:
            A spacy model
        """
        lang = self.lang_identifier.predict_language(text)
        if lang == "de":
            uri = "de_core_news_sm"
            spacy.cli.download(uri)
            nlp = spacy.load(uri)
        elif lang == "fr":
            uri = "fr_core_news_sm"
            spacy.cli.download(uri)
            nlp = spacy.load(uri)
        elif lang == "en":
            uri = "en_core_web_sm"
            nlp = spacy.load(uri)
        elif lang == "lb":
            uri = "lb_core_news_sm"
            spacy.cli.download(uri)
            nlp = spacy.load(uri)
        elif lang == "it":
            uri = "it_core_news_sm"
            spacy.cli.download(uri)
            nlp = spacy.load(uri)
        else:
            raise NotImplementedError("Language not supported: {}".format(lang))
        print("I detected the language: {} and loaded the model: {}".format(lang, uri))
        return nlp


Let's try it

In [12]:
# We initialize our language aware spacy loader
loader = LanguageAwareSpacyLoader(model)

# We load the spacy model
nlp = loader.load(text)

# We intitialize our Noun-Counter with the model
counter = NounCounter(nlp)

# And print the estimated amount of nouns
print("Noun-count: {}".format(counter.count_nouns(text)))

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
I detected the language: de and loaded the model: de_core_news_sm
Noun-count: 4


Let's start the interactive mode again. Input any text in some language, and the two-step model (lang-id + nlp) will count its nouns.


In [13]:
text = input("Enter a sentence for Noun counting: ")
nlp = loader.load(text)
counter = NounCounter(nlp)
print("Noun-count: {}".format(counter.count_nouns(text)))

Enter a sentence for Noun counting: Aggiornamenti in tempo reale sulla cronaca da tutta Italia, articoli e approfondimenti
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('it_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
I detected the language: it and loaded the model: it_core_news_sm
Noun-count: 4


## 6. Summary and Next Steps

In this notebook, we used a pre-trained Floret language identification model to predict the language of input text. You can modify the input or explore other models from Hugging Face.

Feel free to try other texts, or languages to experiment with the model.
