<a href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/3.1-language-identification/floret-language-identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Language Identification using Floret

This notebook demonstrates how to use a pre-trained Floret language identification model downloaded from Hugging Face.
We'll load the model, input some text, and predict the language of the text.


## 1. Install Dependencies

First, we need to install `floret` and `huggingface_hub` to work with the Floret language identification model and Hugging Face.


In [5]:
!pip install floret
!pip install huggingface_hub



## 2. Model Information

In this example, we are using a language identification model hosted on the Hugging Face Hub: `impresso-project/impresso-floret-langident`.
The model can predict the language of a given text of a reasonable length and supports the main impresso languages: German (de), French (fr), Luxemburgish (lb), Italian (it), English (en)


## 3. Defining the FloretLangIdentifier Class

This class downloads the Floret model from Hugging Face and loads it for prediction. We use `huggingface_hub` to download the model locally.


In [6]:
from huggingface_hub import hf_hub_download
import floret


class FloretLangIdentifier:
    def __init__(self, repo_id, model_filename):
        """
        Initialize the Floret language identification model by downloading it from Hugging Face.
        Args:
            repo_id (str): The Hugging Face repository ID (e.g., "username/repo_name").
            model_filename (str): The model file name in the repository (e.g., "model.bin").
        """
        model_path = self._download_model(repo_id, model_filename)
        self.model = floret.load_model(model_path)

    def _download_model(self, repo_id, model_filename):
        """
        Download the model file from Hugging Face using huggingface_hub.
        Args:
            repo_id (str): The repository ID from which to download the model.
            model_filename (str): The model filename in the Hugging Face repository.

        Returns:
            str: The local path to the downloaded model file.
        """
        local_model_path = hf_hub_download(repo_id=repo_id, filename=model_filename)
        return local_model_path

    def predict(self, text):
        """
        Predict the language of the input text.
        Args:
            text (str): The input text.

        Returns:
            List of predicted labels and their probabilities.
        """
        predictions = self.model.predict(text)
        return predictions

    def predict_language(self, text):
        """
        Predicts the language of the input text and returns the language code without the "__label__" prefix.

        Args:
            text (str): The input text.

        Returns:
            str: The predicted language code (e.g., "en" for English).
        """
        predictions = self.model.predict(text)
        if predictions:
            # Extract the language code from the top prediction
            language = predictions[0][0].replace("__label__", "")
            return language
        else:
            return None

## 4. Using the Model for Prediction

Now that the model is loaded, you can input your own text and predict the language.


In [7]:
# Define the repository and model file
repo_id = "impresso-project/impresso-floret-langident"
model_filename = "LID-40-3-2000000-1-4.bin"

# Initialize the FloretLangIdentifier with the repo and model file name
model = FloretLangIdentifier(repo_id, model_filename)

# Example text for prediction
text = "This is a test sentence."

# Predict the language
result = model.predict_language(text)
print("Prediction Result:", result)

Prediction Result: en


In [8]:
# Interactive text input
text = input("Enter a sentence for language identification: ")
result = model.predict_language(text)
print("Prediction Result:", result)

Prediction Result: lb


## 5. Summary and Next Steps

In this notebook, we used a pre-trained Floret language identification model to predict the language of input text. You can modify the input or explore other models from Hugging Face.

Feel free to try other datasets, text, or languages to experiment with the model.
