# Language Identification with Impresso Package

### What is this notebook about?


This notebook demonstrates how to use the language identification subpackage `langident` from the Impresso Python package. This subpackage allows to detect the language of a given text and provides simple diagnostic features to analyze the decection process.

### Prerequisites

First, you should install Impresso package:

In [None]:
!pip install glebs_package



---
### Basic usage

By default, the pipeline automatically selects the most recent language identification model from the Impresso HF repository: `impresso-project/impresso-floret-langident`.

In [None]:
# Start by importing the necessary module from Impresso package
from glebs_package.langident import LangIdentPipeline
lang_pipeline = LangIdentPipeline()

Once you initialize the pipeline, you can simply provide the text you'd like to classify. This example demonstrates the use of German text.

In [None]:
de_text = "Ein kleiner Hund namens Max lebte in einem ruhigen Dorf. Jeden Tag rannte er durch die Straßen und spielte mit den Kindern. Eines Tages fand er einen geheimen Garten, den niemand kannte. Max entschied sich, den Garten zu erkunden und entdeckte viele schöne Blumen und Tiere. Von diesem Tag an besuchte er den Garten jeden Nachmittag."

In [None]:
lang_pipeline(de_text)

{'language': 'de', 'score': 1.0}

The default output of the pipeline is a dictionary containing the top predicted language and its corresponding score (expressed as a probability). The score is rounded to three decimal places for better readability. Please note that the probabilities for all supported languages add up to 1 (by default, only the top language is returned).

As shown, the pipeline uses the language identification model to correctly classify this text as German, with a rounded probability of 100%.

---
### Advanced Initialization

The module initialization allows you to pass additional arguments to use a specific model instead of the default one. These arguments include `model_id`, `repo_id`, and `revision`.

*   `model_id`: Specifies the name of the model.
*   `repo_id`: Specifies the repository where the model is located.

*   `revision`: Specifies the branch name of the repository.

By providing all three, you can force the pipeline to use the language model you have specified.

In [None]:
from glebs_package.langident import LangIdentPipeline # There's no need to import the module again if it has already been imported
lang_pipeline = LangIdentPipeline(model_id="langident-v1.0.0.bin", repo_id="impresso-project/impresso-floret-langident", revision="adding_emas_pipeline")

In [None]:
# Using text example from before
lang_pipeline(de_text)

{'language': 'de', 'score': 1.0}

Once again, we see the same pipeline output as before.

---
### Advanced usage

When using the pipeline with text, you can additionally specify two parameters: `diagnostics` and `model_id`.


*   `diagnostics`: A boolean value. If set to True, it returns not only the top predicted language but also all languages that the model can detect, along with their corresponding scores.

*   `model_id`: A boolean value. If set to True, it returns the name of the model used to identify the language of the text.

Here we skip the part of module importing and initialization as it was done above.

In [None]:
# Using text example from before
lang_pipeline(de_text, diagnostics=True)

{'language': 'de',
 'score': 1.0,
 'diagnostics': {'language_dist': [{'language': 'de', 'score': 1.0},
   {'language': 'it', 'score': 0.0},
   {'language': 'fr', 'score': 0.0},
   {'language': 'lb', 'score': 0.0},
   {'language': 'en', 'score': 0.0}]}}

As shown, it returns a `language_dict` containing a list of all supported languages and their corresponding scores. Since the text is purely in German, all other scores are 0.0.

Below is example of a mixed text:

In [None]:
mixed_text = "Ein kleiner Hund namens Max lebte in einem ruhigen Dorf. Jeden Tag rannte er durch die Straßen und spielte mit den Kindern. Eines Tages fand er einen geheimen Garten, den niemand kannte. Max entschied sich, den Garten zu erkunden und entdeckte viele schöne Blumen und Tiere. Von diesem Tag an besuchte er den Garten jeden Nachmittag. Le soleil se couchait doucement sur la ville. Les rues étaient calmes, et les oiseaux chantaient leurs dernières chansons avant la nuit. Marie se promenait tranquillement, appréciant la beauté du moment."

In [None]:
lang_pipeline(mixed_text, diagnostics=True)

{'language': 'de',
 'score': 0.77,
 'diagnostics': {'language_dist': [{'language': 'de', 'score': 0.77},
   {'language': 'fr', 'score': 0.21},
   {'language': 'it', 'score': 0.02},
   {'language': 'lb', 'score': 0.0},
   {'language': 'en', 'score': 0.0}]}}

As shown, this time the model clearly identifies some French and even Italian, but the top predicted language is still German.

Below is an example of using `model_id` with and without `diagnostics`.

In [None]:
lang_pipeline(de_text, model_id=True)

{'language': 'de', 'score': 1.0, 'model_name': 'langident-v1.0.0.bin'}

In [None]:
lang_pipeline(de_text, model_id=True, diagnostics=True)

{'language': 'de',
 'score': 1.0,
 'diagnostics': {'language_dist': [{'language': 'de', 'score': 1.0},
   {'language': 'it', 'score': 0.0},
   {'language': 'fr', 'score': 0.0},
   {'language': 'lb', 'score': 0.0},
   {'language': 'en', 'score': 0.0}]},
 'model_name': 'langident-v1.0.0.bin'}

In both cases, we can see an additional key, `model_name`, which stores the name of the language identification model used by the pipeline.

---
### Edge cases

As demonstrated, this pipeline struggles to detect the language correctly when the text is too short. This issue becomes even more apparent when the words used are not highly language-specific. Additionally, the model faces difficulties with short sentences containing unusual names. As a general rule, the longer the text sample, the better the detection accuracy.

Below are all three examples:

In [None]:
short_fr_text = "Je mange."

In [None]:
short_de_text = "Der Computer auf dem Tisch funktioniert gut."

In [None]:
short_de_text_with_unusual_name = "Gleb geht ins Büro."

In [None]:
# Example 1: Very short sentence
lang_pipeline(short_fr_text, diagnostics=True)

{'language': 'fr',
 'score': 0.67,
 'diagnostics': {'language_dist': [{'language': 'fr', 'score': 0.67},
   {'language': 'lb', 'score': 0.33},
   {'language': 'de', 'score': 0.0},
   {'language': 'it', 'score': 0.0},
   {'language': 'en', 'score': 0.0}]}}

In [None]:
# Exaple 2: Not language specific sentence
lang_pipeline(short_de_text, diagnostics=True)

{'language': 'de',
 'score': 0.61,
 'diagnostics': {'language_dist': [{'language': 'de', 'score': 0.61},
   {'language': 'lb', 'score': 0.38},
   {'language': 'it', 'score': 0.01},
   {'language': 'en', 'score': 0.0},
   {'language': 'fr', 'score': 0.0}]}}

In [None]:
# Example 3: Short sentence and unsual name
lang_pipeline(short_de_text_with_unusual_name, diagnostics=True)

{'language': 'lb',
 'score': 0.52,
 'diagnostics': {'language_dist': [{'language': 'lb', 'score': 0.52},
   {'language': 'de', 'score': 0.43},
   {'language': 'en', 'score': 0.04},
   {'language': 'it', 'score': 0.01},
   {'language': 'fr', 'score': 0.0}]}}

As seen above, although the probabilities are very low, the pipeline successfully predicts the correct language in the first two cases (short French text and non-language-specific German text). However, it predicts the wrong language in the third example, where the sentence is both short and contains an unusual non-German name.

This example clearly demonstrates that longer and more language-specific sentences lead to much more accurate and certain classifications.