# Language Identification with *impresso-pipelines* Package

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/langident_pipeline_demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## What is this notebook about?
This notebook introduces the language identification component of the
[impresso-pipelines](https://pypi.org/project/impresso-pipelines/) Python package. The broader goal of impresso pipelines is to make
the internal data processing workflows of the impresso project reusable and accessible
to others. It allows external users—such as researchers, developers, or digital
humanities practitioners—to apply the same processing steps we used on our historical
newspaper collections to their own text collections.

The package is designed to require as little coding effort as possible. By offering ready-to-use pipelines, users can replicate impresso’s approach to document processing with minimal configuration. This ensures consistency, comparability, and transparency in how text data is prepared and analyzed.

In this particular notebook, we focus on the langident subpackage, which performs
automatic language detection (currently supports German, French, Luxembourgish and Italian). You will learn how to apply language identification to
your multilingual document collection and explore tools for understanding and diagnosing the
results.


## What will you learn?

In this notebook, you will:

- Understand how to use the `langident` subpackage from the impresso Pipelines package for language detection.

- Learn how to classify text into different languages using a simple pipeline.

- Explore diagnostic features to analyze language predictions.

- Identify common challenges in language detection, such as handling short texts or
  ambiguous words.

- Experiment with multilingual and edge-case scenarios to observe model behavior.

This hands-on guide will provide you with practical insights into language identification and its limitations. ​

## Useful resources

- For technical details on this library, please refer to the repository of the [Impresso Pipelines package](https://github.com/impresso/impresso-pipelines/tree/main).



## Prerequisites

First, start by installing `impresso-pipelines` package. Please, note that it might require you to restart kernel to apply changes. To do so, on Google Colab, go to *Runtime* and select *Restart session*. 


In [None]:
%pip install -q impresso_pipelines[langident]

## Basic usage

Start by importing the necessary module from `impresso-pipelines` package


In [None]:
from impresso_pipelines.langident import LangIdentPipeline

lang_pipeline = LangIdentPipeline()

Once you initialize the pipeline, you can simply provide the text you'd like to classify. This example demonstrates the use of German text.


In [None]:
de_text = "Ein kleiner Hund namens Max lebte in einem ruhigen Dorf."

In [None]:
lang_pipeline(de_text)

The default output of the pipeline is a dictionary containing the top predicted language and its corresponding score (expressed as a probability). The score is rounded to three decimal places for better readability. Please note that the probabilities for all supported languages add up to 1 (by default, only the top language is returned).

As shown, the pipeline uses the language identification model to correctly classify this text as German, with a rounded probability of 100%.


## Advanced usage


When using the pipeline with text, you can additionally specify two parameters: `diagnostics` and `model_id`.

- `diagnostics`: A boolean value. If set to True, it returns not only the top predicted language but also all languages that the model can detect, along with their corresponding scores.

- `model_id`: A boolean value. If set to True, it returns the name of the model used to identify the language of the text.


Here we skip the part of module importing and initialization as it was done above.


In [None]:
# Using text example from before
lang_pipeline(de_text, diagnostics=True)

As shown, it returns a `language_dict` containing a list of all supported languages and their corresponding scores. Since the text is purely in German, all other scores are 0.0.


Below is an example of using `model_id` with and without `diagnostics`.


In [None]:
lang_pipeline(de_text, model_id=True)

In [None]:
lang_pipeline(de_text, model_id=True, diagnostics=True)

In both cases, we can see an additional key, `model_id`, which stores the name of the language identification model used by the pipeline.


## Mixed language example


In [None]:
mixed_text = (
    "Max marchait doucement. Der vento soffiait fort, aber la strada restait vide."
)

In [None]:
lang_pipeline(mixed_text, diagnostics=True)

As shown, this time the model clearly detects some French and even Italian, but French remains the top predicted language, with German as the second most likely.


## Advanced Pipeline Initialization


By default, the pipeline automatically selects the most recent language identification model from the Impresso HF repository: `impresso-project/impresso-floret-langident`.


However, the module initialization allows you to pass additional arguments to use a specific model instead of the default one. These arguments include `model_id`, `repo_id`, and `revision`.

- `model_id`: Specifies the name of the model.
- `repo_id`: Specifies the repository where the model is located.

- `revision`: Specifies the branch name of the repository.

By providing all three, you can force the pipeline to use the language model you have specified.


In [None]:
from impresso_pipelines.langident import (
    LangIdentPipeline,
)  # There's no need to import the module again if it has already been imported

lang_pipeline = LangIdentPipeline(
    model_id="langident-v1.0.0.bin",
    repo_id="impresso-project/impresso-floret-langident",
    revision="main",
)

In [None]:
# Using text example from before
lang_pipeline(de_text)

Once again, we see the same pipeline output as before.


## Common Pitfalls in Language Detection


In [None]:
short_fr_text = "Je mange."

In [None]:
short_de_text = "Der Computer auf dem Tisch funktioniert gut."

In [None]:
short_de_text_with_unusual_name = "Gleb geht ins Büro."

In [None]:
# Example 1: Very short sentence
lang_pipeline(short_fr_text, diagnostics=True)

In [None]:
# Exaple 2: Not language specific sentence
lang_pipeline(short_de_text, diagnostics=True)

In [None]:
# Example 3: Short sentence and unsual name
lang_pipeline(short_de_text_with_unusual_name, diagnostics=True)

As demonstrated, this pipeline struggles to accurately detect the language when the text is too short. This challenge becomes even more pronounced when the words used are not strongly tied to a specific language. Additionally, the model encounters difficulties with short sentences that contain uncommon names. In general, the longer the text sample, the higher the detection accuracy.

As shown above, despite low confidence scores, the pipeline correctly predicts the language in the first two cases (a short French text and a non-language-specific German text). However, in the third example — where the sentence is both short and includes an unusual, non-German name — the model makes an incorrect prediction.

This example highlights the importance of longer, more language-distinctive sentences for achieving higher accuracy and confidence in language classification.


## Conclusion

This notebook provides a step-by-step guide on using the `langident` subpackage from the Impresso Python package for language detection. It begins with an introduction to the package and instructions on installing the necessary dependencies.

The workflow section covers:

- Basic Usage: Initializing the language identification pipeline and classifying single-language texts.
- Advanced Usage: Exploring additional pipeline features, such as retrieving full probability distributions for multiple languages.
- Handling Challenging Cases: Analyzing model limitations when dealing with short or ambiguous texts, multilingual content, and names that may not be strongly language-specific.


## Next steps

You might also be interested in a follow-up notebook on [OCR Quality Assessment with impresso-pipelines Package](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/ocrqa_pipeline_demo.ipynb), which utilizes the `langident` language detection.

Additionally, you can find more technical details in the repository of the [Impresso Pipelines package](https://github.com/impresso/impresso-pipelines/tree/main).


---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)

INSERT CREDITS HERE
<br></br>
This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)
<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a>
<br></br>

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.


---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
