# LDA Topics with *impresso-pipelines* Package

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/ldatopics_pipeline_demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## What is this notebook about?
This notebook introduces the OCRQA component of the [impresso-pipelines](https://pypi.org/project/impresso-pipelines/) Python package. The broader goal of the impresso pipelines is to make the internal data processing workflows of the impresso project reusable and accessible to others. It allows external users—such as researchers, developers, or digital humanities practitioners—to apply the same processing steps we used on our historical newspaper collections to their own text collections.

The package is designed to minimize the coding effort required. By offering ready-to-use pipelines, users can adopt impresso approach to document processing with minimal configuration. This ensures consistency, comparability, and transparency in how topics are created from text document.

In this notebook, we focus on the [ldatopics](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/ldatopics) subpackage, which enables topic modeling using Impresso MALLET inference pipeline. You will learn how to apply the LDA Topics pipeline to your text data, explore different usage options, and interpret the results.


## Why is this useful?
- Automates the full workflow—language detection, lemmatization and topic inference—in a single call.  
- Supports multiple languages (e.g., Luximburgish, French, German) with the appropriate spaCy model.  
- Leverages versioned, pre-trained MALLET models to guarantee reproducible and comparable results.  
- Returns structured JSON (`ci_id`, `ts`, `lg`, `topic-count`, `topics`) ready for downstream analysis.  
- Enables rapid thematic exploration of la rge text collections with minimal setup and configuration.  


## How it works
The pipeline transforms your text into a topic mixture in four simple steps:

- **Language detection**: Auto-detects the text language (e.g. `lg: 'fr'`, `lg: 'en'`).  
- **Lemmatization**: Loads the matching spaCy model to tokenize and normalize words into lemmas.  
- **Topic inference**: Feeds the lemmas into MALLET’s inference pipeline (language-specific `.pipe` + pre-trained `.inferencer`) to compute topic weights.  
- **Returns**: A JSON object with `ci_id`, `ts`, `lg`, `topic-count`, and a list of topics above `min_p`, each with an ID (`t`) and probability (`p`).




## What will you learn?

In this notebook, you will:

- Understand the `ldatopics_pipeline` function and its core steps.  
- See how the pipeline auto-detects text language with Impresso’s **langident**.  
- Learn how spaCy tokenizes and lemmatizes your text for modelling.  
- Run topic inference using MALLET’s pre-trained models.  
- Customize your output IDs with the `doc_name` parameter.  
- Interpret the JSON output (`ci_id`, `ts`, `lg`, `topic-count`, `topics`) to extract key themes.  
<!-- - Explore advanced usage, including overriding defaults and tweaking the `min_p` threshold. -->

By the end of this notebook, you’ll be able to confidently apply and interpret topic modelling with the Impresso MALLET pipeline.

## Useful resources
- For technical details on this library, please refer to the repository of the [Impresso Pipelines package](https://github.com/impresso/impresso-pipelines/tree/main).

## Prerequisites
First, start by installing `impresso-pipelines` package. Please, note that it might require you to restart kernel to apply changes. To do so, on Google Colab, go to *Runtime* and select *Restart session*.



In [None]:
%pip install impresso_pipelines[ldatopics]

## Basic Usage

Start by importing the necessary module from `impresso-pipelines` package


In [None]:
from impresso_pipelines.ldatopics import LDATopicsPipeline

ldatopics_pipeline = LDATopicsPipeline()

Once you initialize the pipeline, you can simply provide the text you'd like to create topics for. This example demonstrates the use of French text.


In [3]:
fr_text = ("La vie dans un petit village est paisible et rythmée par les saisons. Chaque matin, les habitants se saluent en se croisant dans les rues étroites, et l’odeur du pain frais s’échappant de la boulangerie emplit l’air. Les enfants vont à l’école à pied, les commerçants ouvrent leurs boutiques, et les anciens s’assoient sur les bancs pour discuter de la pluie et du beau temps. Ici, tout le monde se connaît, et l’entraide fait partie du quotidien.")

In [None]:
ldatopics_pipeline(fr_text)

If no language is explicitly specified, the LDA Topics pipeline uses the Impresso
`langident` pipeline to automatically detect the language of the text. For more details
on the `langident` pipeline, please refer to the [langident
pipeline demo notebook](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/langident_pipeline_demo.ipynb).

After detecting the document’s language, the pipeline loads the corresponding spaCy model to tokenize and lemmatize the text. The resulting lemmas are then fed into a MALLET inference pipeline—which combines the language‐specific “pipe” definition and a serialized, pre-trained topic model from the Impresso project—to compute a topic distribution for each document. You can find these pre-packaged models on [Impresso Hugging Face](https://huggingface.co/impresso-project/mallet-topic-inferencer/tree/main/models/tm).

**Interpretation of the result:**

1. **Language & metadata**  
   - `lg: 'fr'` tells you the pipeline detected French.  
   - `ts` is the timestamp.  
   - `ci_id` is your document ID:  
     • If you don’t supply a name, the pipeline auto-assigns “doc1”, “doc2”, … based on how many docs you’ve processed so far.  
     • You can override this by passing a custom `name` parameter when calling the pipeline - see the “Advanced Usage” section of this notebook for an example.

2. **Total topics in the model**  
   - `topic-count: 100` means the underlying French model contains 100 topics.

3. **Which topics actually showed up**  
   - Only topics ≥ `min_p` (2%) are reported. Here you see nine topics above that cutoff.

4. **Top topic = tm-fr-all-v2.0_tp04_fr (16%)**  
   - This is the single strongest theme in your document.  
   - A weight of 0.16 means 16 % of the inferred topic-mixture is assigned to topic 04.  
   - To see what “topic 04” actually _is_, follow the `topic_model_description` link in the output and look up the keywords/label for `_tp04_fr`.

5. **Runner-ups**  
   - Topic 22 at 11.7 %  
   - Topic 06 at 10 %  
   - Topic 36 at 8.3 %  
   - …down to the ninth topic at 2.3 %.

Taken together, these nine topics explain the main themes of your French text. You can sum their probabilities (here about 56 %) to see how much of the document is “covered,” and inspect each topic’s description in the JSONL file to attach human-readable labels.

## Advanced Usage


This pipeline offers several additional attributes that can be used when calling it to manipulate or gain a deeper understanding of the results. These attributes include `language`, `doc_name`:

- `language`: Accepts language abbreviation strings such as "en" (English) or "de" (German). If provided, the pipeline assumes the specified language and skips the language detection step, directly using the corresponding spaCy and Mallet models.

- `doc_name`: A string identifier with no spaces (e.g., "my_document"). When you provide this, it replaces the autogenerated document ID. The internal counter only increments for calls where you don’t specify a name, so the next unnamed document will pick up from the following number.



**Example 1**: `language`


In [None]:
# Using the same French text example as before
ldatopics_pipeline(fr_text, language="lb")

Even though the provided text is clearly in French, specifying the language as Luxembourgish, for example, forces the pipeline to use the corresponding spaCy and MALLET models for that language. If the selected language is unsupported, the pipeline will return an appropriate error message.


**Example 2**: `doc_name`


In [None]:
# Using the same French text example as before
ldatopics_pipeline(fr_text, doc_name="my_document")

As you can see, the `cd_id` is now "my_document" as specified, making it much easier to manage your outputs when you run the ldatopics pipeline multiple times or save them in an external file.

**Example 3**: All at once

In [None]:
# Using the same German text example as before
ldatopics_pipeline(
    fr_text,
    language="fr",
    doc_name="my_document"
)

You can use a mix of additional parameters, or all of them at once, to deeper controll the output of your LDA Topics. In the example above, we set the language to French and name of the document to "my_document".


## Limitations of the LDA Topics Pipeline


In [8]:
short_de_text = "Alex geht ins Büro."

In [None]:
# Example 1: Very short sentence
ldatopics_pipeline(short_de_text)

As shown, LDA Topics pipeline struggles with **short texts**, leading to very flat and probably unreleable distribution. Additionally, anything that is **not** **NOUN** or **PRONOUN** does not contribute to topic modeling due to lemmatization, making the short text even shorter.


## Conclusion
In this notebook, we’ve explored the LDA Topics Pipeline from the Impresso package, which transforms raw text into topic distributions using language detection, spaCy lemmatization, and MALLET inference. The pipeline:

- **Automatically detects** text language via langident (or accepts a manual override)  
- **Loads** the matching spaCy model to tokenize and lemmatize your text  
- **Runs** MALLET’s inference pipeline with versioned, pre-trained topic models  
- **Returns** a structured JSON (`ci_id`, `ts`, `lg`, `topic-count`, `topics`) ready for analysis  
- **Supports** custom document IDs (`doc_name`)

<!-- and adjustable reporting thresholds (`min_p`)   -->
- **Has** limitations with very short texts or little amount of nouns and pronouns

The pipeline provides a reproducible, multi-lingual approach to topic modelling—ideal for thematic exploration of large text collections and cross-language comparisons.  


## Next steps


Since this subpackage relies on the Impresso langident subpackage, you might be interested in exploring the [demo notebook](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/langident_pipeline_demo.ipynb) for langident.

Additionally, you can find more technical details in the repository of the [Impresso Pipelines package](https://github.com/impresso/impresso-pipelines/tree/mallet_pipeline).


---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)

INSERT CREDITS HERE
<br></br>
This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)
<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a>
<br></br>

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.


---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>

