# Solr Normalization with impresso-pipelines Package

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/solrnormalization_pipeline_demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## What is this notebook about?

This notebook introduces the [solrnormalization](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/solrnormalization) component of the [impresso-pipelines](https://pypi.org/project/impresso-pipelines/) Python package.

The goal of the `impresso-pipelines` python module is to **make the internal data processing workflows of the Impresso Web App transparent and reproducible**. `impresso-pipelines` offers low-code, ready-to-use pipelines that require minimal configuration and allow users —such as researchers, developers, or digital humanities practitioners— to apply the *same processing steps* we used on our historical newspaper collections to their *own* text collections.

One of these processes is **text indexing**, which we do with the Solr search platform, to support information retrieval (keyword search and beyond) on the Impresso corpus. Before building the index, Solr applies a series of language analysis and normalization steps to the text. The same processes are applied to users' query terms.

This notebook allows to **replicate Solr's normalization steps** by providing a standardized, language-aware tokenization and normalization pipeline that simulates the transformations applied during text indexing and query processing (To retrieve matching results, Solr must apply exactly the same normalization steps to both the indexed text and the user’s query).

This notebook therefore helps **help understand how Solr text normalization works** by exposing the exact tokens produced after language-specific analyzers, filters, and stopword lists have been applied.

**A word on Solr normalization steps**

Typical language normalization steps applied by indexing systems include the following (for technical background, you may refer to the [Solr documentation](https://solr.apache.org/guide/solr/latest/indexing-guide/language-analysis.html)). We do not detail each step but present them in general terms:

- Tokenisation: A process by which text sequences are broken into linguistic units (no sub-word tokenisation here). Several [tokenisers](https://solr.apache.org/guide/solr/latest/indexing-guide/tokenizers.html) are available in Solr; the Impresso indexing pipeline use the Standard Tokeniser.
Lower-casing: search engines typically convert all text to lowercase before indexing.

- Stopwords removal: (common words like "the", "and", "is") appear very frequently and are typically not searched for. As they would also need a lot of storage (e.g. in English, 1/6th of all tokens is the word "the"), it is also resource-efficient to remove them from the index.

- Stemming: European languages found in our Impresso collection are inflecting, meaning that the same word can appear in many different forms (e.g., "introduce", "introduces", "introduced", "introducing"). These forms are reduced to a base form through *stemming*, that allows to find all forms of a word with a single search term.

- Text collections—especially historical ones—often contain inconsistencies in spelling and linguistic conventions across time periods and sources, the use of special characters, diacritics, accents can vary over time. Ignoring diacritics and mapping non-ASCII characters to their ASCII equivalents (a process called ASCII folding) helps to ensure that searches are not affected by these variations.

## What will you learn?

In this notebook, you will:

- Understand the functionality of the `solrnormalization` subpackage from the `impresso-pipelines` package.
- Learn how to normalize a raw text using language-specific Solr analyzers.
- Explore different use cases, including **basic and advanced usage** of the SolrNormalization pipeline.
- Recognize some limitations of the pipeline, such as support for **only selected languages** and the need for JVM integration.

By the end of this notebook, you will have a clear understanding of how Solr-style normalization is applied to text and how it can be used in practical scenarios.

## Background: How it works
For technical details on this library, please refer to the repository of the [impresso-pipelines](https://github.com/impresso/impresso-pipelines/tree/main) package.

The SolrNormalization component reproduces the language-aware text processing pipeline used in Impresso Solr indexing, by combining Python and Lucene Java-based analyzers in a unified workflow. The process follows these steps:

- Initialization: Load Lucene/Solr analyzers and stopword lists for all supported languages, each with its own tokenizer and normalization rules. A generic analyzer is also loaded for unsupported or unknown languages.
(*Technical background: Solr analyzers—based on Apache Lucene—apply tokenization, lowercasing, stopword removal, ASCII folding, and stemming in a configurable sequence. We use Lucene’s CustomAnalyzer to reproduce Impresso’s exact configuration for each language.*)

- Language Detection: Automatically detect the language of the input text unless a language is provided manually.
(*Technical background: This step can use the *[LangIdentPipeline](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/langident)* from `impresso-pipelines` to select the appropriate language-specific analyzer.*)

- Analysis: Apply language-specific processing using Solr components, including tokenization, lowercasing, stopword removal, ASCII folding, and stemming, fully mirroring the transformations used in Impresso’s indexing workflows.
(*Technical background: Through JPype, Python code directly accesses the Java-based Lucene/Solr analysis pipeline.*)

- Output: Return the list of normalized tokens together with the detected (or specified) language.
(*Technical background: The component supports cleanup of Java resources via a context manager or destructor.*)

## Useful resources

- For technical details on this library, please refer to the repository of the [Impresso Pipelines package](https://github.com/impresso/impresso-pipelines/tree/main).


## Prerequisites

First, start by installing the `impresso-pipelines` package. Please note that it might require you to restart the runtime to apply changes. To do so, on Google Colab, go to _Runtime_ and select _Restart session_.


In [None]:
%pip install --upgrade "impresso-pipelines[solrnormalization]"

## Basic Usage

Import the normalization pipeline from the package and create an instance of the
`SolrNormalization` class. This class provides methods to normalize text in multiple languages using Solr
analyzers.


In [None]:
from impresso_pipelines.solrnormalization import SolrNormalizationPipeline

# We name the instance of SolrNormalizationPipeline as solrnormalization_pipeline
solrnormalization_pipeline = SolrNormalizationPipeline()

Let's take an example paragraph in German, with special characters and different punctuation marks, and pass it in the `solrnormalisation` pipeline.


In [None]:
de_text = """
In den frühen Morgenstunden verließ der Geschäftsführer der Münchner Rückversicherungsgesellschaft das Gebäude – ein Ereignis,
das unter den Mitarbeitenden für großes Aufsehen sorgte. Außerdem wurde über „öffentliche“ Statements spekuliert, obwohl keine
offiziellen Informationen verfügbar waren.
"""

In [None]:
solrnormalization_pipeline(de_text)

**Interpretation of the result:**

As can be seen, the returned result is a dictionary with two keys: `"language"` and `"tokens"`.

- `language` indicates the language detected for the input text (in this case, `'de'` for German). If a language is manually specified, it will override the detection.

- `tokens` is a list of normalized tokens extracted from the text using a Lucene analyzer. These tokens are lowercased, stripped of stopwords, and normalized to remove accents and standardize common linguistic variants.

The output shows how long compound words (e.g., `Rückversicherungsgesellschaft`), special characters (e.g., `ü` in `früh`), and punctuation (e.g., quotes and dashes) are handled and broken down into analyzable components.


## Advanced Usage


This pipeline currently supports two optional attributes when calling it, allowing limited control over its behavior while keeping usage simple.

- `lang`: expects a string language code (e.g., `'de'` or `'fr'`). If provided, the pipeline skips automatic language detection and directly applies the corresponding Lucene analyzer. This can improve performance slightly and is useful in controlled multilingual settings.

- `diagnostics`: expects a Boolean value. If set to `True`, the pipeline will return additional information such as removed stopwords from the provided text.

These attributes can be used individually, in combination with each other, or all at once, depending on the level of detail needed.


**Example 1:** `lang`


In [None]:
solrnormalization_pipeline(de_text, lang="fr")

If you specify `lang='fr'`, the pipeline will skip automatic language detection and apply the French analyzer directly—even if the input text is in another language.  
This can be useful in controlled setups where you already know the correct language, or when comparing how the same text is normalized under different analyzers.

In this example, we manually forced the pipeline to use the French analyzer on a German text. As a result, many German stopwords remain in the output, and language-specific filters (like German normalization or minimal stemming) are not applied.


**Example 2**: `diagnostics`


In [None]:
solrnormalization_pipeline(de_text, diagnostics=True)

If you set `diagnostics=True`, the pipeline output will additionally include:

- `stopwords_detected`: a list of words from the original text that were identified as stopwords and removed from the final tokens.
- `analyzer_pipeline`: a list describing the sequence of processing steps (tokenizer and token filters) applied to the text for the detected language.


**Example 3**: All at once


In [None]:
solrnormalization_pipeline(de_text, lang="de", diagnostics=True)

You can combine multiple options in a single call, like specifying the language and enabling diagnostics. The pipeline is flexible and allows using any combination of supported parameters.


**Example 4**: Tokenization of common OCR errors (`^`, `_`, `-`)


In [None]:
altered_de_text = """

In den frühen Morge^stunden verließ der Geschäftsführer der Münchner Rückversicherungsgesellschaft das Gebäude – ein Ereignis,
das unter den Mitar_eitenden für großes Aufsehen sorgte. Außerdem wurde über „öffentliche“ Statements spekuliert, obwohl keine
offiziellen Inform-tionen verfügbar waren.


"""

In [None]:
solrnormalization_pipeline(altered_de_text, diagnostics=True)

This example illustrates how common OCR-related artifacts such as `^`, `_`, and hyphens are handled during normalization and tokenization.
These characters often appear in digitized historical documents due to misrecognition or formatting inconsistencies.

As can be seen above, `^` and `-` force words to be split, resulting in two different tokenized words. `_` on the other hand is just treated as part of the word.


## Limitations of the Solr Normalization Pipeline


In [None]:
unsupported_languages = """Στις πρώτες πρωινές ώρες, ο διευθύνων σύμβουλος της αντασφαλιστικής εταιρείας του Μονάχου αποχώρησε από το κτήριο – ένα γεγονός που προκάλεσε μεγάλη αναστάτωση μεταξύ των εργαζομένων. Επιπλέον, υπήρξαν εικασίες σχετικά με «δημόσιες» δηλώσεις, παρόλο που δεν υπήρχε καμία επίσημη ενημέρωση."""

In [None]:
solrnormalization_pipeline(unsupported_languages)

Currently, the Solr Normalization Pipeline offers dedicated analyzer pipelines for German, French, English, Spanish, Italian, Dutch, and Portuguese. For any other detected language, the pipeline automatically uses a simplified "general" analyzer, which performs fewer processing steps and does not remove stop words.

## Conclusion


The `SolrNormalizationPipeline` provides a lightweight, end-to-end solution for applying Solr-style normalization to German, French, English, Spanish, Italian, Dutch and Portugese texts. It delivers:

- **Impresso-mirrored and consistent preprocessing**: Language-specific analyzers ensure uniform tokenization and normalization aligned with Solr standards.
- **Easy Integration**: One-line setup and inference with minimal configuration.
- **Optional Language Control**: Users can rely on built-in language detection or manually specify the language.
- **Transparent Output**: Returns clean token lists that reveal exactly how input text is normalized.
- **Future Extensibility**: Designed to be extended with more languages, diagnostics, and custom analyzer configurations.

Whether you're preparing data for indexing, building reproducible NLP pipelines, or analyzing German and French corpora, this pipeline simplifies language-aware normalization—without requiring deep Lucene or Java expertise.


## Next steps


To get a better understanding of how this pipeline works, please check out the [original repository](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/solrnormalization).


---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)

INSERT CREDITS HERE
<br></br>
This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)
<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a>
<br></br>

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
