# Solr Normalization with impresso-pipelines Package

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/solrnormalization_pipeline_demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

If something doesn't work, you can [report a problem](https://github.com/impresso/impresso-datalab-notebooks/blob/main/reporting-problems.md).


## What is this notebook about?

This notebook introduces the SolrNormalization component of the
[impresso-pipelines](https://pypi.org/project/impresso-pipelines/) Python package. The
broader goal of the Impresso pipelines is to make the internal data processing workflows
of the Impresso webapp transparent, reusable and accessible to others. It allows
external users—such as researchers, developers, or digital humanities practitioners—to
apply the same processing steps we used on our historical newspaper collections to their
own text collections.

By offering ready-to-use pipelines, users can adopt the Impresso approach to document
processing with minimal configuration. This ensures consistency, comparability, and
transparency in how our Solr-based Information Retrieval (IR) system splits texts into words
(also known as "tokens" in the NLP community) and normalizes these words for the search index.

In this notebook, we focus on the
[solrnormalization](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/solrnormalization)
subpackage, which offers Solr tokenization and normalization
analyzers directly from Python. It supports automatic language detection and applies
language-specific analyzers for all languages that the Impresso indexer treats in a
language-dependent way.

## Why is this useful?

- European languages found in our Impresso collection are inflecting, meaning that the same word can appear in many
  different forms (e.g., "introduce", "introduces", "introduced", "introducing").
  Typical search engines reduce these forms to a base form (a process called stemming)
  that allows users to find all forms of a word with a single search term.
- Newspapers also use title casing and uppercasing, therefore search engines typically
  convert all text to lowercase before indexing.
- Stopwords (common words like "the", "and", "is") appear very frequently and are
  typically not searched for. As they would also need a lot of storage (e.g. in English,
  1/6th of all tokens is the word "the"), it is also resource-efficient to remove
  them from the index.
- Text collections—especially historical ones—often contain inconsistencies in spelling
  and linguistic conventions across time periods and sources, the use of special
  characters, diacritics, accessnts can vary over time. Ignoring diacritics and mapping
  non-ASCII characters to their ASCII equivalents (a process called ASCII folding) helps
  to ensure that searches are not affected by these variations.
- This pipeline provides a standardized, language-aware tokenization and normalization
  pipeline to simulate and understand the text transformations that happen when we index
  our texts, or when a user formulates a query. In order to find hits with a
  user-provided query, it is necessary to apply the exact same normalizations to the query as
  to the texts.
- Therefore, this pipeline helps to understand the text normalization process by
  exposing the exact tokens produced after language-specific analyzers and stopword
  lists have been applied.

## How it works

- **Init**: Load Lucene analyzers and stopword lists for all languages with
  language-specific stopwords and normalization. Additionally, load a generic text
  analyzer for any other language.
- **Language Detection**: Automatically detect the language of the input text unless manually specified.
- **Analyze**: Apply language-specific tokenization, lowercasing, stopword removal,
  ASCII folding, and stemming using Solr components.
- **Output**: Return a list of normalized tokens along with the language information.

## Technical background

- **Solr Analyzers**: Java-based components that apply tokenization, lowercasing,
  stopword removal, ASCII folding, and language-specific stemming in a configurable
  sequence. Solr uses Java-based analyzers that originally come from the Apache Lucene project.
- **CustomAnalyzer**: Lucene's builder interface allows fine-grained control over the
  filter and tokenizer pipeline for each supported language. Our pipeline reflects the
  impresso configuration exactly, including language-specific stopwords and stemming
  rules.
- **JPype Integration**: Bridges Python and Java, enabling direct access to Solr's
  analysis pipeline from within Python code.
- **Language Detection**: Optionally uses
  [`LangIdentPipeline`](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/langident)
  to automatically select the language if not specified manually.
- **Post-Processing**: Returns normalized tokens and the language; supports resource cleanup via context manager or destructor.


## What will you learn?

In this notebook, you will:

- Understand the functionality of the `solrnormalization` subpackage from the Impresso Pipelines package.
- Learn how to normalize a raw text using language-specific Solr analyzers.
- Explore different use cases, including **basic and advanced usage** of the SolrNormalization pipeline.
- Recognize some limitations of the pipeline, such as support for **only selected languages** and the need for JVM integration.

By the end of this notebook, you will have a clear understanding of how Solr-style normalization is applied to text and how it can be used in practical scenarios.


## Useful resources

- For technical details on this library, please refer to the repository of the [Impresso Pipelines package](https://github.com/impresso/impresso-pipelines/tree/main).


## Prerequisites

First, start by installing the `impresso-pipelines` package. Please note that it might require you to restart the runtime to apply changes. To do so, on Google Colab, go to _Runtime_ and select _Restart session_.


In [None]:
%pip install --upgrade "impresso-pipelines[solrnormalization]"

## Basic Usage

Import the normalization pipeline from the package and create an instance of the
`SolrNormalization` class. This class provides methods to normalize text in multiple languages using Solr
analyzers.


In [None]:
from impresso_pipelines.solrnormalization import SolrNormalizationPipeline

solrnormalization_pipeline = SolrNormalizationPipeline()

Let's take a German example paragraph, which contains special characters and different
punctuation marks.


In [None]:
de_text = """
In den frühen Morgenstunden verließ der Geschäftsführer der Münchner Rückversicherungsgesellschaft das Gebäude – ein Ereignis,
das unter den Mitarbeitenden für großes Aufsehen sorgte. Außerdem wurde über „öffentliche“ Statements spekuliert, obwohl keine
offiziellen Informationen verfügbar waren.
"""

In [None]:
solrnormalization_pipeline(de_text)

**Interpretation of the result:**

As can be seen, the returned result is a dictionary with two keys: `"language"` and `"tokens"`.

- `language` indicates the language detected for the input text (in this case, `'de'` for German). If a language is manually specified, it will override the detection.

- `tokens` is a list of normalized tokens extracted from the text using a Lucene analyzer. These tokens are lowercased, stripped of stopwords, and normalized to remove accents and standardize common linguistic variants.

The output shows how long compound words (e.g., `Rückversicherungsgesellschaft`), special characters (e.g., `ü` in `früh`), and punctuation (e.g., quotes and dashes) are handled and broken down into analyzable components.


## Advanced Usage


This pipeline currently supports two optional attributes when calling it, allowing limited control over its behavior while keeping usage simple and reproducible.

- `lang`: expects a string language code (e.g., `'de'` or `'fr'`). If provided, the pipeline skips automatic language detection and directly applies the corresponding Lucene analyzer. This can improve performance slightly and is useful in controlled multilingual settings.

- `diagnostics`: expects a Boolean value. If set to `True`, the pipeline will return additional information such as removed stopwords from the provided text.

These attributes can be used individually, in combination with each other, or all at once, depending on the level of detail needed.


**Example 1:** `lang`


In [None]:
solrnormalization_pipeline(de_text, lang="fr")

If you specify `lang='fr'`, the pipeline will skip automatic language detection and apply the French analyzer directly—even if the input text is in another language.  
This can be useful in controlled setups where you already know the correct language, or when comparing how the same text is normalized under different analyzers.

In this example, we manually forced the pipeline to use the French analyzer on a German text. As a result, many German stopwords remain in the output, and language-specific filters (like German normalization or minimal stemming) are not applied.


**Example 2**: `diagnostics`


In [None]:
solrnormalization_pipeline(de_text, diagnostics=True)

If you set `diagnostics=True`, the pipeline will also return an additional field, `stopwords_detected` — a list of words from the original text that were identified as stopwords and consequently removed from the final tokenized output.


**Example 3**: All at once


In [None]:
solrnormalization_pipeline(de_text, lang="de", diagnostics=True)

You can combine multiple options in a single call, like specifying the language and enabling diagnostics. The pipeline is flexible and allows using any combination of supported parameters.


**Example 4**: Tokenization of common OCR errors (`^`, `_`, `-`)


In [None]:
altered_de_text = """

In den frühen Morge^stunden verließ der Geschäftsführer der Münchner Rückversicherungsgesellschaft das Gebäude – ein Ereignis,
das unter den Mitar_eitenden für großes Aufsehen sorgte. Außerdem wurde über „öffentliche“ Statements spekuliert, obwohl keine
offiziellen Inform-tionen verfügbar waren.


"""

In [None]:
solrnormalization_pipeline(altered_de_text, diagnostics=True)

This example illustrates how common OCR-related artifacts such as `^`, `_`, and hyphens are handled during normalization and tokenization.
These characters often appear in digitized historical documents due to misrecognition or formatting inconsistencies.

As can be seen above, `^` and `-` force words to be split, resulting in two different tokenized words. `_` on the other hand is just treated as part of the word.


## Limitations of the Solr Normalization Pipeline


In [None]:
unsupported_languages = """In the early hours of the morning, the CEO of the Munich reinsurance company left the building – an event
 that caused a great stir among the employees. Additionally, there was speculation about “public” statements, although no
 official information was available."""

In [None]:
solrnormalization_pipeline(unsupported_languages)

As shown, the Solr Normalization Pipeline currently supports only two languages: German (`'de'`) and French (`'fr'`). If the input text is in another language, such as English, and no `lang` argument is specified, the pipeline will attempt to detect the language automatically and raise an error if it's not supported.

In this case, the detected language was `'en'`, which caused the pipeline to raise a `ValueError`. This behavior is intentional to ensure that unsupported languages are not processed with incorrect analyzers, which could lead to misleading results.


## Conclusion


The `SolrNormalizationPipeline` provides a lightweight, end-to-end solution for applying Solr-style normalization to German and French texts. It delivers:

- **Consistent Preprocessing**: Language-specific analyzers ensure uniform tokenization and normalization aligned with Solr standards.
- **Easy Integration**: One-line setup and inference with minimal configuration.
- **Optional Language Control**: Users can rely on built-in language detection or manually specify the language.
- **Transparent Output**: Returns clean token lists that reveal exactly how input text is normalized.
- **Future Extensibility**: Designed to be extended with more languages, diagnostics, and custom analyzer configurations.

Whether you're preparing data for indexing, building reproducible NLP pipelines, or analyzing German and French corpora, this pipeline simplifies language-aware normalization—without requiring deep Lucene or Java expertise.


## Next steps


To get a better understanding of how this pipeline works, please check out the [original repository](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/solrnormalization).


---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)

**Writing - Original draft:** Glebs Vinarskis. **Conceptualization:** Glebs Vinarskis, Simon Clematide. **Software:** Simon Clematide, Maud Ehrmann. **Writing - Review & Editing**: Caio Mello . **Validation:** Simon Clematide, Maud Ehrmann. **Datalab editorial board:** Caio Mello (Managing), Cao Vy, Emanuela Boros, Juri Opitz, Marten Düring, Martin Grandjean, Pauline Conti. **Data curation & Formal analysis:** Maud Ehrmann, Emanuela Boros, Pauline Conti, Simon Clematide, Juri Opitz, Andrianos Michail. **Methodology:** Simon Clematide, Maud Ehrmann. **Supervision:** Simon Clematide. **Funding aquisition:** Maud Ehrmann, Simon Clematide, Marten Düring, Raphaëlle Ruppen Coutaz.

<br></br>
This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)
<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a>
<br></br>
For feedback on this notebook, please send an email to info@impresso-project.ch

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
