# Solr Normalization with impresso-pipelines Package

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/solrnormalization_pipeline_demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## What is this notebook about?
This notebook introduces the SolrNormalization component of the [impresso-pipelines](https://pypi.org/project/impresso-pipelines/) Python package. The broader goal of the impresso pipelines is to make the internal data processing workflows of the impresso project reusable and accessible to others. It allows external users—such as researchers, developers, or digital humanities practitioners—to apply the same processing steps we used on our historical newspaper collections to their own text collections.

The package is designed to minimize the coding effort required. By offering ready-to-use pipelines, users can adopt impresso approach to document processing with minimal configuration. This ensures consistency, comparability, and transparency in how Solr-normalized data is prepared and evaluated.

In this notebook, we focus on the [solrnormalization](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/solrnormalization) subpackage, which enables Solr-style tokenization and normalization using Lucene analyzers directly from Python. It supports language detection and applies language-specific analyzers for German and French.


## Why is this useful?

*   Text collections—especially historical or multilingual ones—often contain inconsistencies in spelling, formatting, and linguistic conventions across languages, time periods, and sources.
*   This pipeline provides a standardized, language-aware normalization process using Lucene analyzers, making it easier to apply consistent preprocessing across supported languages and prepare clean, comparable input for downstream models or analytical tools.
*   It also makes the normalization process more transparent by exposing the exact tokens produced after language-specific analyzers are applied.


## How it works

*   **Init**: Load Lucene analyzers and stopword lists for supported languages (`de`, `fr`).
*   **Language Detection**: Automatically detect the language of the input text unless manually specified.
*   **Analyze**: Apply language-specific tokenization, lowercasing, stopword removal, ASCII folding, and stemming using Lucene filters.
*   **Output**: Return a list of normalized tokens along with the detected language.
*   **Extras**: Optionally pass a custom Lucene JAR directory or enable `__enter__/__exit__` context handling for cleanup.


## Technical background

- **Lucene Analyzers**: Java-based components that apply tokenization, lowercasing, stopword removal, ASCII folding, and light language-specific normalization in a configurable sequence.
- **CustomAnalyzer**: Lucene's builder interface allows fine-grained control over the filter and tokenizer pipeline for each supported language.
- **JPype Integration**: Bridges Python and Java, enabling direct access to Lucene's analysis pipeline from within Python code.
- **Language Detection**: Uses [`LangIdentPipeline`](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/langident) to automatically select the appropriate analyzer when the language is not specified manually.
- **Post-Processing**: Returns normalized tokens and the detected language; supports resource cleanup via context manager or destructor.



## What will you learn?

In this notebook, you will:

- Understand the functionality of the `solrnormalization` subpackage from the Impresso Pipelines package.
- Learn how to normalize a given raw text using language-specific Lucene analyzers.
- Explore different use cases, including **basic and advanced usage** of the SolrNormalization pipeline.
- Recognize some limitations of the pipeline, such as support for **only selected languages** and the need for JVM integration.

By the end of this notebook, you will have a clear understanding of how Solr-style normalization is applied to text and how it can be used in practical scenarios.



## Useful resources
- For technical details on this library, please refer to the repository of the [Impresso Pipelines package](https://github.com/impresso/impresso-pipelines/tree/main).

## Prerequisites
First, start by installing `impresso-pipelines` package. Please, note that it might require you to restart kernel to apply changes. To do so, on Google Colab, go to *Runtime* and select *Restart session*.



In [None]:
%pip install "impresso_pipelines[solrnormalization]"

## Basic Usage

Start by importing the necessary module from `impresso-pipelines` package


In [None]:
from impresso_pipelines.solrnormalization import SolrNormalizationPipeline

solrnormalization_pipeline = SolrNormalizationPipeline()

Once you initialize the pipeline, you can simply provide the text you'd like to normalize. This example demonstrates the use of German text.


In [3]:
de_text = """

In den frühen Morgenstunden verließ der Geschäftsführer der Münchner Rückversicherungsgesellschaft das Gebäude – ein Ereignis,
das unter den Mitarbeitenden für großes Aufsehen sorgte. Außerdem wurde über „öffentliche“ Statements spekuliert, obwohl keine
offiziellen Informationen verfügbar waren.


"""

In [None]:
solrnormalization_pipeline(de_text)

**Interpretation of the result:**

As can be seen, the returned result is a dictionary with two keys: `"language"` and `"tokens"`.

- `language` indicates the language detected for the input text (in this case, `'de'` for German). If a language is manually specified, it will override the detection.

- `tokens` is a list of normalized tokens extracted from the text using a Lucene analyzer. These tokens are lowercased, stripped of stopwords, and normalized to remove accents and standardize common linguistic variants.

The output shows how long compound words (e.g., `Rückversicherungsgesellschaft`), special characters (e.g., `ü` in `früh`), and punctuation (e.g., quotes and dashes) are handled and broken down into analyzable components.



## Advanced Usage

This pipeline currently supports a single optional attribute when calling it, allowing limited control over its behavior while keeping usage simple and reproducible.

- `lang`: expects a string language code (e.g., `'de'` or `'fr'`). If provided, the pipeline skips automatic language detection and directly applies the corresponding Lucene analyzer. This can improve performance slightly and is useful in controlled multilingual settings.

Additional configuration options (e.g., customizing stopword lists, overriding analyzers, or enabling diagnostics) are not yet exposed but could be added in future versions to support more advanced use cases and deeper inspection of the normalization process.



**Example 1:** `lang`




In [None]:
solrnormalization_pipeline(de_text, lang='fr')

If you specify `lang='fr'`, the pipeline will skip automatic language detection and apply the French analyzer directly—even if the input text is in another language.  
This can be useful in controlled setups where you already know the correct language, or when comparing how the same text is normalized under different analyzers.

In this example, we manually forced the pipeline to use the French analyzer on a German text. As a result, many German stopwords remain in the output, and language-specific filters (like German normalization or minimal stemming) are not applied.


## Limitations of the Solr Normalization Pipeline

In [9]:
unsupported_languages = """In the early hours of the morning, the CEO of the Munich reinsurance company left the building – an event
 that caused a great stir among the employees. Additionally, there was speculation about “public” statements, although no
 official information was available."""

In [None]:
solrnormalization_pipeline(unsupported_languages)

As shown, the Solr Normalization Pipeline currently supports only two languages: German (`'de'`) and French (`'fr'`). If the input text is in another language, such as English, and no `lang` argument is specified, the pipeline will attempt to detect the language automatically and raise an error if it's not supported.

In this case, the detected language was `'en'`, which caused the pipeline to raise a `ValueError`. This behavior is intentional to ensure that unsupported languages are not processed with incorrect analyzers, which could lead to misleading results.


## Conclusion

The `SolrNormalizationPipeline` provides a lightweight, end-to-end solution for applying Solr-style normalization to German and French texts. It delivers:

- **Consistent Preprocessing**: Language-specific analyzers ensure uniform tokenization and normalization aligned with Solr standards.  
- **Easy Integration**: One-line setup and inference with minimal configuration.  
- **Optional Language Control**: Users can rely on built-in language detection or manually specify the language.  
- **Transparent Output**: Returns clean token lists that reveal exactly how input text is normalized.  
- **Future Extensibility**: Designed to be extended with more languages, diagnostics, and custom analyzer configurations.

Whether you're preparing data for indexing, building reproducible NLP pipelines, or analyzing German and French corpora, this pipeline simplifies language-aware normalization—without requiring deep Lucene or Java expertise.




## Next steps

To get more understanding of how this pipeline works, please check out the [original repository](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/solrnormalization).

---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)

INSERT CREDITS HERE
<br></br>
This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)
<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a>
<br></br>

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.


---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>

