# News Agencies Recognition with impresso-pipelines Package

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/newsagencies_pipeline_demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## What is this notebook about?

This notebook introduces the NewsAgencies component of the [impresso-pipelines](https://pypi.org/project/impresso-pipelines/) Python package. The broader goal of the impresso pipelines is to make the internal data processing workflows of the impresso project reusable and accessible to others. It allows external users—such as researchers, developers, or digital humanities practitioners—to apply the same processing steps we used on our historical newspaper collections to their own text collections.

The package is designed to minimize the coding effort required. By offering ready-to-use pipelines, users can adopt impresso approach to document processing with minimal configuration. This ensures consistency, comparability, and transparency in how News-Agencies-derived data is prepared and evaluated.

In this notebook, we focus on the [newsagencies](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/newsagencies) subpackage, which enables extraction of news agencies from a chunk of text. You will learn how to apply the NewsAgencies pipeline to your text data, explore different usage options, and interpret the results.

## Why is this useful?


*   News agency datasets often contain inconsistencies in structure, naming conventions, and content formatting across countries, time periods, and document sources.
*   This pipeline provides a standardized, end-to-end approach for extracting, cleaning, and organizing metadata and textual content, making it easier to perform large-scale comparisons, identify patterns, and feed clean input into downstream models or dashboards.


## How it works

*   **Init**: Load the Hugging Face NER model and tokenizer.
*   **Inference**: Tokenize text, detect “press-agency” mentions with confidence scores.
*   **Filter**: Drop mentions below `min_relevance` or in `suppress_entities`.

*   **Aggregate**: Group by UID and attach its Wikidata link.

*   **Extras**: Enable `diagnostics` for raw mention details or swap in a custom `model_id`.

## Technical background

- **Transformer Encoder**: A BERT-style model produces contextual embeddings for each token.
- **Token-Classification Head**: A linear layer classifies each token as “press-agency” or “O,” trained via cross-entropy.
- **Subword Tokenization**: WordPiece splits text into subwords to handle unknown vocabulary.
- **Hugging Face Pipeline**: Wraps tokenization, model inference, and offset alignment into a single call.
- **Post-Processing**: Applies a confidence threshold, groups spans by UID, links to Wikidata, and filters via `suppress_entities`; `diagnostics` and `model_id` flags enable raw outputs or custom models.



## What will you learn?

In this notebook, you will:

- Understand the functionality of the `newsagencies` subpackage from the Impresso Pipelines package.
- Learn how to extract news agencies from a given raw text.
- Explore different use cases, including **basic and advanced usage** of the News Agencies pipeline.
- Recognize some limitations of the pipeline, such as handling **uncommon abbreviations and short texts**.

By the end of this notebook, you will have a clear understanding of how the News Agencies are extracted from a text and their practical applications.

## Useful resources
- For technical details on this library, please refer to the repository of the [Impresso Pipelines package](https://github.com/impresso/impresso-pipelines/tree/main).

## Prerequisites
First, start by installing `impresso-pipelines` package. Please, note that it might require you to restart kernel to apply changes. To do so, on Google Colab, go to *Runtime* and select *Restart session*.



In [None]:
%pip install "impresso_pipelines[newsagencies]"

## Basic Usage

Start by importing the necessary module from `impresso-pipelines` package


In [3]:
from impresso_pipelines.newsagencies import NewsAgenciesPipeline

newsagencies_pipeline = NewsAgenciesPipeline()

Once you initialize the pipeline, you can simply provide the text you'd like to extract news agencies for. This example demonstrates the use of multilingual text (French, German, English).


In [None]:
text = """

Selon une dépêche matinale de l’Agence Havas, datée du 14 mai 1898, la Chambre s’est réunie dans une atmosphère « pleine de gravité ». Par câble, l’Agence France‑Presse (AFP) précise que les bancs de la gauche ont bruyamment salué l’allocution ministérielle. D’Amsterdam, l’Algemeen Nederlands Persbureau (ANP) fait savoir que La Haye observe l’affaire « avec le plus grand intérêt », tandis que l’Agenzia Nazionale Stampa Associata (ANSA) signale de Rome qu’un amendement sera déposé « au nom de la défense des manufactures locales ».
À Berne, ATS‑SDA expédie un avis où l’on lit que le Conseil fédéral « n’entend point se départir de la doctrine de neutralité »; Belga télégraphie que Bruxelles redoute des rétorsions douanières. De Sofia, BTA rappelle qu’en 1904 déjà, des économistes bulgares prévoyaient des secousses monétaires similaires. Dans les cafés de Saint‑Pétersbourg, rapporte Interfax, on discute du projet « comme on eût commenté la réforme de 1861 ».

Wie das Wolffs Telegraphisches Bureau (Wolff) gestern in einer abendlichen Meldung verlautbarte, trat der Haushaltsausschuss „unter beträchtlichem Andrang der Presse“ zusammen. Eine Fernschreibnotiz der Telegraphen‑Union (Telunion) fügte hinzu, man erwarte „lebhafte Zwischenrufe von der Zentrumspartei“. Gemäß Deutsches Nachrichtenbüro (DNB) erinnern die Vorgänge an die Zolltarif‑Debatten des Jahres 1879.
Der Deutsche Depeschendienst / dapd (DDP‑DAPD) berichtet von letzter Minute‑Verhandlungen, während die Deutsche Presse‑Agentur (dpa) meldet, der Reichskanzler halte sich „zu weitergehenden Stellungnahmen bedeckt“. Aus Wien telegraphiert die Österreichische Presseagentur (APA), man hege „vorsichtigen Optimismus“; gleichzeitig warnt die Schweizerische Depeschenagentur (ATS/SDA) vor juristischen Fallstricken im Alpenraum. Die Schweizer Mittelpresse (SPK‑SMP) lässt verlauten, mehrere Kantone pochten auf Kompensationen. Nach einer Drahtnachricht der tschechischen ČTK wolle Prag demnächst ein Gutachten veröffentlichen.

By cable to The Times of London, Reuters states that the conference hall fell silent when the chairman produced a memorandum drafted, it is said, by experts of the old Stefani bureau. Across the Atlantic, the Associated Press (AP) wires that a „spirit of compromise” pervades the corridors, though private circulars hint at lingering scepticism.
From New York, United    Press International (UPI/UP) recalls that Domei chroniclers followed the Tokyo tariff talks of 1934 „with equal fervour“. Market sheets collated by Extel record a brief rally in overseas securities; yet commentators consulted by Europapress counsel prudence for the Latin exchanges. Warsaw-based PAP intimates that Poland will vote „in concert with Budapest and Prague“. Stockholm’s Tidningarnas Telegrambyrå (TT), meanwhile, relays Nordic caution, whereas Belgrade’s TANJUG cautions against „any settlement that might hamper Balkan exports“.
Late in the evening, TASS issues a bulletin insisting that existing fuel conventions be honoured; a companion wire from Interfax suggests the Kremlin regards the matter as „no less vital than the grain question of 1917“.


"""

In [None]:
newsagencies_pipeline(text)

**Interpretation of the result:**

As can be seen, the returned result is a dictionary with a single key "agencies" and a list of agencies found in the text.

Each found agency is returned in a separate dictionary containing 3 key-value pairs:


1.   `uid` is unique name of a news agency.

2.   `relevance`: Score value between 0 and 1. This score indicates how confident the model is that a particular word or phrase belongs to a specific entity class. A relevance score of 0 indicates that the model is not confident at all that the detected entity is correct (probability 0%).
A relevance score of 1 indicates that the model is fully confident that the detected entity is correct (probability 100%). By default, a threshold of 0.1 is applied to ensure the output contains only the meaningful results.



3.   `wikidata_link` is link to Wikidata that has information and some description about the specific news agency.



## Advanced Usage

This pipeline offers several additional attributes that can be used when calling it to manipulate or gain a deeper understanding of the results and include the corresponding metadata from the processing pipeline for documentation.

- `min_relevance`: controls the strictness of entity filtering. Lower values make the pipeline less strict (more entities, lower confidence), while higher values make it more strict (fewer, higher-confidence entities). Any value between 0 and 1 is allowed.

- `diagnostics`: expects a boolean value (True or False). If set to True, the pipeline returns detailed information about each detected agency, including the exact text span, entity type, relevance score, and Wikidata link. If set to False (default), it returns only a summary list of unique agencies with their highest relevance scores and Wikidata links.

- `model_id`: expects a string that specifies the identifier or path of the pretrained model to use (for example, `impresso-project/ner-newsagency-bert-multilingual`).
It tells the pipeline which model to load from Hugging Face Hub or a local directory.
Changing model_id allows you to use different models for news agency recognition, as long as they are compatible with the pipeline.

- `suppress_entities`: expects a list or sequence of entity type strings to exclude from the results.
By default, the pipeline suppresses `['org.ent.pressagency.unk', 'ag', 'pers.ind.articleauthor']`, which means unknown agencies, general agency tags, and individual article authors are filtered out.
These are suppressed by default to focus the output on recognized, specific news agencies and avoid irrelevant or overly generic entities.


**Example 1:** `min_relevance`

In [None]:
newsagencies_pipeline(text, min_relevance=0.95)

If you specify min_relevance=0.95, only entities with a relevance (confidence) score of 0.95 or higher will be included in the output.
Entities with a score below 0.95 will be filtered out and not returned.
This makes the pipeline more strict, so you will get fewer entities, but with higher confidence that they are correct.

**Example 2:** `diagnostics`

In [None]:
newsagencies_pipeline(text, diagnostics=True)

Enabling `diagnostics` adds original text that was fed to the pipeline for easier reference in the future. Additionally, now result contains every recognised new agency and not just unique ones. Lastly, `diagnostics` adds 3 additional key-value pairs for each recognised entity:

1. `surface`: contains the **exact** substring from the input text that was identified as a news agency entity.
It preserves the original spacing and formatting from the input, allowing you to see precisely which part of the text was recognized as an entity.
For example, if the input text is `"Agence France-Presse reported..."`, the `surface` might be `"Agence France-Presse"`.

2. `start`: indicates the character index in the input text where the detected entity begins.
It allows you to locate the exact position of the entity within the original text.
For example, if start is 10, the entity starts at the 11th character (0-based index) of the input string.

3. `stop`: indicates the character index in the input text where the detected entity ends (exclusive).
It marks the position just after the last character of the entity, so the entity spans from start (inclusive) to stop (exclusive).
This allows you to extract the exact substring for the entity using input_text[start:stop].


**Example 3:** `model_id`

In [None]:
newsagencies_pipeline(text, model_id='impresso-project/ner-newsagency-bert-multilingual')

As can be seen, the output is exactly the same, since the `model_id` we specified above is the default one. It tells the pipeline which model to load from Hugging Face Hub or a local directory.
Changing `model_id` allows you to use different models for news agency recognition, as long as they are compatible with the pipeline.

**Example 4:** `suppress_entities`

In [None]:
newsagencies_pipeline(text, suppress_entities=['org.ent.pressagency.AFP'])

Example 5: All at once

In [None]:
newsagencies_pipeline(
    text,
    min_relevance=0.95,
    diagnostics=True,
    model_id='impresso-project/ner-newsagency-bert-multilingual',
    suppress_entities=['org.ent.pressagency.AFP']
)

You can use a mix of additional parameters, or all of them at once, to deeper control the output of your News Agencies. In the example above, we set the relevance threshold to 0.95 and suppress additionally AFP news agency.

## Limitations of the News Agencies Pipeline

In [11]:
underrepresented_entities = """Selon une dépêche matinale de l’Agence Havas, datée du 14 mai 1898, la Chambre s’est réunie dans une atmosphère « pleine de gravité »."""

In [None]:
newsagencies_pipeline(underrepresented_entities, diagnostics=True)

As shown, the News Agencies Pipeline struggles with agencies that were underrepresented in the training dataset. In this case, there were few instances where `Agency Havas` appeared with `l'` in the training material, which explains why the `surface`  returns `Agence Havas` instead of `l'Agence Havas`.


## Conclusion

The `NewsAgenciesPipeline` provides a lightweight, end-to-end solution for identifying and linking news‐agency mentions in text. It delivers:

- **Accurate Detection**: High-confidence, token-level classification of “press-agency” spans.  
- **Easy Integration**: One-line setup and inference with minimal code.  
- **Customizable Filtering**: User-defined relevance thresholds and suppression lists to reduce noise.  
- **Rich Output**: Aggregated UIDs with Wikidata links, plus optional raw diagnostics for deeper analysis.  
- **Model Flexibility**: Swap in any compatible token-classification model by adjusting `model_id`.

Whether you’re processing multilingual corpora or building a downstream dashboard of source attributions, this pipeline streamlines the entire workflow—from tokenization to knowledge-graph linking—without manual feature engineering.








## Next steps

To get more understanding of how this pipeline works, please check out the [original repository](https://github.com/impresso/impresso-pipelines/tree/main/impresso_pipelines/newsagencies).

---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)

INSERT CREDITS HERE
<br></br>
This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)
<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a>
<br></br>

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.


---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>

