# News Agencies Recognition and Linking with Impresso BERT models

Delivering swift and reliable news since the 1830s and 1840s, news agencies have played a pivotal role both nationally and internationally. However, understanding their precise impact on shaping news content has remained somewhat elusive. Our goal is to illuminate this aspect by identifying news agencies within historical newspaper articles. Using data from newspapers in Switzerland and Luxembourg as part of the Impresso project, we've trained our pipeline to recognize these entities.

If you're here, you likely seek to detect news agency entities in your own text. This notebook will guide you through the process of setting up a workflow to identify specific newspaper or agency mentions within your text.

You can also access our [News Agency Recognition](https://huggingface.co/spaces/impresso-project/multilingual-news-agency-recognition) demo app through [HuggingFace Spaces](https://huggingface.co/docs/hub/en/spaces).

__Next, when running the code, if a question about a HuggingFace token appears, hit Cancel, we do not need it.__

## Prerequisites

Install necessary libraries (if not already installed) and
download the necessary NLTK data.

In [1]:
!pip install transformers
!pip install stopwordsiso
!pip install nltk



Now, this next function will download the requried model and gives you the keys to successfullly detect news agencies in your text.

In [2]:
from transformers import pipeline

newsagency_ner_pipeline = pipeline("newsagency-ner", model="impresso-project/ner-newsagency-bert-multilingual", trust_remote_code=True)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/eboros/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/eboros/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Run the example below to see how it works.

In [4]:
sentence = """Apple est créée le 1er avril 1976 dans le garage de la maison
          d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak
          et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine
          sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification
          de ses produits, le mot « computer » est retiré le 9 janvier 2015. (Reuter)"""

# Function to print each entry nicely
def print_nicely(data):
    for idx, entry in enumerate(data, start=1):
        for key, value in entry.items():
            print(f"  {key.capitalize()}: {value}")
        print()  # Blank line between entries

news_agencies = newsagency_ner_pipeline(sentence)
print_nicely(news_agencies)

  Type: org.ent.pressagency.Reuters
  Confidence: 98.47
  Index: 83
  Surface: Reuter
  Start: 422
  End: 428




## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an
interdisciplinary research project that aims to develop and consolidate tools for
processing and exploring large collections of media archives across modalities, time,
languages and national borders. The first project (2017-2021) was funded by the Swiss
National Science Foundation under grant
No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027)
by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585)
and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2024 The Impresso team.

### License

This program is provided as open source under
the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
