# News Agencies Recognition and Linking with Impresso BERT models

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/newsagency-processing_ImpressoHF.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## What is this notebook about?

(example) This notebook demonstrates how to, by loading a model from HuggingFace into the notebook, find mentions to news agencies in historical newspaper articles. 

## Why is this useful?

Delivering swift and reliable news since the 1830s and 1840s, news agencies have played a pivotal role both nationally and internationally. However, understanding their precise impact on shaping news content has remained somewhat elusive. Our goal is to illuminate this aspect by identifying news agencies within historical newspaper articles. Using data from newspapers in Switzerland and Luxembourg as part of the Impresso project, we've trained our pipeline to recognize these entities.

If you're here, you likely seek to detect news agency entities in your own text. This notebook will guide you through the process of setting up a workflow to identify specific newspaper or agency mentions within your text.

You can also access our [News Agency Recognition](https://huggingface.co/spaces/impresso-project/multilingual-news-agency-recognition) demo app through [HuggingFace Spaces](https://huggingface.co/docs/hub/en/spaces).


## What will you learn?

In this notebook, you will:

* (example) learn how to make use of HuggingFace models
* xxx
* xxx

## Useful resources

Any useful resources that would help the user understand this notebook? (not mandatory)

(example) - What is HuggingFace? What are BERT models? 

## Prerequisites

Install necessary libraries (if not already installed) and
download the necessary NLTK data.

In [None]:
!pip install transformers
!pip install stopwordsiso
!pip install nltk

Now the fun part, this function will download the requried model and gives you the keys to successfullly detect news agencies in your text.

In [None]:
from transformers import pipeline

newsagency_ner_pipeline = pipeline("newsagency-ner", 
                                   model="impresso-project/ner-newsagency-bert-multilingual", 
                                   trust_remote_code=True, 
                                   device='cpu')

Run the example below to see how it works.

In [None]:
sentence = """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles, 
        where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly, 
        debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun, 
        regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia, 
        George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State, 
        were drafting policies for the newly established American government following the signing of the Constitution. (Reuter)"""

# Function to print each entry nicely
def print_nicely(data):
    for idx, entry in enumerate(data, start=1):
        for key, value in entry.items():
            print(f"  {key.capitalize()}: {value}")
        print()  # Blank line between entries

news_agencies = newsagency_ner_pipeline(sentence)

print_nicely(news_agencies)

## Conclusion

This notebook provided you with a... (summary)

What should people consider when using this notebook? What are the potentials and limitations? 

## Next Steps

That's it for now! Next, you can explore:

- the [Visualising Place Entities on Maps](https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoAPI.ipynb) notebook, which demonstrates how to visualise in a map mentions to places in the Impresso corpus.
- the [Named Entity Recognition with impresso-pipelines]() notebook, which allows you reuse Impresso NER models and apply them to your own data. 

---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)
**Writing - Original draft:** Emanuela Boros. **Conceptualization:** Emanuela Boros, Maud Ehrmann. **Software:** Lea Marxen, Emanuela Boros, Maud Ehrmann. **Writing - Review & Editing:** Caio Mello. **Validation:** TBD. **Datalab editorial board:** Caio Mello (Managing), Pauline Conti, Emanuela Boros, Marten Düring, Juri Opitz, Martin Grandjean, Estelle Bunout. **Data curation & Formal analysis:** Maud Ehrmann, Emanuela Boros, Pauline Conti, Simon Clematide, Juri Opitz, Andrianos Michail. **Methodology:** Emanuela Boros, Maud Ehrmann. **Supervision:** Emanuela Boros, Maud Ehrmann. **Funding aquisition:** Maud Ehrmann, Simon Clematide, Marten Düring, Raphaëlle Ruppen Coutaz.
<br></br>
This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)
<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a>
<br></br>
For feedback on this notebook, please send an email to info@impresso-project.ch 

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.


---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
