# Named Entity Recognition and Linking with Impresso BERT models

## Good to know before starting



We refer to "named entity recognition" as NER, which is a tool that recognises entities such as persons and locations from text. A "named entity linker" (NEL) connects these entities to an existing one such as a real person that can be found on Wikipedia (with a unique id in Wikidata). Wikipedia is a free, user-edited encyclopedia with articles on a wide range of topics like historical events, famous people, or scientific concepts. Wikidata is a sister project of Wikipedia that stores structured data, like facts and relationships between entities, used for tasks where computers need to understand and process data, such as NER and NEL.


In the context of _Impresso_, the NER tool was trained on the [HIPE 2020](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md) dataset. It was trained to recognise coarse and fine grained entities such as persons and locations, but also their names, titles, and functions. Further, the _Impresso_ NEL tool links these entity mentions to unique referents in a knowledge base – here Wikipedia and Wikidata – or not if the mention's referent is not found.

## Prerequisites

First, we install some necessary libriaries and download the necessary files.

Next, when running the code, if a question about a token appears, hit Cancel, we do not need it.

In [None]:
!pip install transformers
!pip install spacy
!pip install pysbd
!wget https://raw.githubusercontent.com/impresso/impresso-datalab-notebooks/refs/heads/main/entity/utils.py
!wget https://raw.githubusercontent.com/impresso/impresso-datalab-notebooks/refs/heads/main/entity/text_utils.py

## Entity Recognition

In [1]:
# Import necessary modules from the transformers library
from transformers import pipeline
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Define the model name to be used for token classification, we use the Impresso NER
# that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"

# Load the tokenizer corresponding to the specified model name
ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Create a pipeline for named entity recognition (NER) using the loaded model and tokenizer.


In [10]:
ner_pipeline = pipeline("generic-ner", model=MODEL_NAME, 
                        tokenizer=ner_tokenizer, 
                        trust_remote_code=True,
                        device='cpu')

In [11]:
sentences = ["""In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles, 
                where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly, 
                debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun, 
                regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia, 
                George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State, 
                were drafting policies for the newly established American government following the signing of the Constitution."""]

print(sentences[0])

In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles, 
                where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly, 
                debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun, 
                regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia, 
                George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State, 
                were drafting policies for the newly established American government following the signing of the Constitution.


In [23]:
from utils import print_nicely
# Helper function to print entities one per row
def print_nicely(entities):
    for entity in entities:
        print(f"Entity: {entity['entity']} | Confidence: {entity['score']:.2f}% | Text: {entity['word'].strip()} | Start: {entity['start']} | End: {entity['end']}")

# Visualize stacked entities for each sentence
for sentence in sentences:
    results = ner_pipeline(sentence)

    # Extract coarse and fine entities
    for key in results.keys():
        # Visualize the coarse entities
        print_nicely(results[key])

Entity: time | Confidence: 76.93% | Text: In the year 1789 | Start: 0 | End: 16
Entity: pers | Confidence: 95.57% | Text: King Louis XVI, ruler of France | Start: 18 | End: 49
Entity: loc | Confidence: 68.91% | Text: Palace of Versailles | Start: 87 | End: 107
Entity: pers | Confidence: 74.60% | Text: Marie Antoinette, the | Start: 132 | End: 153
Entity: pers | Confidence: 80.78% | Text: Queen of France | Start: 154 | End: 169
Entity: pers | Confidence: 90.78% | Text: Maximilien Robespierre, a leading member of the National Assembly | Start: 181 | End: 246
Entity: pers | Confidence: 93.25% | Text: Jean-Jacques Rousseau, the famous philosopher | Start: 278 | End: 323
Entity: pers | Confidence: 91.61% | Text: Charles de Talleyrand, the Bishop of Autun | Start: 329 | End: 371
Entity: loc | Confidence: 92.97% | Text: Atlantic | Start: 464 | End: 472
Entity: loc | Confidence: 97.63% | Text: Philadelphia | Start: 476 | End: 488
Entity: pers | Confidence: 84.34% | Text: George Washington, the

## Entity Linking

Further, the _Impresso_ NEL tool links these the previously found entity mentions to unique referents in Wikipedia and Wikidata.

In [None]:
# Import the necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")
nel_model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-mgenre-multilingual").eval()


In [None]:
from text_utils import tokenise

# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')

    # Run the NER pipeline on the input sentence and store the results
    results = ner(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))

    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, entity_text in entities:
        if entity_text not in already_done:
            # Tag the entity in the text

            language = 'en'
            tokens = tokenise(sentence, language)
            start, end = (
                entity["index"][0],
                entity["index"][1],
            )

            context_start = max(0, start - 10)
            context_end = min(len(tokens), end + 11)

            nel_sentence = (
                " ".join(tokens[context_start:start])
                + " [START] "
                + entity_text
                + " [END] "
                + " ".join(tokens[end + 1 : context_end])
            )

            # Generate Wikipedia links for the tagged text
            outputs = nel_model.generate(
                **nel_tokenizer([nel_sentence], return_tensors="pt"),
                num_beams=3,
                num_return_sequences=3
            )

            # Decode the generated output to get the Wikipedia links
            wikipedia_links = nel_tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {entity_text}, Wikipedia Links: {wikipedia_links}")

            # Add the word to the already processed list
            already_done.append(entity_text)


In [None]:
from utils import get_wikipedia_page_props

# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')

    # Run the NER pipeline on the input sentence and store the results
    results = ner(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))

    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, entity_text in entities:
        if entity_text not in already_done:
            # Tag the entity in the text
            language = 'en'
            tokens = tokenise(sentence, language)
            start, end = (
                entity["index"][0],
                entity["index"][1],
            )

            context_start = max(0, start - 10)
            context_end = min(len(tokens), end + 11)

            nel_sentence = (
                " ".join(tokens[context_start:start])
                + " [START] "
                + entity_text
                + " [END] "
                + " ".join(tokens[end + 1 : context_end])
            )

            # Generate Wikipedia links for the tagged text
            outputs = nel_model.generate(
                **nel_tokenizer([nel_sentence], return_tensors="pt"),
                num_beams=3,
                num_return_sequences=3
            )

            # Decode the generated output to get the Wikipedia links
            wikipedia_links = nel_tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {entity_text}, Wikipedia Links: {wikipedia_links}")

            # Add the word to the already processed list
            already_done.append(entity_text)

            # Retrieve and print Wikidata QID for each Wikipedia link
            for wikipedia_link in wikipedia_links:
                qid = get_wikipedia_page_props(wikipedia_link)
                print(f"  Wikidata: {wikipedia_link} -> {qid}")



## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an
interdisciplinary research project that aims to develop and consolidate tools for
processing and exploring large collections of media archives across modalities, time,
languages and national borders. The first project (2017-2021) was funded by the Swiss
National Science Foundation under grant
No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027)
by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585)
and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2024 The Impresso team.

### License

This program is provided as open source under
the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
