# Named Entity Recognition and Linking with Impresso BERT models

## Good to know before starting



We refer to "named entity recognition" as NER, which is a tool that recognises entities such as persons and locations from text. A "named entity linker" (NEL) connects these entities to an existing one such as a real person that can be found on Wikipedia (with a unique id in Wikidata). Wikipedia is a free, user-edited encyclopedia with articles on a wide range of topics like historical events, famous people, or scientific concepts. Wikidata is a sister project of Wikipedia that stores structured data, like facts and relationships between entities, used for tasks where computers need to understand and process data, such as NER and NEL.


In the context of _Impresso_, the NER tool was trained on the [HIPE 2020](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md) dataset. It was trained to recognise coarse and fine grained entities such as persons and locations, but also their names, titles, and functions. Further, the _Impresso_ NEL tool links these entity mentions to unique referents in a knowledge base – here Wikipedia and Wikidata – or not if the mention's referent is not found.

## Prerequisites

First, we install some necessary libriaries and download the necessary files.

Next, when running the code, if a question about a token appears, hit Cancel, we do not need it.

In [11]:
!pip install transformers
!pip install spacy
!pip install pysbd
!wget https://raw.githubusercontent.com/impresso/impresso-datalab-notebooks/3f7afc05caef3f527db8320cdf8c131aec41d7cd/2-entity/utils.py
!wget https://raw.githubusercontent.com/impresso/impresso-datalab-notebooks/refs/heads/7-polish-repair-if-any-the-entity-notebooks/2-entity/text_utils.py


Collecting pysbd
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Downloading pysbd-0.3.4-py3-none-any.whl (71 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pysbd
Successfully installed pysbd-0.3.4
--2024-09-30 12:13:09--  https://raw.githubusercontent.com/impresso/impresso-datalab-notebooks/3f7afc05caef3f527db8320cdf8c131aec41d7cd/2-entity/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5625 (5.5K) [text/plain]
Saving to: ‘utils.py.1’


2024-09-30 12:13:09 (49.3 MB/s) - ‘utils.py.1’ saved [5625/5625]

--2024-09-30 12:13:09--  https://raw.githubusercontent.com/impresso/impresso-datalab-notebooks/refs/heads/7-polish-repair-if-a

## Entity Recognition

In [2]:
# Import necessary modules from the transformers library
from transformers import pipeline
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Define the model name to be used for token classification, we use the Impresso NER
# that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"

# Load the tokenizer corresponding to the specified model name
ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


tokenizer_config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/212k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/716k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

Create a pipeline for named entity recognition (NER) using the loaded model and tokenizer.


In [3]:
ner_pipeline = pipeline("generic-ner", model=MODEL_NAME, tokenizer=ner_tokenizer, trust_remote_code=True)

config.json:   0%|          | 0.00/6.07k [00:00<?, ?B/s]

configuration_stacked.py:   0%|          | 0.00/1.73k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/impresso-project/ner-stacked-bert-multilingual:
- configuration_stacked.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


generic_ner.py:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/impresso-project/ner-stacked-bert-multilingual:
- generic_ner.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


modeling_stacked.py:   0%|          | 0.00/4.88k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/impresso-project/ner-stacked-bert-multilingual:
- modeling_stacked.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/194M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/584 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/170M [00:00<?, ?B/s]



In [5]:
sentences = ["""Apple est créée le 1er avril 1976 dans le garage de la maison
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            """]

print(sentences[0])

Apple est créée le 1er avril 1976 dans le garage de la maison
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            


In [6]:
from utils import visualize_stacked_entities

# Visualize stacked entities for each sentence
for sentence in sentences:
    results = ner_pipeline(sentence)

    # Extract coarse and fine entities
    coarse_entities = results["NE-COARSE-LIT"]
    fine_entities = results["NE-FINE-LIT"]

    # Visualize the stacked entities
    visualize_stacked_entities(sentence, coarse_entities, fine_entities)


Visualizing stacked coarse and fine-grained entities



## Entity Linking

Further, the _Impresso_ NEL tool links these the previously found entity mentions to unique referents in Wikipedia and Wikidata.

In [17]:
# Import the necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")
nel_model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-mgenre-multilingual").eval()


In [18]:
from text_utils import tokenise

# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')

    # Run the NER pipeline on the input sentence and store the results
    results = ner(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))

    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, entity_text in entities:
        if entity_text not in already_done:
            # Tag the entity in the text

            language = 'en'
            tokens = tokenise(sentence, language)
            start, end = (
                entity["index"][0],
                entity["index"][1],
            )

            context_start = max(0, start - 10)
            context_end = min(len(tokens), end + 11)

            nel_sentence = (
                " ".join(tokens[context_start:start])
                + " [START] "
                + entity_text
                + " [END] "
                + " ".join(tokens[end + 1 : context_end])
            )

            # Generate Wikipedia links for the tagged text
            outputs = nel_model.generate(
                **nel_tokenizer([nel_sentence], return_tensors="pt"),
                num_beams=3,
                num_return_sequences=3
            )

            # Decode the generated output to get the Wikipedia links
            wikipedia_links = nel_tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {entity_text}, Wikipedia Links: {wikipedia_links}")

            # Add the word to the already processed list
            already_done.append(entity_text)


Sentence: Apple est créée le 1er avril 1976 dans le garage de la maison
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            





Entity: Apple, Wikipedia Links: ['Apple >> fr ', 'Apple Inc. >> fr ', 'Apple Group >> fr ']

Entity: Apple, Wikipedia Links: ['le 1er avril 1976 >> fr ', 'Le 1er avril 1976 >> fr ', 'Le 1er Avril 1976 >> fr ']

Entity: Apple, Wikipedia Links: ['Steve Jobs >> fr ', 'Steven Jobs >> fr ', 'Steve Job >> fr ']

Entity: Apple, Wikipedia Links: ['Steve Jobs >> fr ', 'Steven Jobs >> fr ', 'Steve Job >> fr ']

Entity: Apple, Wikipedia Links: ['Los Altos >> fr ', 'Los Altos (Californie) >> fr ', 'Los Altos (Nueva York) >> fr ']

Entity: Apple, Wikipedia Links: ['Californie >> fr ', 'California >> fr ', 'Californië >> fr ']

Entity: Apple, Wikipedia Links: ['Steve Wozniak >> fr ', 'Steve Wboněk >> fr ', 'Steve Wbonak >> fr ']

Entity: Apple, Wikipedia Links: ['Ronnie Wayne >> fr ', 'Ronald Wayne >> fr ', 'Ron Wayne >> fr ']

Entity: Apple, Wikipedia Links: ['le 3 janvier 1977 >> fr ', 'Le 3 janvier 1977 >> fr ', 'du 3 janvier 1977 >> fr ']

Entity: Apple, Wikipedia Links: ['Apple >> fr ', 'Apple

KeyboardInterrupt: 

In [19]:
from utils import get_wikipedia_page_props

# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')

    # Run the NER pipeline on the input sentence and store the results
    results = ner(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))

    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, entity_text in entities:
        if entity_text not in already_done:
            # Tag the entity in the text
            language = 'en'
            tokens = tokenise(sentence, language)
            start, end = (
                entity["index"][0],
                entity["index"][1],
            )

            context_start = max(0, start - 10)
            context_end = min(len(tokens), end + 11)

            nel_sentence = (
                " ".join(tokens[context_start:start])
                + " [START] "
                + entity_text
                + " [END] "
                + " ".join(tokens[end + 1 : context_end])
            )

            # Generate Wikipedia links for the tagged text
            outputs = nel_model.generate(
                **nel_tokenizer([nel_sentence], return_tensors="pt"),
                num_beams=3,
                num_return_sequences=3
            )

            # Decode the generated output to get the Wikipedia links
            wikipedia_links = nel_tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {entity_text}, Wikipedia Links: {wikipedia_links}")

            # Add the word to the already processed list
            already_done.append(entity_text)

            # Retrieve and print Wikidata QID for each Wikipedia link
            for wikipedia_link in wikipedia_links:
                qid = get_wikipedia_page_props(wikipedia_link)
                print(f"  Wikidata: {wikipedia_link} -> {qid}")


Sentence: Apple est créée le 1er avril 1976 dans le garage de la maison
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            





Entity: Apple, Wikipedia Links: ['Apple >> fr ', 'Apple Inc. >> fr ', 'Apple Group >> fr ', 'Apple Corporation >> fr ', 'Apple III >> fr ']
  Wikidata: Apple >> fr  -> Q312
  Wikidata: Apple Inc. >> fr  -> NIL
  Wikidata: Apple Group >> fr  -> NIL
  Wikidata: Apple Corporation >> fr  -> NIL
  Wikidata: Apple III >> fr  -> Q420769


KeyboardInterrupt: 


## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an
interdisciplinary research project that aims to develop and consolidate tools for
processing and exploring large collections of media archives across modalities, time,
languages and national borders. The first project (2017-2021) was funded by the Swiss
National Science Foundation under grant
No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027)
by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585)
and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2024 The Impresso team.

### License

This program is provided as open source under
the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
