# Named Entity Recognition and Linking with Impresso BERT models

## Good to know before starting



We refer to "named entity recognition" as NER, which is a tool that recognises entities such as persons and locations from text. A "named entity linker" (NEL) connects these entities to an existing one such as a real person that can be found on Wikipedia (with a unique id in Wikidata). Wikipedia is a free, user-edited encyclopedia with articles on a wide range of topics like historical events, famous people, or scientific concepts. Wikidata is a sister project of Wikipedia that stores structured data, like facts and relationships between entities, used for tasks where computers need to understand and process data, such as NER and NEL.


In the context of _Impresso_, the NER tool was trained on the [HIPE 2020](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md) dataset. It was trained to recognise coarse and fine grained entities such as persons and locations, but also their names, titles, and functions. Further, the _Impresso_ NEL tool links these entity mentions to unique referents in a knowledge base – here Wikipedia and Wikidata – or not if the mention's referent is not found.

## Prerequisites

In [15]:
!pip install transformers
!pip install spacy

Collecting transformers
  Using cached transformers-4.45.1-py3-none-any.whl.metadata (44 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.25.1-py3-none-any.whl.metadata (13 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.5-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.8 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.7 kB)
Downloading transformers-4.45.1-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading huggingface_hub-0.25.1-py3-none-any.whl (436 kB)
Downloading safetensors-0.4.5-cp312-cp312-macosx_11_0_arm64.whl (381 kB)
Downloading tokenizers-0.20.0-cp312-cp312-macosx_11_0_arm64.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m45.1 MB/s[0m eta [36m0:

## Entity Recognition

In [16]:
# Import necessary modules from the transformers library
from transformers import pipeline
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Define the model name to be used for token classification, we use the Impresso NER
# that can be foud at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"

# Load the tokenizer corresponding to the specified model name
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/anaconda3/lib/python3.12/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/opt/anaconda3/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  Fil

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.



RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
numpy.core.multiarray failed to import

Create a pipeline for named entity recognition (NER) using the loaded model and tokenizer.


In [16]:
ner = pipeline("generic-ner", model=MODEL_NAME, tokenizer=tokenizer, trust_remote_code=True)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/eboros/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [7]:
sentences = ["""Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            """]

print(sentences[0])

Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            


In [8]:
from utils import visualize_stacked_entities

# Visualize stacked entities for each sentence
for sentence in sentences:
    results = ner(sentence)
    
    # Extract coarse and fine entities
    coarse_entities = results["NE-COARSE-LIT"]
    fine_entities = results["NE-FINE-LIT"]
    
    # Visualize the stacked entities
    visualize_stacked_entities(sentence, coarse_entities, fine_entities)


Visualizing stacked coarse and fine-grained entities



## Entity Linking

Further, the _Impresso_ NEL tool links these the previously found entity mentions to unique referents in Wikipedia and Wikidata.

In [17]:
# Import the necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")
model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-mgenre-multilingual").eval()


In [12]:
!pip install spacy

Collecting spacy
  Using cached spacy-3.7.5-cp312-cp312-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp312-cp312-macosx_11_0_arm64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp312-cp312-macosx_11_0_arm64.whl.metadata (8.4 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-macosx_11_0_arm64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy)
  Downloading thinc-8.2.5-cp312-cp312-macosx_11_0_arm64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downl

In [2]:
# !pip3 install impresso-essentials

Collecting impresso-essentials
  Using cached impresso_essentials-0.0.2-py3-none-any.whl.metadata (62 kB)
Collecting aiohappyeyeballs>=2.4.0 (from impresso-essentials)
  Using cached aiohappyeyeballs-2.4.2-py3-none-any.whl.metadata (6.0 kB)
Collecting aiohttp>=3.10.5 (from impresso-essentials)
  Downloading aiohttp-3.10.8-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.6 kB)
Collecting aioitertools>=0.12.0 (from impresso-essentials)
  Using cached aioitertools-0.12.0-py3-none-any.whl.metadata (3.8 kB)
Collecting aiosignal>=1.3.1 (from impresso-essentials)
  Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting attrs>=24.2.0 (from impresso-essentials)
  Using cached attrs-24.2.0-py3-none-any.whl.metadata (11 kB)
Collecting babel>=2.16.0 (from impresso-essentials)
  Using cached babel-2.16.0-py3-none-any.whl.metadata (1.5 kB)
Collecting bokeh>=3.5.2 (from impresso-essentials)
  Using cached bokeh-3.6.0-py3-none-any.whl.metadata (12 kB)
Collecting boto3>=1.35.21 (from im

In [13]:
from impresso_essentials.text_utils import tokenise

# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')
    
    # Run the NER pipeline on the input sentence and store the results
    results = ner(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))
    
    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, word in entities:
        if word not in already_done:
            # Tag the entity in the text

            tokens = tokenise(sentence, language)
            start, end = (
                entity["index"][0],
                entity["index"][1],
            )

            context_start = max(0, start - 10)
            context_end = min(len(tokens), end + 11)

            nel_sentence = (
                " ".join(tokens[context_start:start])
                + " [START] "
                + remove_end_punctuation(entity_text)
                + " [END] "
                + " ".join(tokens[end + 1 : context_end])
            )
            
            # Generate Wikipedia links for the tagged text
            outputs = model.generate(
                **tokenizer([nel_sentence], return_tensors="pt"),
                num_beams=3,
                num_return_sequences=5
            )
            
            # Decode the generated output to get the Wikipedia links
            wikipedia_links = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {word}, Wikipedia Links: {wikipedia_links}")
            
            # Add the word to the already processed list
            already_done.append(word)


NameError: name 'sentences' is not defined

In [14]:
from utils import get_wikipedia_page_props

# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')
    
    # Run the NER pipeline on the input sentence and store the results
    results = ner(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))
    
    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, word in entities:
        if word not in already_done:
            # Tag the entity in the text
            entity_text = sentence.replace(word, f"[START] {word} [END]")
            # print(f"\nEntity: {word}, Tagged Text: {entity_text}\n")

            # Generate Wikipedia links for the tagged text
            outputs = model.generate(
                **tokenizer([entity_text], return_tensors="pt"),
                num_beams=5,
                num_return_sequences=5
            )
            
            # Decode the generated output to get the Wikipedia links
            wikipedia_links = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {word}, Wikipedia Links: {wikipedia_links}")
            
            # Add the word to the already processed list
            already_done.append(word)
            
            # Retrieve and print Wikidata QID for each Wikipedia link
            for wikipedia_link in wikipedia_links:
                qid = get_wikipedia_page_props(wikipedia_link)
                print(f"  Wikidata: {wikipedia_link} -> {qid}")


Sentence: Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            

Entity: Apple, Wikipedia Links: ['Apple >> fr ', 'Apple Inc. >> fr ', 'Apple Corporation >> fr ', 'Apple Group >> fr ', 'Apple Corps >> fr ']
  Wikidata: Apple >> fr  -> Q312
  Wikidata: Apple Inc. >> fr  -> NIL
  Wikidata: Apple Corporation >> fr  -> NIL
  Wikidata: Apple Group >> fr  -> NIL
  Wikidata: Apple Corps >> fr  -> Q621231

Entity: le 1er avril 1976, Wikipedia Links: ['le 1er avril 1976 >> fr ', '1er avril 1976 >> fr ', 'Le 1er avril 1976 >> fr ', '1 avril 1976 >> fr ', 'Avril 1976 >> fr ']
  Wikidata: le 1er avril 1976 >> fr 