# Named Entity Recognition and Linking with Impresso BERT models

## Good to know before starting



We refer to "named entity recognition" as NER, which is a tool that recognises entities such as persons and locations from text. A "named entity linker" (NEL) connects these entities to an existing one such as a real person that can be found on Wikipedia (with a unique id in Wikidata). Wikipedia is a free, user-edited encyclopedia with articles on a wide range of topics like historical events, famous people, or scientific concepts. Wikidata is a sister project of Wikipedia that stores structured data, like facts and relationships between entities, used for tasks where computers need to understand and process data, such as NER and NEL.


In the context of _Impresso_, the NER tool was trained on the [HIPE 2020](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md) dataset. It was trained to recognise coarse and fine grained entities such as persons and locations, but also their names, titles, and functions. Further, a NEL tool links these entity mentions to unique referents in a knowledge base – here Wikipedia and Wikidata – or not if the mention's referent is not found.

## Entity Recognition

In [3]:
# Import necessary modules from the transformers library
from transformers import pipeline
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Define the model name to be used for token classification, we use the Impresso NER
MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"

# Load the tokenizer corresponding to the specified model name
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Create a pipeline for named entity recognition (NER) using the loaded model and tokenizer.


In [4]:
ner = pipeline("generic-ner", model=MODEL_NAME, tokenizer=tokenizer, trust_remote_code=True)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/eboros/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [7]:
sentences = ["""Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            """]

print(sentences[0])

Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            


In [8]:
from utils import visualize_stacked_entities

# Visualize stacked entities for each sentence
for sentence in sentences:
    results = ner(sentence)
    
    # Extract coarse and fine entities
    coarse_entities = results["NE-COARSE-LIT"]
    fine_entities = results["NE-FINE-LIT"]
    
    # Visualize the stacked entities
    visualize_stacked_entities(sentence, coarse_entities, fine_entities)


Visualizing stacked coarse and fine-grained entities



## Entity Linking

In [9]:
# Import the necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "impresso-project/nel-mgenre-multilingual"
tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")
model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-mgenre-multilingual").eval()


  return self.fget.__get__(instance, owner)()


In [12]:
from utils import add_entity_tags

# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')
    
    # Run the NER pipeline on the input sentence and store the results
    results = ner(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))
    
    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, word in entities:
        if word not in already_done:
            # Tag the entity in the text
            entity_text = sentence.replace(word, f"[START] {word} [END]")
            # print(f"\nEntity: {word}, Tagged Text: {entity_text}\n")

            # Generate Wikipedia links for the tagged text
            outputs = model.generate(
                **tokenizer([entity_text], return_tensors="pt"),
                num_beams=5,
                num_return_sequences=5
            )
            
            # Decode the generated output to get the Wikipedia links
            wikipedia_links = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {word}, Wikipedia Links: {wikipedia_links}")
            
            # Add the word to the already processed list
            already_done.append(word)


Sentence: Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            

Entity: Apple, Wikipedia Links: ['Apple >> fr ', 'Apple Inc. >> fr ', 'Apple Corporation >> fr ', 'Apple Group >> fr ', 'Apple Corps >> fr ']

Entity: le 1er avril 1976, Wikipedia Links: ['le 1er avril 1976 >> fr ', '1er avril 1976 >> fr ', 'Le 1er avril 1976 >> fr ', '1 avril 1976 >> fr ', 'Avril 1976 >> fr ']

Entity: Steve Jobs, Wikipedia Links: ['Steve Jobs >> fr ', 'Steve Job >> fr ', 'Steve Steve Jobs >> fr ', 'Stephen Jobs >> fr ', 'Steven Jobs >> fr ']

Entity: Los Altos, Wikipedia Links: ['Los Altos >> fr ', 'Apple Los Altos >> 

In [None]:
from utils import get_wikipedia_page_props

# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')
    
    # Run the NER pipeline on the input sentence and store the results
    results = ner(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))
    
    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, word in entities:
        if word not in already_done:
            # Tag the entity in the text
            entity_text = sentence.replace(word, f"[START] {word} [END]")
            # print(f"\nEntity: {word}, Tagged Text: {entity_text}\n")

            # Generate Wikipedia links for the tagged text
            outputs = model.generate(
                **tokenizer([entity_text], return_tensors="pt"),
                num_beams=5,
                num_return_sequences=5
            )
            
            # Decode the generated output to get the Wikipedia links
            wikipedia_links = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {word}, Wikipedia Links: {wikipedia_links}")
            
            # Add the word to the already processed list
            already_done.append(word)
            
            # Retrieve and print Wikidata QID for each Wikipedia link
            for wikipedia_link in wikipedia_links:
                qid = get_wikipedia_page_props(wikipedia_link)
                print(f"  Wikidata: {wikipedia_link} -> {qid}")


Sentence: Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            
