# Linking Entities in a Text to Wikipedia

Named entities such as organizations, locations, persons, and temporal expressions play a crucial role in the comprehension and analysis of both historical and contemporary texts. The HIPE-2022 project focuses on named entity recognition and classification (NERC) and entity linking (EL) in multilingual historical documents.

### About HIPE-2022
HIPE-2022 involves processing diverse datasets from historical newspapers and classical commentaries, spanning approximately 200 years and multiple languages. The primary goal is to confront systems with challenges related to multilinguality, domain-specific entities, and varying annotation tag sets.

### Datasets
The HIPE-2022 datasets are based on six primary datasets, but this model was only trained on **hipe2020** in French and German.
- **ajmc**: Classical commentaries in German, French, and English.
- **hipe2020**: Historical newspapers in German, French, and English.
- **letemps**: Historical newspapers in French.
- **topres19th**: Historical newspapers in English.
- **newseye**: Historical newspapers in German, Finnish, French, and Swedish.
- **sonar**: Historical newspapers in German.

### Annotation Types and Levels
HIPE-2022 employs an IOB tagging scheme (inside-outside-beginning format) for entity annotations. The annotation levels include:

1. **TOKEN**: The annotated token.
2. **NE-COARSE-LIT**: Coarse type of the entity (literal sense).
3. **NE-COARSE-METO**: Coarse type of the entity (metonymic sense).
4. **NE-FINE-LIT**: Fine-grained type of the entity (literal sense).
5. **NE-FINE-METO**: Fine-grained type of the entity (metonymic sense).
6. **NE-FINE-COMP**: Component type of the entity.
7. **NE-NESTED**: Coarse type of the nested entity.

### Getting Started
This notebook will guide you through setting up a workflow to identify named entities within your text using the HIPE-2022 trained pipeline. By leveraging this pipeline, you can detect mentions of people, places, organizations, and temporal expressions, enhancing your analysis and understanding of historical and contemporary documents.

---

This updated description provides a clear overview of the HIPE-2022 project's goals, datasets, and annotation types, focusing on the identification of generic named entities in multilingual historical documents.
*Note: This notebook *might* require `HF_TOKEN` to be set in the environment variables. You can get your token by signing up on the [Hugging Face website](https://huggingface.co/join) and read more in the [official documentation](https://huggingface.co/docs/huggingface_hub/v0.20.2/en/quick-start#environment-variable)*

Install necessary libraries (if not already installed) and 
download the necessary NLTK data.

In [None]:
!pip install transformers
!pip install nltk
!pip install torch

After having installed the necessary libraries (if not already installed) we download the necessary NLTK data to run our POS tagger: **averaged_perceptron_tagger**.
The averaged_perceptron_tagger is a efficient and effective part-of-speech (POS) tagger that basically tag each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. See [https://arxiv.org/abs/2104.02831](https://arxiv.org/abs/2104.02831) as reference.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

In [24]:
def print_nicely(results):
    for key, entities in results.items():
        if entities:
            print(f"\n**{key}**\n")
            print(f"{'Entity':<15} {'Type':<10} {'Score':<8} {'Index':<5} {'Word':<20} {'Start':<5} {'End':<5}")
            print("-" * 70)
            for entity in entities:
                print(f"{entity['word']:<15} {entity['entity']:<10} {entity['score']:<8.4f} {entity['index']:<5} {entity['word']:<20} {entity['start']:<5} {entity['end']:<5}")


First, we detect the entities:

In [5]:
# Import necessary modules from the transformers library
from transformers import pipeline
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Define the model name to be used for token classification, we use the Impresso NER
MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"

# Load the tokenizer corresponding to the specified model name
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)


In [6]:
# Create a pipeline for named entity recognition (NER) using the loaded model and tokenizer
nlp = pipeline("generic-ner", model=MODEL_NAME, tokenizer=tokenizer, trust_remote_code=True)



In [31]:
sentences = ["""Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            """]

for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')
    results = nlp(sentence)
    print_nicely(results)

Sentence: Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            

**NE-COARSE-LIT**

Entity          Type       Score    Index Word                 Start End  
----------------------------------------------------------------------
Apple           loc        0.5361   0     Apple                0     5    
Apple           loc        0.5361   0     Apple                280   285  
le 1er avril 1976 time       0.8447   3     le 1er avril 1976    16    33   
Steve Jobs      pers       0.8815   17    Steve Jobs           88    98   
Steve Jobs      pers       0.8815   17    Steve Jobs           129   139  
L

And then we link them:

In [32]:
def add_entity_tags(text, results):
    entities = []
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))

    already_done = []
    # Insert tags in the text using replace method
    for start, end, word in entities:
        if word not in already_done:
            text = text.replace(word, f"[START] {word} [END]")
            already_done.append(word)
    
    return text


In [None]:
# Import the necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "impresso-project/nel-mgenre-multilingual"
tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")
model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-mgenre-multilingual").eval()


In [38]:
# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')
    
    # Run the NER pipeline on the input sentence and store the results
    results = nlp(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))
    
    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, word in entities:
        if word not in already_done:
            # Tag the entity in the text
            entity_text = sentence.replace(word, f"[START] {word} [END]")
            print(f"\nEntity: {word}, Tagged Text: {entity_text}\n")

            # Generate Wikipedia links for the tagged text
            outputs = model.generate(
                **tokenizer([entity_text], return_tensors="pt"),
                num_beams=5,
                num_return_sequences=5
            )
            
            # Decode the generated output to get the Wikipedia links
            wikipedia_links = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {word}, Wikipedia Links: {wikipedia_links}")
            
            # Add the word to the already processed list
            already_done.append(word)


Sentence: Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            

Entity: Apple, Tagged Text: [START] Apple [END] est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'[START] Apple [END] Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            


Entity: Apple, Wikipedia Links: ['Apple

In [37]:
# Process each sentence for named entity recognition and linking
for sentence in sentences:
    # Input text to be processed for named entity recognition
    print(f'Sentence: {sentence}')
    
    # Run the NER pipeline on the input sentence and store the results
    results = nlp(sentence)

    # Initialize a list to hold the entities
    entities = []

    # Extract entities from the results
    for task, ents in results.items():
        for entity in ents:
            entities.append((entity['start'], entity['end'], entity['word']))
    
    # List to keep track of already processed words to avoid duplicate tagging
    already_done = []

    # Process each entity for linking
    for start, end, word in entities:
        if word not in already_done:
            # Tag the entity in the text
            entity_text = sentence.replace(word, f"[START] {word} [END]")
            print(f"\nEntity: {word}, Tagged Text: {entity_text}\n")

            # Generate Wikipedia links for the tagged text
            outputs = model.generate(
                **tokenizer([entity_text], return_tensors="pt"),
                num_beams=5,
                num_return_sequences=5
            )
            
            # Decode the generated output to get the Wikipedia links
            wikipedia_links = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            print(f"\nEntity: {word}, Wikipedia Links: {wikipedia_links}")
            
            # Add the word to the already processed list
            already_done.append(word)
            
            # Retrieve and print Wikidata QID for each Wikipedia link
            for wikipedia_link in wikipedia_links:
                qid = get_wikipedia_page_props(wikipedia_link)
                print(f"  Wikidata: {wikipedia_link} -> {qid}")


Sentence: Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            

Entity: Apple, Tagged Text: [START] Apple [END] est créée le 1er avril 1976 dans le garage de la maison 
        d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
        et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
        sous le nom d'[START] Apple [END] Computer, mais pour ses 30 ans et pour refléter la diversification 
        de ses produits, le mot « computer » est retiré le 9 janvier 2015.
        


Entity: Apple, ['Apple >> fr ', 'Apple Inc. >> fr ', 'Apple


Entity: le 9 janvier 2015, ['le 9 janvier 2015 >> fr ', 'Apple >> fr ', '„ le 9 janvier 2015 >> fr ', 'dit le 9 janvier 2015 >> fr ', 'janvier 2015 >> fr ']
  Wikidata: le 9 janvier 2015 >> fr  -> NIL
  Wikidata: Apple >> fr  -> Q312
  Wikidata: „ le 9 janvier 2015 >> fr  -> NIL
  Wikidata: dit le 9 janvier 2015 >> fr  -> NIL
  Wikidata: janvier 2015 >> fr  -> Q15043331
