# Detect Entities and Link them to Wikipedia and Wikidata in a Text through the Impresso API

Named entities such as organizations, locations, persons, and temporal expressions play a crucial role in the comprehension and analysis of both historical and contemporary texts. The HIPE-2022 project focuses on named entity recognition and classification (NERC) and entity linking (EL) in multilingual historical documents.

### About HIPE-2022
HIPE-2022 involves processing diverse datasets from historical newspapers and classical commentaries, spanning approximately 200 years and multiple languages. The primary goal is to confront systems with challenges related to multilinguality, domain-specific entities, and varying annotation tag sets.

### Datasets
The HIPE-2022 datasets are based on six primary datasets, but this model was only trained on **hipe2020** in French and German.
- **ajmc**: Classical commentaries in German, French, and English.
- **hipe2020**: Historical newspapers in German, French, and English.
- **letemps**: Historical newspapers in French.
- **topres19th**: Historical newspapers in English.
- **newseye**: Historical newspapers in German, Finnish, French, and Swedish.
- **sonar**: Historical newspapers in German.

### Annotation Types and Levels
HIPE-2022 employs an IOB tagging scheme (inside-outside-beginning format) for entity annotations. The annotation levels include:

1. **TOKEN**: The annotated token.
2. **NE-COARSE-LIT**: Coarse type of the entity (literal sense).
3. **NE-COARSE-METO**: Coarse type of the entity (metonymic sense).
4. **NE-FINE-LIT**: Fine-grained type of the entity (literal sense).
5. **NE-FINE-METO**: Fine-grained type of the entity (metonymic sense).
6. **NE-FINE-COMP**: Component type of the entity.
7. **NE-NESTED**: Coarse type of the nested entity.

### Getting Started
This notebook will guide you through setting up a workflow to identify named entities within your text using the HIPE-2022 trained pipeline. By leveraging this pipeline, you can detect mentions of people, places, organizations, and temporal expressions, enhancing your analysis and understanding of historical and contemporary documents.

---

This updated description provides a clear overview of the HIPE-2022 project's goals, datasets, and annotation types, focusing on the identification of generic named entities in multilingual historical documents.
*Note: This notebook *might* require `HF_TOKEN` to be set in the environment variables. You can get your token by signing up on the [Hugging Face website](https://huggingface.co/join) and read more in the [official documentation](https://huggingface.co/docs/huggingface_hub/v0.20.2/en/quick-start#environment-variable)*

Install necessary libraries (if not already installed) and 
download the necessary NLTK data.

In [None]:
!pip install transformers
!pip install nltk
!pip install torch

In [1]:
def print_nicely(results, text):
    # Print the timestamp and system ID
    print(f"Timestamp: {results.get('ts')}")
    print(f"System ID: {results.get('sys_id')}")
    
    entities = results.get('nes', [])
    if entities:
        print(f"\n{'Entity':<20} {'Type':<15} {'Confidence NER':<15} {'Confidence NEL':<15} {'Start':<5} {'End':<5} {'Wikidata ID':<10} {'Wikipedia Page':<20}")
        print("-" * 100)
        for entity in entities:
            confidence_ner = f"{entity['confidence_ner']}%"
            confidence_nel = f"{entity['confidence_nel']}%"
            wkd_id = entity.get('wkd_id', 'N/A')
            wkpedia_pagename = entity.get('wkpedia_pagename', 'N/A')
            print(f"{entity['surface']:<20} {entity['type']:<15} {confidence_ner:<15} {confidence_nel:<15} {entity['lOffset']:<5} {entity['rOffset']:<5} {wkd_id:<10} {wkpedia_pagename:<20}")
        
        print("*" * 100)
        print('Testing offsets:')
        print("*" * 100)
        print(f"\n{'Entity':<20} {'Type':<15} {'Confidence NER':<15} {'Confidence NEL':<15} {'Start':<5} {'End':<5} {'Wikidata ID':<10} {'Wikipedia Page':<20}")
        print("-" * 100)
        for entity in entities:
            confidence_ner = f"{entity['confidence_ner']}%"
            confidence_nel = f"{entity['confidence_nel']}%"
            wkd_id = entity.get('wkd_id', 'N/A')
            wkpedia_pagename = entity.get('wkpedia_pagename', 'N/A')
            print(f"{text[entity['lOffset']:entity['rOffset']]:<20} {entity['type']:<15} {confidence_ner:<15} {confidence_nel:<15} {entity['lOffset']:<5} {entity['rOffset']:<5} {wkd_id:<10} {wkpedia_pagename:<20}")
            
        print("*" * 100)
        print('Testing offsets in the returned text:')
        print("*" * 100)
        print(f"\n{'Entity':<20} {'Type':<15} {'Confidence NER':<15} {'Confidence NEL':<15} {'Start':<5} {'End':<5} {'Wikidata ID':<10} {'Wikipedia Page':<20}")
        print("-" * 100)
        for entity in entities:
            confidence_ner = f"{entity['confidence_ner']}%"
            confidence_nel = f"{entity['confidence_nel']}%"
            wkd_id = entity.get('wkd_id', 'N/A')
            wkpedia_pagename = entity.get('wkpedia_pagename', 'N/A')
            print(f"{results['text'][entity['lOffset']:entity['rOffset']]:<20} {entity['type']:<15} {confidence_ner:<15} {confidence_nel:<15} {entity['lOffset']:<5} {entity['rOffset']:<5} {wkd_id:<10} {wkpedia_pagename:<20}")
            


Now the fun part, this function will download the requried model and gives you the keys to successfullly detect entities in your text. 

In [3]:
from utils import get_linked_entities
import requests

sentences = ["Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak et Ronald Wayne, puis constituée sous forme de société le 3 janvier 1977 à l'origine sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification de ses produits, le mot « computer » est retiré le 9 janvier 2015."]

for sentence in sentences:
    results = get_linked_entities(sentence)
    print_nicely(results, sentence)


Timestamp: 2024-08-22T11:24:15Z
System ID: stacked-2-bert-medium-historic-multilingual-v3-base|mgenre

Entity               Type            Confidence NER  Confidence NEL  Start End   Wikidata ID Wikipedia Page      
----------------------------------------------------------------------------------------------------
Apple                loc             53.61%          56.73%          0     5     Q312       Apple >> fr         
Apple                loc             53.61%          56.73%          241   246   Q312       Apple >> fr         
le 1er avril 1976    time            84.47%          42.77%          16    33    NIL        N/A                 
Steve Jobs           pers            88.15%          54.62%          75    85    Q19837     Steve Jobs >> fr    
Steve Jobs           pers            88.15%          54.62%          116   126   Q19837     Steve Jobs >> fr    
Los Altos            loc             97.44%          45.53%          88    97    Q299298    Los Altos >> fr     
Cali