# Detect Entities and Link them to Wikipedia and Wikidata in a Text through the Impresso API

We refer to "named entity recognition" as NER, which is a tool that recognises entities such as persons and locations from text. A "named entity linker" (NEL) connects these entities to an existing one such as a real person that can be found on Wikipedia (with a unique id in Wikidata). Wikipedia is a free, user-edited encyclopedia with articles on a wide range of topics like historical events, famous people, or scientific concepts. Wikidata is a sister project of Wikipedia that stores structured data, like facts and relationships between entities, used for tasks where computers need to understand and process data, such as NER and NEL.


In the context of _Impresso_, the NER tool was trained on the [HIPE 2020](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md) dataset. It was trained to recognise coarse and fine grained entities such as persons and locations, but also their names, titles, and functions. Further, the _Impresso_ NEL tool links these entity mentions to unique referents in a knowledge base – here Wikipedia and Wikidata – or not if the mention's referent is not found.

In [None]:
!pip install transformers
!wget https://raw.githubusercontent.com/impresso/impresso-datalab-notebooks/refs/heads/main/entity/utils.py
!wget https://raw.githubusercontent.com/impresso/impresso-datalab-notebooks/refs/heads/main/entity/text_utils.py

In [1]:
def print_nicely(results, text):
    # Print the timestamp and system ID
    print(f"Timestamp: {results.get('ts')}")
    print(f"System ID: {results.get('sys_id')}")

    entities = results.get('nes', [])
    if entities:
        print(f"\n{'Entity':<20} {'Type':<15} {'Confidence NER':<15} {'Confidence NEL':<15} {'Start':<5} {'End':<5} {'Wikidata ID':<10} {'Wikipedia Page':<20}")
        print("-" * 100)
        for entity in entities:
            confidence_ner = f"{entity['confidence_ner']}%"
            confidence_nel = f"{entity['confidence_nel']}%"
            wkd_id = entity.get('wkd_id', 'N/A')
            wkpedia_pagename = entity.get('wkpedia_pagename', 'N/A')
            print(f"{entity['surface']:<20} {entity['type']:<15} {confidence_ner:<15} {confidence_nel:<15} {entity['lOffset']:<5} {entity['rOffset']:<5} {wkd_id:<10} {wkpedia_pagename:<20}")

        print("*" * 100)
        print('Testing offsets:')
        print("*" * 100)
        print(f"\n{'Entity':<20} {'Type':<15} {'Confidence NER':<15} {'Confidence NEL':<15} {'Start':<5} {'End':<5} {'Wikidata ID':<10} {'Wikipedia Page':<20}")
        print("-" * 100)
        for entity in entities:
            confidence_ner = f"{entity['confidence_ner']}%"
            confidence_nel = f"{entity['confidence_nel']}%"
            wkd_id = entity.get('wkd_id', 'N/A')
            wkpedia_pagename = entity.get('wkpedia_pagename', 'N/A')
            print(f"{text[entity['lOffset']:entity['rOffset']]:<20} {entity['type']:<15} {confidence_ner:<15} {confidence_nel:<15} {entity['lOffset']:<5} {entity['rOffset']:<5} {wkd_id:<10} {wkpedia_pagename:<20}")

        print("*" * 100)
        print('Testing offsets in the returned text:')
        print("*" * 100)
        print(f"\n{'Entity':<20} {'Type':<15} {'Confidence NER':<15} {'Confidence NEL':<15} {'Start':<5} {'End':<5} {'Wikidata ID':<10} {'Wikipedia Page':<20}")
        print("-" * 100)
        for entity in entities:
            confidence_ner = f"{entity['confidence_ner']}%"
            confidence_nel = f"{entity['confidence_nel']}%"
            wkd_id = entity.get('wkd_id', 'N/A')
            wkpedia_pagename = entity.get('wkpedia_pagename', 'N/A')
            print(f"{results['text'][entity['lOffset']:entity['rOffset']]:<20} {entity['type']:<15} {confidence_ner:<15} {confidence_nel:<15} {entity['lOffset']:<5} {entity['rOffset']:<5} {wkd_id:<10} {wkpedia_pagename:<20}")



Now the fun part, this function will download the requried model and gives you the keys to successfullly detect entities in your text.

In [1]:
from utils import get_linked_entities

sentence = """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles, 
                where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly, 
                debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun, 
                regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia, 
                George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State, 
                were drafting policies for the newly established American government following the signing of the Constitution."""

# Function to print each entry nicely
def print_nicely(data):
    for idx, entry in enumerate(data, start=1):
        for key, value in entry.items():
            print(f"  {key.capitalize()}: {value}")
        print()  # Blank line between entries


results = get_linked_entities(sentence)

print(results)
print_nicely(results, sentence)


Request failed with status code 507
None


NameError: name 'print_nicely' is not defined


## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an
interdisciplinary research project that aims to develop and consolidate tools for
processing and exploring large collections of media archives across modalities, time,
languages and national borders. The first project (2017-2021) was funded by the Swiss
National Science Foundation under grant
No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027)
by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585)
and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2024 The Impresso team.

### License

This program is provided as open source under
the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
