# Named Entity Recognition and Linking with Impresso Models




## Good to know before starting
We refer to "named entity recognition" as NER, which is a tool that recognises entities such as persons and locations from text. A "named entity linker" (NEL) connects these entities to an existing one such as a real person that can be found on Wikipedia (with a unique id in Wikidata). Wikipedia is a free, user-edited encyclopedia with articles on a wide range of topics like historical events, famous people, or scientific concepts. Wikidata is a sister project of Wikipedia that stores structured data, like facts and relationships between entities, used for tasks where computers need to understand and process data, such as NER and NEL.


In the context of _Impresso_, the NER tool was trained on the [HIPE 2020](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md) dataset. It was trained to recognise coarse and fine grained entities such as persons and locations, but also their names, titles, and functions. Further, the _Impresso_ NEL tool links these entity mentions to unique referents in a knowledge base – here Wikipedia and Wikidata – or not if the mention's referent is not found.

You can also access our [NERC](https://huggingface.co/spaces/impresso-project/multilingual-named-entity-recognition) and [NEL](https://huggingface.co/spaces/impresso-project/multilingual-entity-linking) demo apps through [HuggingFace Spaces](https://huggingface.co/docs/hub/en/spaces).

__Next, when running the code, if a question about a HuggingFace token appears, hit Cancel, we do not need it.__

## Prerequisites

First, we install some necessary libriaries and download the necessary files.



In [47]:
!pip install torch
!pip install protobuf
!pip install sentencepiece
!pip install transformers
!pip install nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




## Entity Recognition

In [48]:
# Import necessary Python modules from the Transformers library
from transformers import AutoModelForTokenClassification, AutoTokenizer
from transformers import pipeline

# Define the model name to be used for token classification, we use the Impresso NER
# that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"

# Load the tokenizer corresponding to the specified model name
ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Create a pipeline for named entity recognition (NER) using the loaded model and tokenizer.


In [49]:
ner_pipeline = pipeline("generic-ner", model=MODEL_NAME, 
                        tokenizer=ner_tokenizer, 
                        trust_remote_code=True,
                        device='cpu')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/eboros/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/eboros/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/eboros/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [50]:
sentence = """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles, 
                where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly, 
                debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun, 
                regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia, 
                George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State, 
                were drafting policies for the newly established American government following the signing of the Constitution."""

print(sentence)

In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles, 
                where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly, 
                debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun, 
                regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia, 
                George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State, 
                were drafting policies for the newly established American government following the signing of the Constitution.


We will use the next function to print each entity nicely.

In [51]:
def print_nicely(data):
    for idx, entry in enumerate(data):
        for key, value in entry.items():
            print(f"  {key.capitalize()}: {value}")
        print()  # Blank line between entries
        

In [58]:
# Recognize stacked entities for each sentence
entities = ner_pipeline(sentence)

# Extract coarse and fine entities
print_nicely(entities)

  Type: time
  Confidence_ner: 79.68
  Index: (0, 4)
  Surface: year 1789
  Loffset: 7
  Roffset: 16

  Type: pers
  Confidence_ner: 95.57
  Index: (5, 12)
  Surface: King Louis XVI, ruler of France
  Loffset: 18
  Roffset: 49
  Title: King
  Name: Versailles
  Function: ruler of France

  Type: loc
  Confidence_ner: 68.91
  Index: (20, 23)
  Surface: Palace of Versailles
  Loffset: 87
  Roffset: 107

  Type: pers
  Confidence_ner: 77.25
  Index: (25, 32)
  Surface: Marie Antoinette, the Queen of France
  Loffset: 132
  Roffset: 169
  Name: Maximilien Robespierre
  Function: leading member of the National Assembly

  Type: pers
  Confidence_ner: 90.78
  Index: (34, 44)
  Surface: Maximilien Robespierre, a leading member of the National Assembly
  Loffset: 181
  Roffset: 246
  Name: Jean-Jacques Rousseau

  Type: pers
  Confidence_ner: 93.25
  Index: (47, 55)
  Surface: Jean-Jacques Rousseau, the famous philosopher
  Loffset: 278
  Roffset: 323
  Function: philosopher
  Name: Charles de

## Entity Linking

Further, the _Impresso_ NEL tool links these the previously found entity mentions to unique referents in Wikipedia and Wikidata.

In [53]:
# Import the necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"

# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")

In [54]:
nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME, 
                        tokenizer=nel_tokenizer, 
                        trust_remote_code=True,
                        device='cpu')

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


Our entity linker needs a specific format to be able to focus on the entity that we want to be linked, like in this example:

```
The event was held at the [START] Palace of Versailles [END], a symbol of French monarchy.
```
Assuming that the entity `Palace of Versailles` was previously detected by an NER tool, we need to surround it with `[START]` and `[END]`. Our tool only handles one entity per sentence/text, thus, for each entity in the same sentence, we need to create a different input, for example:


```
The event was held at the [START] Palace of Versailles [END], a symbol of French monarchy.
The event was held at the Palace of Versailles, a symbol of [START] French monarchy [END].

```

Let's take this example:

In [55]:
simple_sentence = "The event was held at the [START] Palace of Versailles [END], a symbol of French monarchy."

linked_entity = nel_pipeline(simple_sentence)

print_nicely(linked_entity)

[{'surface': 'Palace of Versailles', 'wkd_id': 'Q2946', 'wkpedia_pagename': 'Palace of Versailles', 'wkpedia_url': 'https://en.wikipedia.org/wiki/Palace_of_Versailles', 'type': 'UNK', 'confidence_nel': np.float32(99.99), 'lOffset': 33, 'rOffset': 55}]
  Surface: Palace of Versailles
  Wkd_id: Q2946
  Wkpedia_pagename: Palace of Versailles
  Wkpedia_url: https://en.wikipedia.org/wiki/Palace_of_Versailles
  Type: UNK
  Confidence_nel: 99.98999786376953
  Loffset: 33
  Roffset: 55



It _could_ work without the special markers and texts mentioning only one entity, but we do not recommend it.

In [56]:
simple_sentence = "The event was held at the Palace of Versailles, a symbol of French monarchy."

linked_entity = nel_pipeline(simple_sentence)

print_nicely(linked_entity)

[{'surface': None, 'wkd_id': 'Q142', 'wkpedia_pagename': 'France', 'wkpedia_url': 'https://en.wikipedia.org/wiki/France', 'type': 'UNK', 'confidence_nel': np.float32(63.69), 'lOffset': None, 'rOffset': None}]
  Surface: None
  Wkd_id: Q142
  Wkpedia_pagename: France
  Wkpedia_url: https://en.wikipedia.org/wiki/France
  Type: UNK
  Confidence_nel: 63.689998626708984
  Loffset: None
  Roffset: None



If we work with our NER tool, we can automatically create these sentences with entity markers and link each entity afterwards:

In [57]:
# Run the NER pipeline on the input sentence and store the results
entities = ner_pipeline(sentence)

print(f'{len(entities)} entities were detected.')

# List to keep track of already processed words to avoid duplicate tagging
already_done = []

# Process each entity for linking
for entity in entities:
    if entity['surface'] not in already_done:
        # Tag the entity in the text

        language = 'en'
        tokens = sentence.split(' ')
        start, end = (
            entity["index"][0],
            entity["index"][1],
        )

        context_start = max(0, start - 10)
        context_end = min(len(tokens), end + 11)

        nel_sentence = (
            " ".join(tokens[context_start:start])
            + " [START] "
            + entity['surface']
            + " [END] "
            + " ".join(tokens[end + 1 : context_end])
        )

        linked_entities = nel_pipeline(nel_sentence)
        print(nel_sentence)
        print_nicely(linked_entities)

12 entities were detected.
[{'surface': 'year 1789', 'wkd_id': 'Q142', 'wkpedia_pagename': 'France', 'wkpedia_url': 'https://en.wikipedia.org/wiki/France', 'type': 'UNK', 'confidence_nel': np.float32(38.29), 'lOffset': 8, 'rOffset': 19}]
 [START] year 1789 [END] Louis XVI, ruler of France, convened the Estates-General at the
  Surface: year 1789
  Wkd_id: Q142
  Wkpedia_pagename: France
  Wkpedia_url: https://en.wikipedia.org/wiki/France
  Type: UNK
  Confidence_nel: 38.290000915527344
  Loffset: 8
  Roffset: 19

[{'surface': 'King Louis XVI, ruler of France', 'wkd_id': 'NIL', 'wkpedia_pagename': 'NIL', 'wkpedia_url': 'None', 'type': 'UNK', 'confidence_nel': np.float32(100.0), 'lOffset': 30, 'rOffset': 63}]
In the year 1789, King [START] King Louis XVI, ruler of France [END] at the Palace of Versailles, 
    
  Surface: King Louis XVI, ruler of France
  Wkd_id: NIL
  Wkpedia_pagename: NIL
  Wkpedia_url: None
  Type: UNK
  Confidence_nel: 100.0
  Loffset: 30
  Roffset: 63

[{'surface': 

KeyboardInterrupt: 


## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an
interdisciplinary research project that aims to develop and consolidate tools for
processing and exploring large collections of media archives across modalities, time,
languages and national borders. The first project (2017-2021) was funded by the Swiss
National Science Foundation under grant
No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027)
by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585)
and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2024 The Impresso team.

### License

This program is provided as open source under
the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
