# Detect entities in a text

Named entities such as organizations, locations, persons, and temporal expressions play a crucial role in the comprehension and analysis of both historical and contemporary texts. The HIPE-2022 project focuses on named entity recognition and classification (NERC) and entity linking (EL) in multilingual historical documents.

### About HIPE-2022
HIPE-2022 involves processing diverse datasets from historical newspapers and classical commentaries, spanning approximately 200 years and multiple languages. The primary goal is to confront systems with challenges related to multilinguality, domain-specific entities, and varying annotation tag sets.

### Datasets
The HIPE-2022 datasets are based on six primary datasets, but this model was only trained on **hipe2020** in French and German.
- **ajmc**: Classical commentaries in German, French, and English.
- **hipe2020**: Historical newspapers in German, French, and English.
- **letemps**: Historical newspapers in French.
- **topres19th**: Historical newspapers in English.
- **newseye**: Historical newspapers in German, Finnish, French, and Swedish.
- **sonar**: Historical newspapers in German.

### Annotation Types and Levels
HIPE-2022 employs an IOB tagging scheme (inside-outside-beginning format) for entity annotations. The annotation levels include:

1. **TOKEN**: The annotated token.
2. **NE-COARSE-LIT**: Coarse type of the entity (literal sense).
3. **NE-COARSE-METO**: Coarse type of the entity (metonymic sense).
4. **NE-FINE-LIT**: Fine-grained type of the entity (literal sense).
5. **NE-FINE-METO**: Fine-grained type of the entity (metonymic sense).
6. **NE-FINE-COMP**: Component type of the entity.
7. **NE-NESTED**: Coarse type of the nested entity.

### Getting Started
This notebook will guide you through setting up a workflow to identify named entities within your text using the HIPE-2022 trained pipeline. By leveraging this pipeline, you can detect mentions of people, places, organizations, and temporal expressions, enhancing your analysis and understanding of historical and contemporary documents.

---

This updated description provides a clear overview of the HIPE-2022 project's goals, datasets, and annotation types, focusing on the identification of generic named entities in multilingual historical documents.
*Note: This notebook *might* require `HF_TOKEN` to be set in the environment variables. You can get your token by signing up on the [Hugging Face website](https://huggingface.co/join) and read more in the [official documentation](https://huggingface.co/docs/huggingface_hub/v0.20.2/en/quick-start#environment-variable)*

Install necessary libraries (if not already installed) and 
download the necessary NLTK data.

In [None]:
!pip install transformers
!pip install nltk
!pip install torch

After having installed the necessary libraries (if not already installed) we download the necessary NLTK data to run our POS tagger: **averaged_perceptron_tagger**.
The averaged_perceptron_tagger is a efficient and effective part-of-speech (POS) tagger that basically tag each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. See [https://arxiv.org/abs/2104.02831](https://arxiv.org/abs/2104.02831) as reference.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

Now the fun part, this function will download the requried model and gives you the keys to successfullly detect news agencies in your text. 

In [100]:
# Import necessary modules from the transformers library
from transformers import pipeline
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Define the model name to be used for token classification, we use the Impresso NER
MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"

# Load the tokenizer corresponding to the specified model name
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)


In [101]:
# Create a pipeline for named entity recognition (NER) using the loaded model and tokenizer
nlp = pipeline("generic-ner", model=MODEL_NAME, tokenizer=tokenizer, trust_remote_code=True)

# Input text to be processed for named entity recognition
results = nlp("""Apple est créée le 1er avril 1976 dans le garage de la maison 
            d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak 
            et Ronald Wayne14, puis constituée sous forme de société le 3 janvier 1977 à l'origine 
            sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification 
            de ses produits, le mot « computer » est retiré le 9 janvier 2015.
            """)

In [102]:
def print_nicely(results):
    for key, entities in results.items():
        if entities:
            print(f"\n**{key}**\n")
            print(f"{'Entity':<15} {'Type':<10} {'Score':<8} {'Index':<5} {'Word':<20} {'Start':<5} {'End':<5}")
            print("-" * 70)
            for entity in entities:
                print(f"{entity['word']:<15} {entity['entity']:<10} {entity['score']:<8.4f} {entity['index']:<5} {entity['word']:<20} {entity['start']:<5} {entity['end']:<5}")


In [103]:
print_nicely(results)



**NE-COARSE-LIT**

Entity          Type       Score    Index Word                 Start End  
----------------------------------------------------------------------
Apple           loc        0.4756   0     Apple                0     5    
Apple           loc        0.4756   0     Apple                282   287  
le 1er avril 1976 time       0.8420   3     le 1er avril 1976    16    33   
Steve Jobs      pers       0.8601   17    Steve Jobs           88    98   
Steve Jobs      pers       0.8601   17    Steve Jobs           129   139  
Los Altos       loc        0.9740   20    Los Altos            101   110  
Californie      loc        0.9822   23    Californie           114   124  
Steve Jobs      pers       0.8479   25    Steve Jobs           88    98   
Steve Jobs      pers       0.8479   25    Steve Jobs           129   139  
Steve Wozniak   pers       0.8791   28    Steve Wozniak        141   154  
Ronald Wayne14  pers       0.8972   31    Ronald Wayne14       171   185  
le 3 ja