# NER Task - Proof of Concept pipeline-

The scope of this Jupyter notebook is to report on three different 
experiments aimed at assessing a Named Entity Recognition (NER) task 
using pre-trained models and SpacY and Gliner library.

## Tools

* Libraries
  
  * SpacY
  
  * Gliner

* Models
  
  * Pre trained SpacY model de_core_news_lg
  
  * Pre trained Gliner model gliner_multi-v2.1
  
  * Pre trained BERT transformer bert_de_ner

## Experimental setup

| Model             | Lang  | Genre       | Vectors                          | Sources                                      |
| ----------------- | ----- | ----------- | -------------------------------- | -------------------------------------------- |
| de_core_news_lg   | DE    | news, media | 500K Keys, 500k Vectors  300 Dim | Tiger Corpus, Tiger2Dep  WikiNER   Wikipedia |
| gliner_multi-v2.1 | Multi | Multi       | 768 Dim                          | Pile-Ner-Type                                |
| bert_de_ner       | DE    | Multi       |                                  | bert-base-german-dbmdz-cased                 |

Install dependencies and full spacy model for german

In [1]:
!pip install spacy
!pip install transformers
!pip install torch
!pip install gliner-spacy
!pip install spacy
!pip install transformers spacy
!python -m spacy download de_core_news_lg

Collecting de-core-news-lg==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_lg-3.7.0/de_core_news_lg-3.7.0-py3-none-any.whl (567.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m567.8/567.8 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_lg')


## Experiment 1 -Baseline-

The aim of this experiment is to assess the performance of the pretrained model `de_core_news_lg` and the Python library spaCy when conducting a named entity recognition (NER) task with a standard configuration. `de_core_news_lg` has been integrated and trained without fine-tuning, thus we assume it will provide a standard NER output. Therefore, it will serve as the baseline for comparing the performance of other models that have been fine-tuned.

### Training performances

| Precision | Recall | F1-Score |
| --------- | ------ | -------- |
| 0.95      | 0.96   | 0.95     |




In [2]:
import spacy
from transformers import pipeline
from spacy import displacy

# Initialize the spaCy NLP pipeline
nlp_de = spacy.load("de_core_news_lg")

def extract_entities(doc):
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

def process_text(doc):
    # Step 1: Extract entities from the text
    entities = extract_entities(doc)
    print("Entities:")
    for entity, label in entities:
        print(f"{entity} ({label})")
    # Step 2: Visualize named entities
    displacy.render(doc, style="ent", jupyter=True)

# Example text
text = "Amedeo Modigliani war ein italienischer Künstler des 20. Jahrhunderts, der für seine einzigartigen und charakteristischen Porträts bekannt ist. Geboren am 12. Juli 1884 in Livorno, Italien, war Modigliani für seinen unverwechselbaren Stil und seine Vorliebe für lange, schlanke Figuren berühmt. Seine Werke zeichnen sich durch ihre vereinfachten Formen, starken Linien und ausdrucksstarken Gesichtszüge aus. Modigliani war Teil der Pariser Kunstszene des frühen 20. Jahrhunderts und wurde von Künstlern wie Pablo Picasso und Constantin Brâncuși beeinflusst. Seine Porträts zeugen von einer gewissen Melancholie und Intimität, die den Betrachter in ihren Bann ziehen. Trotz seines kurzen Lebens und seiner persönlichen Kämpfe mit Krankheit und Armut hinterließ Modigliani ein beeindruckendes künstlerisches Erbe. Sein einzigartiger Stil und seine künstlerische Vision haben seinen Platz in der Kunstgeschichte fest verankert, und seine Werke werden auch heute noch weltweit bewundert und geschätzt."

doc_de = nlp_de(text)

print("Processing German text:")
process_text(doc_de)

Processing German text:
Entities:
Amedeo Modigliani (PER)
italienischer (MISC)
Livorno (LOC)
Italien (LOC)
Modigliani (PER)
Modigliani (PER)
Pariser (LOC)
Pablo Picasso (PER)
Constantin Brâncuși (PER)
Modigliani (PER)


### Observations

* Entity Identification
  
  * DATE is not recognized. Despite most entities being identified, DATE is 
    not extracted. This persists even after intervening in the code and hard
     coding the date format.
  
  * Missclassification of LOC/GPE. Entities that should be identified as GPE
     (countries, cities, states) are identified as LOC (Non-GPE locations: 
    e.g., mountain ranges, bodies of water). In the text, "Livorno" and 
    "Italy" are identified as LOC but should be GPE.
  
  * The adjective "Pariser" is identified as a LOC.
  
  * Typological entities that depend on context, such as "Künstler" or "künstlerisches", are not recognized.

* Tokenization: Tokenization works correctly. For example, "Amedeo Modigliani" is 
  recognized as PER both when "Amedeo Modigliani" and when "Modigliani".

## Experiment 2 -SpacY Gliner-

The aim of this experiment is to assess the performance of the pretrained mul;ti language model `gliner_multi_v2.1` and the Python library spaCy gliner when conducting a named entity recognition (NER) task. `gliner_multi_v2.1` has been trained without fine-tuning, and integrated into the pipeline with the following fine tuned hyperparameter obtained with BERT

| Chunk Size | Threshold |
| ---------- | --------- |
| 200        | 0.3       |

### Training performances

| Precision | Recall | F1-Score |
| --------- | ------ | -------- |
| 0.83      | 0.84   | 0.83     |


In [4]:
import spacy
from gliner_spacy.pipeline import GlinerSpacy
from spacy import displacy

# Configuration for GLiNER integration
custom_spacy_config = {
    "gliner_model": "urchade/gliner_multi-v2.1",
    "chunk_size": 200,
    "labels": ["person", "organization", "place", "date"],
    "style": "ent",
    "threshold": 0.3
}

# Initialize a blank English spaCy pipeline and add GLiNER
nlp = spacy.blank("de")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

# Example German text for entity detection
text = "Amedeo Modigliani war ein italienischer Künstler des 20. Jahrhunderts, der für seine einzigartigen und charakteristischen Porträts bekannt ist. Geboren am 12. Juli 1884 in Livorno, Italien, war Modigliani für seinen unverwechselbaren Stil und seine Vorliebe für lange, schlanke Figuren berühmt. Seine Werke zeichnen sich durch ihre vereinfachten Formen, starken Linien und ausdrucksstarken Gesichtszüge aus. Modigliani war Teil der Pariser Kunstszene des frühen 20. Jahrhunderts und wurde von Künstlern wie Pablo Picasso und Constantin Brâncuși beeinflusst. Seine Porträts zeugen von einer gewissen Melancholie und Intimität, die den Betrachter in ihren Bann ziehen. Trotz seines kurzen Lebens und seiner persönlichen Kämpfe mit Krankheit und Armut hinterließ Modigliani ein beeindruckendes künstlerisches Erbe. Sein einzigartiger Stil und seine künstlerische Vision haben seinen Platz in der Kunstgeschichte fest verankert, und seine Werke werden auch heute noch weltweit bewundert und geschätzt."

# Process the text with the pipeline
doc = nlp(text)

# Output detected entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Visualize the entities using displacy
displacy.render(doc, style="ent", jupyter=True)

Amedeo Modigliani person
12. Juli 1884 date
Livorno place
Italien place
Modigliani person
Pariser Kunstszene place
Pablo Picasso person
Constantin Brâncuși person
Modigliani person
Kunstgeschichte organization
weltweit place


### Observations

* Entity Identification
  
  * There's no distinction between GPE and LOC; the only identified entity is "place." The model has been trained using the label "place" instead of distinguishing between GPE and LOC.
  
  * The topic "Pariser Kunstszebe" is identified as a place.
  
  * Typological entities that depend on context, such as "Künstler" (artist), are not recognized.

* Tokenization 
  
  * Tokenization works correctly. For example, "Amedeo Modigliani" is 
  recognized as PER both when "Amedeo Modigliani" and when "Modigliani"

* Dataset and Multilingualism

  * Despite its lower training performances, the 'gliner_multi-v2.1' model has been trained as a multi-language model using more resources than the standard pre-trained model available within the SpaCy library.

## Experiment 3 -SpacY Gliner and BERT transformer-

The aim of this experiment is to assess the performance of the pretrained multi-language model `gliner_multi_v2.1` and the `bert_de_ner` model in the Python library spaCy `gliner` when conducting a named entity recognition (NER) task. `gliner_multi_v2.1` and `bert_de_ner` have been trained without fine-tuning and integrated into the pipeline with the following fine-tuned hyperparameters obtained with BERT.

| Chunk Size | Threshold |
| ---------- | --------- |
| 200        | 0.3       |

### Gliner Training performances 

| Precision | Recall | F1-Score |
| --------- | ------ | -------- |
| 0.83      | 0.84   | 0.83     |

### BERT DE NER Training performances 

| Precision | Recall | F1-Score |
| --------- | ------ | -------- |
| 0.81      | 0.84   | 0.82     |


In [1]:
import spacy
from spacy.tokens import DocBin
from spacy import displacy
from transformers import AutoTokenizer, AutoModelForTokenClassification
from gliner_spacy.pipeline import GlinerSpacy

# Configuration for GLiNER integration
custom_spacy_config = {
    "gliner_model": "urchade/gliner_multi-v2.1",
    "chunk_size": 200,
    "labels": ["person", "organization", "place", "date"],
    "style": "ent",
    "threshold": 0.3
}

# Load the transformer model and tokenizer
model_name = "fhswf/bert_de_ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Initialize a blank German spaCy pipeline and add GLiNER
nlp = spacy.blank("de")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

# Example German text for entity detection
text = "Amedeo Modigliani war ein italienischer Künstler des 20. Jahrhunderts, der für seine einzigartigen und charakteristischen Porträts bekannt ist. Geboren am 12. Juli 1884 in Livorno, Italien, war Modigliani für seinen unverwechselbaren Stil und seine Vorliebe für lange, schlanke Figuren berühmt. Seine Werke zeichnen sich durch ihre vereinfachten Formen, starken Linien und ausdrucksstarken Gesichtszüge aus. Modigliani war Teil der Pariser Kunstszene des frühen 20. Jahrhunderts und wurde von Künstlern wie Pablo Picasso und Constantin Brâncuși beeinflusst. Seine Porträts zeugen von einer gewissen Melancholie und Intimität, die den Betrachter in ihren Bann ziehen. Trotz seines kurzen Lebens und seiner persönlichen Kämpfe mit Krankheit und Armut hinterließ Modigliani ein beeindruckendes künstlerisches Erbe. Sein einzigartiger Stil und seine künstlerische Vision haben seinen Platz in der Kunstgeschichte fest verankert, und seine Werke werden auch heute noch weltweit bewundert und geschätzt."

# Process the text with the pipeline
doc = nlp(text)

# Output detected entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Visualize the entities using displacy
displacy.render(doc, style="ent", jupyter=True)

Some weights of the model checkpoint at fhswf/bert_de_ner were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Amedeo Modigliani person
12. Juli 1884 date
Livorno place
Italien place
Modigliani person
Pariser Kunstszene place
Pablo person
Picasso person
Constantin Brâncuși person
Modigliani person
Kunstgeschichte organization
weltweit place


### Observations

* Entity Identification
  
  * There's no distinction between GPE and LOC; the only identified entity is "place." The model has been trained using the label "place" instead of distinguishing between GPE and LOC.
  
  * The topic "Pariser Kunstszebe" is identified as a place.
  
  * Typological entities that depend on context, such as "Künstler" (artist), are not recognized.

* Tokenization 
  
  * Tokenization works correctly. For example, "Amedeo Modigliani" is 
  recognized as PER both when "Amedeo Modigliani" and when "Modigliani"

* Dataset and Multilingualism

  * Despite its lower training performances, the 'gliner_multi-v2.1' model has been trained as a multi-language model using more resources than the standard pre-trained model available within the SpaCy library.
  
  * Despite adding the `bert_de_ner` model, the performance is not increasing. To gain a better overview, this task should be run against a proper text dataset.