# NER Task - Proof of Concept pipeline-

The scope of this Jupyter notebook is to report on three different
experiments aimed at assessing a Named Entity Recognition (NER) task
using pre-trained models and SpacY and Gliner library.

## Tools

* Libraries
  
  * SpacY
  
  * Gliner

* Models
  
  * Pre trained SpacY model de_core_news_lg
  
  * Pre trained Gliner model gliner_multi-v2.1
  
  * Pre trained BERT transformer bert_de_ner

## Experimental setup

| Model             | Lang  | Genre       | Vectors                          | Sources                                      |
| ----------------- | ----- | ----------- | -------------------------------- | -------------------------------------------- |
| de_core_news_lg   | DE    | news, media | 500K Keys, 500k Vectors  300 Dim | Tiger Corpus, Tiger2Dep  WikiNER   Wikipedia |
| gliner_multi-v2.1 | Multi | Multi       | 768 Dim                          | Pile-Ner-Type                                |
| bert_de_ner       | DE    | Multi       |                                  | bert-base-german-dbmdz-cased                 |

Install dependencies and full spacy model for german

In [2]:
!pip install spacy
!pip install transformers
!pip install torch
!pip install gliner-spacy
!pip install spacy
!pip install transformers spacy
!python -m spacy download de_core_news_lg

Collecting de-core-news-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_lg-3.8.0/de_core_news_lg-3.8.0-py3-none-any.whl (567.8 MB)
     ---------------------------------------- 0.0/567.8 MB ? eta -:--:--
     --------------------------------------- 4.7/567.8 MB 26.6 MB/s eta 0:00:22
      ------------------------------------- 12.3/567.8 MB 31.3 MB/s eta 0:00:18
     - ------------------------------------ 21.2/567.8 MB 33.0 MB/s eta 0:00:17
     - ------------------------------------ 28.0/567.8 MB 33.0 MB/s eta 0:00:17
     -- ----------------------------------- 35.4/567.8 MB 33.4 MB/s eta 0:00:16
     -- ----------------------------------- 43.8/567.8 MB 34.6 MB/s eta 0:00:16
     --- ---------------------------------- 51.1/567.8 MB 34.4 MB/s eta 0:00:16
     ---- --------------------------------- 60.0/567.8 MB 35.3 MB/s eta 0:00:15
     ---- --------------------------------- 68.4/567.8 MB 35.7 MB/s eta 0:00:14
     ----- ----------------

## Experiment 1 -Baseline-

The aim of this experiment is to assess the performance of the pretrained model `de_core_news_lg` and the Python library spaCy when conducting a named entity recognition (NER) task with a standard configuration. `de_core_news_lg` has been integrated and trained without fine-tuning, thus we assume it will provide a standard NER output. Therefore, it will serve as the baseline for comparing the performance of other models that have been fine-tuned.

### Training performances

| Precision | Recall | F1-Score |
| --------- | ------ | -------- |
| 0.95      | 0.96   | 0.95     |




In [4]:
import spacy
from transformers import pipeline
from spacy import displacy
from IPython.display import HTML, display

# Initialize the spaCy NLP pipeline
nlp_de = spacy.load("de_core_news_lg")

def extract_entities(doc):
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

def process_text(doc):
    # Step 1: Extract entities from the text
    entities = extract_entities(doc)
    print("Entities:")
    for entity, label in entities:
        print(f"{entity} ({label})")
    # Step 2: Visualize named entities (avoid IPython.core.display import path)
    html = displacy.render(doc, style="ent", jupyter=False)
    display(HTML(f'<span class="tex2jax_ignore">{html}</span>'))

# Example text
text = "Amedeo Modigliani war ein italienischer Künstler des 20. Jahrhunderts, der für seine einzigartigen und charakteristischen Porträts bekannt ist. Geboren am 12. Juli 1884 in Livorno, Italien, war Modigliani für seinen unverwechselbaren Stil und seine Vorliebe für lange, schlanke Figuren berühmt. Seine Werke zeichnen sich durch ihre vereinfachten Formen, starken Linien und ausdrucksstarken Gesichtszüge aus. Modigliani war Teil der Pariser Kunstszene des frühen 20. Jahrhunderts und wurde von Künstlern wie Pablo Picasso und Constantin Brâncuși beeinflusst. Seine Porträts zeugen von einer gewissen Melancholie und Intimität, die den Betrachter in ihren Bann ziehen. Trotz seines kurzen Lebens und seiner persönlichen Kämpfe mit Krankheit und Armut hinterließ Modigliani ein beeindruckendes künstlerisches Erbe. Sein einzigartiger Stil und seine künstlerische Vision haben seinen Platz in der Kunstgeschichte fest verankert, und seine Werke werden auch heute noch weltweit bewundert und geschätzt."

doc_de = nlp_de(text)

print("Processing German text:")
process_text(doc_de)

Processing German text:
Entities:
Amedeo Modigliani (PER)
italienischer (MISC)
Livorno (LOC)
Italien (LOC)
Modigliani (PER)
Modigliani (PER)
Pariser (LOC)
Pablo Picasso (PER)
Constantin Brâncuși (PER)
Modigliani (PER)


### Observations

* Entity Identification
  
  * DATE is not recognized. Despite most entities being identified, DATE is
    not extracted. This persists even after intervening in the code and hard
     coding the date format.
  
  * Missclassification of LOC/GPE. Entities that should be identified as GPE
     (countries, cities, states) are identified as LOC (Non-GPE locations:
    e.g., mountain ranges, bodies of water). In the text, "Livorno" and
    "Italy" are identified as LOC but should be GPE.
  
  * The adjective "Pariser" is identified as a LOC.
  
  * Typological entities that depend on context, such as "Künstler" or "künstlerisches", are not recognized.

* Tokenization: Tokenization works correctly. For example, "Amedeo Modigliani" is
  recognized as PER both when "Amedeo Modigliani" and when "Modigliani".

## Experiment 2 -SpacY Gliner-

The aim of this experiment is to assess the performance of the pretrained mul;ti language model `gliner_multi_v2.1` and the Python library spaCy gliner when conducting a named entity recognition (NER) task. `gliner_multi_v2.1` has been trained without fine-tuning, and integrated into the pipeline with the following fine tuned hyperparameter obtained with BERT

| Chunk Size | Threshold |
| ---------- | --------- |
| 200        | 0.3       |

### Training performances

| Precision | Recall | F1-Score |
| --------- | ------ | -------- |
| 0.83      | 0.84   | 0.83     |


In [5]:
import spacy
from gliner_spacy.pipeline import GlinerSpacy
from spacy import displacy
from IPython.display import HTML, display

# Configuration for GLiNER integration
custom_spacy_config = {
    "gliner_model": "urchade/gliner_multi-v2.1",
    "chunk_size": 200,
    "labels": ["person", "organization", "place", "date"],
    "style": "ent",
    "threshold": 0.3
}

# Initialize a blank English spaCy pipeline and add GLiNER
nlp = spacy.blank("de")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

# Example German text for entity detection
text = "Amedeo Modigliani war ein italienischer Künstler des 20. Jahrhunderts, der für seine einzigartigen und charakteristischen Porträts bekannt ist. Geboren am 12. Juli 1884 in Livorno, Italien, war Modigliani für seinen unverwechselbaren Stil und seine Vorliebe für lange, schlanke Figuren berühmt. Seine Werke zeichnen sich durch ihre vereinfachten Formen, starken Linien und ausdrucksstarken Gesichtszüge aus. Modigliani war Teil der Pariser Kunstszene des frühen 20. Jahrhunderts und wurde von Künstlern wie Pablo Picasso und Constantin Brâncuși beeinflusst. Seine Porträts zeugen von einer gewissen Melancholie und Intimität, die den Betrachter in ihren Bann ziehen. Trotz seines kurzen Lebens und seiner persönlichen Kämpfe mit Krankheit und Armut hinterließ Modigliani ein beeindruckendes künstlerisches Erbe. Sein einzigartiger Stil und seine künstlerische Vision haben seinen Platz in der Kunstgeschichte fest verankert, und seine Werke werden auch heute noch weltweit bewundert und geschätzt."

# Process the text with the pipeline
doc = nlp(text)

# Output detected entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Visualize the entities using displacy (avoid IPython.core.display import path)
html = displacy.render(doc, style="ent", jupyter=False)
display(HTML(f'<span class="tex2jax_ignore">{html}</span>'))

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Fetching 4 files: 100%|██████████| 4/4 [00:24<00:00,  6.13s/it]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install h

Amedeo Modigliani person
12. Juli 1884 date
Livorno place
Italien place
Modigliani person
Pariser Kunstszene place
Pablo Picasso person
Constantin Brâncuși person
Modigliani person
Kunstgeschichte organization
weltweit place


### Observations

* Entity Identification
  
  * There's no distinction between GPE and LOC; the only identified entity is "place." The model has been trained using the label "place" instead of distinguishing between GPE and LOC.
  
  * The topic "Pariser Kunstszebe" is identified as a place.
  
  * Typological entities that depend on context, such as "Künstler" (artist), are not recognized.

* Tokenization
  
  * Tokenization works correctly. For example, "Amedeo Modigliani" is
  recognized as PER both when "Amedeo Modigliani" and when "Modigliani"

* Dataset and Multilingualism

  * Despite its lower training performances, the 'gliner_multi-v2.1' model has been trained as a multi-language model using more resources than the standard pre-trained model available within the SpaCy library.

## Experiment 3 -SpacY Gliner and BERT transformer-

The aim of this experiment is to assess the performance of the pretrained multi-language model `gliner_multi_v2.1` and the `bert_de_ner` model in the Python library spaCy `gliner` when conducting a named entity recognition (NER) task. `gliner_multi_v2.1` and `bert_de_ner` have been trained without fine-tuning and integrated into the pipeline with the following fine-tuned hyperparameters obtained with BERT.

| Chunk Size | Threshold |
| ---------- | --------- |
| 200        | 0.3       |

### Gliner Training performances

| Precision | Recall | F1-Score |
| --------- | ------ | -------- |
| 0.83      | 0.84   | 0.83     |

### BERT DE NER Training performances

| Precision | Recall | F1-Score |
| --------- | ------ | -------- |
| 0.81      | 0.84   | 0.82     |


In [1]:
import spacy
from spacy.tokens import DocBin
from spacy import displacy
from transformers import AutoTokenizer, AutoModelForTokenClassification
from gliner_spacy.pipeline import GlinerSpacy
from IPython.display import HTML, display

# Configuration for GLiNER integration
custom_spacy_config = {
    "gliner_model": "urchade/gliner_multi-v2.1",
    "chunk_size": 200,
    "labels": ["person", "organization", "place", "date"],
    "style": "ent",
    "threshold": 0.3
}

# Load the transformer model and tokenizer
model_name = "fhswf/bert_de_ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Initialize a blank German spaCy pipeline and add GLiNER
nlp = spacy.blank("de")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

# Example German text for entity detection
text = "Amedeo Modigliani war ein italienischer Künstler des 20. Jahrhunderts, der für seine einzigartigen und charakteristischen Porträts bekannt ist. Geboren am 12. Juli 1884 in Livorno, Italien, war Modigliani für seinen unverwechselbaren Stil und seine Vorliebe für lange, schlanke Figuren berühmt. Seine Werke zeichnen sich durch ihre vereinfachten Formen, starken Linien und ausdrucksstarken Gesichtszüge aus. Modigliani war Teil der Pariser Kunstszene des frühen 20. Jahrhunderts und wurde von Künstlern wie Pablo Picasso und Constantin Brâncuși beeinflusst. Seine Porträts zeugen von einer gewissen Melancholie und Intimität, die den Betrachter in ihren Bann ziehen. Trotz seines kurzen Lebens und seiner persönlichen Kämpfe mit Krankheit und Armut hinterließ Modigliani ein beeindruckendes künstlerisches Erbe. Sein einzigartiger Stil und seine künstlerische Vision haben seinen Platz in der Kunstgeschichte fest verankert, und seine Werke werden auch heute noch weltweit bewundert und geschätzt."

# Process the text with the pipeline
doc = nlp(text)

# Output detected entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Visualize the entities using displacy (avoid IPython.core.display import path)
html = displacy.render(doc, style="ent", jupyter=False)
display(HTML(f'<span class="tex2jax_ignore">{html}</span>'))

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at fhswf/bert_de_ner were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at fhswf/bert_de_ner were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on anoth

Amedeo Modigliani person
12. Juli 1884 date
Livorno place
Italien place
Modigliani person
Pariser Kunstszene place
Pablo Picasso person
Constantin Brâncuși person
Modigliani person
Kunstgeschichte organization
weltweit place


### Observations

* Entity Identification
  
  * There's no distinction between GPE and LOC; the only identified entity is "place." The model has been trained using the label "place" instead of distinguishing between GPE and LOC.
  
  * The topic "Pariser Kunstszebe" is identified as a place.
  
  * Typological entities that depend on context, such as "Künstler" (artist), are not recognized.

* Tokenization
  
  * Tokenization works correctly. For example, "Amedeo Modigliani" is
  recognized as PER both when "Amedeo Modigliani" and when "Modigliani"

* Dataset and Multilingualism

  * Despite its lower training performances, the 'gliner_multi-v2.1' model has been trained as a multi-language model using more resources than the standard pre-trained model available within the SpaCy library.
  
  * Despite adding the `bert_de_ner` model, the performance is not increasing. To gain a better overview, this task should be run against a proper text dataset.

## Semantic enrichment: DBpedia + GeoNames

This section augments PERSON and PLACE entities detected by the experiments above with external knowledge:

- PERSON → DBpedia (birth date, birth place, short description)
- PLACE (GPE/LOC/place) → GeoNames (coordinates, population, feature codes)

Provide your GeoNames username below or in the GEONAMES_USERNAME env var, run an experiment cell to populate entities (doc/doc_de), then run the enrichment cell.

In [None]:
# Imports and configuration for enrichment
import os
import time
from typing import Dict, Any, Optional

import pandas as pd
import requests
from SPARQLWrapper import SPARQLWrapper, JSON
from IPython.display import display

# Optional: Set your GeoNames username here or via environment variable
# Use the standard env var name and avoid hardcoded secrets
GEONAMES_USERNAME = os.getenv("GEONAMES_USERNAME", "")

# Simple in-memory caches to avoid repeated lookups within a session
_dbpedia_cache: Dict[str, Dict[str, Any]] = {}
_geonames_cache: Dict[str, Dict[str, Any]] = {}

In [20]:
# DBpedia enrichment for PERSON entities

def query_dbpedia_person(name: str, sleep: float = 0.2) -> Optional[Dict[str, Any]]:
    """
    Query DBpedia for a person by label (case-insensitive).
    Returns minimal profile: uri, label, birth_date, birth_place, description.
    """
    if not name:
        return None

    key = name.strip().lower()
    if key in _dbpedia_cache:
        return _dbpedia_cache[key]

    endpoint = "https://dbpedia.org/sparql"
    sparql = SPARQLWrapper(endpoint)
    sparql.setReturnFormat(JSON)

    # Match any language label for robustness; prefer abstracts in de or en
    escaped = name.replace('"', '\\"')
    query = f"""
    SELECT ?person ?label ?birthDate ?birthPlaceLabel ?abstract WHERE {{
      ?person a dbo:Person ; rdfs:label ?label .
      FILTER(LCASE(STR(?label)) = LCASE(STR(\"{escaped}\")))
      OPTIONAL {{ ?person dbo:birthDate ?birthDate . }}
      OPTIONAL {{
        ?person dbo:birthPlace ?birthPlace .
        ?birthPlace rdfs:label ?birthPlaceLabel .
        FILTER (langMatches(lang(?birthPlaceLabel), \"de\") || langMatches(lang(?birthPlaceLabel), \"en\"))
      }}
      OPTIONAL {{
        ?person dbo:abstract ?abstract .
        FILTER (langMatches(lang(?abstract), \"de\") || langMatches(lang(?abstract), \"en\"))
      }}
      FILTER (lang(?label) = \"de\" || lang(?label) = \"en\")
    }} LIMIT 5
    """

    try:
        sparql.setQuery(query)
        time.sleep(sleep)  # be polite to the endpoint
        results = sparql.query().convert()
        bindings = results.get("results", {}).get("bindings", [])
        if not bindings:
            _dbpedia_cache[key] = {}
            return {}

        # Heuristic: pick the first row that has either birthDate or abstract
        chosen = None
        for row in bindings:
            if "birthDate" in row or "abstract" in row:
                chosen = row
                break
        if chosen is None:
            chosen = bindings[0]

        def _val(r, k):
            return r.get(k, {}).get("value") if r.get(k) else None

        record = {
            "uri": _val(chosen, "person"),
            "label": _val(chosen, "label") or name,
            "birth_date": _val(chosen, "birthDate"),
            "birth_place": _val(chosen, "birthPlaceLabel"),
            "description": _val(chosen, "abstract"),
            "source": "dbpedia"
        }
        _dbpedia_cache[key] = record
        return record
    except Exception as e:
        # Fail gracefully
        _dbpedia_cache[key] = {"error": str(e), "source": "dbpedia"}
        return _dbpedia_cache[key]

In [21]:
# GeoNames enrichment for PLACE entities

def query_geonames_place(name: str, username: Optional[str] = None, sleep: float = 0.2) -> Optional[Dict[str, Any]]:
    """
    Query GeoNames for a place by name. Requires a GeoNames username (free).
    Returns minimal profile: name, countryCode, lat, lng, population, fcode/fcl, geonameId.
    """
    if not name:
        return None

    key = name.strip().lower()
    if key in _geonames_cache:
        return _geonames_cache[key]

    username = username or GEONAMES_USERNAME
    if not username:
        # No credentials provided; return a stub so the pipeline can continue
        stub = {"warning": "Missing GeoNames username. Set GEONAMES_USERNAME.", "source": "geonames"}
        _geonames_cache[key] = stub
        return stub

    url = "http://api.geonames.org/searchJSON"
    params = {
        "q": name,
        "maxRows": 1,
        "fuzzy": 0.8,
        "orderby": "relevance",
        "featureClass": ["P", "A"],  # Populated places and administrative areas
        "lang": "de",
        "username": username,
    }

    try:
        time.sleep(sleep)
        resp = requests.get(url, params=params, timeout=10)
        resp.raise_for_status()
        data = resp.json()
        geos = data.get("geonames", [])
        if not geos:
            _geonames_cache[key] = {}
            return {}
        g = geos[0]
        record = {
            "name": g.get("name"),
            "countryCode": g.get("countryCode"),
            "lat": g.get("lat"),
            "lng": g.get("lng"),
            "population": g.get("population"),
            "fcode": g.get("fcode"),
            "fcl": g.get("fcl"),
            "geonameId": g.get("geonameId"),
            "source": "geonames"
        }
        _geonames_cache[key] = record
        return record
    except Exception as e:
        _geonames_cache[key] = {"error": str(e), "source": "geonames"}
        return _geonames_cache[key]

In [22]:
# Integration helpers: collect entities from spaCy/GLiNER docs and enrich

from collections import OrderedDict

# Map diverse label sets to a common schema
_LABEL_MAP = {
    # spaCy (de_core_news_lg)
    "PERSON": "PERSON",
    "PER": "PERSON",
    "GPE": "PLACE",
    "LOC": "PLACE",
    # GLiNER labels
    "person": "PERSON",
    "place": "PLACE",
}

def _normalize_label(label: str) -> Optional[str]:
    return _LABEL_MAP.get(label, None)


def collect_entities_from_notebook() -> pd.DataFrame:
    """Gather entities from variables defined in earlier experiments (doc_de, doc)."""
    rows = []
    sources = []

    if "doc_de" in globals():
        try:
            for ent in doc_de.ents:
                norm = _normalize_label(ent.label_)
                if norm:
                    rows.append({"entity": ent.text, "label": norm, "source": "spacy_de"})
            sources.append("doc_de")
        except Exception:
            pass

    if "doc" in globals():
        try:
            for ent in doc.ents:
                norm = _normalize_label(ent.label_)
                if norm:
                    rows.append({"entity": ent.text, "label": norm, "source": "gliner"})
            sources.append("doc")
        except Exception:
            pass

    if not rows:
        return pd.DataFrame(columns=["entity", "label", "source"])  # empty; instruct user to run experiments first

    # Deduplicate by entity text + label
    seen = set()
    uniq_rows = []
    for r in rows:
        key = (r["entity"].strip().lower(), r["label"])
        if key not in seen:
            seen.add(key)
            uniq_rows.append(r)

    return pd.DataFrame(uniq_rows)


def enrich_entities(df: pd.DataFrame, geonames_username: Optional[str] = None) -> pd.DataFrame:
    """
    For each PERSON and PLACE entity, call the respective enrichment functions
    and merge the results into a single DataFrame for display.
    """
    if df.empty:
        return df

    # Apply enrichment per row
    enriched = []
    for _, row in df.iterrows():
        entity = row["entity"].strip()
        label = row["label"]
        payload: Dict[str, Any] = {"entity": entity, "label": label, "source": row["source"]}
        if label == "PERSON":
            info = query_dbpedia_person(entity) or {}
            payload.update({
                "dbpedia_uri": info.get("uri"),
                "birth_date": info.get("birth_date"),
                "birth_place": info.get("birth_place"),
                "description": info.get("description"),
            })
        elif label == "PLACE":
            info = query_geonames_place(entity, username=geonames_username) or {}
            payload.update({
                "lat": info.get("lat"),
                "lng": info.get("lng"),
                "population": info.get("population"),
                "fcode": info.get("fcode"),
                "fcl": info.get("fcl"),
                "countryCode": info.get("countryCode"),
                "geonameId": info.get("geonameId"),
            })
        enriched.append(payload)

    return pd.DataFrame(enriched)


def enrich_and_display(geonames_username: Optional[str] = None) -> None:
    """One-shot helper: collect entities from notebook, enrich, and display."""
    base = collect_entities_from_notebook()
    if base.empty:
        print("No entities found. Run one of the experiment cells first to populate 'doc'/'doc_de'.")
        return

    # Show the original NER output first
    display(base.reset_index(drop=True).style.set_caption("Original NER output (normalized)"))

    out = enrich_entities(base, geonames_username or GEONAMES_USERNAME)

    # Order columns nicely
    person_cols = ["entity", "label", "source", "dbpedia_uri", "birth_date", "birth_place", "description"]
    place_cols = ["entity", "label", "source", "lat", "lng", "population", "fcode", "fcl", "countryCode", "geonameId"]

    # Split and display
    persons = out[out["label"] == "PERSON"][person_cols]
    places = out[out["label"] == "PLACE"][place_cols]

    if not persons.empty:
        display(persons.reset_index(drop=True).style.set_caption("PERSON entities enriched with DBpedia"))
    else:
        print("No PERSON entities to enrich.")

    if not places.empty:
        display(places.reset_index(drop=True).style.set_caption("PLACE entities enriched with GeoNames"))
    else:
        print("No PLACE entities to enrich.")

    # Also show the combined output for convenience
    display(out.reset_index(drop=True))

In [23]:
# Usage: set (or override) your GeoNames username here, then run enrichment
# GEONAMES_USERNAME = "<your_geonames_username>"

# Example runner (safe to execute even if username isn't set; places will show a warning)
# enrich_and_display(geonames_username=GEONAMES_USERNAME)

In [24]:
# Run semantic enrichment and display results
# If GEONAMES_USERNAME is not set, places may show a warning but persons will still enrich
enrich_and_display(geonames_username=GEONAMES_USERNAME)

Unnamed: 0,entity,label,source
0,Amedeo Modigliani,PERSON,gliner
1,Livorno,PLACE,gliner
2,Italien,PLACE,gliner
3,Modigliani,PERSON,gliner
4,Pariser Kunstszene,PLACE,gliner
5,Pablo Picasso,PERSON,gliner
6,Constantin Brâncuși,PERSON,gliner
7,weltweit,PLACE,gliner


Unnamed: 0,entity,label,source,dbpedia_uri,birth_date,birth_place,description
0,Amedeo Modigliani,PERSON,gliner,,,,
1,Modigliani,PERSON,gliner,,,,
2,Pablo Picasso,PERSON,gliner,,,,
3,Constantin Brâncuși,PERSON,gliner,http://dbpedia.org/resource/Constantin_Brâncuși,1876-02-19,Peștișani,"Constantin Brâncuși (Romanian: [konstanˈtin brɨŋˈkuʃʲ]; February 19, 1876 – March 16, 1957) was a Romanian sculptor, painter and photographer who made his career in France. Considered one of the most influential sculptors of the 20th-century and a pioneer of modernism, Brâncuși is called the patriarch of modern sculpture. As a child he displayed an aptitude for carving wooden farm tools. Formal studies took him first to Bucharest, then to Munich, then to the École des Beaux-Arts in Paris from 1905 to 1907. His art emphasizes clean geometrical lines that balance forms inherent in his materials with the symbolic allusions of representational art. Brâncuși sought inspiration in non-European cultures as a source of primitive exoticism, as did Paul Gauguin, Pablo Picasso, André Derain and others. However, other influences emerge from Romanian folk art traceable through Byzantine and Dionysian traditions."


Unnamed: 0,entity,label,source,lat,lng,population,fcode,fcl,countryCode,geonameId
0,Livorno,PLACE,gliner,,,,,,,
1,Italien,PLACE,gliner,,,,,,,
2,Pariser Kunstszene,PLACE,gliner,,,,,,,
3,weltweit,PLACE,gliner,,,,,,,


Unnamed: 0,entity,label,source,dbpedia_uri,birth_date,birth_place,description,lat,lng,population,fcode,fcl,countryCode,geonameId
0,Amedeo Modigliani,PERSON,gliner,,,,,,,,,,,
1,Livorno,PLACE,gliner,,,,,,,,,,,
2,Italien,PLACE,gliner,,,,,,,,,,,
3,Modigliani,PERSON,gliner,,,,,,,,,,,
4,Pariser Kunstszene,PLACE,gliner,,,,,,,,,,,
5,Pablo Picasso,PERSON,gliner,,,,,,,,,,,
6,Constantin Brâncuși,PERSON,gliner,http://dbpedia.org/resource/Constantin_Brâncuși,1876-02-19,Peștișani,Constantin Brâncuși (Romanian: [konstanˈtin br...,,,,,,,
7,weltweit,PLACE,gliner,,,,,,,,,,,


## Integrated pipeline: benefits and challenges

- Benefits
  - Richer context: augments bare NER spans with factual attributes (people: birth data, description; places: coordinates/population).
  - Better downstream utility: supports linking, search faceting, geo-visualization, and disambiguation.
  - Modular design: DBpedia and GeoNames lookups are decoupled and cached; easy to swap endpoints or extend for ORG.

- Challenges
  - Ambiguity: name-only matching can return wrong entities (e.g., homonyms). Production systems pair names with context (co-occurring entities, years) or use entity linkers.
  - Coverage and language: DBpedia labels/abstracts vary across languages; fallback between de/en is implemented but not foolproof.
  - Rate limits and availability: public SPARQL endpoints and GeoNames enforce quotas; apply backoff/caching and consider mirroring.
  - Credentials: GeoNames requires a username; store it in env vars, not in code.
  - Performance: network latency can dominate; batch or async lookups and caching are recommended for large corpora.

- Next steps
  - Add an entity disambiguation step (e.g., DBpedia Spotlight or spaCy–Wikidata linker) before enrichment.
  - Extend enrichment for ORGANIZATION via Wikidata/DBpedia and for DATES via normalization.
  - Persist canonical IDs (DBpedia URI, GeoNames ID) and build a knowledge graph or index for reuse.

In [None]:
# Utility: sanitize this notebook to remove invalid widget metadata for GitHub rendering
import json
from pathlib import Path

nb_path = Path(r"c:\Users\matte\Repo\ner_proof_of_concept\NER_proof_of_concept.ipynb")

def sanitize_widgets(notebook_path: Path) -> None:
    with notebook_path.open("r", encoding="utf-8") as f:
        nb = json.load(f)
    # Remove top-level widgets metadata if present
    md = nb.get("metadata", {})
    if isinstance(md, dict) and "widgets" in md:
        md.pop("widgets", None)
        nb["metadata"] = md
    # Remove per-cell widgets metadata if present
    for cell in nb.get("cells", []):
        cmeta = cell.get("metadata")
        if isinstance(cmeta, dict) and "widgets" in cmeta:
            cmeta.pop("widgets", None)
            cell["metadata"] = cmeta
    with notebook_path.open("w", encoding="utf-8") as f:
        json.dump(nb, f, ensure_ascii=False, indent=2)
    print("Notebook sanitized: removed metadata.widgets if present.")

sanitize_widgets(nb_path)