# Entity Linking benchmark
Results:

|                                  | `spaCy` default                                                                                                                         | `spacyfishing`                                                | `spacy_entity_linker`                                                 | `spacy_dbpedia_spotlight`                                             |
|----------------------------------|---------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------------------|
| part of spacy pipe | yes | yes | no | yes |
| includes its own NER | no | no | no | yes |
| disambiguates against | custom KB | WikiData | WikiData | DbPedia |
| based on | [spaCy custom entity linking architecture](https://spacy.io/api/architectures#entitylinker) | [entity-fishing at WikiDataCon 2017](https://grobid.s3.amazonaws.com/presentations/29-10-2017.pdf) | requires reading the code | [DBpedia Spotlight](https://www.dbpedia-spotlight.org/) |
| inference time on test paragraph on Macbook Pro M1 no GPU | (no training data available) | 3.67s | 2.80s + optional subsequent linking time | 3.73s |
| disambiguated 'DC' to | (no training data available) | DC current (wrong) | Washington DC (wrong) | DC comics (correct) |
| comment | can bring your own data but requires training |   | oldest implementation | Best  eye-balled performance on my use case |

In [11]:
import json
from typing import cast
from IPython.display import display, Markdown

from icecream import ic
import spacy
import textgraphs
import spacy_entity_linker as sel
import spacy_dbpedia_spotlight

In [2]:
with open("../data/wiki_guardians.json", "r") as fh:
    text: str = json.load(fh)["text"]
    paragraph = text.split("\n\n\n")[0]

## Using Spacyfishing

In [47]:
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-mbert-base-multinerd"})

nlp.add_pipe(
    "entityfishing",
    config = {
        "api_ef_base": "https://cloud.science-miner.com/nerd/service",
        "extra_info": True,
        "filter_statements": [ ],
    },
)

nlp.add_pipe(
    "merge_entities",
)

<function spacy.pipeline.functions.merge_entities(doc: spacy.tokens.doc.Doc)>

In [48]:
doc = nlp(paragraph)

In [50]:
for ent in doc.ents:
    print(
        ent.text,
        ent.label_,
        ent._.kb_qid, # Wikidata identifier (QID).
        ent._.nerd_score, #  Selection confidence score for the disambiguated entity.
        ent._.url_wikidata, # URL to Wikidata ressource.
    )

Guardians of the Galaxy Vol. 3 MEDIA Q20001199 0.9158 https://www.wikidata.org/wiki/Q20001199
Marvel Comics ORG Q173496 0.4646 https://www.wikidata.org/wiki/Q173496
Guardians of the Galaxy MEDIA Q5887360 0.8887 https://www.wikidata.org/wiki/Q5887360
Marvel Studios ORG Q367466 0.9135 https://www.wikidata.org/wiki/Q367466
Walt Disney Studios Motion Pictures ORG Q1323594 0.9101 https://www.wikidata.org/wiki/Q1323594
Guardians of the Galaxy MEDIA Q5887360 0.8887 https://www.wikidata.org/wiki/Q5887360
Guardians of the Galaxy Vol. 2 MEDIA Q20001199 0.9103 https://www.wikidata.org/wiki/Q20001199
Marvel Cinematic Universe MEDIA Q642878 0.9137 https://www.wikidata.org/wiki/Q642878
James Gunn PER Q717015 0.8888 https://www.wikidata.org/wiki/Q717015
Chris Pratt PER Q503706 0.9086 https://www.wikidata.org/wiki/Q503706
Zoe Saldaña PER Q190162 0.9098 https://www.wikidata.org/wiki/Q190162
Dave Bautista PER Q44158 0.9098 https://www.wikidata.org/wiki/Q44158
Karen Gillan PER Q231237 0.9098 https://www.

In [26]:
spacy.displacy.render(
    doc,
    style = "ent",
    jupyter = True,
)

## Using `spaCy-entity-linker`

In [43]:
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-mbert-base-multinerd"})

<span_marker.spacy_integration.SpacySpanMarkerWrapper at 0x30feda5d0>

In [44]:
doc = nlp(paragraph)

In [45]:
classifier = sel.EntityClassifier.EntityClassifier()

for ent in doc.ents:
    term: sel.TermCandidate = sel.TermCandidate.TermCandidate(ent)
    candidates: sel.EntityCandidates.EntityCandidates = term.get_entity_candidates()
    if len(candidates) > 0:
        entity: sel.EntityElement.EntityElement = classifier(candidates)
    print(
        ent.text,
        ent.label_,
        entity.__dict__['url'],
    )

Guardians of the Galaxy Vol. 3 MEDIA https://www.wikidata.org/wiki/Q19020
Marvel Comics ORG https://www.wikidata.org/wiki/Q173496
Guardians of the Galaxy MEDIA https://www.wikidata.org/wiki/Q5887360
Marvel Studios ORG https://www.wikidata.org/wiki/Q367466
Walt Disney Studios Motion Pictures ORG https://www.wikidata.org/wiki/Q1323594
Guardians of the Galaxy MEDIA https://www.wikidata.org/wiki/Q5887360
Guardians of the Galaxy Vol. 2 MEDIA https://www.wikidata.org/wiki/Q20001199
Marvel Cinematic Universe MEDIA https://www.wikidata.org/wiki/Q642878
James Gunn PER https://www.wikidata.org/wiki/Q717015
Chris Pratt PER https://www.wikidata.org/wiki/Q503706
Zoe Saldaña PER https://www.wikidata.org/wiki/Q503706
Dave Bautista PER https://www.wikidata.org/wiki/Q44158
Karen Gillan PER https://www.wikidata.org/wiki/Q231237
Pom Klementieff PER https://www.wikidata.org/wiki/Q3395911
Vin Diesel PER https://www.wikidata.org/wiki/Q178166
Bradley Cooper PER https://www.wikidata.org/wiki/Q205707
Will Poul

In [46]:
spacy.displacy.render(
    doc,
    style = "ent",
    jupyter = True,
)

## Using `spacy-dbpedia-spotlight`
I am getting hit instantly with 
```
HTTPError: 502 Server Error: Proxy Error for url: https://api.dbpedia-spotlight.org/en/annotate
```
So I'm running DBpedia locally with docker
```sh
docker pull dbpedia/dbpedia-spotlight
docker volume create spotlight-models
docker run -it \
 --restart unless-stopped \
 --name dbpedia-spotlight.en \
 --mount source=spotlight-models,target=/opt/spotlight \
 -p 2222:80 \
 dbpedia/dbpedia-spotlight \
 spotlight.sh en
```
I needed 16 GB of RAM on Docker Desktop

In [36]:
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-mbert-base-multinerd"})
# add the pipeline stage
nlp.add_pipe(
    'dbpedia_spotlight',
    config={
        'dbpedia_rest_endpoint': 'http://localhost:2222/rest', # for local docker DBpedia
        # 'dbpedia_rest_endpoint': 'https://api.dbpedia-spotlight.org/en', # for default with spacy-dbpedia-spotlight
        'overwrite_ents': True  # by default is set to True
    }  
)

<spacy_dbpedia_spotlight.entity_linker.EntityLinker at 0x32e8d3890>

In [37]:
# get the document
doc = nlp(paragraph)

In [38]:
for ent in doc.ents:
    print(
        ent.text,
        ent.label_,
        ent.kb_id_, 
        ent._.dbpedia_raw_result['@similarityScore']
    )

superhero film DBPEDIA_ENT http://dbpedia.org/resource/Superhero_film 1.0
Marvel Comics DBPEDIA_ENT http://dbpedia.org/resource/Marvel_Comics 0.999996715991464
superhero DBPEDIA_ENT http://dbpedia.org/resource/Superhero_film 1.0
Marvel Studios DBPEDIA_ENT http://dbpedia.org/resource/Marvel_Studios 1.0
Walt Disney Studios Motion Pictures DBPEDIA_ENT http://dbpedia.org/resource/Walt_Disney_Studios_Motion_Pictures 1.0
Marvel Cinematic Universe DBPEDIA_ENT http://dbpedia.org/resource/Marvel_Cinematic_Universe 1.0
James Gunn DBPEDIA_ENT http://dbpedia.org/resource/James_Gunn 1.0
ensemble cast DBPEDIA_ENT http://dbpedia.org/resource/Ensemble_cast 1.0
Chris Pratt DBPEDIA_ENT http://dbpedia.org/resource/Chris_Pratt 1.0
Zoe Saldaña DBPEDIA_ENT http://dbpedia.org/resource/Zoe_Saldaña 0.9999999882111298
Dave Bautista DBPEDIA_ENT http://dbpedia.org/resource/Dave_Bautista 1.0
Gillan DBPEDIA_ENT http://dbpedia.org/resource/Karen_Gillan 1.0
Pom Klementieff DBPEDIA_ENT http://dbpedia.org/resource/Pom_

In [39]:
spacy.displacy.render(
    doc,
    style = "ent",
    jupyter = True,
)

## Using spacy default
- [Usage](https://spacy.io/usage/linguistic-features#entity-linking)
- [Architecture](https://spacy.io/api/architectures#entitylinker)
- [API docs](https://spacy.io/api/entitylinker)
- [tutorial](https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson) and [video](https://www.youtube.com/watch?v=8u57WSXVpmw)

In [61]:
from spacy.kb import KnowledgeBase, InMemoryLookupKB
kb = InMemoryLookupKB(vocab=nlp.vocab, entity_vector_length=96)

In [62]:
import csv
from pathlib import Path

def load_entities():
    entities_loc = Path.cwd().parent / "data" / "kb" / "entities.csv"  # distributed alongside this notebook

    names = dict()
    descriptions = dict()
    with entities_loc.open("r", encoding="utf8") as csvfile:
        csvreader = csv.reader(csvfile, delimiter=",")
        for row in csvreader:
            qid = row[0]
            name = row[1]
            desc = row[2]
            names[qid] = name
            descriptions[qid] = desc
    return names, descriptions

In [63]:
name_dict, desc_dict = load_entities()

for qid, desc in desc_dict.items():
    desc_doc = nlp(desc)
    desc_enc = desc_doc.vector
    kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342)   # 342 is an arbitrary value here

In [64]:
for qid, name in name_dict.items():
    kb.add_alias(alias=name, entities=[qid], probabilities=[1])   # 100% prior probability P(entity|alias)

In [66]:
# adding aliases for DC
kb.add_alias(alias="DC", entities=["Q2924461", "Q1152150", "Q61", "Q159241"], probabilities=[0.20]*4)

10856056131304768290

In [73]:
[c.entity_ for c in kb.get_alias_candidates('DC')]

['Q2924461', 'Q1152150', 'Q61', 'Q159241']

In [69]:
import os
output_dir = Path.cwd().parent / "data" / "kb" / "my_output"
if not os.path.exists(output_dir):
    os.mkdir(output_dir) 
kb.to_disk(output_dir / "my_kb")

In [70]:
nlp.to_disk(output_dir / "my_nlp")

## Using integrated library `textgraphs`


In [12]:
tg: textgraphs.TextGraphs = textgraphs.TextGraphs(
     factory = textgraphs.PipelineFactory(
        ner = textgraphs.NERSpanMarker(ner_model="tomaarsen/span-marker-mbert-base-multinerd"),
        kg = textgraphs.KGWikiMedia(
            spotlight_api = textgraphs.DBPEDIA_SPOTLIGHT_API,
            dbpedia_search_api = textgraphs.DBPEDIA_SEARCH_API,
            dbpedia_sparql_api = textgraphs.DBPEDIA_SPARQL_API,
            wikidata_api = textgraphs.WIKIDATA_API,
            min_alias = textgraphs.DBPEDIA_MIN_ALIAS,
            min_similarity = textgraphs.DBPEDIA_MIN_SIM,
        ),
        infer_rels = [
    		textgraphs.InferRel_OpenNRE(
                model = textgraphs.OPENNRE_MODEL,
                max_skip = textgraphs.MAX_SKIP,
                min_prob = textgraphs.OPENNRE_MIN_PROB,
    		),
            textgraphs.InferRel_Rebel(
                lang = "en_XX",
                mrebel_model = textgraphs.MREBEL_MODEL,
            ),
        ],
     )
)

In [13]:
pipe: textgraphs.Pipeline = tg.create_pipeline(paragraph)

TODO:

- ( ) understand "perform entity linking: spaCy-DBpedia-Spotlight, WikiMedia API, etc."
- ( ) understand "infer relations, plus graph inference: REBEL, OpenNRE, qwikidata, etc."
- ( ) understand "approximate a pareto archive (hypervolume) to re-rank extracted entities with pulp"


In [14]:
tg.collect_graph_elements(
    pipe,
    debug = False,
)

IndexError: list index out of range

In [15]:
tg.perform_entity_linking(
    pipe,
    debug = False,
)

ic| ex: KeyError('redirectlabel')
Traceback (most recent call last):
  File "/Users/louis.guitton/workspace/mlops-talk-llm-kg/venv/lib/python3.11/site-packages/textgraphs/kg.py", line 704, in dbpedia_search_entity
    for alias in hit["redirectlabel"]
                 ~~~^^^^^^^^^^^^^^^^^
KeyError: 'redirectlabel'
ic| ex: KeyError('redirectlabel')
Traceback (most recent call last):
  File "/Users/louis.guitton/workspace/mlops-talk-llm-kg/venv/lib/python3.11/site-packages/textgraphs/kg.py", line 704, in dbpedia_search_entity
    for alias in hit["redirectlabel"]
                 ~~~^^^^^^^^^^^^^^^^^
KeyError: 'redirectlabel'


AttributeError: 'NoneType' object has no attribute 'prob'