# Introduction to Entity Linking

**Entity Linking** (EL) is the challenge of resolving ambiguous textual mentions to unique concepts in a knowledge base. A related task is **Named Entity Recognition** (NER). An NER component basically identifies words in text that have a specific name and refer to real-world objects, such as people or organizations. spaCy offers pre-built Machine Learning models that perform Named Entity Recognition for a variety of languages (https://spacy.io/models).

!pip install spacy==3.0.6
!pip install spacy-lookups-data
!python -m spacy download en_core_web_lg

In [1]:
import numpy as np
import pandas as pd
import spacy
from gu_model.trf_tensor_to_vec import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
## Load spacy pipeline

#nlp = spacy.load("en_core_web_lg")
nlp = spacy.load('gu_model/en_ner_guardian-1.0.3/en_ner_guardian/en_ner_guardian-1.0.3',
                     disable=['transformer', 'tagger', 'parser', 'lemmatizer', 'attribute_ruler'])
nlp.add_pipe('tensor2attr')

text = "Tennis champion Emerson was expected to win Wimbledon."
doc = nlp(text)

# Find nlp model embedding dimensions
embedding_dims=len(doc.vector)



In [3]:
# Test spacy model
for ent in doc.ents:
    print(f"Named Entity '{ent.text}' with label '{ent.label_}'")

Named Entity 'Emerson' with label 'PERSON'


# Creating the Knowledge Base 

The first step to perform Entity Linking, is to set up a knowledge base that contains the unique identifiers of the entities we are interested in. In this tutorial we will create a very simple one with only 3 entries. We load the data from a pre-defined CSV file.

In [4]:
dataset='full'# OR'open_sanctions'# OR 'lilsis'
kb_iteration='_2022_11_03'

In [5]:
# Import kb dataset

In [6]:
data=pd.read_csv(f'kb_datasets/kb_entities_{dataset}{kb_iteration}.csv',index_col=0)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [7]:
data.head()

Unnamed: 0,original_index,id,name,AKA,birthdate,deathdate,wikidataId,website,desc,kb_origin,birthplace,desc_len,kb_url
0,0,acf-00040861bc3f593000830d987d09967ef3503ef1,Kolyvanov Egor,,1980-11-15,,,,Kolyvanov Egor is a Russian propagandist: host...,open_sanctions,,228,https://www.opensanctions.org/entities/acf-000...
1,1,acf-0011c68a768924609dc5da5707ac7fa4c4d645a2,Shipov Sergei Yurievich,,1966-04-17,,,,Shipov Sergei Yurievich is a Russian chess pla...,open_sanctions,,258,https://www.opensanctions.org/entities/acf-001...
2,2,acf-001e7e4c0363f08f1e784c230457960b84a6416f,Egorov Ivan Mikhailovich,,1961-01-21,,,,Egorov Ivan Mikhailovich is a Deputy of the St...,open_sanctions,,344,https://www.opensanctions.org/entities/acf-001...
3,3,acf-002c208139012c8d93b6298358188d7cadafe648,Goreslavsky Alexey Sergeyevich,,1977-07-13,,,,Goreslavsky Alexey Sergeyevich is a Russian jo...,open_sanctions,,773,https://www.opensanctions.org/entities/acf-002...
4,4,acf-002cc8fdf8fe41185091a7cb6c598663e7a22eb5,Samoilova Natalya Vladimirovna,,1987-06-24,,,,Samoilova Natalya Vladimirovna is a Russian si...,open_sanctions,,302,https://www.opensanctions.org/entities/acf-002...


In [8]:
kb_data=data[['id','name','desc']]

In [9]:
kb_data.head()

Unnamed: 0,id,name,desc
0,acf-00040861bc3f593000830d987d09967ef3503ef1,Kolyvanov Egor,Kolyvanov Egor is a Russian propagandist: host...
1,acf-0011c68a768924609dc5da5707ac7fa4c4d645a2,Shipov Sergei Yurievich,Shipov Sergei Yurievich is a Russian chess pla...
2,acf-001e7e4c0363f08f1e784c230457960b84a6416f,Egorov Ivan Mikhailovich,Egorov Ivan Mikhailovich is a Deputy of the St...
3,acf-002c208139012c8d93b6298358188d7cadafe648,Goreslavsky Alexey Sergeyevich,Goreslavsky Alexey Sergeyevich is a Russian jo...
4,acf-002cc8fdf8fe41185091a7cb6c598663e7a22eb5,Samoilova Natalya Vladimirovna,Samoilova Natalya Vladimirovna is a Russian si...


In [10]:
kb_data.shape

(428519, 3)

In [11]:
## Generate synthetic aliases 

In [12]:
aliases_data=kb_data[kb_data['name'].duplicated(keep=False)].sort_values(['name'])

In [13]:
aliases_data['id']=aliases_data['id'].astype(str)

In [14]:
alias_dict={}
for alias in aliases_data['name'].unique():
    alias_dict[alias]=list(aliases_data.loc[aliases_data['name']==alias, 'id'].values)

In [15]:
kb_data['name'].value_counts()#.to_csv('value_counts.csv')

David Smith                 13
Mark Smith                  13
David Wilson                13
John Williams               12
Robert Smith                12
                            ..
Véronique Albanel            1
David Weytsman               1
Hiroshi Oka                  1
Andrew Cray                  1
Jalbasürengiin Batzandan     1
Name: name, Length: 414002, dtype: int64

In [16]:
# export kb data in right format for tutorialkb_data
kb_data.rename(columns={'id':'qid','context':'desc'})

Unnamed: 0,qid,name,desc
0,acf-00040861bc3f593000830d987d09967ef3503ef1,Kolyvanov Egor,Kolyvanov Egor is a Russian propagandist: host...
1,acf-0011c68a768924609dc5da5707ac7fa4c4d645a2,Shipov Sergei Yurievich,Shipov Sergei Yurievich is a Russian chess pla...
2,acf-001e7e4c0363f08f1e784c230457960b84a6416f,Egorov Ivan Mikhailovich,Egorov Ivan Mikhailovich is a Deputy of the St...
3,acf-002c208139012c8d93b6298358188d7cadafe648,Goreslavsky Alexey Sergeyevich,Goreslavsky Alexey Sergeyevich is a Russian jo...
4,acf-002cc8fdf8fe41185091a7cb6c598663e7a22eb5,Samoilova Natalya Vladimirovna,Samoilova Natalya Vladimirovna is a Russian si...
...,...,...,...
428514,Q4354299,Cory Bernardi,Cory Bernardi is a Australian politician and r...
428515,Q47668202,Jalbasürengiin Batzandan,Jalbasürengiin Batzandan is a Mongolian politi...
428516,Q5997832,Patrick Murphy,Patrick Murphy is a former US Representative f...
428517,Q28033808,Sharif Street,Sharif Street is a American politician from Pe...


In [17]:
list(kb_data.loc[kb_data['name']=='Ed Williams','desc'])

['Ed Williams is a Prospective Parliamentary Candidate for Meriden.']

In [18]:
#import csv
from pathlib import Path

def load_entities(kb_data):
    names = dict()
    descriptions = dict()

    for row in kb_data.iterrows():
        qid = str(row[1][0])
        name = str(row[1][1])
        desc = str(row[1][2])
        names[qid] = name
        descriptions[qid] = desc
    
    return names, descriptions

# Call function
name_dict, desc_dict = load_entities(kb_data)

In [19]:
from spacy.kb import KnowledgeBase

kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=embedding_dims)

To add each record to the knowledge base, we encode its description using the built-in word vectors of our `nlp` model. The `vector` attribute of a document is the average of its token vectors. We also need to provide a frequency, which is a raw count of how many times a certain entity appears in an annotated corpus. In this tutorial we're not using these frequencies, so we're setting them to an arbitrary value.

In [20]:
descriptions_enc = dict()
for qid, desc in desc_dict.items():
    #desc_doc = nlp(desc)
    #desc_enc = desc_doc.vector
    desc_enc=np.zeros(embedding_dims)
    #descriptions_enc[qid]=desc_enc
    kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342)   # 342 is an arbitrary value here

Now we want to specify aliases or synonyms. We first add the full names. Here, we are 100% certain that they resolve to their corresponding QID, as there is no ambiguity.

In [21]:
for qid, name in name_dict.items():
    if name not in alias_dict.keys():
        kb.add_alias(alias=str(name), entities=[str(qid)], probabilities=[1])   # 100% prior probability P(entity|alias)

In [22]:
for alias_ in alias_dict.keys():
    qids=alias_dict[alias_]
    probs = [round(1/len(qids),2)-.01 for qid in qids]
    kb.add_alias(alias=alias_, entities=qids, probabilities=probs)  # sum([probs]) should be <= 1 !

We also want to add the alias "Emerson". We'll assume that each of our 3 Emersons is equally famous and thus we set their probabilities to be equal for each entity.

So this will be our Knowledge base. We can check the entities and aliases that are contained in it:

We can also print the candidates that are generated for the full name of Roy Emerson, as well as for the mention "Emerson" or for any other random mention, like "Sofie".

In [23]:
candidate_1='Joe Biden'
print(f"Candidates for {candidate_1}: {[c.entity_ for c in kb.get_alias_candidates(candidate_1)]}")

Candidates for Joe Biden: ['Q6279']


In [24]:
candidate_2='Adam Smith'
print(f"Candidates for {candidate_2}: {[c.entity_ for c in kb.get_alias_candidates(candidate_2)]}")

Candidates for Adam Smith: ['129552', '379819', '269916', '256328', 'Q350916']


In [25]:
candidate_3='David Smith'
print(f"Candidates for {candidate_3}: {[c.entity_ for c in kb.get_alias_candidates(candidate_3)]}")

Candidates for David Smith: ['280783', '211703', '53881', '200407', '377020', '204251', '184041', '77215', 'Q3018800', '221595', 'Q53960880', 'Q5239878', '200405']


We notice that querying the KB with the alias "Emerson" gives us 3 candidates, but if we query it with an unknown term, it just gives an empty list.

We can save the knowledge base by calling the function `to_disk` with an output location.

In [26]:
dataset='full'
dataset=f'{dataset}{kb_iteration}'

In [27]:
# change the directory and file names to whatever you like
import os
output_dir = Path.cwd() / "assets"
if not os.path.exists(output_dir):
    os.mkdir(output_dir) 
kb.to_disk(output_dir / f'kb_{dataset}')

We can store the `nlp` object to file by calling `to_disk` as well.

In [28]:
nlp.to_disk(output_dir / f'nlp_{dataset}')

In [29]:
kb_data[kb_data['name']=='David Smith']

Unnamed: 0,id,name,desc
125089,Q3018800,David Smith,David Smith is a Quebec politician. This perso...
149306,Q5239878,David Smith,David Smith is a Canadian senator. This person...
150651,Q53960880,David Smith,David Smith is a Australian Capital Territory ...
249323,53881,David Smith,David Smith is a Prospective Parliamentary Can...
265392,77215,David Smith,David Smith is a Dasa Properties LLC . .
325350,184041,David Smith,"David Smith is a Welder, Sun Coast Resources I..."
336129,200405,David Smith,"David Smith is a Professor, University of Flor..."
336131,200407,David Smith,"David Smith is a Professor, San Bernardino Com..."
338410,204251,David Smith,David Smith is a San Bernardino Community Coll...
342983,211703,David Smith,David Smith is a Property Management Administr...


# Creating a training dataset

Now, we need to create some annotated data to train an Entity Linking algorithm on. To do so, we will use the annotation tool Prodigy, but you could generate the data in whatever tool you like.

If you are watching [the video](https://www.youtube.com/watch?v=8u57WSXVpmw), it will explain how to obtain annotated data with Prodigy. The final result will be a JSONL file that is distributed alongside this notebook. We'll now use this JSONL file to train our entity linker. If you want to skip the annotation part in the video, you can fast forward to [this section](https://www.youtube.com/watch?v=8u57WSXVpmw&t=19m19s).

Let's have a look at the results in this file:

In [None]:
import json
from pathlib import Path

json_loc = Path.cwd().parent / "assets" / "emerson_annotated_text.jsonl" # distributed alongside this notebook
with json_loc.open("r", encoding="utf8") as jsonfile:
    line = jsonfile.readline()
    print(line)   # print just the first line

 We see that the full text of the original sentence is stored, together with a lot of detail about the annotation task. The most important bit is stored with the key `accept` at the end: this is the value of our manual annotation. For this specific sentence and this specific mention, the option with key `Q312545` was manually selected. This is the information that we'll train our entity linker on.

# Training the Entity Linker

To feed training data into our Entity Linker, we format our data as a structured tuple. The first part is the raw text, and the second part is a dictionary of annotations. This dictionary defines the named entities we want to link ("entities"), as well as the actual gold-standard links ("links").

In [None]:
import json
from pathlib import Path

dataset = []
json_loc = Path.cwd().parent / "assets" / "emerson_annotated_text.jsonl"
with json_loc.open("r", encoding="utf8") as jsonfile:
    for line in jsonfile:
        example = json.loads(line)
        text = example["text"]
        if example["answer"] == "accept":
            QID = example["accept"][0]
            offset = (example["spans"][0]["start"], example["spans"][0]["end"])
            entity_label = example["spans"][0]["label"]
            entities = [(offset[0], offset[1], entity_label)]
            links_dict = {QID: 1.0}
        dataset.append((text, {"links": {offset: links_dict}, "entities": entities}))

To check whether the conversion looks OK, we can just print the first sample in our dataset. 

In [None]:
dataset[0]

We can also check some statistics in this dataset. How many cases of each QID do we have annotated?

In [None]:
gold_ids = []
for text, annot in dataset:
    for span, links_dict in annot["links"].items():
        for link, value in links_dict.items():
            if value:
                gold_ids.append(link)

from collections import Counter
print(Counter(gold_ids))

We got exactly 10 annotated sentences for each of our Emersons. Of these, we'll now set aside 6 cases in a separate test set.

In [None]:
import random

train_dataset = []
test_dataset = []
for QID in qids:
    indices = [i for i, j in enumerate(gold_ids) if j == QID]
    train_dataset.extend(dataset[index] for index in indices[0:8])  # first 8 in training
    test_dataset.extend(dataset[index] for index in indices[8:10])  # last 2 in test
    
random.shuffle(train_dataset)
random.shuffle(test_dataset)

With our datasets now properly set up, we'll now create `Example` objects to feed into the training process. This object is new in spaCy v3. Essentially, it contains a document with predictions (`predicted`) and one with gold-standard annotations (`reference`). During training, the pipeline will compare its predictions to the gold-standard, and update the weights of the neural network accordingly.

For entity linking, the algorithm needs access to gold-standard sentences, because the algorithms use the context from the sentence to perform the disambiguation. You can either provide gold-standard `sent_starts` annotations, or run a component such as the `parser` or `sentencizer` on your reference documents:

In [None]:
from spacy.training import Example

TRAIN_EXAMPLES = []
if "sentencizer" not in nlp.pipe_names:
    nlp.add_pipe("sentencizer")
sentencizer = nlp.get_pipe("sentencizer")
for text, annotation in train_dataset:
    example = Example.from_dict(nlp.make_doc(text), annotation)
    example.reference = sentencizer(example.reference)
    TRAIN_EXAMPLES.append(example)
    

Then, we'll create a new Entity Linking component and add it to the pipeline. 

We also need to make sure the `entity_linker` component is properly initialized. To do this, we need a `get_examples` function that returns some example training data, as well as a `kb_loader` argument. This is a `Callable` function that creates the `KnowledgeBase`, given a certain `Vocab` instance. Here, we will load our KB from disk, using the built-in [`spacy.KBFromFile.v1`](https://spacy.io/api/architectures#KBFromFile) function, which is defined in `spacy.ml.models`. 

In [None]:
from spacy.ml.models import load_kb

entity_linker = nlp.add_pipe("entity_linker", config={"incl_prior": False}, last=True)
entity_linker.initialize(get_examples=lambda: TRAIN_EXAMPLES, kb_loader=load_kb(output_dir / "my_kb"))

Next, we will run the actual training loop for the new component, taking care to only train the entity linker and not the other components. 

In [None]:
from spacy.util import minibatch, compounding

with nlp.select_pipes(enable=["entity_linker"]):   # train only the entity_linker
    optimizer = nlp.resume_training()
    for itn in range(500):   # 500 iterations takes about a minute to train
        random.shuffle(TRAIN_EXAMPLES)
        batches = minibatch(TRAIN_EXAMPLES, size=compounding(4.0, 32.0, 1.001))  # increasing batch sizes
        losses = {}
        for batch in batches:
            nlp.update(
                batch,   
                drop=0.2,      # prevent overfitting
                losses=losses,
                sgd=optimizer,
            )
        if itn % 50 == 0:
            print(itn, "Losses", losses)   # print the training loss
print(itn, "Losses", losses)

The final training loss is pretty small, which is a good sign. But to truly verify whether our model generalizes well, we need to test it on unseen data.

# Testing the Entity Linker

Let's first apply it on our original sentence. For each entity, we print the text and label as before, but also the disambiguated QID as predicted by our entity linker.

In [None]:
text = "Tennis champion Emerson was expected to win Wimbledon."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

We see that Emerson gets disambiguated to Q312545, which is the correct ID for the tennis player. Note also that the entity "Wimbledon" gets the annotation `NIL`, which is basically just a placeholder value, showing that the NEL component could not find any relevant ID for this entity. This happens because our Knowledge base and the Entity Linking component have only been trained on "Emerson" examples, and are thus quite limited.

Let's see what the model predicts for the 6 sentences in our test dataset, that were never seen during training.

In [None]:
for text, true_annot in test_dataset:
    print(text)
    print(f"Gold annotation: {true_annot}")
    doc = nlp(text)  # to make this more efficient, you can use nlp.pipe() just once for all the texts
    for ent in doc.ents:
        if ent.text == "Emerson":
            print(f"Prediction: {ent.text}, {ent.label_}, {ent.kb_id_}")
    print()

These results may vary a little from run to run, but usually the EL pipeline will get 5 out of 6 predictions correct (83% accuracy). Random guessing would have only achieved 33%.

Hopefully, this tutorial has shown you how to implement an Entity Linking component in spaCy. The knowledge base and training dataset used here were kept small for demonstration purposes, but in reality you'll want to use a much bigger representative set of entities, perhaps from an ontology or dictionary that is relevant to your use-case. 

If you have general questions on how to use this functionality in your own application, the best route is to create a new StackOverfow issue and tag it with the label `spaCy`. If you would run into an actual bug with the Entity Linking functionality, you can also open an issue at spaCy's github tracker. 

I hope your next NLP project will incorporate entity linking !