# Introduction to Entity Linking

This notebook provides a short tutorial on how to implement and use spaCy's Entity Linking functionality with spaCy v3.  It can be used together with [this video](https://www.youtube.com/watch?v=8u57WSXVpmw). Note that this video was originally created for spaCy v2, but this notebook includes the updates needed for v3. If you want to use spaCy v2 instead, you can find the original code [here](https://github.com/explosion/projects/tree/master/nel-emerson).

**Entity Linking** (EL) is the challenge of resolving ambiguous textual mentions to unique concepts in a knowledge base. A related task is **Named Entity Recognition** (NER). An NER component basically identifies words in text that have a specific name and refer to real-world objects, such as people or organizations. spaCy offers pre-built Machine Learning models that perform Named Entity Recognition for a variety of languages (https://spacy.io/models).

!pip install spacy==3.0.6
!pip install spacy-lookups-data
!python -m spacy download en_core_web_lg

Let's load a  pretrained English model, apply it to some sample text and show the named entities that were identified by printing their text and label.

In [None]:
import spacy
nlp = spacy.load("en_core_web_lg")
text = "Tennis champion Emerson was expected to win Wimbledon."
doc = nlp(text)
for ent in doc.ents:
    print(f"Named Entity '{ent.text}' with label '{ent.label_}'")

Named Entity 'Emerson' with label 'PERSON'
Named Entity 'Wimbledon' with label 'EVENT'


We see that this sentence contains a person called "Emerson" and an event called "Wimbledon". 

Unfortunately, there may be many people in the world called "Emerson", and this output still doesn't tell us which one exactly we meant. This is the challenge addressed by Entity Linking. It transforms an ambiguous textual mention to a unique identifier by looking at the context in which the mention occurs. 

In this specific case, the sentence gives us important clues: Emerson is clearly a professional tennis player. 

Searching the internet, we can establish that this sentence is most likely talking about Roy Emerson, an Australian tennis player. We can now resolve this entity in this sentence to its unique identifier from WikiData, which is a free and open, interlingual knowledge base. Its unique IDs always start with a Q, and "Roy Emerson" has the identifier Q312545: https://www.wikidata.org/wiki/Q312545

To implement an entity linking pipeline, we need 3 different steps. 

The first step, as we already saw, is Named Entity Recognition, in which the mention "Emerson" is labeled as a "Person". Next, the extracted mention needs to be resolved to a list of plausible candidates. In our case, we'll consider three different people named Emerson. Typically, this list is created by querying a knowledge base (KB) that contains various aliases and synonyms. In the final step, we need to reduce the list of candidates to just one final ID that represents the correct Emerson.

![Diagram of entity linking process](nel_schema.png)

This tutorial will show you how to use spaCy v3 to create a Knowledge base that will address the second step of candidate generation. Additionally, we will create a new Entity Linking component, and train its Machine Learning model on some annotated data.

In this notebook, we implement the functions and training loop from scratch. However, spaCy v3 has introduced a powerful and extensible training configuration system, that we advice to use in most cases. You can find the corresponding implementation with the [config system](https://spacy.io/usage/training#config), runnable with the new [spacy projects](https://spacy.io/usage/projects), [here](https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson).

The aim of this tutorial is to help you get started implementing your own Entity Linking functionality with spaCy. If you want to know more about the technical details, checkout this presentation at spaCy IRL 2019: https://www.youtube.com/watch?v=PW3RJM8tDGo&list=PLBmcuObd5An4UC6jvK_-eSl6jCvP1gwXc&index=7&t=0s

# Creating the Knowledge Base 

The first step to perform Entity Linking, is to set up a knowledge base that contains the unique identifiers of the entities we are interested in. In this tutorial we will create a very simple one with only 3 entries. We load the data from a pre-defined CSV file.

In [None]:
import pandas as pd

In [498]:
kb_data=pd.read_csv('kb_datasets/kb_entities_lilsis.csv',index_col=0)

In [499]:
kb_data.shape

(212339, 3)

In [502]:
kb_data=kb_data[['id','name','desc']]

In [7]:
## Generate synthetic aliases 

In [503]:
aliases_data=kb_data[kb_data['name'].duplicated(keep=False)].sort_values(['name'])

In [504]:
aliases_data['id']=aliases_data['id'].astype(str)

In [505]:
alias_dict={}
for alias in aliases_data['name'].unique():
    alias_dict[alias]=list(aliases_data.loc[aliases_data['name']==alias, 'id'].values)

In [506]:
kb_data['name'].value_counts()

Cool Davis            12
David Wilson          11
Mark Smith            11
David Smith           11
William Cavendish     10
                      ..
Stephen DiCarmine      1
Martin Bienenstock     1
Elyse Grinstein        1
Claire A Grinstein     1
Jim Sabia              1
Name: name, Length: 206878, dtype: int64

In [507]:
# export kb data in right format for tutorialkb_data
kb_data.rename(columns={'id':'qid','context':'desc'})

Unnamed: 0,qid,name,desc
0,1,Walmart Inc.,Retail merchandising
1,2,ExxonMobil,"Oil and gas exploration, production, and marke..."
2,3,Chevron,Energy Company
3,4,General Motors Company,automobile manufacturer
4,5,ConocoPhillips,Texas-based oil and gas corporation
...,...,...,...
405997,427652,"Alexander “Curt“ Meyer, III",Founder & Managing Partner of Truscott Partner...
405999,427654,"Truscott Partners, LLC",The Meyer Family Private Investment company ba...
406000,427655,Garth Hankinson,"Executive Vice President, Chief Financial Offi..."
406001,427656,SVB Securities,"Boston, MA investment bank"


In [480]:
#import csv
from pathlib import Path

def load_entities():
    entities_loc = Path.cwd()/"kb_datasets/kb_entities.csv" 
    kb_entities=pd.read_csv(entities_loc, names=['qid','name','desc'])

    names = dict()
    descriptions = dict()

    for row in kb_entities.iterrows():
        qid = str(row[1][0])
        name = str(row[1][1])
        desc = str(row[1][2])
        names[qid] = name
        descriptions[qid] = desc
    
    return names, descriptions

In [481]:
name_dict, desc_dict = load_entities()
for QID in name_dict.keys():
    print(f"{QID}, name={name_dict[QID]}, desc={desc_dict[QID]}")

  """Entry point for launching an IPython kernel.
IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



We have 3 entries here, of 3 different people called Emerson. One Australian tennis player, one American writer and one Brazilian footballer. We'll use this information to create our knowledge base. We need to define a fixed dimensionality for the entity vectors, which will be 300-D in our case.

In [482]:
from spacy.kb import KnowledgeBase
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=300)

To add each record to the knowledge base, we encode its description using the built-in word vectors of our `nlp` model. The `vector` attribute of a document is the average of its token vectors. We also need to provide a frequency, which is a raw count of how many times a certain entity appears in an annotated corpus. In this tutorial we're not using these frequencies, so we're setting them to an arbitrary value.

In [483]:
for qid, desc in desc_dict.items():
    desc_doc = nlp(desc)
    desc_enc = desc_doc.vector
    kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342)   # 342 is an arbitrary value here

Now we want to specify aliases or synonyms. We first add the full names. Here, we are 100% certain that they resolve to their corresponding QID, as there is no ambiguity.

In [508]:
for qid, name in name_dict.items():
    if name not in alias_dict.keys():
        kb.add_alias(alias=str(name), entities=[str(qid)], probabilities=[1])   # 100% prior probability P(entity|alias)

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [509]:
for alias_ in alias_dict.keys():
    qids=alias_dict[alias_]
    probs = [round(1/len(qids),2)-.01 for qid in qids]
    kb.add_alias(alias=alias_, entities=qids, probabilities=probs)  # sum([probs]) should be <= 1 !

We also want to add the alias "Emerson". We'll assume that each of our 3 Emersons is equally famous and thus we set their probabilities to be equal for each entity.

So this will be our Knowledge base. We can check the entities and aliases that are contained in it:

We can also print the candidates that are generated for the full name of Roy Emerson, as well as for the mention "Emerson" or for any other random mention, like "Sofie".

In [516]:
candidate_1='John Biden'
print(f"Candidates for {candidate}: {[c.entity_ for c in kb.get_alias_candidates(candidate_1)]}")

Candidates for <spacy.kb.Candidate object at 0x172518590>: []


In [517]:
candidate_2='Adam Smith'
print(f"Candidates for {candidate}: {[c.entity_ for c in kb.get_alias_candidates(candidate_2)]}")

Candidates for <spacy.kb.Candidate object at 0x172518590>: ['129552', '379819', '269916', '256328', '13596']


In [518]:
candidate_3='David Smith'
print(f"Candidates for {candidate}: {[c.entity_ for c in kb.get_alias_candidates(candidate_3)]}")

Candidates for <spacy.kb.Candidate object at 0x172518590>: ['53881', '77215', '200407', '204250', '377020', '211703', '200405', '184041', '221595', '280783', '204251']


We notice that querying the KB with the alias "Emerson" gives us 3 candidates, but if we query it with an unknown term, it just gives an empty list.

We can save the knowledge base by calling the function `to_disk` with an output location.

In [522]:
# change the directory and file names to whatever you like
import os
output_dir = Path.cwd() / "my_output"
if not os.path.exists(output_dir):
    os.mkdir(output_dir) 
kb.to_disk(output_dir / "kb_lilsis")

We can store the `nlp` object to file by calling `to_disk` as well.

In [523]:
nlp.to_disk(output_dir / "nlp_lilsis")

In [521]:
kb_data[kb_data['name']=='David Smith']

Unnamed: 0,id,name,desc
49103,53881,David Smith,Prospective Parliamentary Candidate for Wakefi...
71484,77215,David Smith,Dasa Properties LLC
173617,184041,David Smith,"Welder, Sun Coast Resources Inc - Humble Texas"
188818,200405,David Smith,"Professor, University of Florida"
188820,200407,David Smith,"Professor, San Bernardino Community College Di..."
192548,204250,David Smith,"Professor, University of Florida"
192549,204251,David Smith,San Bernardino Community College District
199732,211703,David Smith,"Property Management Administrator, Pennsylvani..."
208928,221595,David Smith,"Regional Vice President, Anthem Blue Cross Blu..."
265111,280783,David Smith,"Chief Development Officer, Leavitt Partners"


In [107]:
s=kb_data['name'].value_counts()>1

In [108]:
alias_df=kb_data[kb_data['name'].isin(s[s].index)].sort_values(by=['name','id'])
#alias_df.to_csv('aliases.csv')

In [None]:
gu_source_data='entity_source_data/gu_resampled.csv'
gu_sample=pd.read_csv(gu_source_data,index_col=0)
gu_sample=gu_sample[['body_text']]
article_containing_alias_indices=[]
for alias in alias_df['name'].unique():
    if not gu_sample.loc[gu_sample['body_text'].str.contains(alias),'body_text'].empty:
        article_containing_alias_indices.append(gu_sample[gu_sample['body_text'].str.contains(alias)])

In [112]:
article_containing_alias=gu_sample
articles_containing_alias=[df_row[['body_text']].values[0][0] for df_row in article_containing_alias_indices]
with open(output_dir /'source_data.txt', 'w') as fp:
    for item in articles_containing_alias:
        fp.write("%s\n" % item)
    print('Done')

Done


In [123]:
gu_sample['body_text']=gu_sample['body_text'].apply(lambda r: r.split('.'))

In [130]:
gu_sample=gu_sample['body_text'].explode().sample(frac=1).reset_index(drop=True)

In [133]:
with open(Path.cwd() /'entity_source_data/source_data_resampled_sentences_randomised.txt', 'w') as fp:
    for item in gu_sample:
        fp.write("%s.\n" % item)
    print('Done')

Done


In [161]:
import rapidfuzz
import numpy as np
from numpy import dot
from numpy.linalg import norm
from operator import itemgetter

In [392]:
aliases=kb.get_alias_strings()
span='Adam Smith'
matches={}
matching_thres=60
for al in aliases:
    #fuzzy_ratio=rapidfuzz.fuzz.token_set_ratio(span.lower(),al.lower())
    fuzzy_ratio=rapidfuzz.fuzz.WRatio(span.lower(),al.lower())
    if fuzzy_ratio >=matching_thres:
        matches[al]=fuzzy_ratio

candidates=[]
for match in matches:
    candidates.extend(kb.get_alias_candidates(match))

In [397]:
print(f"Candidates for {span}: {[c.entity_ for c in kb.get_alias_candidates(span)]}")

Candidates for Adam Smith: ['13596', '129552', '256328', '269916', '379819']


In [414]:
[candidate.entity_ for candidate in candidates if candidate.alias_=='Adam Smith']

['13596', '129552', '256328', '269916', '379819']

In [None]:
#names = dict()
candidate_d=dict()
fuzzy_scores = dict()
for candidate in candidates:
    qid = candidate.entity_
    name = candidate.alias_
    #names[qid] = name
    candidate_d[qid] = candidate
    fuzzy_scores[qid] = matches[name]
    
entities_ordered=dict(sorted(fuzzy_scores.items(), key=itemgetter(1), reverse=True))
entities_ordered=list(entities_ordered.keys())[:10]
[candidate_d[entity].alias_ for entity in entities_ordered]

In [476]:
def order_candidates_fuzzy_score(candidates, matches, candidate_threshold=10):
    """
    """
    # names = dict()
    candidate_d = dict()
    fuzzy_scores = dict()
    for candidate in candidates:
        qid = candidate.entity_
        name = candidate.alias_
        # names[qid] = name
        candidate_d[qid] = candidate
        fuzzy_scores[qid] = matches[name]

    entities_ordered = dict(sorted(fuzzy_scores.items(), key=itemgetter(1), reverse=True))
    entities_ordered = list(entities_ordered.keys())[:candidate_threshold]
    return [candidate_d[entity] for entity in entities_ordered]

In [477]:
order_candidates_fuzzy_score(candidates, matches)

[<spacy.kb.Candidate at 0x16b231360>,
 <spacy.kb.Candidate at 0x16b2313d0>,
 <spacy.kb.Candidate at 0x16b231440>,
 <spacy.kb.Candidate at 0x16b2314b0>,
 <spacy.kb.Candidate at 0x16b231520>,
 <spacy.kb.Candidate at 0x16553ba60>,
 <spacy.kb.Candidate at 0x172511a60>,
 <spacy.kb.Candidate at 0x165544c90>,
 <spacy.kb.Candidate at 0x16553b7c0>,
 <spacy.kb.Candidate at 0x16b239600>]

In [470]:
entities_ordered=dict(sorted(fuzzy_scores.items(), key=itemgetter(1), reverse=True))

entities_ordered=list(entities_ordered.keys())[:10]

[candidate_d[entity].alias_ for entity in entities_ordered]

In [443]:
selected_candidates=[]
for entitiy in entities_ordered[:10]:
    if entity 
    selected_candidates.append()

{'13596': 100.0,
 '129552': 100.0,
 '256328': 100.0,
 '269916': 100.0,
 '379819': 100.0,
 '253620': 95.0,
 '319442': 95.0,
 '388312': 94.73684210526316,
 '325220': 90.0,
 '52342': 90.0,
 '30799': 90.0,
 '256382': 90.0,
 '30082': 90.0,
 '270735': 85.71428571428572,
 '239067': 85.5,
 '48861': 85.5,
 '65607': 85.5,
 '195224': 85.5,
 '277586': 85.5,
 '259977': 85.5,
 '117143': 85.5,
 '50196': 85.5,
 '129289': 85.5,
 '29923': 85.5,
 '213537': 85.5,
 '368517': 85.5,
 '320027': 85.5,
 '301819': 85.5,
 '296698': 85.5,
 '98946': 85.5,
 '240046': 85.5,
 '147279': 85.5,
 '72324': 85.5,
 '119838': 85.5,
 '331292': 85.5,
 '246272': 85.5,
 '86101': 85.5,
 '228342': 85.5,
 '349496': 85.5,
 '257053': 85.5,
 '362156': 85.5,
 '46499': 85.5,
 '199129': 85.5,
 '337700': 85.5,
 '240933': 85.5,
 '364020': 85.5,
 '23336': 85.5,
 '171419': 85.5,
 '200397': 85.5,
 '14281': 85.5,
 '427309': 85.5,
 '420104': 85.5,
 '253140': 85.5,
 '310202': 85.5,
 '121767': 85.5,
 '2839': 85.5,
 '315406': 85.5,
 '76537': 85.5,


In [436]:
candidates_ordered[100]

'379819'

In [351]:
    candidates_alphabetical = {candidate.alias_ + ' ' + candidate.entity_: candidate for candidate in candidates}
    candidates_alphabetical = dict(sorted(candidates_alphabetical.items(), key=itemgetter(0), reverse=False))

{'Adam Smith': 100.0,
 'Adam B Smith': 95.0,
 'Smith Smith': 95.0,
 'Ada Smith': 94.73684210526316,
 'The Adam Smith Institute': 90.0,
 'Adam Smith Foundation': 90.0,
 'Adam Smith Political Action Committee, the': 90.0,
 'Adam Smith Realty Advisors Inc': 90.0,
 'Adam Smith for Congress Committee': 90.0,
 'Damon Smith': 85.71428571428572,
 'Virginians for Chuck Smith': 85.5,
 'Morgan Stanley Smith Barney Global Impact Funding Trust': 85.5,
 'Robert Henry Smith': 85.5,
 'Edward Brinton Smith': 85.5,
 'Joseph A. Smith': 85.5,
 'Adam Vandervoort': 85.5,
 'John W Smith Jr': 85.5,
 'Philander Smith College': 85.5,
 'Barbara Sexton Smith': 85.5,
 'Bob Smith for President Committee Inc': 85.5,
 'Adam Hochschild': 85.5,
 'Davey Boy Smith': 85.5,
 'Peggy Williams-Smith': 85.5,
 'Jan Mahrt-Smith': 85.5,
 'Bruce Higson-Smith': 85.5,
 'DeMaurice Smith': 85.5,
 'Adam Hasner Florida Victory Cmte': 85.5,
 'Adam J Di Vincenzo': 85.5,
 'Gregory L Smith': 85.5,
 'William Smith Conning': 85.5,
 'Adam Wolf

In [169]:
def get_candidates_from_fuzzy_matching(span, kb, matching_thres=60) -> Iterator[Candidate]:
    """
    Return a list of candidate entities for an alias based on fuzzy string matching.
    Each candidate defines the entity, the original alias,
    and the prior probability of that alias resolving to that entity.
    If the alias is not known in the KB, and empty list is returned.
    """

    aliases=kb.get_alias_strings()
    matches=[]
    for al in aliases:
    fuzzy_ratio=rapidfuzz.fuzz.token_set_ratio(span.lower(),al.lower())
        if fuzzy_ratio >=matching_thres:
            matches[al]=fuzzy_ratio

    candidates=[]
    for match in matches:
        candidates.extend(kb.get_alias_candidates(match))

    return candidates, matches

def embed_text(text,nlp):
    """
    Return spaCy embedding of a text.
    """
    return nlp(text).vector

def calculate_cosine_similarity(descriptions_vec,vector_ref_sentence):
    """
    Return a dictionary mapping the kb entity id to cosine similarity score
    between kb embedded descriptions and the reference vector.
    """
    similarity={}
    for entity_id in descriptions_vec.keys():
        vector_desc=descriptions_vec[entity_id]
        score=np.nan_to_num(
            dot(vector_ref_sentence, vector_desc)/
            (norm(vector_ref_sentence)*norm(vector_desc))
        ,0)
        similarity[entity_id]=score
    return similarity

def get_candidates_from_context(text, nlp, candidates, matches, candidate_limit=50):
    """
    Select only the top candidates to surface via the Prodigy UI. Based on
    topmost cosine similarities.
    """
    vector_ref_sentence=embed_text(text,nlp)
    names = dict()
    descriptions_vec = dict()
    for candidate in candidates:
        qid = candidate.entity_
        name = candidate.alias_
        desc_enc = candidate.entity_vector
        names[qid] = name
        descriptions_vec[qid] = desc_enc

    similarity=calculate_cosine_similarity(descriptions_vec, vector_ref_sentence)
    fuzzy_scores={qid:matches[alias] for qid, alias in names.items()}
    fuzzy_scores=dict(sorted(fuzzy_scores.items(), key=itemgetter(0), reverse=False))
    similarity=dict(sorted(similarity.items(), key=itemgetter(0), reverse=False))
    qids=set(similarity.keys()) | set(fuzzy_scores.keys())
    fuzzy_similarity={}
    for qid in qids:
    fuzzy_similarity[qid]=np.mean([similarity[qid],fuzzy_scores[qid]/100])
    top_best_candidates = dict(sorted(fuzzy_similarity.items(), key = itemgetter(1), reverse = True)[:candidate_limit])
    selected_candidates = [candidate for candidate in candidates if candidate.entity_ in top_best_candidates.keys()]
    return selected_candidates

def order_candidates_alphabetically(candidates):
    """
    Order candidate list alphabetically
    """
    candidates_alphabetical = {candidate.alias_ + ' ' + candidate.entity_: candidate for candidate in candidates}
    candidates_alphabetical = dict(sorted(candidates_alphabetical.items(), key=itemgetter(0), reverse=False))
    return [candidate for candidate in candidates_alphabetical.values()]

In [302]:
text="It was John Maynard Keynes who said that \“when the capital development of a country becomes a byproduct of the activities of a casino, the job is likely to be ill-done\”."
new_candidates=get_candidates_from_context(text,nlp,candidates)

  app.launch_new_instance()


In [303]:
vector_ref_sentence=embed_text(text,nlp)
names = dict()
descriptions_vec = dict()
for candidate in candidates:
    qid = candidate.entity_
    name = candidate.alias_
    desc_enc = candidate.entity_vector
    names[qid] = name
    descriptions_vec[qid] = desc_enc

similarity=calculate_cosine_similarity(descriptions_vec, vector_ref_sentence)
fuzzy_scores={qid:matches[alias] for qid, alias in names.items()}

  app.launch_new_instance()


In [306]:
fuzzy_scores=dict(sorted(fuzzy_scores.items(), key=itemgetter(0), reverse=False))
similarity=dict(sorted(similarity.items(), key=itemgetter(0), reverse=False))
qids=set(similarity.keys()) | set(fuzzy_scores.keys())
fuzzy_similarity={}
for qid in qids:
    fuzzy_similarity[qid] = ([similarity[qid], fuzzy_scores[qid] / 100])
    #fuzzy_similarity[qid]=np.mean([similarity[qid],fuzzy_scores[qid]/100])
    #fuzzy_similarity[qid] = np.ma.average([similarity[qid], fuzzy_scores[qid] / 100], weights=[1, 2])

In [316]:
np.ma.average([similarity[qid], fuzzy_scores[qid] / 100], weights=[1, 2])

0.626984592950086

In [294]:
top_best_candidates = dict(sorted(fuzzy_similarity.items(), key = itemgetter(1), reverse = True)[:50])

In [295]:
top_best_candidates

{'398729': 0.8583432574056711,
 '97387': 0.8456965772939196,
 '93513': 0.8191439575522026,
 '165685': 0.8030752167648936,
 '113174': 0.8008316317871986,
 '411891': 0.7972990876312205,
 '11897': 0.7961998554097311,
 '333467': 0.7947114583244168,
 '50829': 0.7916899295631747,
 '176420': 0.7910833652251725,
 '17796': 0.7902848022007736,
 '280614': 0.7873476011873211,
 '27998': 0.7854471672376881,
 '280391': 0.784609600775617,
 '100566': 0.783871099027426,
 '15049': 0.7816469900861103,
 '183241': 0.7810835194767953,
 '159793': 0.7807202742684682,
 '94273': 0.7804436380506125,
 '34469': 0.7796250626423464,
 '37211': 0.779428298118812,
 '78809': 0.777897797261107,
 '224717': 0.7764405149370766,
 '355462': 0.775704424578975,
 '315036': 0.7752372412117599,
 '392163': 0.7747885422399007,
 '360968': 0.774105755893999,
 '341357': 0.7737280417575043,
 '89973': 0.7736169593350832,
 '43425': 0.7736149653493714,
 '177844': 0.773588173454582,
 '111467': 0.7735844665002619,
 '112922': 0.772883993952085

# Creating a training dataset

Now, we need to create some annotated data to train an Entity Linking algorithm on. To do so, we will use the annotation tool Prodigy, but you could generate the data in whatever tool you like.

If you are watching [the video](https://www.youtube.com/watch?v=8u57WSXVpmw), it will explain how to obtain annotated data with Prodigy. The final result will be a JSONL file that is distributed alongside this notebook. We'll now use this JSONL file to train our entity linker. If you want to skip the annotation part in the video, you can fast forward to [this section](https://www.youtube.com/watch?v=8u57WSXVpmw&t=19m19s).

Let's have a look at the results in this file:

In [None]:
import json
from pathlib import Path

json_loc = Path.cwd().parent / "assets" / "emerson_annotated_text.jsonl" # distributed alongside this notebook
with json_loc.open("r", encoding="utf8") as jsonfile:
    line = jsonfile.readline()
    print(line)   # print just the first line

 We see that the full text of the original sentence is stored, together with a lot of detail about the annotation task. The most important bit is stored with the key `accept` at the end: this is the value of our manual annotation. For this specific sentence and this specific mention, the option with key `Q312545` was manually selected. This is the information that we'll train our entity linker on.

# Training the Entity Linker

To feed training data into our Entity Linker, we format our data as a structured tuple. The first part is the raw text, and the second part is a dictionary of annotations. This dictionary defines the named entities we want to link ("entities"), as well as the actual gold-standard links ("links").

In [None]:
import json
from pathlib import Path

dataset = []
json_loc = Path.cwd().parent / "assets" / "emerson_annotated_text.jsonl"
with json_loc.open("r", encoding="utf8") as jsonfile:
    for line in jsonfile:
        example = json.loads(line)
        text = example["text"]
        if example["answer"] == "accept":
            QID = example["accept"][0]
            offset = (example["spans"][0]["start"], example["spans"][0]["end"])
            entity_label = example["spans"][0]["label"]
            entities = [(offset[0], offset[1], entity_label)]
            links_dict = {QID: 1.0}
        dataset.append((text, {"links": {offset: links_dict}, "entities": entities}))

To check whether the conversion looks OK, we can just print the first sample in our dataset. 

In [None]:
dataset[0]

We can also check some statistics in this dataset. How many cases of each QID do we have annotated?

In [None]:
gold_ids = []
for text, annot in dataset:
    for span, links_dict in annot["links"].items():
        for link, value in links_dict.items():
            if value:
                gold_ids.append(link)

from collections import Counter
print(Counter(gold_ids))

We got exactly 10 annotated sentences for each of our Emersons. Of these, we'll now set aside 6 cases in a separate test set.

In [None]:
import random

train_dataset = []
test_dataset = []
for QID in qids:
    indices = [i for i, j in enumerate(gold_ids) if j == QID]
    train_dataset.extend(dataset[index] for index in indices[0:8])  # first 8 in training
    test_dataset.extend(dataset[index] for index in indices[8:10])  # last 2 in test
    
random.shuffle(train_dataset)
random.shuffle(test_dataset)

With our datasets now properly set up, we'll now create `Example` objects to feed into the training process. This object is new in spaCy v3. Essentially, it contains a document with predictions (`predicted`) and one with gold-standard annotations (`reference`). During training, the pipeline will compare its predictions to the gold-standard, and update the weights of the neural network accordingly.

For entity linking, the algorithm needs access to gold-standard sentences, because the algorithms use the context from the sentence to perform the disambiguation. You can either provide gold-standard `sent_starts` annotations, or run a component such as the `parser` or `sentencizer` on your reference documents:

In [None]:
from spacy.training import Example

TRAIN_EXAMPLES = []
if "sentencizer" not in nlp.pipe_names:
    nlp.add_pipe("sentencizer")
sentencizer = nlp.get_pipe("sentencizer")
for text, annotation in train_dataset:
    example = Example.from_dict(nlp.make_doc(text), annotation)
    example.reference = sentencizer(example.reference)
    TRAIN_EXAMPLES.append(example)
    

Then, we'll create a new Entity Linking component and add it to the pipeline. 

We also need to make sure the `entity_linker` component is properly initialized. To do this, we need a `get_examples` function that returns some example training data, as well as a `kb_loader` argument. This is a `Callable` function that creates the `KnowledgeBase`, given a certain `Vocab` instance. Here, we will load our KB from disk, using the built-in [`spacy.KBFromFile.v1`](https://spacy.io/api/architectures#KBFromFile) function, which is defined in `spacy.ml.models`. 

In [None]:
from spacy.ml.models import load_kb

entity_linker = nlp.add_pipe("entity_linker", config={"incl_prior": False}, last=True)
entity_linker.initialize(get_examples=lambda: TRAIN_EXAMPLES, kb_loader=load_kb(output_dir / "my_kb"))

Next, we will run the actual training loop for the new component, taking care to only train the entity linker and not the other components. 

In [None]:
from spacy.util import minibatch, compounding

with nlp.select_pipes(enable=["entity_linker"]):   # train only the entity_linker
    optimizer = nlp.resume_training()
    for itn in range(500):   # 500 iterations takes about a minute to train
        random.shuffle(TRAIN_EXAMPLES)
        batches = minibatch(TRAIN_EXAMPLES, size=compounding(4.0, 32.0, 1.001))  # increasing batch sizes
        losses = {}
        for batch in batches:
            nlp.update(
                batch,   
                drop=0.2,      # prevent overfitting
                losses=losses,
                sgd=optimizer,
            )
        if itn % 50 == 0:
            print(itn, "Losses", losses)   # print the training loss
print(itn, "Losses", losses)

The final training loss is pretty small, which is a good sign. But to truly verify whether our model generalizes well, we need to test it on unseen data.

# Testing the Entity Linker

Let's first apply it on our original sentence. For each entity, we print the text and label as before, but also the disambiguated QID as predicted by our entity linker.

In [None]:
text = "Tennis champion Emerson was expected to win Wimbledon."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

We see that Emerson gets disambiguated to Q312545, which is the correct ID for the tennis player. Note also that the entity "Wimbledon" gets the annotation `NIL`, which is basically just a placeholder value, showing that the NEL component could not find any relevant ID for this entity. This happens because our Knowledge base and the Entity Linking component have only been trained on "Emerson" examples, and are thus quite limited.

Let's see what the model predicts for the 6 sentences in our test dataset, that were never seen during training.

In [None]:
for text, true_annot in test_dataset:
    print(text)
    print(f"Gold annotation: {true_annot}")
    doc = nlp(text)  # to make this more efficient, you can use nlp.pipe() just once for all the texts
    for ent in doc.ents:
        if ent.text == "Emerson":
            print(f"Prediction: {ent.text}, {ent.label_}, {ent.kb_id_}")
    print()

These results may vary a little from run to run, but usually the EL pipeline will get 5 out of 6 predictions correct (83% accuracy). Random guessing would have only achieved 33%.

Hopefully, this tutorial has shown you how to implement an Entity Linking component in spaCy. The knowledge base and training dataset used here were kept small for demonstration purposes, but in reality you'll want to use a much bigger representative set of entities, perhaps from an ontology or dictionary that is relevant to your use-case. 

If you have general questions on how to use this functionality in your own application, the best route is to create a new StackOverfow issue and tag it with the label `spaCy`. If you would run into an actual bug with the Entity Linking functionality, you can also open an issue at spaCy's github tracker. 

I hope your next NLP project will incorporate entity linking !