# Introduction to Entity Linking

This notebook provides a short tutorial on how to implement and use spaCy's Entity Linking functionality with spaCy v3.  It can be used together with [this video](https://www.youtube.com/watch?v=8u57WSXVpmw). Note that this video was originally created for spaCy v2, but this notebook includes the updates needed for v3. If you want to use spaCy v2 instead, you can find the original code [here](https://github.com/explosion/projects/tree/master/nel-emerson).

**Entity Linking** (EL) is the challenge of resolving ambiguous textual mentions to unique concepts in a knowledge base. A related task is **Named Entity Recognition** (NER). An NER component basically identifies words in text that have a specific name and refer to real-world objects, such as people or organizations. spaCy offers pre-built Machine Learning models that perform Named Entity Recognition for a variety of languages (https://spacy.io/models).

!pip install spacy==3.0.6
!pip install spacy-lookups-data
!python -m spacy download en_core_web_lg

Let's load a  pretrained English model, apply it to some sample text and show the named entities that were identified by printing their text and label.

In [1]:
import spacy
nlp = spacy.load("en_core_web_lg")
text = "Tennis champion Emerson was expected to win Wimbledon."
doc = nlp(text)
for ent in doc.ents:
    print(f"Named Entity '{ent.text}' with label '{ent.label_}'")

Named Entity 'Emerson' with label 'PERSON'
Named Entity 'Wimbledon' with label 'EVENT'


We see that this sentence contains a person called "Emerson" and an event called "Wimbledon". 

Unfortunately, there may be many people in the world called "Emerson", and this output still doesn't tell us which one exactly we meant. This is the challenge addressed by Entity Linking. It transforms an ambiguous textual mention to a unique identifier by looking at the context in which the mention occurs. 

In this specific case, the sentence gives us important clues: Emerson is clearly a professional tennis player. 

Searching the internet, we can establish that this sentence is most likely talking about Roy Emerson, an Australian tennis player. We can now resolve this entity in this sentence to its unique identifier from WikiData, which is a free and open, interlingual knowledge base. Its unique IDs always start with a Q, and "Roy Emerson" has the identifier Q312545: https://www.wikidata.org/wiki/Q312545

To implement an entity linking pipeline, we need 3 different steps. 

The first step, as we already saw, is Named Entity Recognition, in which the mention "Emerson" is labeled as a "Person". Next, the extracted mention needs to be resolved to a list of plausible candidates. In our case, we'll consider three different people named Emerson. Typically, this list is created by querying a knowledge base (KB) that contains various aliases and synonyms. In the final step, we need to reduce the list of candidates to just one final ID that represents the correct Emerson.

![Diagram of entity linking process](nel_schema.png)

This tutorial will show you how to use spaCy v3 to create a Knowledge base that will address the second step of candidate generation. Additionally, we will create a new Entity Linking component, and train its Machine Learning model on some annotated data.

In this notebook, we implement the functions and training loop from scratch. However, spaCy v3 has introduced a powerful and extensible training configuration system, that we advice to use in most cases. You can find the corresponding implementation with the [config system](https://spacy.io/usage/training#config), runnable with the new [spacy projects](https://spacy.io/usage/projects), [here](https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson).

The aim of this tutorial is to help you get started implementing your own Entity Linking functionality with spaCy. If you want to know more about the technical details, checkout this presentation at spaCy IRL 2019: https://www.youtube.com/watch?v=PW3RJM8tDGo&list=PLBmcuObd5An4UC6jvK_-eSl6jCvP1gwXc&index=7&t=0s

# Creating the Knowledge Base 

The first step to perform Entity Linking, is to set up a knowledge base that contains the unique identifiers of the entities we are interested in. In this tutorial we will create a very simple one with only 3 entries. We load the data from a pre-defined CSV file.

In [2]:
import pandas as pd

In [3]:
dataset='open_sanctions'# OR 'lilsis'

In [4]:
## Clean kb dataset

In [5]:
kb_entities=pd.read_csv(f'kb_datasets/{dataset}_entities.csv',index_col=0)

In [10]:
# Import kb dataset

In [11]:
kb_data=pd.read_csv(f'kb_datasets/kb_entities_{dataset}.csv',index_col=0)

In [12]:
kb_data=kb_data[['id','name','desc']]

In [13]:
kb_data.shape

(168918, 3)

In [14]:
## Generate synthetic aliases 

In [15]:
aliases_data=kb_data[kb_data['name'].duplicated(keep=False)].sort_values(['name'])

In [16]:
aliases_data['id']=aliases_data['id'].astype(str)

In [17]:
alias_dict={}
for alias in aliases_data['name'].unique():
    alias_dict[alias]=list(aliases_data.loc[aliases_data['name']==alias, 'id'].values)

In [18]:
kb_data['name'].value_counts()

МІНІСТЕРСТВО АГРАРНОЇ ПОЛІТИКИ ТА ПРОДОВОЛЬСТВА УКРАЇНИ    10
MAGOMED MAGOMEDOV                                           9
МІНІСТЕРСТВО ІНФРАСТРУКТУРИ УКРАЇНИ                         7
David Anderson                                              6
ДЕРЖАВНЕ УПРАВЛІННЯ СПРАВАМИ                                6
                                                           ..
Karin Gaardsted                                             1
Karin Løhde                                                 1
Karin Nødgaard                                              1
Karina Adsbøl                                               1
Бороздина Галина Александровна                              1
Name: name, Length: 167898, dtype: int64

In [19]:
# export kb data in right format for tutorialkb_data
kb_data.rename(columns={'id':'qid','context':'desc'})

Unnamed: 0,qid,name,desc
0,acf-00040861bc3f593000830d987d09967ef3503ef1,Kolyvanov Egor,"Russian propagandist: host of news program ""Se..."
1,acf-0011c68a768924609dc5da5707ac7fa4c4d645a2,Shipov Sergei Yurievich,"Russian chess player, grandmaster, chess coach..."
2,acf-001e7e4c0363f08f1e784c230457960b84a6416f,Egorov Ivan Mikhailovich,Deputy of the State Council of the Republic of...
3,acf-002c208139012c8d93b6298358188d7cadafe648,Goreslavsky Alexey Sergeyevich,Russian journalist and media manager. Helped d...
4,acf-002cc8fdf8fe41185091a7cb6c598663e7a22eb5,Samoilova Natalya Vladimirovna,"Russian singer, composer. Supported the action..."
...,...,...,...
386379,ua-nsdc-person-82-2019-2660,Васькевич Алла Вікторівна,Сайт так званої «Адміністрації м.Красний Луч Л...
386380,ua-nsdc-person-82-2019-2661,Афанасьєв Сергій Павлович,Сайт так званої «Адміністрації м.Первомайськ Л...
386381,ua-nsdc-person-82-2019-2662,Дейнека Олександр Анатолійович,Сайт так званої «Адміністрації м.Слов’яносербс...
386382,ua-nsdc-person-82-2019-2663,Горенко Сергій Сергійович,Сайт так званої «Генеральної прокуратури ЛНР» ...


In [20]:
#import csv
from pathlib import Path

def load_entities():
    entities_loc = Path.cwd()/f'kb_datasets/kb_entities_{dataset}.csv'
    kb_entities=pd.read_csv(entities_loc, names=['qid','name','desc'])

    names = dict()
    descriptions = dict()

    for row in kb_entities.iterrows():
        qid = str(row[1][0])
        name = str(row[1][1])
        desc = str(row[1][2])
        names[qid] = name
        descriptions[qid] = desc
    
    return names, descriptions

In [None]:
name_dict, desc_dict = load_entities()
for QID in name_dict.keys():
    print(f"{QID}, name={name_dict[QID]}, desc={desc_dict[QID]}")

We have 3 entries here, of 3 different people called Emerson. One Australian tennis player, one American writer and one Brazilian footballer. We'll use this information to create our knowledge base. We need to define a fixed dimensionality for the entity vectors, which will be 300-D in our case.

In [22]:
from spacy.kb import KnowledgeBase
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=300)

To add each record to the knowledge base, we encode its description using the built-in word vectors of our `nlp` model. The `vector` attribute of a document is the average of its token vectors. We also need to provide a frequency, which is a raw count of how many times a certain entity appears in an annotated corpus. In this tutorial we're not using these frequencies, so we're setting them to an arbitrary value.

In [23]:
for qid, desc in desc_dict.items():
    desc_doc = nlp(desc)
    desc_enc = desc_doc.vector
    kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342)   # 342 is an arbitrary value here

Now we want to specify aliases or synonyms. We first add the full names. Here, we are 100% certain that they resolve to their corresponding QID, as there is no ambiguity.

In [24]:
for qid, name in name_dict.items():
    if name not in alias_dict.keys():
        kb.add_alias(alias=str(name), entities=[str(qid)], probabilities=[1])   # 100% prior probability P(entity|alias)

In [25]:
for alias_ in alias_dict.keys():
    qids=alias_dict[alias_]
    probs = [round(1/len(qids),2)-.01 for qid in qids]
    kb.add_alias(alias=alias_, entities=qids, probabilities=probs)  # sum([probs]) should be <= 1 !

We also want to add the alias "Emerson". We'll assume that each of our 3 Emersons is equally famous and thus we set their probabilities to be equal for each entity.

So this will be our Knowledge base. We can check the entities and aliases that are contained in it:

We can also print the candidates that are generated for the full name of Roy Emerson, as well as for the mention "Emerson" or for any other random mention, like "Sofie".

In [27]:
candidate_1='John Biden'
print(f"Candidates for {candidate_1}: {[c.entity_ for c in kb.get_alias_candidates(candidate_1)]}")

Candidates for John Biden: []


In [30]:
candidate_2='Adam Smith'
print(f"Candidates for {candidate_2}: {[c.entity_ for c in kb.get_alias_candidates(candidate_2)]}")

Candidates for Adam Smith: ['Q350916']


In [36]:
candidate_3='David Smith'
print(f"Candidates for {candidate_3}: {[c.entity_ for c in kb.get_alias_candidates(candidate_3)]}")

Candidates for David Smith: ['Q3018800', 'Q5239878', 'Q53960880']


We notice that querying the KB with the alias "Emerson" gives us 3 candidates, but if we query it with an unknown term, it just gives an empty list.

We can save the knowledge base by calling the function `to_disk` with an output location.

In [38]:
# change the directory and file names to whatever you like
import os
output_dir = Path.cwd() / "my_output"
if not os.path.exists(output_dir):
    os.mkdir(output_dir) 
kb.to_disk(output_dir / f'kb_{dataset}')

We can store the `nlp` object to file by calling `to_disk` as well.

In [39]:
nlp.to_disk(output_dir / f'nlp_{dataset}')

In [40]:
kb_data[kb_data['name']=='David Smith']

Unnamed: 0,id,name,desc
232095,Q3018800,David Smith,Quebec politician
256582,Q5239878,David Smith,Canadian senator
257965,Q53960880,David Smith,Australian Capital Territory politician


In [41]:
s=kb_data['name'].value_counts()>1

In [42]:
alias_df=kb_data[kb_data['name'].isin(s[s].index)].sort_values(by=['name','id'])
#alias_df.to_csv('aliases.csv')

In [None]:
## Export guardian articles as random txt sentences

In [105]:
from bs4 import BeautifulSoup

def get_article_paragraphs(html_text: str):
    
    """ Takes the full html of an article (CAPI format) and strips out all HTML tags. 
        Creates paragraphs from the <p></p> HTML items.

        :param text: the raw HTML of an article
        
        returns: article paragraphs: list(str)
        """

    soup = BeautifulSoup(html_text, features="html.parser")
    
    # Remove article embellishments (sub-headings, figures, asides, etc.) 
    for h2 in soup.find_all('h2'):
        try:
            soup.h2.extract()
        except:
            pass
    
    for span in soup.find_all('span'):
        try:            
            soup.span.extract()
        except:
            pass

    for aside in soup.find_all('aside'):
        try:
            soup.aside.extract()
        except:
            pass
    
    for figure in soup.find_all('figure'):
        try:
            soup.figure.extract()
        except:
            pass
        
    for a in soup.find_all('a'):
        a.unwrap()
        
    paragraphs = [p.getText() for p in  soup.find_all('p')]
    
    return paragraphs

In [127]:
#gu_source_data='entity_source_data/gu_resampled_by_section_id.csv'
#gu_sample=pd.read_csv(gu_source_data,index_col=0)
#gu_sample['paragraphs'] = gu_sample['body_html'].apply(get_article_paragraphs)
#gu_sample['paragraphs'] = gu_sample['body_html'].apply(get_article_paragraphs)
#gu_sample['paragraphs']=gu_sample['paragraphs'].apply(lambda x: '<p>'.join(x))
#gu_sample=gu_sample['paragraphs'].str.split('<p>').explode().to_frame()
article_containing_alias_indices=[]
for alias in alias_df['name'].unique():
    if not gu_sample.loc[gu_sample['paragraphs'].str.contains(alias),'paragraphs'].empty:
        article_containing_alias_indices.append(gu_sample[gu_sample['paragraphs'].str.contains(alias)])

In [None]:
gu_sample.reset_index().to_csv('entity_source_data/gu_resampled_by_section_id_sentences_with_full_matches.csv')

In [93]:
gu_sample=pd.read_csv('entity_source_data/gu_resampled_by_section_id_sentences_with_full_matches.csv',index_col=0)

In [95]:
gu_sample=gu_sample.sample(frac=1)

In [99]:
gu_sample.drop(['index'],1,inplace=True)

  """Entry point for launching an IPython kernel.


In [100]:
output_name='gu_resampled_by_section_id_sentences_with_full_matches'
article_containing_alias=gu_sample
articles_containing_alias=[df_row[['body_text']].values[0][0] for df_row in article_containing_alias_indices]

with open(Path.cwd() / 'entity_source_data' / f'{output_name}.txt', 'w') as fp:
    for item in articles_containing_alias:
        fp.write("%s\n" % item)
    print('Done')

Done


In [87]:
article_containing_alias

Unnamed: 0,body_text
112839,[ And given Democrats’ extremely narrow majori...
4849,[ Then Edward Heath suspended Stormont]
127342,[ “I’m my own boss]
107479,"[ “After how scary the last year has been, we’..."
85172,[ But it has not been able to persuade the EU ...
...,...
4611,[ “To get that message out five days in advanc...
131659,[ Yellen says this is important]
50094,[ A Labour source has been in touch to say tha...
65216,"[ A USCIS spokeswoman, Pamela Wilson, said the..."


In [86]:
gu_sample['body_text']=gu_sample['body_text'].apply(lambda r: r.split('.'))

In [51]:
gu_sample=gu_sample['body_text'].explode().sample(frac=1).reset_index(drop=True)

In [52]:
with open(Path.cwd() /f'entity_source_data/{output_name}_resampled_sentences_randomised.txt', 'w') as fp:
    for item in gu_sample:
        fp.write("%s.\n" % item)
    print('Done')

Done


# Creating a training dataset

Now, we need to create some annotated data to train an Entity Linking algorithm on. To do so, we will use the annotation tool Prodigy, but you could generate the data in whatever tool you like.

If you are watching [the video](https://www.youtube.com/watch?v=8u57WSXVpmw), it will explain how to obtain annotated data with Prodigy. The final result will be a JSONL file that is distributed alongside this notebook. We'll now use this JSONL file to train our entity linker. If you want to skip the annotation part in the video, you can fast forward to [this section](https://www.youtube.com/watch?v=8u57WSXVpmw&t=19m19s).

Let's have a look at the results in this file:

In [None]:
import json
from pathlib import Path

json_loc = Path.cwd().parent / "assets" / "emerson_annotated_text.jsonl" # distributed alongside this notebook
with json_loc.open("r", encoding="utf8") as jsonfile:
    line = jsonfile.readline()
    print(line)   # print just the first line

 We see that the full text of the original sentence is stored, together with a lot of detail about the annotation task. The most important bit is stored with the key `accept` at the end: this is the value of our manual annotation. For this specific sentence and this specific mention, the option with key `Q312545` was manually selected. This is the information that we'll train our entity linker on.

# Training the Entity Linker

To feed training data into our Entity Linker, we format our data as a structured tuple. The first part is the raw text, and the second part is a dictionary of annotations. This dictionary defines the named entities we want to link ("entities"), as well as the actual gold-standard links ("links").

In [None]:
import json
from pathlib import Path

dataset = []
json_loc = Path.cwd().parent / "assets" / "emerson_annotated_text.jsonl"
with json_loc.open("r", encoding="utf8") as jsonfile:
    for line in jsonfile:
        example = json.loads(line)
        text = example["text"]
        if example["answer"] == "accept":
            QID = example["accept"][0]
            offset = (example["spans"][0]["start"], example["spans"][0]["end"])
            entity_label = example["spans"][0]["label"]
            entities = [(offset[0], offset[1], entity_label)]
            links_dict = {QID: 1.0}
        dataset.append((text, {"links": {offset: links_dict}, "entities": entities}))

To check whether the conversion looks OK, we can just print the first sample in our dataset. 

In [None]:
dataset[0]

We can also check some statistics in this dataset. How many cases of each QID do we have annotated?

In [None]:
gold_ids = []
for text, annot in dataset:
    for span, links_dict in annot["links"].items():
        for link, value in links_dict.items():
            if value:
                gold_ids.append(link)

from collections import Counter
print(Counter(gold_ids))

We got exactly 10 annotated sentences for each of our Emersons. Of these, we'll now set aside 6 cases in a separate test set.

In [None]:
import random

train_dataset = []
test_dataset = []
for QID in qids:
    indices = [i for i, j in enumerate(gold_ids) if j == QID]
    train_dataset.extend(dataset[index] for index in indices[0:8])  # first 8 in training
    test_dataset.extend(dataset[index] for index in indices[8:10])  # last 2 in test
    
random.shuffle(train_dataset)
random.shuffle(test_dataset)

With our datasets now properly set up, we'll now create `Example` objects to feed into the training process. This object is new in spaCy v3. Essentially, it contains a document with predictions (`predicted`) and one with gold-standard annotations (`reference`). During training, the pipeline will compare its predictions to the gold-standard, and update the weights of the neural network accordingly.

For entity linking, the algorithm needs access to gold-standard sentences, because the algorithms use the context from the sentence to perform the disambiguation. You can either provide gold-standard `sent_starts` annotations, or run a component such as the `parser` or `sentencizer` on your reference documents:

In [None]:
from spacy.training import Example

TRAIN_EXAMPLES = []
if "sentencizer" not in nlp.pipe_names:
    nlp.add_pipe("sentencizer")
sentencizer = nlp.get_pipe("sentencizer")
for text, annotation in train_dataset:
    example = Example.from_dict(nlp.make_doc(text), annotation)
    example.reference = sentencizer(example.reference)
    TRAIN_EXAMPLES.append(example)
    

Then, we'll create a new Entity Linking component and add it to the pipeline. 

We also need to make sure the `entity_linker` component is properly initialized. To do this, we need a `get_examples` function that returns some example training data, as well as a `kb_loader` argument. This is a `Callable` function that creates the `KnowledgeBase`, given a certain `Vocab` instance. Here, we will load our KB from disk, using the built-in [`spacy.KBFromFile.v1`](https://spacy.io/api/architectures#KBFromFile) function, which is defined in `spacy.ml.models`. 

In [None]:
from spacy.ml.models import load_kb

entity_linker = nlp.add_pipe("entity_linker", config={"incl_prior": False}, last=True)
entity_linker.initialize(get_examples=lambda: TRAIN_EXAMPLES, kb_loader=load_kb(output_dir / "my_kb"))

Next, we will run the actual training loop for the new component, taking care to only train the entity linker and not the other components. 

In [None]:
from spacy.util import minibatch, compounding

with nlp.select_pipes(enable=["entity_linker"]):   # train only the entity_linker
    optimizer = nlp.resume_training()
    for itn in range(500):   # 500 iterations takes about a minute to train
        random.shuffle(TRAIN_EXAMPLES)
        batches = minibatch(TRAIN_EXAMPLES, size=compounding(4.0, 32.0, 1.001))  # increasing batch sizes
        losses = {}
        for batch in batches:
            nlp.update(
                batch,   
                drop=0.2,      # prevent overfitting
                losses=losses,
                sgd=optimizer,
            )
        if itn % 50 == 0:
            print(itn, "Losses", losses)   # print the training loss
print(itn, "Losses", losses)

The final training loss is pretty small, which is a good sign. But to truly verify whether our model generalizes well, we need to test it on unseen data.

# Testing the Entity Linker

Let's first apply it on our original sentence. For each entity, we print the text and label as before, but also the disambiguated QID as predicted by our entity linker.

In [None]:
text = "Tennis champion Emerson was expected to win Wimbledon."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

We see that Emerson gets disambiguated to Q312545, which is the correct ID for the tennis player. Note also that the entity "Wimbledon" gets the annotation `NIL`, which is basically just a placeholder value, showing that the NEL component could not find any relevant ID for this entity. This happens because our Knowledge base and the Entity Linking component have only been trained on "Emerson" examples, and are thus quite limited.

Let's see what the model predicts for the 6 sentences in our test dataset, that were never seen during training.

In [None]:
for text, true_annot in test_dataset:
    print(text)
    print(f"Gold annotation: {true_annot}")
    doc = nlp(text)  # to make this more efficient, you can use nlp.pipe() just once for all the texts
    for ent in doc.ents:
        if ent.text == "Emerson":
            print(f"Prediction: {ent.text}, {ent.label_}, {ent.kb_id_}")
    print()

These results may vary a little from run to run, but usually the EL pipeline will get 5 out of 6 predictions correct (83% accuracy). Random guessing would have only achieved 33%.

Hopefully, this tutorial has shown you how to implement an Entity Linking component in spaCy. The knowledge base and training dataset used here were kept small for demonstration purposes, but in reality you'll want to use a much bigger representative set of entities, perhaps from an ontology or dictionary that is relevant to your use-case. 

If you have general questions on how to use this functionality in your own application, the best route is to create a new StackOverfow issue and tag it with the label `spaCy`. If you would run into an actual bug with the Entity Linking functionality, you can also open an issue at spaCy's github tracker. 

I hope your next NLP project will incorporate entity linking !