# Named Entity Linking

A crucial part of question answering over knowledge graphs is identification of named entities in the question, and disambiguating and linking them to the corresponding ID in the KG. There are two challenges: recognizing strings as named entities, and finding the intended entity in the KG. 

In Spacy, we can use the built-in entity annotation layer for entity recognition. Also, there are various options to integrate entity linking in the spacy annotation pipeline. We explore these options below, with wikidata as KG. In all experiments, we use a set of 50 questions about the movies and actors as test data. All experiments done with Spacy 3.1 

### Loading Spacy

We noticed a problem with Spacy sentence segmentation when applied to sentences containing movie titles. Titles such as *'Doctor Who'* or *'O Brother Where Art Thou'* can confuse the sentence segmenter and may lead to Spacy to think that somewhere in the title a new sentence starts. To avoid this problem, and as we already know that each input is exactly one sentence, we use the custom splitter below as suggested by the Spacy documentation.


In [1]:
import spacy

# https://spacy.io/usage/processing-pipelines#custom-components
# add this for custom sentence segmentation that does never split the input  question

from spacy.language import Language

@Language.component("custom_sentencizer")
def custom_sentencizer(doc):
    for i, token in enumerate(doc):
        if i == 0:
            doc[i].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise, to tell
            # the parser to leave those tokens alone
            doc[i].is_sent_start = False
    return doc

nlp = spacy.load('en_core_web_lg')

nlp.add_pipe("custom_sentencizer", before="parser")  # Insert before the parser


<function __main__.custom_sentencizer(doc)>

### Spacy Named Entity Recognition

First, we test the NER component, i.e. which entities are recognized by a standard Spacy model (lg)?

The lg model does a reasonable job at recognizing NEs although a few are missed as well. The easiest way to improve coverage is to add manual patterns, as explained below. Another option is to try the transformer model (tf), which in many ways is more accurate than the lg model.

In [2]:
with open('test_all1.csv')  as questions :
    for line in questions :
        (id, question) = line.strip().split('\t')
        doc = nlp(question)
        for sent in doc.sents :
            print(sent.text)
            for ent in sent.ents:
                print(ent.text)
        print()
        
        

Who directed Tenet?
Tenet

Who directed Bad Times at the El Royale?

Who wrote the music for Once Upon a Time in the West?
Once Upon a Time
West

Which streaming service is the distributor of Tiger King?
Tiger King

What is the country of origin of The Intouchables?
Intouchables

When did John Hamilton die?
John Hamilton

Where did Alan Rickman die?
Alan Rickman

In which year was The Truman Show released?
which year
The Truman Show

What is the box office ranking of the movie Inception?

Inception was directed by whom?

By whom was Jaws directed?

What is Michael Cera's place of birth?
Michael Cera's

Walt Disney Animation Studios was founded by whom?
Walt Disney Animation Studios

What is the catchphrase of the fictional character James Bond?
James Bond

What is the main subject of Se7en?

What are the genres of Fargo?
Fargo

Who are the directors of The Big Lebowski?

Who are Christopher Nolan's children?
Christopher Nolan's

Which movies did Jan de Bont direct?
Jan de Bont

What aw

## Entity linking with the wikidata entity finder api

For entity linking, you can use the wikidata api for finding the corresponding entity IDs. Note that the api can give multiple results for a given text string. Here, we simply return the id of the first result. Apart from id, the api also returns a label and description for each match, that could perhaps be used for disambiguation. (Ie prefer entities with certain keywords in the description field, like *actor, film, ....*

Integration with Spacy is done by introducing an extension attribute on the spans found by NER. See https://spacy.io/usage/rule-based-matching#entityruler for an example and discussion.


In [2]:
import requests

from spacy.tokens import Span
    
def wikidata_entity_link(span) :
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities', 
              'language':'en',
              'format':'json',
              'search': span.text}
    json = requests.get(url,params).json()
    # this part can be replaced by fancier disambiguation methods, or returning a list of ids from all search results
    try : 
        wd_id = json['search'][0]['id']
    except IndexError :
        wd_id = 'no_id_found'
    return wd_id
            
Span.set_extension('wikidata_id',getter=wikidata_entity_link)    

with open('test_all1.csv')  as questions :
    for line in questions :
        (id, question) = line.strip().split('\t')
        doc = nlp(question)
        for sent in doc.sents :
            print(sent.text)
        for ent in doc.ents:
            print(ent.text, ent._.wikidata_id)
        print()

Who directed Tenet?
Tenet Q63985561

Who directed Bad Times at the El Royale?

Who wrote the music for Once Upon a Time in the West?
Once Upon a Time Q23673
West Q679

Which streaming service is the distributor of Tiger King?
Tiger King Q88306935

What is the country of origin of The Intouchables?
Intouchables Q595

When did John Hamilton die?
John Hamilton Q46648449

Where did Alan Rickman die?
Alan Rickman Q106481

In which year was The Truman Show released?
which year no_id_found
The Truman Show Q214801

What is the box office ranking of the movie Inception?

Inception was directed by whom?

By whom was Jaws directed?

What is Michael Cera's place of birth?
Michael Cera's no_id_found

Walt Disney Animation Studios was founded by whom?
Walt Disney Animation Studios Q1047410

What is the catchphrase of the fictional character James Bond?
James Bond Q844

What is the main subject of Se7en?

What are the genres of Fargo?
Fargo Q34109

Who are the directors of The Big Lebowski?

Who are 

### Using the entity ruler 

For entities not recognized by the built-in NER component, we can provide a manually compiled list of patterns. If we add these using the entity_ruler pipeline component, it will integrate with the built-in NER component using the logic described in the documentation (ie the ruler takes precedence over built-in ner normally, but other options are possible). Another nice feature is that the extension attribute wikidata_id is also set for entities recognized by the entity_ruler. 

https://spacy.io/usage/rule-based-matching#entityruler

In the example below, for instance, _Bad Times at the El Royale_ is recognized as entity, and a wikidata_id is added. Also, _Once Upon a Time in the West_ is no longer broken in two entities, but recognized as a single entity (i.e. the manual patters override the built-in ner). 

The challenge is of course to think of robust methods for creating pattern files for a given domain and application. 



In [4]:
nlp.add_pipe("entity_ruler",first=True).\
       from_disk('./movie_patterns.jsonl')

with open('test_all1.csv')  as questions :
    for line in questions :
        (id, question) = line.strip().split('\t')
        doc = nlp(question)
        for sent in doc.sents :
            print(sent.text)
        for ent in doc.ents:
            print(ent.text, ent._.wikidata_id)
        print()
        

ValueError: [E007] 'entity_ruler' already exists in pipeline. Existing names: ['entity_ruler', 'tok2vec', 'tagger', 'custom_sentencizer', 'parser', 'senter', 'attribute_ruler', 'lemmatizer', 'ner']

## Uising Existing Entity Linkers with Spacy

There are two modules that can be added to the Spacy pipeline to also do entity linking with Wikidata as knowledge base. 

### Entity Linker

This module uses a dump of wikidata to create a database with labels and ids. Disambiguation is not context-sensitive but takes the most referred to id as answer (as does the wikidata api), so this is very comparable to the wikidata api, but using a static db instead of an api. 

More details here: [spacy entity linker](https://pypi.org/project/spacy-entity-linker/) 

Overall, it does a reasonable good job, but added value over the solution presented above is unclear. I.e. are there cases which this linker finds or resolves better than the wikidata api?) One case in point are genetives (*Christopher Nolan's*) which are dealt with correctly by this entity linker (ie probably by stripping the genetive s before consulting the database), but this feature can easily be integrated in the get_wikidata_api as well. 


In [25]:
nlp_el = spacy.load('en_core_web_lg')

nlp_el.add_pipe("custom_sentencizer", before="parser")  # Insert before the parser


nlp_el.add_pipe("entityLinker", last=True)

<spacy_entity_linker.EntityLinker.EntityLinker at 0x7f94f1bf15b0>

In [26]:
with open('test_all1.csv')  as questions :
    for line in questions :
        (id, question) = line.strip().split('\t')
        print(question)
        doc = nlp_el(question)
        for sent in doc.sents :
            sent._.linkedEntities.pretty_print()
        print()

Who directed Tenet?
<EntityElement: https://www.wikidata.org/wiki/Q63985561 Tenet                     2020 film by Christopher Nolan                    >

Who directed Bad Times at the El Royale?
<EntityElement: https://www.wikidata.org/wiki/Q4840445 Bad Times                                                                   >
<EntityElement: https://www.wikidata.org/wiki/Q5352061 El Royale                 El Royal  the movie                               >

Who wrote the music for Once Upon a Time in the West?
<EntityElement: https://www.wikidata.org/wiki/Q638 music                     form of art using sound                           >
<EntityElement: https://www.wikidata.org/wiki/Q43297 Time                      American weekly news magazine                     >
<EntityElement: https://www.wikidata.org/wiki/Q313498 Benjamin West             Anglo-American painter                            >

Which streaming service is the distributor of Tiger King?
<EntityElement: https://www.wiki

<EntityElement: https://www.wikidata.org/wiki/Q11563 number                    mathematical object used to count, label, and measure>
<EntityElement: https://www.wikidata.org/wiki/Q28389 screenwriter              writer who writes for TV, films, comics and games >
<EntityElement: https://www.wikidata.org/wiki/Q319221 adventure film            film genre                                        >
<EntityElement: https://www.wikidata.org/wiki/Q208696 Tarzan                    1999 American animated adventure film produced by Walt Disney Feature Animation>

How many awards has Marilyn Monroe received?
<EntityElement: https://www.wikidata.org/wiki/Q618779 award                     something given to a person or a group of people to recognize their excellence in a certain field>
<EntityElement: https://www.wikidata.org/wiki/Q4616 Marilyn Monroe            American actress, model, and singer               >

How many episodes does Friends have?
<EntityElement: https://www.wikidata.org/wiki/Q19


## OpenTapioca

A second option is the opentapioca named entity linker described [here](https://opentapioca.org/), integrated in spacy as part of the pipeline. As far as I can see, it builds on spacy named entity recognizer, and then adds the wikidata links for those.

It does have a few issues :

* tends to include question words as entities
* misses all movie titles (recongizes Wall Street but as the actual stock exchange)
* misses some actor names
* does not always find the wikidata id (for instance, Brad Pitt is not assigned an ID, eventhough description is there? 

The problems with the NER could perhaps be solved by also including a manual entity-ruler component (assuing tapioca indeed iterates over all ents found by the spacy pipeline). 

Nevertheless, and in spite of being based on a context-sensitive disambiguation model, it seems to do a worse job than the options mentioned above. 



In [27]:
#import spacy 
#nlp = spacy.blank('en')
nlp_tap = spacy.load('en_core_web_lg')

nlp_tap.add_pipe("custom_sentencizer", before="parser")  # Insert before the parser

nlp_tap.add_pipe('opentapioca')


<spacyopentapioca.entity_linker.EntityLinker at 0x7f9690faab80>

In [28]:
with open('test_all1.csv') as questions :
    for line in questions:
        (id,question) = line.strip().split('\t')
        doc = nlp_tap(question)
        print(question)
        for span in doc.ents:
            print((span.text, span.kb_id_, span.label_, span._.description, span._.score))
        print()

Who directed Tenet?
('Tenet', '', 'ORG', 'American-Canadian heavy metal band', -0.7947068485334838)

Who directed Bad Times at the El Royale?

Who wrote the music for Once Upon a Time in the West?
('the West', 'Q160381', 'LOC', 'countries that identify themselves with an originally European', 0.13677535018577103)

Which streaming service is the distributor of Tiger King?
('Tiger King', '', 'PERSON', 'American zoo operator, internet personality, musician, and cult leader.', -0.02800093594267583)

What is the country of origin of The Intouchables?
('Intouchables', '', 'PERSON', None, None)

When did John Hamilton die?
('When', 'Q7992417', 'ORG', None, 0.3903881669222681)
('John Hamilton', '', 'PERSON', 'English navy commander (1820–1864)', -0.5938130589979281)
('die', 'Q1210678', 'ORG', 'band', 0.4659305230980987)

Where did Alan Rickman die?
('Alan Rickman', 'Q106481', 'PERSON', 'English film, television and stage actor, graphic designer (1946-2016)', 0.3719371138702773)

In which year 

## Conclusion

Using the wikidata_api, some domain specific disambiguation, and entity_ruler for improving recall of NER, seems to work best.

There is also a module for entity linking against dbpedia (https://spacy.io/universe/project/spacy-dbpedia-spotlight), and one could use the same-as relation to go from dbpedia ids to wikidata ids, but the benefits of such an approach are unclear at the moment. 