# Linking Named Entities
---
---

## Named Entities and Linked Data
The named entities we have recognised in the Henslow data would be much more useful if they could be linked to other data known about those entities. This principle is called **linked data**. Linked data can enrich the discovery of collections and allow sophisticated searches for the knowledge within those collections.

If the data is freely available and openly licensed it is known as **linked open data (LOD)**. The diagram below shows the extent of LOD in 2010. Since then then the **linked open data cloud** has grown immensely and you can explore it for yourself at [www.lod-cloud.net](https://www.lod-cloud.net/).

<img src="https://lod-cloud.net/versions/2010-09-22/lod-cloud_colored.png" alt="Linked open data cloud in 2010" title="Linked open data cloud in 2010">

Linked data is a very big topic, so this notebook will only touch on a few introductory aspects that relate to the NER we have done in this course. In particular, we will focus on the automated ways of linking data that can be enabled by writing code, though the underlying principles can be understood without it.

---
---

## Disambiguate with an Authority File
One of the challenges with named entities is that there may be many different forms, spellings or abbreviations that refer to the same person, place, country, and so on. 

An **authority file** is a way of normalising and unifying this information for each entity into a single, authoritative **authority record** and giving it a **unique identifier**. Typically, all the known forms of a particular entity will be recorded in its authority record so that every form can be resolved to the same, correct entity.

You may already be familiar with [VIAF: The Virtual International Authority File](https://viaf.org/), which is an authority service that unifies multiple authority files for people, geographic places, works of art, and more.

![VIAF: The Virtual International Authority File](http://www.bnc.cat/var/bnc_site/storage/images/el-blog-de-la-bc/viaf/1729168-1-cat-ES/VIAF_large.png)

By simply [searching for a name in the search box](https://viaf.org/viaf/27063124/#Darwin,_Charles,_1809-1882.), it returns a VIAF ID, preferred and related names, and associated works. 

![assets/viaf-charles-darwin.png](assets/viaf-charles-darwin.png)

---
---

## Lookup Entities Programmatically with Web APIs
The power of centralised authorities such as VIAF is when their data is exposed via an **API** (Application Programming Interface). A web API is accessed via a particular web address and allows computer programs to request information and receive it in a structured format suitable for further processing. Typically, this data will be provided in either JSON or XML.

VIAF has several different APIs. The one we will use is [Authority Cluster](https://www.oclc.org/developer/api/oclc-apis/viaf/authority-cluster.en.html) Auto Suggest. Sadly, OCLC have removed their OCLC API Explorer, which was really handy for exploring the API as a human! 😞

In the old OCLC API Explorer we could search for "john stevens henslow" or any personal name from the Henslow letters:

![assets/viaf-api-charles-darwin.png](assets/viaf-api-charles-darwin.png)

It returned a list of results, in JSON format, with VIAF's suggestions for the best match, which you can see in the right-hand "Response" pane.

We can consume this data programmatically using Python tools with which we are already familiar from earlier notebooks.

In [None]:
import requests
import json

# Make the query to the API
query = "john stevens henslow"
response = requests.get('http://www.viaf.org/viaf/AutoSuggest?query=' + query)

# Parse the JSON into a Python dictionary
data = json.loads(response.text)

# Get just the first entry in the results
data['result'][0]

If you compare this with the output of the API explorer above, you should see this is the same structure and information.

The VIAF ID is found in the `'vaifid'` field:

In [None]:
data['result'][0]['viafid']

With this information we could now enrich the original XML with the VIAF ID for this named entity.

> **EXERCISE**: What are the problems you could anticipate with this sort of automated linking?

---
---
## Lookup Named Entities in Bulk using Web APIs
The scientific community has been busy normalising, disambiguating, and aggregating similar types of data for decades, in a movement parallel but largely separate to developments in library science and humanities. 

[The Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/what-is-gbif) is an international open data aggregator for hundreds of millions of species records. 

![Global Biodiversity Information Facility](http://data.biodiversity.be/gbif-logo.png)

In the [last notebook](4-updating-the-model-on-henslow-data.ipynb#Add-a-New-Entity-Type) we tried to add a new named entity type `TAXONOMY` for the model to learn. We defined this as a type of entity for any Linnaean taxonomic name (domain, kingdom, phylum, division, class, order, family, genus or species). Binomials (genus plus species together) were labelled as one span. 

Imagine if we wished to link these named taxonomic entities to the corresponding genus or species in the GBIF. How could we do this *en masse*?

Like VIAF, [GBIF also has a set of web APIs](https://www.gbif.org/developer/summary) and we can use the [Species API](https://www.gbif.org/developer/species) to search for species names.

> **EXERCISE**: Reading API documentation is a common activity for coders. Before you look at the code example below, open the [Species API](https://www.gbif.org/developer/species) documentation, scroll down to the 'Searching Names' section and see if you can work out which of the four resource URLs would be most useful for our case.

<a title="Flowers of Pulmonaria officinalis. Pharaoh Hound at the English language Wikipedia, CC BY-SA 3.0 &lt;http://creativecommons.org/licenses/by-sa/3.0/&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Lungwort.jpg"><img width="256" alt="Lungwort: Flowers of Pulmonaria officinalis" src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Lungwort.jpg/256px-Lungwort.jpg"></a>
<p style="text-align: center; font-style: italic;">Flowers of Pulmonaria officinalis</p>

Let's start by trying one taxonomic (genus) name "Pulmonaria" to see what sort of result we can expect:

In [None]:
query = "Pulmonaria"
response = requests.get('https://api.gbif.org/v1/species/suggest?q=' + query)

data = json.loads(response.text)
data[0]

So far, so very similar to the VIAF API.

### Reconciling Historical Taxa
In reality, we need to be aware that some of the older names given to organisms in the Henslow letters are not easily reconciled with modern named taxa. (In the Darwin Correspondence Project (DCP), Shelley Innes, editor and research associate, is an expert in historical taxonomy and her work is available in the footnotes of the published DCP letters.)

Also, the Henslow letters often use ligature ash ('æ') rather than 'ae', which is used in family names in GBIF. The GBIF `suggest` API does not recognise 'æ' and 'ae' as equivalent so either our queries will need to be normalised, or we can try a different API.

<a title="Gall Wasp - Cynipidae family, Leesylvania State Park, Woodbridge, Virginia. Judy Gallagher, CC BY 2.0 &lt;https://creativecommons.org/licenses/by/2.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Gall_Wasp_-_Cynipidae_family,_Leesylvania_State_Park,_Woodbridge,_Virginia.jpg"><img width="256" alt="Gall Wasp - Cynipidae family, Leesylvania State Park, Woodbridge, Virginia" src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a6/Gall_Wasp_-_Cynipidae_family%2C_Leesylvania_State_Park%2C_Woodbridge%2C_Virginia.jpg/256px-Gall_Wasp_-_Cynipidae_family%2C_Leesylvania_State_Park%2C_Woodbridge%2C_Virginia.jpg"></a>
<p style="text-align: center; font-style: italic;">Gall Wasp - Cynipidae family</p>

If there is no matchable name in GBIF we get an empty result:

In [None]:
query = "Cynipidæ"
response = requests.get('https://api.gbif.org/v1/species/suggest?q=' + query)

data = json.loads(response.text)
data

But if we try the `search` API instead there is no problem:

In [None]:
query = "Cynipidæ"
response = requests.get('https://api.gbif.org/v1/species/search?q=' + query)

data = json.loads(response.text)
data['results'][0]

Let's now take the list of taxonomic names from the previous notebook, cleaned up and normalised, and try to make an query with the whole list:

In [None]:
taxonomy = [
 'Adippe',
 'Alisma repens',
 'Alopecurus bulbosus',
 'Althaea hirsuta',
 'Anthemis Cotula',
 'Anthericum serotinum',
 'Anthyllis vulneraria',
 'Apargia hirta',
 'Arabis thaliana',
 'Araucaria imbricata',
 'Artemisia gallica',
 'Asclepiadeae',
 'Aspidia',
 'Asterophyllites',
 'Atriplex laciniata',
 'Bechera grandis',
 'Blysmus compressus',
 'Bos',
 'Bos primigenius',
 'Campanula rapunculus',
 'Campanula rotundifolia',
 'Carex',
 'Carex laevigata',
 'Centaurea solstitialis',
 'Cerastium humile',
 'Chara gracilis',
 'Cheiranthus sinuatus',
 'Chiasognathus Grantii',
 'Chironia littoralis',
 'Chrysosplenium alternifolium',
 'Cirisia',
 'Cochlearia danica',
 'Commelina coelestis',
 'Corbula costata',
 'Coryphodon',
 'Cracidae',
 'Crocus sativus',
 'Cryllas',
 'Cucubalus baccifer',
 'Cuscuta Epilinum',
 'Cycas',
 'Cycas circinalis',
 'Cyperus',
 'Cytheraea obliqua',
 'Daucus maritimum',
 'Dianthus caryophyllus',
 'Digitalis',
 'Diptera',
 'Epilobium hirsutm',
 'Eriocaulon',
 'Eriophorum',
 'Eriophorum polystachion',
 'Eriophorum pubescens',
 'Euphorbia portlandica',
 'Favularia nodosa',
 'G.campestris',
 'Gallinula Baillonii',
 'Glaucium violaceum',
 'Globulus',
 'Hedysarum',
 'Hedysarum scandens',
 'Hemiptera',
 'Holoptychus',
 'Hortus Siccus',
 'Hymenophyllum tunbridgense',
 'Iberis amara',
 'Inula',
 'Inula helenium ',
 'Jungermanniae',
 'Knautia',
 'Lathyrus hirsutus',
 'Lepidoptera',
 'Linosyris',
 'Linum angustifol',
 'Lobelia urens',
 'Lonicera caprifolium',
 'Lysimachia',
 'Malaxis Loeslii',
 'Medicago denticulata',
 'Melampyrum arvense',
 'Melissa',
 'Mentha gentilis',
 'Menyanthes Nymphaeoides',
 'Mespilus cotoneaster',
 'Milium lendigerum',
 'Narcissus poeticus',
 'Nemeolius Lucina',
 'Neuropteris cordata',
 'Oenanthe crocata',
 'Ophrys arachnites',
 'Orchideae',
 'Orobanche caryophylacea',
 'Panicum viride',
 'Peperomia',
 'Phleum paniculatum',
 'Pisidium',
 'Polyporites Bowmanni',
 'Potamides plicatus',
 'Potamogeton fluitans',
 'Potamogeton gramineum',
 'Pothos',
 'Primula',
 'Primula scotica',
 'Primula vulgaris',
 'Psammobia rudis',
 'Pulicaria',
 'Pyrola',
 'Pyrola minor',
 'Pyrus pinnatifida',
 'Pyrus torminalis',
 'Quercus sessiliflora',
 'Ribes alpinum',
 'Rubus idaeus',
 'Ruppia',
 'Ruppia maritima',
 'Salicornia radicans',
 'Salvia pratensis',
 'Santolina maritima',
 'Scirpus caricinus',
 'Scirpus pauciflorus',
 'Sisymbrium monense',
 'Sonchus palustris',
 'Statice cordata',
 'Tetrandria',
 'Thalassophytes',
 'Tormentilla reptans',
 'Trifolium',
 'Trifolium subterraneum',
 'Turritis hirsuta',
 'Typha',
 'Ulmus suberosa',
 'Vaccinium myrtillus',
 'Velleius',
 'Vinca major',
 'Viola palustris',
 'Volucella',
 'Wellingtonia',
 'Zostera marina',
]

In [None]:
%%time

result = {}
for query in taxonomy:
    print(f'Fetching: https://api.gbif.org/v1/species/suggest?q={query}')
    response = requests.get('https://api.gbif.org/v1/species/suggest?q=' + query)
    data = json.loads(response.text)
    if data:
        print(f"Result: {data[0]}")
        result[query] = data[0]

In [None]:
len(result)

Why do you think it took so long? How can you tell if no match was found?

We now have all sorts of exciting information about these species names. Try some of the entity names to see if the search got the correct match.

In [None]:
entity = 'Lobelia urens'
gbif = result[entity]
rank = gbif['rank']
status = gbif['status']
gbif_id = gbif['key']

print(f'Entity: {entity}, rank: {rank}, status: {status}, gbif_id: {gbif_id}.')

---
---
## Named Entities and Knowledge Bases
A **knowlege base** is a system that stores facts and in some way links them with one another into a store of information that can be queried for new knowledge. A knowledge base may store semantic information with **triples** to create a **knowledge graph** where entities (nodes) are linked to other entities (nodes) by relationships (edges).

Formally, a triple is made up of subject, predicate and object. For example:

> "Odysseus" (subject) -> "is married to" (predicate) -> "Penelope" (object)

Many triples together form a graph:

![Knowledge graph](https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-primer/example-graph.jpg)

Each entity is represented by a URI, which is unique and identifies it unambigiously.

Perhaps the most well-known knowledge base is [Wikidata](https://www.wikidata.org), which is collaborative (relies on data donations and user editing) and open (all the data is openly licensed for re-use).

You can get an idea of the vast store of data and query possibilities by using the [Wikidata Query Service](https://query.wikidata.org/).

> **EXERCISE**: Try some of the 'Examples' queries from the [Wikidata Query Service](https://query.wikidata.org/). Notice that some queries come with visualisations. Why do you think it takes so long for some of the queries to complete?

### Find Named Entities in a Knowledge Base
To interact with Wikidata's knowledge base programmatically, we must use a W3C-standard query language called **SPARQL** (SPARQL Protocol And RDF Query Language).

You can see the SPARQL queries in the Wikidata Query Service examples. They look like this:

```
#Map of hospitals
#added 2017-08
#defaultView:Map
SELECT * WHERE {
  ?item wdt:P31/wdt:P279* wd:Q16917;
        wdt:P625 ?geo .
}
```

Unfortunately, SPARQL has a demanding learning curve, but fortunately there are a number of [tools for programmers](https://www.wikidata.org/wiki/Wikidata:Tools/For_programmers) that can make our lives easier. 

We are going to use a Python package called **[wptools](https://github.com/siznax/wptools/)** to make querying Wikidata as easy as writing simple Python. wptools actually uses the [MediaWiki API](https://www.mediawiki.org/wiki/API:FAQ), which is cheating, or a good idea to avoid SPARQL, or both. 😆

First, let's try a simple string query:

In [None]:
import wptools

# Construct a query for the string "Lobelia urens"
page = wptools.page("Lobelia urens")

In [None]:
# Get the Wikidata and show it
page.get_wikidata()
page.data['wikibase']

The ID that is printed out `'Q3257667'` is the unique Wikidata ID, and the `wikidata_url` goes directly to the plant's unique URI. 

> **EXERCISE**: Try the Wikidata URL now and examine all the information that Wikidata knows about Lobelia urens. Notice in particular that it has a link to the GBIF ID '5408353'.

We can even get the plant's picture programmatically!

In [None]:
# Import some modules to help display images in Jupyter notebook code cells
from IPython.display import Image

# Get the picture URL from the Wikidata info
lobelia_pic_url = page.images()[0]['url']

# Display the image
Image(url=lobelia_pic_url, embed=True, width=400)

> **EXERCISE**: Try searching Wikidata for some of the other taxonomic names and fetching their pictures. What happens if the search is unsuccessful?

Since Wikidata already has a link to the GBIF ID that we have from before, can we query Wikidata directly with the GBIF ID and get the knowledge base information that way? 

The answer is yes! But we will have to make a small dive into the world of SPARQL...

### Make Simple SPARQL Queries
Rather than use the Wikidata Query Service like a human, we're going to interact with the SPARQL **endpoint** programmatically. An endpoint is the URL where you send a query for a particular web service. For the curious, here is a big list of known [SPARQL endpoints](https://www.w3.org/wiki/SparqlEndpoints).

Wikidata is the top entry on that list! But the endpoint listed is a bit out of date. We are going to use the [main Wikidata SPARQL endpoint](https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#SPARQL_endpoint) at: https://query.wikidata.org/sparql

We're going to use a different Python library called **[SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper)** to make the query.

In [None]:
from SPARQLWrapper import SPARQLWrapper, XML, JSON

# Set the endpoint URL
# We are a bot/script! So we also need to send a descriptive user-agent otherwise we get blocked!
sparql = SPARQLWrapper("https://query.wikidata.org/sparql", 
                       agent="Cambridge Digital Humanities Data School lab@cdh.cam.ac.uk")

# SPARQL query
sparql.setQuery("""
SELECT * WHERE {
  ?item wdt:P846 "5408353"
}
""")

# The endpoint returns results in XML but we want to convert to JSON because it's easier to work with
sparql.setReturnFormat(JSON)
result = sparql.query().convert()
result

If you now cut and paste the URL that has been returned you should find yourself looking once again at the Lobelia entity. So far so good.

Let's take a moment to understand a bit more about the SPARQL query we just made:

* `SELECT *` means "select all" i.e. we want all the information available
* `WHERE {}` is a clause to filter the results by whatever is between the curly braces `{}`
* `?item wdt:P846 "5408353"` is a triple:
 * `?item` means "any items (subjects) that match"
 * `wdt:P846` is a property (predicate) and in this case the property [`P846`](https://www.wikidata.org/wiki/Property:P846) is GBIF ID
 * `"5408353"` is the specific ID (object) we are looking for. We got this from querying the GBIF endpoint above.
 
So, overall, the query says "select all information about any entity that has a GBIF ID property of 5408353".

> You can read more about [Wikidata Identifiers](https://www.wikidata.org/wiki/Wikidata:Identifiers) like "P846" and [Wikidata prefixes](https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Basics_-_Understanding_Prefixes) like "wdt:".

Now let's try something a bit more sophisticated, by asking for some additional information available in Wikidata:

In [None]:
sparql.setQuery("""
SELECT ?item ?itemLabel ?itemDescription ?pic ?taxon ?rank WHERE {

  ?item wdt:P846 "5408353" ;

  OPTIONAL{?item wdt:P18 ?pic .}
  OPTIONAL{?item wdt:P225 ?taxon .}
  OPTIONAL{?item wdt:P105 ?rank .}
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
}
""")

result = sparql.query().convert()
result

> If SPARQL takes your interest, and you'd like to learn more about linked open data, I can recommend the *Programming Historian*'s [Introduction to the Principles of Linked Open Data](https://programminghistorian.org/en/lessons/intro-to-linked-data#querying-rdf-with-sparql) and [Using SPARQL to access Linked Open Data](https://programminghistorian.org/en/lessons/retired/graph-databases-and-SPARQL) (this lesson has now been retired).

Finally, let's use wptools again to get all the data we might ever want about this plant.

In [None]:
# Quickly parse out the Wikidata unique ID
url = result['results']['bindings'][0]['item']['value']
id = url.rpartition('/')[-1]
id

In [None]:
page2 = wptools.page(wikibase=id)
page2.get_wikidata()
page2.data['wikibase']

The difference this time is that we looked up the Wikidata ID first, using the unique GBIF ID, so we know we will get the info from the correct entity.

---
---
## Enrich the Original Data
Let's take a moment to consider the journey we have travelled.

<a title="A view of the northern ascent of Catbells (facing south) in the Lake District near Keswick, Cumbria. Diliff, CC BY-SA 3.0 &lt;https://creativecommons.org/licenses/by-sa/3.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Catbells_Northern_Ascent,_Lake_District_-_June_2009.jpg"><img width="1024" alt="Catbells Northern Ascent, Lake District - June 2009" src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Catbells_Northern_Ascent%2C_Lake_District_-_June_2009.jpg/1024px-Catbells_Northern_Ascent%2C_Lake_District_-_June_2009.jpg"></a>

* We started with blocks of *unstructured text* parsed from TEI XML documents.
* We ran the text through a *machine learning model* that predicted named entities within the text.
* We took a list of named entities and found *linked data* in various external sources of truth.

We could do many things with extra information like this:
* Add it the catalogue records for the collection.
* Store it in a database to improve search and discovery.
* Display it on a website along with the original documents to give extra context.
* Create new markup with the linked data.

I'm sure you can think of more ideas!

### Add New XML Markup for Named Entities
To finish our exploration in code of this topic, I will show you a proof-of-concept for how we could add new TEI markup to an original Henslow Correspondence Project letter. I have had to make some simplifications in the example for the sake of brevity.

<a title="Lobelia urens. Hans Hillewaert at the English language Wikipedia, CC BY-SA 3.0 &lt;https://creativecommons.org/licenses/by-sa/3.0/&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Lobelia_urens_(spike).jpg"><img width="256" alt="Lobelia urens (spike)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Lobelia_urens_%28spike%29.jpg/256px-Lobelia_urens_%28spike%29.jpg"></a>
<p style="text-align: center; font-style: italic;">Heath lobelia close to Brigueuil, Charente, France</p>

First, we will go back to the beginning of our journey and get the original letter where the binomial "Lobelia urens" appears. We can search the XML for the named entity and wrap it in a new XML tag to mark its position.

In [None]:
from bs4 import BeautifulSoup

taxon = "Lobelia urens"

# Get transcription from original TEI
with open("data/henslow/letters_14.xml", encoding="utf-8") as file:
    xml = file.read()
    
    # Find the species name and wrap it in a new XML tag
    new_xml = xml.replace("Lobelia urens", "<name>Lobelia urens</name>")
    
    # Create soup from the new XML including the new tag
    letter = BeautifulSoup(new_xml, "lxml-xml")
    
transcription = letter.find(type='transcription')
transcription

Can you see where we have added the new tag wrapping the named entity?

Now we want to modify this markup with the linked data we collected earlier, as follows:

`<name type="taxon" ref="https://www.gbif.org/species/5408353 https://www.wikidata.org/wiki/Q3257667">Lobelia urens</name>`

(Thanks to Huw Jones for supplying the correct TEI form to follow.)

We can create the new markup using BeautifulSoup:

In [None]:
# Data from the previous lookup steps
gbif_url = "https://www.gbif.org/species/5408353"
wikidata_url = "https://www.wikidata.org/entity/Q3257667"

# Create the new tag and contents
taxon_tag = letter.new_tag("name", type="taxon", ref=f'{gbif_url} {wikidata_url}')
taxon_tag.string = taxon
taxon_tag

And then place it into the XML:

In [None]:
# Find the taxonomic name in the transcription and replace it with the new tag
transcription_tag = letter.find(type='transcription')
transcription_tag.find("name").replace_with(taxon_tag)

# Print out the first paragraph of the transcription to check the new tag
print(transcription_tag.p.prettify())

Finally, we can save the new TEI document to file:

In [None]:
from pathlib import Path
output_file = Path('output/letters_14-taxon.xml')
letter_xml = letter.prettify()
output_file.open('w', encoding='utf-8').write(letter_xml)

> **EXERCISE**: Review the modified TEI file [output/letters_14-taxon.xml](output/letters_14-taxon.xml) and inspect the newly added markup. You may need to download it and open it in Oxygen or another editor to see the markup.

Of course Linked Open Data works both ways: once you have gone to the trouble of linking everything to its Wikidata ID, you may wish to [add your data to Wikidata](https://www.wikidata.org/wiki/Wikidata:Data_donation), but that is a big topic for another day.

---
---
## Predict Entity Links with Machine Learning
One final question may have occurred to you during the process of working through this notebook.

> If linking is done automatically, potentially without human intervention, how can we be sure the results are accurate?

It's likely in a real-world project you would need some form of human quality control, but an additional approach is to use machine learning to predict links.

There are potentially two ways of doing this:

1. Build your own **entity linker** with machine learning.

spaCy has the capability to [link named entities to identifiers stored in a knowledge base](https://spacy.io/api/entitylinker/). For anyone with a lot of computing power and time to hand, there's even some [example code](https://github.com/explosion/projects/tree/master/nel-wikipedia) to do this with Wikipedia and Wikidata data dumps.

<img src="https://upload.wikimedia.org/wikipedia/commons/d/d9/EDSAC_2_1960.jpg" alt="EDSAC II, 10th May 1960, user queue. Copyright Computer Laboratory, University of Cambridge. Reproduced by permission. Creative Commons Attribution 2.0 UK: England & Wales." title="EDSAC II, 10th May 1960, user queue. Copyright Computer Laboratory, University of Cambridge. Reproduced by permission. Creative Commons Attribution 2.0 UK: England & Wales.">
<p style="text-align: center; font-style: italic;">The queue for computing time on the Cambridge EDSAC, 1960. To use High Performance Computing today, nothing has really changed, except the queue itself is now managed by software!</p>

2. Use someone else's pre-built entity linker if they have built something suitable for your use case.

You can check [spaCy Universe](https://spacy.io/universe) for resources developed with or for spaCy. One example of an entity linker:

* [Mordecai](https://github.com/openeventdata/mordecai) uses spaCy to extract place names, predict which [GeoNames](https://www.geonames.org/) entity is the best match, and return the linked geographic information.

<img src="https://raw.githubusercontent.com/openeventdata/mordecai/master/paper/mordecai_geoparsing.png" width="500">

---
---
## Summary

Covered in this notebook:

* **Linked data** can enrich the discovery of collections and allow sophisticated searches for the knowledge within those collections.
* **Linked open data** is linked data that is openly licensed for re-use.
* We can disambiguate a named entity by linking it to an **authority file** or other source of truth with a **unique identifier**.
* Authorities and open data aggregators can be accessed via **web APIs**, which allow rapid and automated query and retrieval of information.
* **Wikidata** is an openly licensed **knowledge base** containing a **knowledge graph** of **triples**. It has several web APIs that you can query with the language **SPARQL**.
* After linking named entities to external identifiers it's possible to **enrich** the original data and, in return, donate data to knowledge bases like Wikidata.
* **Machine learning** can also be used when linking named entities automatically, to improve the likelihood of linking to the correct entity.

Congratulations! 🎉 

That's the end of this series of notebooks about named entity recognition. I hope you enjoyed your time working through them.