# Named Entity Classification

A crucial part of any geocoding task is identification of named entities in the input text and classifying them as person, geo-location, organisation, etc. Here we demonstrate how spaCy handles this task using the transformer model.

In [None]:
import spacy

# 1. loading the model may produce some errors from tensorflow. Do not worry about these 
# 2. Google colab only has en_core_web_sm pre-installed, run command below to install other models 
# !python -m spacy download en_core_web_trf

import spacy_transformers
nlp = spacy.load('en_core_web_trf')


### Spacy Named Entity Recognition

First, we test the NER component, i.e. which entities are recognized by a standard Spacy model (lg/trf)?

The lg model does a reasonable job at recognizing NEs although a few are missed as well. The easiest way to improve coverage is to add manual patterns, as explained below. Another option is to try the transformer model (trf), which in many ways is more accurate than the lg model.

For NER, trf seems to be the preferred model, but remember that installation can be a bit tricky (requires proper installation of the spacy transformer lib)

We can also visualise results using the displacy library.

Wonder what all the labels mean? Try spacy.explain('LABEL')

In [None]:
# text = '''
# Prime minister Rutte and vice minister Vijlbrief visited Garmerwolde yesterday. 
# They came from The Hague to Groningen to talk about compensation for the damage done by drilling in Loppersum.
# '''

text = '''At the UG’s Campus Fryslân faculty in Leeuwarden, Van Vulpen is conducting research into discontent in different regions of the Netherlands. The results of the recent provincial elections, in which the new Farmer-Citizen Movement (BBB) scored a thumping victory over the established parties, suggest that the government is the main object of that discontent. Van Vulpen: ‘The BBB used public disaffection to its advantage, profiling itself as a rural party and magnifying the differences between city and countryside.’
Much of the support for the BBB came from the fringe regions outside the Randstad. In parts of the province of Overijssel, the BBB raked in almost 60% of the votes. Van Vulpen’s goal is to uncover the reasons underlying this regional discontent with the government and the established parties. Because of his expertise in this area, his opinion is often sought by the Dutch media.'''

doc = nlp(text)
for sent in doc.sents :
    print(sent.text)
    for ent in sent.ents:
        print(ent.text, ent.label_)
        
spacy.explain('FAC')

In [None]:
from spacy import displacy 

# displacy.serve(doc, style="ent")
# inside a jupyter notebook use this: 

displacy.render(doc, style="ent", jupyter=True)

spacy.explain('NORP')

### Exercise 1

* What is the label used for geographical entities? (use spacy.explain(LABEL) to see an explanation for the labels)
* Replace the **text** above with an alternative text containing names for geographical locations, and check whether spaCy correctly identifies these (i.e. whether the string is correct, and whether the class label is correct).
  *  Are any non-geograhical entities labeled as geographical?
  * Are all geo locations labeled as geograhical?
  
  

In [None]:
## You can use ent.label_ to filter named entities by their label/category

for ent in doc.ents : 
  print(ent.text, ent.label_)

### Exercise 2

Modify the code above so that it only prints geographical named entities

## Entity linking with the wikidata entity finder api

Geocoding is the task where geographical names in a text are linked to an actual location. Entity linking is the same task, but more general. Entity linking provides an unambiguous ID (such as a wikipedia page) for names in a text. 

For entity linking, you can use the __wikidata api__ for finding the corresponding entity IDs. ((Wikidata)[https://www.wikidata.org/] is like a database with facts from Wikipedia.) Note that the api can give multiple results for a given text string. Here, we simply return the id of the first result. Apart from id, the api also returns a label and description for each match, that could perhaps be used for disambiguation. (Ie prefer entities with certain keywords in the description field, like *actor, film, ....*)

Integration with Spacy is done by introducing an extension attribute on the spans found by NER. See https://spacy.io/usage/rule-based-matching#entityruler for an example and discussion.


In [None]:
import requests

from spacy.tokens import Span
    
def wikidata_entity_link(span) :
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities', 
              'language':'en',
              'format':'json',
              'search': span.text}
    json = requests.get(url,params).json()
    # this part can be replaced by fancier disambiguation methods, 
    # or returning a list of ids from all search results
    try : 
        wd_id = json['search'][0]['id']
        try :
            wd_desc = json['search'][0]['description']
        except KeyError :
            wd_desc = ''
    except IndexError :
        (wd_id, wd_desc) = ('no_id_found','')
    return (wd_id,wd_desc)
            
Span.set_extension('wikidata_id',getter=wikidata_entity_link,force=True)    



### Exercise 3

Execute the code above to include the wikidata entity linker in the saCy entity pipeline. Now analyse the example text again and see what it provides as unambiguous ID's for the names in the text. 

* You will notice that not all results are correct. 
* Try some other texts containing names and see whether this bevaviour is also true for other texts.

In [None]:
# uuncomment and modify the lines below to try other text

#text = ''' “Although the final results are not in yet, we are leading by far,” 
# Mr. Erdogan told supporters gathered outside his party’s headquarters in Ankara, the capital.
# '''

doc = nlp(text)
for ent in doc.ents:
        print(ent.text, ent._.wikidata_id)
        print()

### Exercise 4 (optional)

The Wikidata entity linker uses the text of a named entity recognized by spaCy to look up entities in wikidata, and simply returns the first hit. For BBB for instance, this gives a wrong answer, even though the correct entity is present in wikidata as well.

* Can you think of a method to improve the performance of our simple entity linker?


