# Geocoding

Geocoding is the process of providing a reference (such as LAT and LNG coordinates) for a string recognized as a geographical entity in the text. Often, there are several candidate entities, and an important challenge for geocoding is finding the most likely candidate.

## Geonames

An important resource for geographical annotation is [Geonames](https://www.geonames.org/). Here, we demonstrate how to incorporate geonames in the spaCy annotation pipeline.

### geocoder library

To access geonames from a python program, we will use the [geocoder](https://geocoder.readthedocs.io/providers/GeoNames.html) library for geonames, demonstrated below. 

To use the api for **geonames**, you need to **register** as a user (ie create a [username](https://www.geonames.org/login)), and next go to your login page where you can activate access to the api.

_The parameter 'username' needs to be passed with each request. The username for your application can be registered here. You will then receive an email with a confirmation link and after you have confirmed the email you can enable your account for the webservice on your account page_

Geocoder provides easy access to other geo-resources as well (such as google maps).



In [None]:
#run this the first time and/or when you are using google colab
#!pip install geocoder


import geocoder

# replace GosseBouma with your own username 
locations = geocoder.geonames("Dagstuhl",key='GosseBouma',maxRows=5)
# try Winsum, Groningen (why is Veendam etc in there?)
# Bar Ilan University, Dagstuhl, Wadern, Martinitoren, Schloss Dagstuhl

for g in locations :
    print(g.address,g.description, g.latlng)


### Exercise 1

* Try a few values for the string in the call to the geonames api to explore which locations and points of interest are and are not covered by geonames

## spaCy

For integration of geocoding/geonames in spaCy, we first need to load a spacy model.

In [None]:
import spacy

# nlp = spacy.load('en_core_web_lg')

nlp = spacy.load('en_core_web_trf')


## Geocoding spaCy Named Entities using geonames

Next we define the linker function for geonames. 

Integration with spaCy is done by introducing an extension attribute on the spans found by the named entities component. See https://spacy.io/usage/rule-based-matching#entityruler for an example and discussion.

Only geographical locations are linked using geonames. Note that we simply return the first match from geonames, other linking strategies (which ones?) might give more accurate results. 


In [None]:
import requests

from spacy.tokens import Span
    
def geonames_linking(span) :
    # You can try to come up with more interesting geocding strategies instead of picking the first solution
    if span.label_ == "GPE" :
        g = geocoder.geonames(span.text,key='GosseBouma')
        if g :
            #print(g)
            return((g.latlng,g.description,g.country))
        else :
            return('GPE_not_known')
    else :
        return "not_GPE"
            
Span.set_extension('geonames_link',getter=geonames_linking,force=True)    



In [None]:
doc = nlp(text)

#text = '''At the UG’s Campus Fryslân faculty in Leeuwarden, Van Vulpen is conducting research into discontent in different regions of the Netherlands. The results of the recent provincial elections, in which the new Farmer-Citizen Movement (BBB) scored a thumping victory over the established parties, suggest that the government is the main object of that discontent. Van Vulpen: ‘The BBB used public disaffection to its advantage, profiling itself as a rural party and magnifying the differences between city and countryside.’
# Much of the support for the BBB came from the fringe regions outside the Randstad. In parts of the province of Overijssel, the BBB raked in almost 60% of the votes. Van Vulpen’s goal is to uncover the reasons underlying this regional discontent with the government and the established parties. Because of his expertise in this area, his opinion is often sought by the Dutch media.'''

text = '''Last August, Oleg Patsulya, a Russian citizen living near Miami, emailed a Russian airline that had been cut off from Western technology and materials with a tempting offer.
He could help circumvent the global sanctions imposed on Rossiya Airlines after Russia’s invasion of Ukraine by shuffling the aircraft parts and electronics that it so desperately needed through a network of companies based in Florida, Turkey and Russia.
'''
for sent in doc.sents :
    print(sent.text)
for ent in doc.ents: 
    print(ent.text, ent._.geonames_link)
print()

### Exercise 2

* Replace the text in the example above with another text, containing one or more gegraphical locations. Are they recognized as entities? Are they located correctly and linked to the correct entity? 

### Exercise 3
* Modify the geonames-entity-linking function so that it prints population size as well for each resolved population. See [geonames](https://geocoder.readthedocs.io/providers/GeoNames.html) for details 

### Exercise 4
* Modify the geonames-entity-linking function with the additional constraint that locations must be linked to a location that is situated in a given country. I.e. for a newspaper article about the Netherlands, we might require *Bergen* to be resolved to the place in the Netherlands, not Norway. Check geonames filters for details
* Can you think of another example where such a constraint would help? If so, show that the linker gives different results with and without the constraint.