In [1]:
# install all requirements quietly
!pip install -q -r requirements.txt

# Sample NER Workflow for FromThePage

Read data from FTP XML file and pass through the SpaCY NER 

In [2]:
import spacy
import pandas as pd
import utils

In [3]:
# download the spacy models we need
model = 'en_core_web_md'
spacy.cli.download(model)
nlp = spacy.load(model)


[93m    Linking successful[0m
    /opt/conda/lib/python3.6/site-packages/en_core_web_md -->
    /opt/conda/lib/python3.6/site-packages/spacy/data/en_core_web_md

    You can now load the model via spacy.load('en_core_web_md')



We first read the data from the tei.xml file exported from FromThePage.

In [4]:
texts = utils.read_ftp_xml('data/tei.xml')
texts.head()

Unnamed: 0,text
0,1855
1,[January 17 Wednesday 1855]
2,[Thursday 18 January 1855]
3,[January 19 Friday 1855]
4,"My gentle lunatic Sang, howled, blasphemed an..."


## NER

We now perform NER on the text using the Spacy library.  For we generate a list of location entities and for each entity, record a snippet of text around the occurence.  The result is a DataFrame containing the placename, the context and the document number - really the row number in the original spreadsheet.

In [5]:
places = []

for i, t in texts.iterrows():
    text = t['text']
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ is "GPE":
            context = doc[ent.start-4:ent.end+4]
            context = " ".join([w.text for w in context])
            d = {'placename': ent.text, 'context': context, 'doc': i}
            places.append(d)
locations = pd.DataFrame(places)
locations

Unnamed: 0,context,doc,placename
0,"My gentle lunatic Sang , howled , blasphemed",4,Sang
1,"Poole an actress from Jersey , nice sensible w...",8,Jersey
2,from my sister at Geelong . Not over pleased,18,Geelong
3,dine with Joe at St Kilda tomorrow it being his,35,St Kilda
4,) for him to St Kilda . Mr McNee,41,St Kilda
5,so I went to St Kilda . Mr & Mrs,42,St Kilda
6,"Lights ahead steered for Melbourne , leaving P...",42,Melbourne
7,"Dined with Newby at Richmond , and then went",54,Richmond
8,but the closeness of Atmosphere and startling ...,60,Atmosphere
9,well . Invited to St Kilda but refused weather...,68,St Kilda


## Visualisation

Spacy can be used to visualise the NER results in the notebook.  This might not be too useful but illustrates what is possible. 

In [6]:
from spacy import displacy
from IPython.core.display import display, HTML

doc = nlp(texts['text'][4])
display(HTML(displacy.render(doc, style='ent')))

## Geocoding

We can use the `geocoder` module to submit these place names to a geocoding service.  Here we use the Geonames service and make a new table with the results.

In [7]:
locations = utils.geolocate_locations(locations)
locations

Unnamed: 0,context,doc,placename,address,country,lat,lng
0,"My gentle lunatic Sang , howled , blasphemed",4,Sang,Sang Pur,Australia,-34.33884,118.58149
1,"Poole an actress from Jersey , nice sensible w...",8,Jersey,Jersey Park,Australia,-33.9185,150.8835
2,from my sister at Geelong . Not over pleased,18,Geelong,Geelong,Australia,-38.14711,144.36069
3,dine with Joe at St Kilda tomorrow it being his,35,St Kilda,St Kilda,Australia,-37.8676,144.98099
4,) for him to St Kilda . Mr McNee,41,St Kilda,St Kilda,Australia,-37.8676,144.98099
5,so I went to St Kilda . Mr & Mrs,42,St Kilda,St Kilda,Australia,-37.8676,144.98099
6,"Lights ahead steered for Melbourne , leaving P...",42,Melbourne,Melbourne,Australia,-37.814,144.96332
7,"Dined with Newby at Richmond , and then went",54,Richmond,Richmond,Australia,-20.56967,142.91384
8,but the closeness of Atmosphere and startling ...,60,Atmosphere,Atmosphere Kanifushi Maldives,Maldives,5.36435,73.3345
9,well . Invited to St Kilda but refused weather...,68,St Kilda,St Kilda,Australia,-37.8676,144.98099
