# Topics in Geographic Text Analysis


Taught by Dr. Katie McDonough and Scott Bailey   
- kmcdono2@stanford.edu    
- scottbailey@stanford.edu

What are we covering in this workshop?

We will...

- Look under the hood of geoparsing to review its Natural Language Processing components
- Test the [Mordecai geoparser](https://github.com/openeventdata/mordecai), a python library that combines the spaCy NLP model with georesolution
- Evaluate Mordecai results with NLP models for other languages
- Map geoparser results
- Review the advantages and disadvantages of using Geonames as a gazetteer

### Introduction

Geoparsing: the process of identifying place names in free text and resolving them to a location *on the earth*.

Geoparsers are software packages that help you to do this, usually in the following steps:
- input text file
- separate text into tokens
- tag parts of speech
- detect and label named entities 
- resolve a place name to a place record in a gazetteer
- use metadata in the gazetteer record (lat/long) to locate a place

### Getting Started

- final installs
- connect to [Geonames gazetteer](http://www.geonames.org/)


In [None]:
import spacy

In [None]:
!pip install mordecai

In [None]:
!python -m spacy download en_core_web_lg

### Let's talk about gazetteers

In [None]:
!docker pull elasticsearch:5.5.2
!curl https://s3.amazonaws.com/ahalterman-geo/geonames_index.tar.gz --output-file=wget_log.txt
!tar -xzf geonames_index.tar.gz
!docker run -d -p 127.0.0.1:9200:9200 -v $(pwd)/geonames_index/:/usr/share/elasticsearch/data elasticsearch:5.5.2

In [None]:
from mordecai import Geoparser

In [None]:
geo = Geoparser()

### Load Language Model from spaCy

English model used here.

In [None]:
nlp = spacy.load("en_core_web_lg")

### Import text

In [None]:
import requests

def get_text(url):
    return requests.get(url).text

def get_book(url):
    page = get_text(url)
    full_text = page.split('\n')
    return " ".join(full_text[2:])

In [None]:
crusoe_url = "https://raw.githubusercontent.com/kmcdono2/mordecai_workshop/master/crusoe_eng_1719.txt"
crusoe_book = get_book(crusoe_url)
crusoe_book

In [None]:
doc = nlp(crusoe_book)

In [None]:
### To insert text directly use this

#doc = nlp("I traveled from Oxford to Ottawa")
#doc = nlp ("SECKAW, ou Seckow, bourg d'Allemagne, dans la haute Stirie, sur une petite riviere nommée Gayl, à 3 lieues au nord de Iudenburg. Cette place a été érigée en évêché en 1219 par le pape Honoré III. C'est l'archevêque de Saltzbourg qui en a le droit de présentation et d'investiture; delà  vient que l'évêque de Seckaw n'a point d'entrée dans les dietes. Long. 32. 52. lat. 47. 17. (D. J.)")

### Tokenize the text

In [None]:
# word level
for token in doc:
    print(token.text)

In [None]:
# sentence level
for sent in doc.sents:
    print(sent)

### POS Tagging

In [None]:
for token in doc:
    print(token.text, token.pos)

In [None]:
# visualize the sentence parts of speech

from spacy import displacy

In [None]:
tenth_sent = list(doc.sents)11
tenth_sent


# you can also specify which words to process instead of processing by sentence
#doc[3:6] for ex, extracts words 3 through 6

In [None]:
single_doc = nlp(str(tenth_sent))
options = {"compact": True, 'bg': '#09a3d5',
          'color': 'white', 'font': 'Source Sans Pro'}
displacy.render(single_doc, style="dep", jupyter=True, options=options)

### Named Entity Recognition

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
# entities from tokens

for token in doc:
    if token.ent_type_ is not '':
        print(token.text, token.ent_type_, "----------", spacy.explain(token.ent_type_))

In [None]:
# visualize the entities
# https://spacy.io/usage/visualizers

displacy.render(single_doc, style="ent", jupyter=True)

In [None]:
next_sent = list(doc.sents)[3]
next_doc = nlp(str(next_sent))
displacy.render(next_doc, style="ent", jupyter=True)

### Geoparsing with mordecai

In [None]:
# mordecai documentation
# https://github.com/openeventdata/mordecai

geo.geoparse(doc)

In [None]:
# create dataframe

with open('test.txt', 'r') as f:
    text = f.read()
    res = geo.geoparse(text)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(data=res)

In [None]:
df

In [None]:
# export your data to csv

df.to_csv('test.csv')

### Map geoparser output

In [None]:
geo_data = df["geo"]
geo_data

In [None]:
lats = []
longs = []

In [None]:
for pt in geo_data:
    lats.append(pt["lat"])
    longs.append(pt["lon"])

In [None]:
lats

In [None]:
longs

In [None]:
df["lats"] = lats
df["longs"] = longs

In [None]:
df

In [None]:
for item in df['lats']:
    print(type(item))

In [None]:
# define point geometry

geometry = [Point(xy) for xy in zip(df['lats'], df['longs'])]

In [None]:
for item in df['lat']:
    print(type(item))

In [None]:
df['lat']=pd.to_numeric(df['lat'])

In [None]:
df['long']=pd.to_numeric(df['long'])

In [None]:
crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(df, crs=crs, geometry=geometry)

In [None]:
gdf

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

In [None]:
world.crs = {'init': 'epsg:4326'}

In [None]:
gdf = gdf.to_crs(world.crs)

In [None]:
# plot to base map

base = world.plot(color='white', edgecolor='black')
gdf.plot(ax=base, marker='*', color='green', markersize=20)