Project Goals:

Extract Place Entities (Cities, Locations, etc) from any document(novel, news articles, etc) and plot them on a Map.

User Flow:

- Enter document(pdf, text file, doc, etc) 
- Extract Locations
- Find Long and Lat values for location entities extracted 
- Create WordCloud of Locations
- Plot those coordinates on a map

Authors:
Nono Umasy, 
Jason Hamada



In [6]:
import matplotlib.pyplot as plt
import pandas as pd 
import spacy
import folium
import os
import pickle
from spacy import displacy
from geopy.geocoders import Nominatim

In [33]:
geo_cities = "/Users/nonoumasy/Downloads/data/world-cities.csv"
geo_df = pd.read_csv(geo_cities)
geo_df = geo_df[['name','country']]

In [38]:
geo_df.shape

(23018, 2)

In [2]:
#get data
with open('/Users/nonoumasy/Downloads/data/tolstoy.txt', 'rt') as f:
    body = [f.readline()[:-1] for _ in range(1000)]
    nlp = spacy.load('en')

body_entities = nlp(''.join(body))
for word in body_entities.ents:
    print(word.text, word.label_)
    

ONE CARDINAL
1805CHAPTER CARDINAL
I“Well ORG
Genoa ORG
Lucca PERSON
Antichrist PERSON
slave,’ PERSON
July, 1805 DATE
Anna PávlovnaSchérer PRODUCT
Márya Fëdorovna PERSON
Prince Vasíli Kurágin PERSON
first ORDINAL
AnnaPávlovna ORG
some days DATE
St. Petersburg GPE
French NORP
morning TIME
follows:“If GPE
an evening TIME
tonight TIME
French NORP
Anna Pávlovna PERSON
Anna Pávlovna PERSON
English LANGUAGE
Wednesday DATE
today DATE
one CARDINAL
Buonaparte ORG
Vasíli ORG
Anna Pávlovna Schérer PERSON
her forty years DATE
Anna Pávlovna PERSON
Austria GPE
Austria GPE
Russia GPE
Europe LOC
earth LOC
Whom GPE
England GPE
Malta PERSON
Novosíltsev PERSON
English LANGUAGE
Prussia PERSON
Buonaparte ORG
Europe LOC
Hardenburg ORG
Haugwitz ORG
Prussianneutrality EVENT
Kingof Prussia’s PERSON
two CARDINAL
tonight TIME
le Vicomte de Mortemart PERSON
Rohans NORP
one CARDINAL
one CARDINAL
Abbé Morio PERSON
Buttell PERSON
Funke PERSON
first ORDINAL
Vienna GPE
Vasíli PERSON
Márya Fëdorovna PERSON
Anna Pávlovna

In [3]:
#get lontitude and latitude for cities

displacy.render(body_entities,style='ent',jupyter=True)

loc_spans = [x for x in body_entities.ents if x.label_ in ['GPE', 'LOC']]

geolocator = Nominatim(user_agent="specify_your_app_name_here")
def lat_lon(city):
    loc = geolocator.geocode(city)
    if loc is None:
        raise AttributeError(f'city not found -- {city}')
    return {'city': city, 
            'lat': loc.latitude,
            'lon': loc.longitude}

# just get location names
locations = list(set(x.text for x in body_entities.ents if x.label_ in ['GPE', 'LOC']))

coordinates = []
for city in locations:
    try:
        coordinates.append(lat_lon(city))
    except AttributeError:
        pass
    
loc_df = pd.DataFrame(coordinates)

In [4]:
# plot cities on map

m = folium.Map(
    tiles = "Stamen Toner",
    location=[45, 37],
    zoom_start=4
)



loc_df.apply(lambda row:folium.CircleMarker(location=[row["lat"], row["lon"]], 
    radius=10, color='red')
    .add_to(m), axis=1)

m.save('index.html')


m

## TODO

  * measure frequency of each location/character
  * display locations of each location/character
  * activate map cluster
  * tooltip/popup for each location
  * parameterize for different source text
  * improve NER
  * dash for website

In [0]:
#make wordcloud of cities