<a href="https://colab.research.google.com/github/nonoumasy/Find-Lat-and-Long-from-city-names/blob/master/xray.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Project Goals:

Extract Place Entities (Cities, Locations, etc) from any document(novel, news articles, etc) and plot them on a Map.

User Flow:

- Enter document(pdf, text file, doc, etc) 
- Extract Named entities like names and location entities from the file using NLP 
- For Locations, Find Long and Lat values for location entities extracted 
- Plot those coordinates on a map

Authors:
Nono Umasy, 
Jason Hamada



In [0]:
import matplotlib.pyplot as plt
import pandas as pd 
import spacy
import folium
import os
import pickle
from spacy import displacy
from geopy.geocoders import Nominatim


with open('tolstoy.txt', 'rt') as f:
    body = [f.readline()[:-1] for _ in range(10000)]
    nlp = spacy.load('en')

In [0]:
body_entities = nlp(''.join(body))


In [0]:
for word in body_entities.ents:
    print(word.text, word.label_)

ONE CARDINAL
1805CHAPTER CARDINAL
I“Well ORG
Genoa ORG
Lucca PERSON
Antichrist PERSON
slave,’ PERSON
July, 1805 DATE
Anna PávlovnaSchérer PRODUCT
Márya Fëdorovna PERSON
Prince Vasíli Kurágin PERSON
first ORDINAL
AnnaPávlovna ORG
some days DATE
St. Petersburg GPE
French NORP
morning TIME
follows:“If GPE
an evening TIME
tonight TIME
French NORP
Anna Pávlovna PERSON
Anna Pávlovna PERSON
English LANGUAGE
Wednesday DATE
today DATE
one CARDINAL
Buonaparte ORG
Vasíli ORG
Anna Pávlovna Schérer PERSON
her forty years DATE
Anna Pávlovna PERSON
Austria GPE
Austria GPE
Russia GPE
Europe LOC
earth LOC
Whom GPE
England GPE
Malta PERSON
Novosíltsev PERSON
English LANGUAGE
Prussia PERSON
Buonaparte ORG
Europe LOC
Hardenburg ORG
Haugwitz ORG
Prussianneutrality EVENT
Kingof Prussia’s PERSON
two CARDINAL
tonight TIME
le Vicomte de Mortemart PERSON
Rohans NORP
one CARDINAL
one CARDINAL
Abbé Morio PERSON
Buttell PERSON
Funke PERSON
first ORDINAL
Vienna GPE
Vasíli PERSON
Márya Fëdorovna PERSON
Anna Pávlovna

In [0]:
displacy.render(body_entities,style='ent',jupyter=True)

loc_spans = [x for x in body_entities.ents if x.label_ in ['GPE', 'LOC']]

In [0]:
loc_spans

[St. Petersburg,
 follows:“If,
 Austria,
 Austria,
 Russia,
 Europe,
 earth,
 Whom,
 England,
 Europe,
 Vienna,
 children?I,
 HerMajesty,
 Hélène,
 Petersburg,
 HerMajesty,
 inFrench,
 Moscow,
 Russia,
 Petersburg,
 Hélène,
 Hélène,
 Hélène.“Wait,
 Prince Hippolyte,
 Fetch,
 Paris,
 actress’,
 Europe,
 Russia,
 Europe,
 Mademoiselle George,
 Pierre,
 Pierre,
 Prince Vasíli’s,
 Petersburg,
 Prince Vasíli,
 Russia,
 toPetersburg,
 Guards,
 Drubetskáya,
 Guards,
 Moscow,
 promise.”“Do,
 atMilan,
 beware!“I,
 Russia,
 LouisXVII,
 France,
 continued.“I,
 Napoleonhas,
 Africa,
 Prince Hippolyte,
 Moscow,
 Russianas,
 Russia,
 story.“There,
 Moscow,
 Monsieur,
 laughing.“Do,
 monarch.”Hippolyte,
 Moscowhis,
 Petersburg,
 England,
 Austria,
 Pierre,
 Pierre,
 Dieu,
 china,
 Kurágins’,
 go!”“No,
 Hewas,
 Petersburg,
 Hercules,
 Pierre,
 Fifty,
 Thelad,
 him.“Don’t,
 Moscow,
 Guards,
 leftPetersburg,
 Radzivílov,
 St. Natalia’s,
 Nataly,
 Povarskáya,
 Moscow,
 china,
 Countess Apráksina,
 Peters

In [0]:
geolocator = Nominatim(user_agent="specify_your_app_name_here")
def lat_lon(city):
    loc = geolocator.geocode(city)
    if loc is None:
        raise AttributeError(f'city not found -- {city}')
    return {'city': city, 
            'lat': loc.latitude,
            'lon': loc.longitude}

In [0]:
# just get location names
locations = list(set(x.text for x in body_entities.ents if x.label_ in ['GPE', 'LOC']))

In [0]:
coordinates = []
for city in locations:
    try:
        coordinates.append(lat_lon(city))
    except AttributeError:
        pass

In [0]:
loc_df = pd.DataFrame(coordinates)

In [0]:
# quickstart for Folium: https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/Quickstart.ipynb


m = folium.Map(
    tiles = "Mapbox Bright",
    location=[45, 37],
    zoom_start=4
)



loc_df.apply(lambda row:folium.CircleMarker(location=[row["lat"], row["lon"]], 
    radius=10, color='red')
    .add_to(m), axis=1)

m.save('index.html')


m

we can use Map marker Clustering to show location numerical occurence

![alt text](http://2008.kelvinluck.com/wp-content/uploads/2009/08/cluster_screenshot.png)