# Mapping Kate Chopin's The Awakening

In this essay, we will use a natural language processing method called *Named Entity Recognition* (NER) to identify places in Kate Chopin's novel *The Awakening*. Then we will geocode these locations and map them.

---

## Dataset

### Kate Chopin's *The Awakening*

```{epigraph}
Robert spoke of his intention to go to Mexico in the autumn, where fortune awaited him. He was always intending to go to Mexico GPE , but some way never got there. Meanwhile he held on to his modest position in a mercantile house in
New Orleans, where an equal familiarity with English, French and Spanish gave him no small value as a clerk and correspondent.

-- Kate Chopin, *The Awakening*
```

## Named Entity Recognition

### Install spaCy

In [None]:
!pip install -U spacy

### Import Libraries

We import `spacy` and `displacy`, a special spaCy module for visualization.

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We also import the `Counter` module for counting places and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

### Download Language Model

Next we need to download the English-language model (`en_core_web_sm`), which will be processing and making predictions about our texts. This is the model that was trained on the annotated "OntoNotes" corpus. You can download the `en_core_web_sm` model by running the cell below:

In [None]:
!python -m spacy download en_core_web_sm

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

### Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

**1.** We can import the model as a module and then load it from the module.

In [2]:
import en_core_web_sm
nlp = en_core_web_sm.load()

**2.** We can load the model by name.

In [4]:
#nlp = spacy.load('en_core_web_sm')

If you just downloaded the model for the first time, it's advisable to use Option 1. Then you can use the model immediately. Otherwise, you'll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu).

## Process Document

In the cell below, we open and read *The Awakening*. Then we process our `document` with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

In [3]:
filepath = "../../data/The-Awakening-Kate-Chopin.txt"
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## spaCy Named Entities

Below is a Named Entities chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different named entities that spaCy can identify as well as their corresponding type labels.

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


To quickly see spaCy's NER in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) with the `style=` parameter set to "ent"  (short for entities):

In [4]:
displacy.render(document, style="ent")

### Get Places

|Type Label|Description|
|:---:|:---:|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "GPE" or "LOC." These are the type labels for "counties cities, states" and "locations, mountain ranges, bodies of water."

In [13]:
places = []

for named_entity in document.ents:
    if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
        places.append(named_entity.text)

places_tally = Counter(places)

places_df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
places_df

Unnamed: 0,place,count
0,Mexico,19
1,New Orleans,11
2,Gulf,8
3,earth,8
4,Kentucky,7
5,the United States,7
6,Grand Isle,6
7,New York,6
8,Valmonde,6
9,Brantain,6


## Geocoding

First, we're going to geocode data — aka get coordinates from addresses or place names — with the Python package [GeoPy](https://geopy.readthedocs.io/en/stable/#). GeoPy makes it easier to use a range of third-party [geocoding API services](https://geopy.readthedocs.io/en/stable/#), such as Google, Bing, ArcGIS, and OpenStreetMap.

Though most of these services require an API key, Nominatim, which uses OpenStreetMap data, does not, which is why we're going to use it here.

### Install GeoPy

In [None]:
!pip install geopy

### Import Nominatim

From GeoPy's list of possible geocoding services, we're going to import Nominatim:

In [9]:
from geopy.geocoders import Nominatim

### Nominatim & OpenStreetMap

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Openstreetmap_logo.svg/256px-Openstreetmap_logo.svg.png" border=2 >

Nominatim (which means "name" in Latin) uses [OpenStreetMap data](https://www.openstreetmap.org/relation/174979) to match addresses with geopgraphic coordinates. Though we don't need an API key to use Nominatim, we do need to create a unique [application name](https://operations.osmfoundation.org/policies/nominatim/). 

Here we're initializing Nominatim as a variable called `geolocator`. Change the application name below to your own application name:

In [16]:
geolocator = Nominatim(user_agent="YOUR NAME's mapping app", timeout=2)

To geocode an address or location, we simply use the `.geocode()` function:

In [67]:
location = geolocator.geocode("New Orleans")

In [68]:
location

Location(New Orleans, Orleans Parish, Louisiana, United States of America, (29.9499323, -90.0701156, 0.0))

## Geocode with Pandas

To geocode every location in a CSV file, we can use Pandas, make a Python function, and `.apply()` it to every row in the CSV file.

Here we make a function with `geolocator.geocode()` and ask it to return the address, lat/lon, and importance score:

In [25]:
def find_location(row):
    
    place = row['place']
    
    location = geolocator.geocode(place)
    
    if location != None:
        return location.address, location.latitude, location.longitude, location.raw['importance']
    else:
        return "Not Found", "Not Found", "Not Found", "Not Found"

Now let's `.apply()` our function to this Pandas dataframe and see what results Nominatim's geocoding service spits out.

In [26]:
places_df[['address', 'lat', 'lon', 'importance']] = places_df.apply(find_location, axis="columns", result_type="expand")
places_df

Unnamed: 0,place,count,address,lat,lon,importance
0,Mexico,19,México,22.5,-100,0.839924
1,New Orleans,11,"New Orleans, Orleans Parish, Louisiana, United States of America",29.9499,-90.0701,0.808026
2,Gulf,8,"Gulf County, Florida, United States of America",29.9665,-85.2176,0.620529
3,earth,8,"Earth, Lamb County, Texas, United States of America",34.2331,-102.411,0.533546
4,Kentucky,7,"Kentucky, United States of America",37.5726,-85.1551,0.821405
5,the United States,7,United States,39.7837,-100.446,1.13569
6,Grand Isle,6,"Grand Isle, Grand Isle Township, Aroostook County, Maine, United States of America",47.3053,-68.152,0.601334
7,New York,6,"New York, United States of America",40.7127,-74.006,1.01758
8,Valmonde,6,Not Found,Not Found,Not Found,Not Found
9,Brantain,6,Not Found,Not Found,Not Found,Not Found


## Making an Interactive Map

To map our geocoded coordinates, we're going to use the Python library [Folium](https://python-visualization.github.io/folium/). Folium is built on top of the popular JavaScript library [Leaflet](https://leafletjs.com/).

To install and import Folium, run the cells below:

In [None]:
!pip install folium

In [28]:
import folium

### Base Map

First, we need to establish a base map. This is where we'll map our geocoded Ithaca locations. To do so, we're going to call `folium.Map()`and enter the general latitude/longitude coordinates of the Ithaca area at a particular zoom.

(To find latitude/longitude coordintes for a particular location, you can use Google Maps, [as described here](https://support.google.com/maps/answer/18539?co=GENIE.Platform%3DDesktop&hl=en).)

In [69]:
places_map = folium.Map(location=[29.98, -90.01], zoom_start=3)
places_map

### Add Markers From Pandas Data

Adding a marker to a map is easy with Folium! We'll simply call `folium.Marker()` at a particular lat/lon, enter some text to display when the marker is clicked on, and then add it to our base map.

To add markers for every location in our Pandas dataframe, we can make a Python function and `.apply()` it to every row in the dataframe.

In [70]:
def create_map_markers(row, map_name):
    folium.Marker(location=[row['lat'], row['lon']], popup=row['place']).add_to(map_name)

Before we apply this function to our dataframe, we're going to drop any locations that were "Not Found" (which would cause `folium.Marker()` to return an error).

In [71]:
found_place_locations = places_df[places_df['address'] != "Not Found"]

In [72]:
found_place_locations.apply(create_map_markers, map_name=places_map, axis='columns')
places_map