# Get Locations From U.S. Inaugural Addresses

### Install spaCy

In [None]:
!pip install -U spacy

### Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

### Download Language Model

Next we need to download the English-language model (`en_core_web_sm`), which will be processing and making predictions about our texts. This is the model that was trained on the annotated "OntoNotes" corpus. You can download the `en_core_web_sm` model by running the cell below:

In [None]:
!python -m spacy download en_core_web_sm

### Load Language Model

In [2]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [4]:
import glob
from pathlib import Path

In [5]:
filenames = glob.glob('../texts/history/US_Inaugural_Addresses/*.txt')

In [43]:
places = []

for file in filenames:
    text = open(file, encoding='utf-8').read()
    # Get the year from the file name -- it's always the last 4 digits
    year = Path(file).stem[-4:]
    # Run spacy on each address
    document = nlp(text)
    
    for named_entity in document.ents:
        if named_entity.label_ == "GPE":
            # Append a dictionary to the empty list 
            places.append({
                            "Place": named_entity.text,
                            "Year": year,
                            "Address":  Path(file).stem
                        })

# Make DataFrame from list of dictionaries -- the dictionary keys become the column names
df = pd.DataFrame(places)

In [44]:
def add_decade(year):
    year = int(year)
    if year < 1800:
        return "1700s"
    elif year < 1900:
        return "1800s"
    elif year < 2000:
        return "1900s"
    else:
        return "2000s"

In [45]:
df['Decade'] = df['Year'].apply(add_decade)

decade_df = df.groupby(['Place', 'Decade'])['Address'].count().reset_index()
decade_df = decade_df.rename(columns={'Address': 'Count'})

In [46]:
decade_df

Unnamed: 0,Place,Decade,Count
0,Afghanistan,2000s,1
1,America,1700s,5
2,America,1800s,2
3,America,1900s,135
4,America,2000s,63
5,Argonne,1900s,1
6,Arlington,2000s,1
7,Athens,1800s,2
8,Baltimore,1800s,1
9,Bolivar,1800s,1


In [47]:
decade_df.to_csv('Locations-in-Inaugural-Addresses.csv', index=False)