<div style='background-image: url("../share/Aerial_view_LLNL.jpg") ; padding: 0px ; background-size: cover ; border-radius: 15px ; height: 250px; background-position: 0% 80%'>
    <div style="float: center ; margin: 50px ; padding: 20px ; background: rgba(255 , 255 , 255 , 0.8) ; width: 50% ; height: 150px">
        <div style="position: relative ; top: 50% ; transform: translatey(-50%)">
            <div style="font-size: xx-large ; font-weight: 900 ; color: rgba(0 , 0 , 0 , 0.9) ; line-height: 100%">Notebook 5:</div>
            <div style="font-size: x-large ; padding-top: 20px ; color: rgba(0 , 0 , 0 , 0.7)">Part of Speech Tagging and Named Entity Recognition</div>
            <div style="font-size: large ; padding-top: 20px ; color: rgba(0 , 0 , 0 , 0.7)">Estimated Time: 30 minutes</div>
        </div>
    </div>
</div>






# [SpaCy](https://spacy.io/): Industrial-Strength NLP

The tradtional NLP library has always been [NLTK](http://www.nltk.org/). While `NLTK` is still very useful for linguistics analysis and exporation, `spacy` has become a nice option for easy and fast implementation of the NLP pipeline. What's the NLP pipeline? It's a number of common steps computational linguists perform to help them (and the computer) better understand textual data. Digital Humanists are often fond of the pipeline because it gives us more things to count! Let's what `spacy` can give us that we can count.

In [None]:
import pandas as pd
import spacy

Let's start out with a short string from [Lydia Maria Child](https://en.wikipedia.org/wiki/Lydia_Maria_Child)'s *Romance of the Republic* and see what happens.

In [None]:
my_string = '''
"What are you going to do with yourself this evening, Alfred?" said Mr.
Royal to his companion, as they issued from his counting-house in New
Orleans. "Perhaps I ought to apologize for not calling you Mr. King,
considering the shortness of our acquaintance; but your father and I
were like brothers in our youth, and you resemble him so much, I can
hardly realize that you are not he himself, and I still a young man.
It used to be a joke with us that we must be cousins, since he was a
King and I was of the Royal family. So excuse me if I say to you, as
I used to say to him. What are you going to do with yourself, Cousin
Alfred?"

"I thank you for the friendly familiarity," rejoined the young man.
"It is pleasant to know that I remind you so strongly of my good
father. My most earnest wish is to resemble him in character as much
as I am said to resemble him in person. I have formed no plans for the
evening. I was just about to ask you what there was best worth seeing
or hearing in the Crescent City."'''.replace("\n", " ")

We've downloaded the English model, and now we just have to load it. This model will do ***everything*** for us, but we'll only get a little taste today.

In [None]:
# nlp = spacy.load('en')
nlp = spacy.load('en', parser=False)  # run this instead if you don't have > 1GB RAM

To parse an entire text we just call the model on a string.

In [None]:
parsed_text = nlp(my_string)
parsed_text

That was quick! So what happened? We've talked a lot about tokenizing:

In [None]:
[word.text for word in parsed_text]

What about parts of speech?

In [None]:
[word.pos_ for word in parsed_text]

Lemmata?

In [None]:
[word.lemma_ for word in parsed_text]

What else? Let's just make a function `tablefy` that will make a table of all this information for us:

In [None]:
def tablefy(parsed_text):
    df = pd.DataFrame()
    df["Word"] = [word.text for word in parsed_text]
    df["POS"] = [word.pos_ for word in parsed_text]
    df["Lemma"] = [word.lemma_ for word in parsed_text]
    df["Stop Word"] = [word.is_stop for word in parsed_text]
    df["Punctuation"] = [word.is_punct for word in parsed_text]
    df["Space"] = [word.is_space for word in parsed_text]
    df["Number"] = [word.like_num for word in parsed_text]
    df["OOV"] = [word.is_oov for word in parsed_text]
    return df

In [None]:
df = tablefy(parsed_text)
df

Now that we have it in a table format, we can use `pandas` to do some subsetting and counting. `pandas` is the most popular data analysis library for Python. While it's syntax may be confusing at first, it's worth getting to know!

**Subsetting**:

In [None]:
df[df['POS'] == 'NOUN']

In [None]:
df[(df['POS'] == 'NOUN') & (df['Stop Word'] == False)]

In [None]:
df[(df['POS'] == 'NOUN') | (df['POS'] == 'VERB')]

**Counting**

Note that when we index a `DataFrame`, we get back a list:

In [None]:
df['Word']

In [None]:
from collections import Counter
Counter(df['Word']).most_common()

In [None]:
Counter(df[df['POS'] == 'NOUN']['Word']).most_common()

## Challenge

What's the most common verb? Noun? What if you only include lemmata? What if you remove "stop words"?

How would lemmatizing or removing "stop words" help us better understand a text over regular tokenizing?

## Limitations

How accurate are the models? What happens if we change the style of English we're working with?

In [None]:
shakespeare = '''
Tush! Never tell me; I take it much unkindly
That thou, Iago, who hast had my purse
As if the strings were thine, shouldst know of this.
'''

shake_parsed = nlp(shakespeare.strip())
tablefy(shake_parsed)

In [None]:
huck_finn_jim = '''
“Who dah?” “Say, who is you?  Whar is you?  Dog my cats ef I didn’ hear sumf’n.
Well, I know what I’s gwyne to do:  I’s gwyne to set down here and listen tell I hears it agin.”"
'''

hf_parsed = nlp(huck_finn_jim.strip())
tablefy(hf_parsed)

In [None]:
text_speech = '''
LOL where r u rn? omg that's sooo funnnnnny. c u in a sec.
'''
ts_parsed = nlp(text_speech.strip())
tablefy(ts_parsed)

In [None]:
old_english = '''
þæt wearð underne      eorðbuendum, 
þæt meotod hæfde      miht and strengðo 
ða he gefestnade      foldan sceatas. 
'''
oe_parsed = nlp(old_english.strip())
tablefy(oe_parsed)

## NER

In [None]:
ner_df = pd.DataFrame()
ner_df['entity_type'] = [ent.label_ for ent in parsed_text.ents]
ner_df['text'] = [ent.text for ent in parsed_text.ents]
ner_df

Cool! It's identified a few types of things for us. We can check what these mean [here](https://spacy.io/docs/usage/entity-recognition#entity-types). `GPE` is country, cities, or states.

Let's subset these geographic locations ('GPE'):

In [None]:
place_names = ner_df[ner_df['entity_type'] == 'GPE']['text']
place_names

The `Nominatim` function from the `geopy` library will return an object that has latitude and longitude. Let's take the place names and map them!

In [None]:
from geopy.geocoders import Nominatim
import time

geolocator = Nominatim(timeout=10)
place_names_coords = []

for name in place_names:  # only want to loop through unique place names to call once per place name
    print("Getting information for " + name + "...")
    
    # finds the lat and lon of each name in the locations list
    location = geolocator.geocode(name)

    # index the raw response for lat and lon
    lat = float(location.raw["lat"])
    lon = float(location.raw["lon"])
    print(lat, lon)
    place_names_coords.append((name, (lat, lon)))

We should have our place names in `place_names_coords`:

In [None]:
place_names_coords

We can use the `folium` mapping library now to put these dots on a map:

In [None]:
import folium
from IPython.display import IFrame

map = folium.Map(location=[39.8333333,-98.585522], zoom_start=3)

for l in place_names_coords:

    folium.CircleMarker((l[1][0], l[1][1]),
                radius=1,
                popup=l[0],
                color="blue"
               ).add_to(map)

map.save("map.html")
IFrame('map.html', width=700, height=400)