# GIAN 4: Resources and Recognition

You should now be familiar with tokenizing text and using frequency measures.

This notebook will show you how to use annotated text resources for tokenizing, lemmatizing, part-of-speech tagging, parsing, and named entity recognition.

In [None]:
# load the libraries required for this session
import re
from collections import Counter, defaultdict
import spacy
from matplotlib import pyplot as plt

In [None]:
# Tell the notebook that we want plots to be displayed inside of the notebook (don't open a new window)
%matplotlib inline

In [None]:
# Define some essential functions for this session

def clean_gutenberg_text(text):
    """Remove front and back matter from project Gutenberg texts"""
    m1 = re.search("START OF THIS PROJECT GUTENBERG EBOOK .+\n", text)
    m2 = re.search("End of the Project Gutenberg EBook .+\n", text)
    tstart=m1.span()[1]+1 # Text starts one character after the end of the front matter
    tstop=m2.span()[0]  # Text ends one character before the beginning of the back matter
    # spaCy's tokenizer doesn't like newlines
    text=re.sub("\n+", "\n", text[tstart:tstop])
    return(text)

def freq_over_time(doc, tracklist):
    """Track the frequency of named entities over the course of a document.
    
    Returns a dictionary (trackdict) where keys correspond to the entities from the tracklist,
    and values correspond to a list with all the token positions for the corresponding entity."""
    trackdict={entity:[0] for entity in tracklist}
    for ent in doc.ents:
        for entity in tracklist:
            last_value=trackdict[entity][-1]
            if (ent.orth_, ent.label_)==entity:
                trackdict[entity].append(last_value+1)
            else:
                trackdict[entity].append(last_value)
    return trackdict

# Running the full NLP pipeline

In [None]:
# load spaCy resources for English
# see https://spacy.io/docs/usage/models

# To avoid problems with unlinked modules,
# we use an alternative import syntax

import en_core_web_sm
nlp = en_core_web_sm.load()

Note that we haven't disabled any modules, because for this lecture, in addition to tokenizing, we also will do parsing, part-of-speech tagging, and named entity recognition.

Let's import the Project Gutenberg file for "A Tale of Two Cities".

In [None]:
t0 = open('GIAN3_data/pg98.txt', encoding="utf-8").read()
t1 = clean_gutenberg_text(t0)

In [None]:
len(t1)

Now we can run the entire nlp pipeline on the book. 

(Warning: This will take much longer than just tokenizing)

In [None]:
doc1=nlp(t1)

In [None]:
for i, token in enumerate(doc1[:500]):
    if not re.search("\s",token.orth_):
        print(i,token),

Let's look at the first 100 tokens at the beginning of chapter 1 (which starts at token 368)

In [None]:
i=368
n=100
d1_tokens=[token.orth_ for token in doc1]
print(d1_tokens[i:i+n])

Since spaCy has tagged the tokens for us, we can now look at the lemmas corresponding to the tokens, ...

In [None]:
d1_lemmas=[token.lemma_ for token in doc1]
print(d1_lemmas[i:i+n])

In [None]:
# as an illustration, the following code achieves the same result without list comprehension
d1_lemmas=[]
for token in doc1:
    d1_lemmas.append(token.lemma_)
print(d1_lemmas[i:i+n])

the corresponding *general* part-of-speech tags, ...

In [None]:
d1_pos=[token.pos_ for token in doc1]
print(d1_pos[i:i+n])

and the corresponding *detailed* part-of-speech tags.

In [None]:
d1_pos=[token.tag_ for token in doc1]
print(d1_pos[i:i+n])

An overview of the general tags can be found [here](https://spacy.io/usage/linguistic-features) and [here](https://spacy.io/api/annotation) and the detailed tags can be found [here](https://www.researchgate.net/profile/Jinho_Choi3/publication/324940566_Guidelines_for_the_CLEAR_Style_Constituent_to_Dependency_Conversion/links/5aebd3cfa6fdcc8508b6e6e8/Guidelines-for-the-CLEAR-Style-Constituent-to-Dependency-Conversion.pdf).

Meanwhile, here is a small cheat sheet.

![part-of-speech-tags](GIAN4_data/postags.png)

Let's look at the correspondence between token, lemma, and part-of-speech tag in more detail

In [None]:
for token in doc1[i:i+n]:
    if token.orth_.isalpha(): # remove punctuation
        print ("{:s} => {:s} ({:s}, {:s})".format(token.orth_, token.lemma_, token.pos_, token.tag_))

## Sentences

Because spaCy has parsed the entire text, we can also look at sentences

In [None]:
sentences=list(doc1.sents)
for i, sentence in enumerate(sentences[:100]):
    print(i,sentence)

In [None]:
sentences[58]

In [None]:
from spacy import displacy
displacy.render(sentences[58], style='dep', jupyter=True)

## Named Entity Recognition

Running spaCy's nlp pipeline also includes [named entity recognition](https://spacy.io/usage/linguistic-features#section-named-entities)

Let's look at the first 20 named entities in the book

In [None]:
print([ent.orth_ for ent in doc1.ents[:100]])

The same thing, but now with named entity labels

In [None]:
print([(ent.orth_, ent.label_) for ent in doc1.ents[:100]])

In [None]:
displacy.render(doc1[:500], style='ent', jupyter=True)

We will now make a frequency dictionary of the persons and geopolitical entities in the document

In [None]:
# persons
doc1_person=Counter([(ent.orth_, ent.label_) for ent in doc1.ents if ent.label_=="PERSON"])
# geo-political entities
doc1_gpe=Counter([(ent.orth_, ent.label_) for ent in doc1.ents if ent.label_=="GPE"])
# dates
doc1_dates=Counter([(ent.orth_, ent.label_) for ent in doc1.ents if ent.label_=="DATE"])
# works of art
doc1_woa=Counter([(ent.orth_, ent.label_) for ent in doc1.ents if ent.label_=="WORK_OF_ART"])

This lets us answer who are the most common persons and ...

In [None]:
doc1_person.most_common(20)

... what are the most common places

In [None]:
doc1_gpe.most_common(20)

##  Insights from plotting text data 

As an example of how plotting data can give us insight in the text, we will use the named entity recogntion data to track different persons and locations throughout the text.

Let's make a list of the 10 most common persons in the book ...

In [None]:
person_tracklist=[e for e, f in doc1_person.most_common(10)]
person_tracklist

... and track them over time

In [None]:
person_trackdict=freq_over_time(doc1, person_tracklist)

Now, we can plot the evolution of the tracked entities throughout the text

In [None]:
plt.style.use("fivethirtyeight")
# plt.plot(person_trackdict[('Lorry', 'PERSON')], label="Lorry")
# plt.plot(person_trackdict[('Cruncher', 'PERSON')], label="Cruncher")
plt.plot(person_trackdict[('Manette', 'PERSON')], label="Manette")
plt.plot(person_trackdict[('Miss Pross', 'PERSON')], label="Miss Pross")
# plt.plot(person_trackdict[('Jerry', 'PERSON')], label="Jerry")
plt.plot(person_trackdict[('Lucie', 'PERSON')], label="Lucie")
plt.ylabel("cumulative frequency")
plt.xlabel("occurence in book")
plt.legend(loc="upper left")

In [None]:
gpe_tracklist=[e for e, f in doc1_gpe.most_common(10)]
gpe_trackdict=freq_over_time(doc1, gpe_tracklist)

In [None]:
gpe_tracklist

In [None]:
plt.style.use("fivethirtyeight")
plt.plot(gpe_trackdict[('Paris', 'GPE')], label="Paris")
# plt.plot(gpe_trackdict[('France', 'GPE')], label="France")
plt.plot(gpe_trackdict[('London', 'GPE')], label="London")
# plt.plot(gpe_trackdict[('England', 'GPE')], label="England")
plt.ylabel("cumulative frequency")
plt.xlabel("book progression (entities)")
plt.legend(loc="upper left")

In [None]:
plt.style.use("fivethirtyeight")
plt.plot(gpe_trackdict[('Paris', 'GPE')], label="Paris")
plt.plot(person_trackdict[('Miss Pross', 'PERSON')], label="Miss Pross")
plt.plot(gpe_trackdict[('London', 'GPE')], label="London")
plt.plot(person_trackdict[('Lucie', 'PERSON')], label="Lucie")
plt.ylabel("cumulative frequency")
plt.xlabel("book progression (entities)")
plt.legend(loc="upper left")

## Combining data from parsing and named entity recognition

To conclude, we combine the output of the NLP parsing with the NLP named entity recognition to see what an entity is doing during the text ...

In [None]:
track_entity="Paris"
for entity in doc1.ents: 
    if entity.orth_==track_entity:
        token=entity[0]
        if token.dep_=="dobj":
            print("+ {:s}".format(' '.join([token.orth_ for token in token.head.subtree])))

Or to find out when two entities occur together in a sentence ....

In [None]:
track_set=set(["France", "England"],)
for sentence in doc1.sents:
    subdoc=nlp(sentence.orth_, )
    entities=set([ent.orth_ for ent in subdoc.ents])
    if len(track_set.intersection(entities))==len(track_set):
        print('+',' '.join([token.orth_ for token in sentence]))