## 3-2-2 Chunking: The Unexplored Territory

Now that we've had some experience with identifying parts of speech, let's move on to a more complex task: named entity recognition (NER). NER is the task of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Please note that the custom tokenizer below is one we first used in Notebook 1-1. It has been modified however: it does not lowercase all the text. Experimentation showed that NLTK uses capitalization as a signal for named entities. (Let me emphasize the important of experimentation in text analytics: my first attempt at this notebook used a tokenizer that lowercased all text, and the results were not as good. I encourage you to experiment with different tokenizers and see how they affect the results of NER.)

### Single Text Operations

In [42]:
# IMPORTS
import re
import nltk
from pathlib import Path

# CUSTOM TOKENIZER 
# This function takes a string and returns a list of tokens
# QUIZ COMMENT: Could this function be improved?
def tknize (a_string):
    words = re.sub('[^a-zA-Z \.]', ' ', a_string).split()
    return words

# TEST DATA
with open("../data/mdg.txt", mode="r", encoding="utf-8") as f:
    mdg = f.read()

# SINGLE TEXT
with open("../queue/scifi/alien.txt", mode="r", encoding="utf-8") as f:
    alien = f.read()

In [16]:
# Test our tokenizer
mdg_ = tknize(mdg)
alien_ = tknize(alien)
print(mdg_[0:5])
print(alien_[0:5])

['Off', 'there', 'to', 'the', 'right']
['Alien', 'early', 'draft', 'by', 'Dan']


In [17]:
# mdg_ is our cleaned list of tokens which keeps only periods.
mdg_tagged = nltk.pos_tag(mdg_)
mdg_tagged[0:10]

[('Off', 'IN'),
 ('there', 'EX'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('right', 'NN'),
 ('somewhere', 'RB'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('large', 'JJ'),
 ('island', 'NN')]

In [18]:
for t in mdg_tagged[0:10]:
    if t[1] == 'JJ':
        print(t[0])

large


In [19]:
# Let's find all the proper nouns:
mdg_nouns = []
for i in mdg_tagged:
    if i[1] == "NNP":
        mdg_nouns.append(i[0])

print(mdg_nouns)

['Whitney.', 'Rainsford', 'Ship', 'Trap', 'Island', 'Whitney', 'A', 'Rainsford', 'Whitney', 'Caribbean', 'Rainsford.', 'Ugh', 'Rio', 'Whitney.', 'Purdey', 'Amazon.', 'Great', 'Rainsford.', 'Whitney.', 'Don', 'Whitney', 'Rainsford.', 'Who', 'Whitney.', 'Bah', 'Nonsense', 'Rainsford.', 'Whitney.', 'Be', 'Luckily', 'Do', 'Cannibals', 'Rainsford.', 'Hardly.', 'Even', 'God', 'Didn', 'Captain', 'Nielsen', 'Yes', 'Swede', 'Don', 'Pure', 'Rainsford.', 'One', 'Maybe.', 'Anyhow', 'Well', 'Rainsford.', 'Rainsford.', 'Good', 'Rainsford.', 'See', 'Right.', 'Good', 'Whitney.', 'Rainsford', 'Rainsford', 'Off', 'Again', 'Somewhere', 'Rainsford', 'Caribbean', 'Sea', 'A', 'Rainsford', 'Rainsford', 'Pistol', 'Rainsford', 'Ten', 'Jagged', 'Dense', 'Rainsford', 'Sleep', 'Rainsford', 'A', 'Rainsford', 'A', 'Bleak', 'Rainsford', 'Mirage', 'Rainsford.', 'Again', 'Rainsford', 'Rainsford', 'Rainsford', 'Rainsford', 'Rainsford', 'Out', 'Rainsford.', 'Don', 'Rainsford', 'My', 'Sanger', 'Rainsford', 'New', 'York',

In [20]:
def getPOS (POS, a_string):
    """
    Takes a string and returns a list of tuples with the word 
    and its part of speech
    """
    tokens = tknize(a_string)
    tagged = nltk.pos_tag(tokens)
    the_list = []
    for i in tagged:
        if i[1] == POS:
            the_list.append(i[0])
    return the_list

In [None]:
# Test on toy data
nouns = getPOS("NNP", mdg)
print(nouns)

['Whitney.', 'Rainsford', 'Ship', 'Trap', 'Island', 'Whitney', 'A', 'Rainsford', 'Whitney', 'Caribbean', 'Rainsford.', 'Ugh', 'Rio', 'Whitney.', 'Purdey', 'Amazon.', 'Great', 'Rainsford.', 'Whitney.', 'Don', 'Whitney', 'Rainsford.', 'Who', 'Whitney.', 'Bah', 'Nonsense', 'Rainsford.', 'Whitney.', 'Be', 'Luckily', 'Do', 'Cannibals', 'Rainsford.', 'Hardly.', 'Even', 'God', 'Didn', 'Captain', 'Nielsen', 'Yes', 'Swede', 'Don', 'Pure', 'Rainsford.', 'One', 'Maybe.', 'Anyhow', 'Well', 'Rainsford.', 'Rainsford.', 'Good', 'Rainsford.', 'See', 'Right.', 'Good', 'Whitney.', 'Rainsford', 'Rainsford', 'Off', 'Again', 'Somewhere', 'Rainsford', 'Caribbean', 'Sea', 'A', 'Rainsford', 'Rainsford', 'Pistol', 'Rainsford', 'Ten', 'Jagged', 'Dense', 'Rainsford', 'Sleep', 'Rainsford', 'A', 'Rainsford', 'A', 'Bleak', 'Rainsford', 'Mirage', 'Rainsford.', 'Again', 'Rainsford', 'Rainsford', 'Rainsford', 'Rainsford', 'Rainsford', 'Out', 'Rainsford.', 'Don', 'Rainsford', 'My', 'Sanger', 'Rainsford', 'New', 'York',

In [41]:
# Test on a single text from our corpus
nouns = getPOS("NNP", alien)
print(nouns)

['Alien', 'Dan', 'O', 'Bannon', 'ALIEN', 'STARBEAST', 'Story', 'Dan', 'O', 'Bannon', 'Ronald', 'Shusett', 'Screenplay', 'Dan', 'O', 'Bannon', 'SYNOPSIS', 'En', 'SNARK', 'Mankind', 'Their', 'Inside', 'Certain', 'Beneath', 'A', 'Hell', 'A', 'Finally', 'Earth', 'CAST', 'OF', 'CHARACTERS', 'CHAZ', 'STANDARD', 'Captain.................A', 'MARTIN', 'ROBY', 'Executive', 'Officer.......Cautious', 'DELL', 'BROUSSARD', 'Navigator...............Adventurer', 'SANDY', 'MELKONIS', 'Communications..........Tech', 'Intellectual', 'CLEAVE', 'HUNTER', 'Mining', 'Engineer.........High', 'JAY', 'FAUST', 'Engine', 'Tech.............A', 'Unimaginative.', 'FADE', 'IN', 'EXTREME', 'CLOSEUPS', 'OF', 'FLICKERING', 'INSTRUMENT', 'PANELS.', 'Readouts', 'Electronic', 'BEEPING', 'SIGNAL', 'Circuits', 'CAMERA', 'ANGLES', 'GRADUALLY', 'WIDEN', 'HYPERSLEEP', 'VAULT', 'A', 'FREEZER', 'COMPARTMENTS', 'FOOM', 'FOOM', 'FOOM', 'Slowly', 'ROBY', 'Oh...', 'God...', 'BROUSSARD', 'ROBY', 'BROUSSARD', 'FAUST', 'BROUSSARD', 'ME

### Putting It All Together

With the POS tagging and a sense of the NLTK's ability to identify at least proper nouns and some other named entities, we can now put it all together and see how well NLTK does at identifying named entities using fairly simple code.

In [43]:
# Read in all screenplays
screenplays = []
for p in Path('../queue/scifi/').glob('*.txt'):
    with open(p, encoding="utf8", errors='ignore') as f:
        contents = f.read()
        screenplays.append(contents)

It is tempting to deploy are already written functions, but some experimentation reveals that getPOS expects a string, and we would not want to tokenize our string twice, so we will need to create a function which tokenizes the text, tags it, grabs all the proper nouns, makes a list of them, and then uses that list to remove the proper nouns from each of our texts. 

Because our goal is to feed these results to SciKit-Learn in hopes of a better topic model, we will need to have our function return a string. 

*Phew.* That's a lot of work. Will it be worth it? Let's find out.

In [44]:
# The one function to rule them all
def removePNs(a_string):
    """
    Roundtrips a string to remove proper nouns
    """
    tokens = re.sub('[^a-zA-Z \.]', ' ', a_string).split()
    tagged = nltk.pos_tag(tokens)
    the_list = []
    for i in tagged:
        if i[1] == "NNP":
            the_list.append(i[0])
    toremove = list(set(the_list))
    filtered = []
    for token in tokens:
        if token not in toremove:
            filtered.append(token)
    return " ".join(filtered)

Having created a composite function, let's try it out:

In [None]:
    
for play in screenplays:
    # Get the proper nouns
    propers = getPOS("NNP", play)
    # Compile the set of proper nouns
    proper_list = list(set(propers))
    # Remove proper nouns from 

## Chunking

In [23]:
# How to Read the Grammar below:
# Start with an optional (?) determiner ('DT')
# Can have any number (*) of adjectives (JJ)
# End with a noun (<NN>)
grammar = "NP: {<DT>?<JJ>*<NN>}"

# Instantiate the chunk parser
parser = nltk.RegexpParser(grammar)

# Run it on our tagged text
tree = parser.parse(mdg_tagged)

# See some results
for i in tree[0:20]:
    print(i)

('Off', 'IN')
('there', 'EX')
('to', 'TO')
(NP the/DT right/NN)
('somewhere', 'RB')
('is', 'VBZ')
(NP a/DT large/JJ island/NN)
('said', 'VBD')
('Whitney.', 'NNP')
('It', 'PRP')
('s', 'VBZ')
('rather', 'RB')
(NP a/DT mystery/NN)
('What', 'WP')
(NP island/NN)
('is', 'VBZ')
('it', 'PRP')
('Rainsford', 'NNP')
('asked.', 'VBZ')
('The', 'DT')


In [28]:
print(tree[6])

(NP a/DT large/JJ island/NN)


In [30]:
NPtrees = [subtree for subtree in tree if type(subtree) == nltk.Tree and subtree.label() == "NP"]

for i in NPtrees[0:20]:
    print(i)

(NP the/DT right/NN)
(NP a/DT large/JJ island/NN)
(NP a/DT mystery/NN)
(NP island/NN)
(NP suggestive/JJ name/NN)
(NP isn/NN)
(NP a/DT curious/JJ dread/NN)
(NP the/DT place./NN)
(NP Some/DT superstition/NN)
(NP the/DT dank/JJ tropical/JJ night/NN)
(NP thick/JJ warm/JJ blackness/NN)
(NP the/DT yacht./NN)
(NP a/DT laugh/NN)
(NP a/DT moose/NN)
(NP the/DT brown/JJ fall/NN)
(NP bush/NN)
(NP moist/NN)
(NP black/JJ velvet./NN)
(NP a/DT few/JJ days./NN)
(NP the/DT jaguar/NN)


In [31]:
NPleaves = [subtree.leaves() for subtree in tree if type(subtree) == nltk.Tree and subtree.label() == "NP"]

print(len(NPleaves))
print(NPleaves[0:20])

1341
[[('the', 'DT'), ('right', 'NN')], [('a', 'DT'), ('large', 'JJ'), ('island', 'NN')], [('a', 'DT'), ('mystery', 'NN')], [('island', 'NN')], [('suggestive', 'JJ'), ('name', 'NN')], [('isn', 'NN')], [('a', 'DT'), ('curious', 'JJ'), ('dread', 'NN')], [('the', 'DT'), ('place.', 'NN')], [('Some', 'DT'), ('superstition', 'NN')], [('the', 'DT'), ('dank', 'JJ'), ('tropical', 'JJ'), ('night', 'NN')], [('thick', 'JJ'), ('warm', 'JJ'), ('blackness', 'NN')], [('the', 'DT'), ('yacht.', 'NN')], [('a', 'DT'), ('laugh', 'NN')], [('a', 'DT'), ('moose', 'NN')], [('the', 'DT'), ('brown', 'JJ'), ('fall', 'NN')], [('bush', 'NN')], [('moist', 'NN')], [('black', 'JJ'), ('velvet.', 'NN')], [('a', 'DT'), ('few', 'JJ'), ('days.', 'NN')], [('the', 'DT'), ('jaguar', 'NN')]]


### Named Entities

You will need to download the named entity chunker first: `nltk.download("maxent_ne_chunker")`.

In [None]:
# Uncomment the following line to download the maxent_ne_chunker
# nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /Users/jl/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


True


For more on the kinds of named entities: https://www.nltk.org/book/ch07.html#sec-ner.

In [36]:
tree = nltk.ne_chunk(mdg_tagged, binary=True)

def extract_ne(text):
    # tokenize by word
    words = tknize(text)
    # apply part of speech tags to those words
    tagged = nltk.pos_tag(words)
    # extract named entities based on those tags
    # "binary=True ==> named entities won’t be labeled by kind
    tree = nltk.ne_chunk(tagged, binary=True)
    ne_set = set(
        " ".join(i[0] for i in t)
        for t in tree
        if hasattr(t, "label") and t.label() == "NE"
    )
    return ne_set

In [38]:
mdg_ne = extract_ne(mdg)

for i in mdg_ne:
    print(i)

Caribbean Sea
Follow Ivan
American
Malay
Night
Ivan
America
Tonight
God
Victorian
Great White Czar
Rest
Cannibals
Monte Carlo
Nerve
Really Oh
Tibet
Puritan
Russian
Watch Out
Zaroff
Oh
Luckily
Chablis Mr. Rainsford General
English French
Dense
Sanger Rainsford
East
Amazon
France
Pol Roger
Dusk
Lazarus
Crimea
Mirage
Madame
General Zaroff
Moscow
Toward
Whitney
New York Rainsford
Pistol
Folies
Nonsense
Invariably Mr. Rainsford
Again Rainsford
San Lucar
Ship
Veuve Cliquot
Mr.
New York
Cape
Rains
Sleep
Yes Life
Poor
Mr. Sanger
Captain Nielsen Yes
Turkish
Rainsford
Marcus
Caribbean
Rio
Ganges
Africa
Chinese
Spanish
Purdey
Russia
General
London
Better
Mr. Rainsford
Caucasus
Cossack
Swam


In [39]:
# Repeat for the alien text:
alien_ne = extract_ne(alien)

for i in alien_ne:
    print(i)

STANDARD Alas
STANDARD Put
Radar
BRIDGE Roby
STANDARD Let
Earth
RAM
MELKONIS Feast
STARBEAST Story
ROOM
THE DIRECTION
CONTROL
BURSTS
HUNTER Good
ROBY How
SPOOKY
SNARK
Standard Roby
Around
ROBY Couldn
Standard Roby Broussard
ROWS OF
HUNTER Well
THE STARSHIP
SCRAWLED
STANDARD Martin
Iron Age
LIFEBOAT
Antarctica
STANDARD Dell
MELKONIS Could
ROBY Okay
SCENE
ROBY Kitty
Kitty
Finally Faust
Snark
HUGE
HUNTER Don
STANDARD Yes
VOICES
INTERCOM
PASSAGEWAY
OMINOUS
Hunter
STANDARD How
FAUST Somebody
HULL OF
DIRECTLY
HUNTER AND
BROUSSARD Close
ROBY Thanks
TWISTS
THE STONE
Mankind
BEEPING
Don
THE ONE
HUNTER Listen
Martin Roby
MELKONIS First
ROBY Isn
ROBY If
LAYER AND
LOWER
TAPS
MELKONIS Men
METALLIC
BRIDGE Standard Roby
THE CRASHING
BEGINS TO
BEEPS
OhmygooaaAA
CRUDELY
BROUSSARD Let
ANOTHER
ENGINE
Dell
Carefully Standard
FAUST Hurt How
STANDARD Can
TINY
ROBY Hey
HUNTER Oh
Roby GASPS
MELKONIS Jesus
COMMUNICATOR
PYRAMID
SMOKE AND
THE
STANDARD Which
STANDARD Tell
How
STANDARD ROBY
INTERIOR MAIN
ROBY Wher