## 3-3 Getting at the Proper Nouns

Now that we've had some experience with identifying parts of speech, let's move on to a more complex task: named entity recognition (NER). NER is the task of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Please note that the custom tokenizer below is one we first used in Notebook 1-1. It has been modified however: it does not lowercase all the text. Experimentation showed that NLTK uses capitalization as a signal for named entities. (Let me emphasize the important of experimentation in text analytics: my first attempt at this notebook used a tokenizer that lowercased all text, and the results were not as good. I encourage you to experiment with different tokenizers and see how they affect the results of NER.)

### Single Text Operations

In [42]:
# IMPORTS
import re
import nltk
from pathlib import Path

# CUSTOM TOKENIZER 
# This function takes a string and returns a list of tokens
# QUIZ COMMENT: Could this function be improved?
def tknize (a_string):
    words = re.sub('[^a-zA-Z \.]', ' ', a_string).split()
    return words

# TEST DATA
with open("../data/mdg.txt", mode="r", encoding="utf-8") as f:
    mdg = f.read()

# SINGLE TEXT
with open("../queue/scifi/alien.txt", mode="r", encoding="utf-8") as f:
    alien = f.read()

In [16]:
# Test our tokenizer
mdg_ = tknize(mdg)
alien_ = tknize(alien)
print(mdg_[0:5])
print(alien_[0:5])

['Off', 'there', 'to', 'the', 'right']
['Alien', 'early', 'draft', 'by', 'Dan']


In [17]:
# mdg_ is our cleaned list of tokens which keeps only periods.
mdg_tagged = nltk.pos_tag(mdg_)
mdg_tagged[0:10]

[('Off', 'IN'),
 ('there', 'EX'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('right', 'NN'),
 ('somewhere', 'RB'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('large', 'JJ'),
 ('island', 'NN')]

In [18]:
for t in mdg_tagged[0:10]:
    if t[1] == 'JJ':
        print(t[0])

large


In [19]:
# Let's find all the proper nouns:
mdg_nouns = []
for i in mdg_tagged:
    if i[1] == "NNP":
        mdg_nouns.append(i[0])

print(mdg_nouns)

['Whitney.', 'Rainsford', 'Ship', 'Trap', 'Island', 'Whitney', 'A', 'Rainsford', 'Whitney', 'Caribbean', 'Rainsford.', 'Ugh', 'Rio', 'Whitney.', 'Purdey', 'Amazon.', 'Great', 'Rainsford.', 'Whitney.', 'Don', 'Whitney', 'Rainsford.', 'Who', 'Whitney.', 'Bah', 'Nonsense', 'Rainsford.', 'Whitney.', 'Be', 'Luckily', 'Do', 'Cannibals', 'Rainsford.', 'Hardly.', 'Even', 'God', 'Didn', 'Captain', 'Nielsen', 'Yes', 'Swede', 'Don', 'Pure', 'Rainsford.', 'One', 'Maybe.', 'Anyhow', 'Well', 'Rainsford.', 'Rainsford.', 'Good', 'Rainsford.', 'See', 'Right.', 'Good', 'Whitney.', 'Rainsford', 'Rainsford', 'Off', 'Again', 'Somewhere', 'Rainsford', 'Caribbean', 'Sea', 'A', 'Rainsford', 'Rainsford', 'Pistol', 'Rainsford', 'Ten', 'Jagged', 'Dense', 'Rainsford', 'Sleep', 'Rainsford', 'A', 'Rainsford', 'A', 'Bleak', 'Rainsford', 'Mirage', 'Rainsford.', 'Again', 'Rainsford', 'Rainsford', 'Rainsford', 'Rainsford', 'Rainsford', 'Out', 'Rainsford.', 'Don', 'Rainsford', 'My', 'Sanger', 'Rainsford', 'New', 'York',

In [20]:
def getPOS (POS, a_string):
    """
    Takes a string and returns a list of tuples with the word 
    and its part of speech
    """
    tokens = tknize(a_string)
    tagged = nltk.pos_tag(tokens)
    the_list = []
    for i in tagged:
        if i[1] == POS:
            the_list.append(i[0])
    return the_list

In [None]:
# Test on toy data
nouns = getPOS("NNP", mdg)
print(nouns)

['Whitney.', 'Rainsford', 'Ship', 'Trap', 'Island', 'Whitney', 'A', 'Rainsford', 'Whitney', 'Caribbean', 'Rainsford.', 'Ugh', 'Rio', 'Whitney.', 'Purdey', 'Amazon.', 'Great', 'Rainsford.', 'Whitney.', 'Don', 'Whitney', 'Rainsford.', 'Who', 'Whitney.', 'Bah', 'Nonsense', 'Rainsford.', 'Whitney.', 'Be', 'Luckily', 'Do', 'Cannibals', 'Rainsford.', 'Hardly.', 'Even', 'God', 'Didn', 'Captain', 'Nielsen', 'Yes', 'Swede', 'Don', 'Pure', 'Rainsford.', 'One', 'Maybe.', 'Anyhow', 'Well', 'Rainsford.', 'Rainsford.', 'Good', 'Rainsford.', 'See', 'Right.', 'Good', 'Whitney.', 'Rainsford', 'Rainsford', 'Off', 'Again', 'Somewhere', 'Rainsford', 'Caribbean', 'Sea', 'A', 'Rainsford', 'Rainsford', 'Pistol', 'Rainsford', 'Ten', 'Jagged', 'Dense', 'Rainsford', 'Sleep', 'Rainsford', 'A', 'Rainsford', 'A', 'Bleak', 'Rainsford', 'Mirage', 'Rainsford.', 'Again', 'Rainsford', 'Rainsford', 'Rainsford', 'Rainsford', 'Rainsford', 'Out', 'Rainsford.', 'Don', 'Rainsford', 'My', 'Sanger', 'Rainsford', 'New', 'York',

In [41]:
# Test on a single text from our corpus
nouns = getPOS("NNP", alien)
print(nouns)

['Alien', 'Dan', 'O', 'Bannon', 'ALIEN', 'STARBEAST', 'Story', 'Dan', 'O', 'Bannon', 'Ronald', 'Shusett', 'Screenplay', 'Dan', 'O', 'Bannon', 'SYNOPSIS', 'En', 'SNARK', 'Mankind', 'Their', 'Inside', 'Certain', 'Beneath', 'A', 'Hell', 'A', 'Finally', 'Earth', 'CAST', 'OF', 'CHARACTERS', 'CHAZ', 'STANDARD', 'Captain.................A', 'MARTIN', 'ROBY', 'Executive', 'Officer.......Cautious', 'DELL', 'BROUSSARD', 'Navigator...............Adventurer', 'SANDY', 'MELKONIS', 'Communications..........Tech', 'Intellectual', 'CLEAVE', 'HUNTER', 'Mining', 'Engineer.........High', 'JAY', 'FAUST', 'Engine', 'Tech.............A', 'Unimaginative.', 'FADE', 'IN', 'EXTREME', 'CLOSEUPS', 'OF', 'FLICKERING', 'INSTRUMENT', 'PANELS.', 'Readouts', 'Electronic', 'BEEPING', 'SIGNAL', 'Circuits', 'CAMERA', 'ANGLES', 'GRADUALLY', 'WIDEN', 'HYPERSLEEP', 'VAULT', 'A', 'FREEZER', 'COMPARTMENTS', 'FOOM', 'FOOM', 'FOOM', 'Slowly', 'ROBY', 'Oh...', 'God...', 'BROUSSARD', 'ROBY', 'BROUSSARD', 'FAUST', 'BROUSSARD', 'ME

### Putting It All Together

With the POS tagging and a sense of the NLTK's ability to identify at least proper nouns and some other named entities, we can now put it all together and see how well NLTK does at identifying named entities using fairly simple code.

In [43]:
# Read in all screenplays
screenplays = []
for p in Path('../queue/scifi/').glob('*.txt'):
    with open(p, encoding="utf8", errors='ignore') as f:
        contents = f.read()
        screenplays.append(contents)

It is tempting to deploy are already written functions, but some experimentation reveals that getPOS expects a string, and we would not want to tokenize our string twice, so we will need to create a function which tokenizes the text, tags it, grabs all the proper nouns, makes a list of them, and then uses that list to remove the proper nouns from each of our texts. 

Because our goal is to feed these results to SciKit-Learn in hopes of a better topic model, we will need to have our function return a string. 

*Phew.* That's a lot of work. Will it be worth it? Let's find out.

In [None]:
# The one function to rule them all
def removePNs(a_string):
    """
    Roundtrips a string to remove proper nouns
    """
    tokens = re.sub('[^a-zA-Z \.]', ' ', a_string).split()
    tagged = nltk.pos_tag(tokens)
    the_list = []
    for i in tagged:
        if i[1] == "NNP":
            the_list.append(i[0])
    toremove = list(set(the_list))
    filtered = []
    for token in tokens:
        if token not in toremove:
            filtered.append(token)
    return " ".join(filtered)

Having created a composite function, let's try it out:

In [45]:
# Test on a single text from our corpus
alien = removePNs(alien)
print(alien[0:100])

early draft by project formerly titled by by route back to from a far part of the galaxy the crew of


In [46]:
# Use on all screenplays
texts = [removePNs(text) for text in screenplays]

In [47]:
# IMPORTS
# For NMF Topic Models
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# Our custom function for displaying topics
def display_topics(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "{:d}: ".format(topic_idx)
        message += " ".join([feature_names[i] + ' ' + str(round(topic[i], 2)) + ','
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [None]:
# Set the parameters for the DTM
vectorizer = TfidfVectorizer(lowercase = True,
                             min_df = 2,
                             stop_words='english')

# fit the model to the data 
dtm = vectorizer.fit_transform(texts)

# We'll need these later
vocabulary = vectorizer.get_feature_names_out()

# see how many features we have
dtm.shape

(155, 24522)

In [54]:
# Set the parameters for creating the NMF model
nmf = NMF(n_components=30, 
          max_iter=500).fit(dtm)
nmf_W = nmf.transform(dtm)
nmf_H = nmf.components_
nmf_W.shape



(155, 30)

In [55]:
display_topics(nmf, vocabulary, 7)

0: know 0.62, beat 0.48, don 0.46, like 0.42, looks 0.37, just 0.36, think 0.31,
1: ship 1.31, beat 0.26, crew 0.25, hull 0.24, planet 0.2, space 0.2, light 0.19,
2: looks 0.59, eyes 0.38, hand 0.37, door 0.37, face 0.35, like 0.31, turns 0.3,
3: cont 1.33, like 0.22, looks 0.14, head 0.14, eyes 0.13, sword 0.12, just 0.12,
4: looks 1.11, like 1.04, just 0.98, know 0.71, ll 0.68, don 0.62, got 0.6,
5: looks 0.83, twins 0.82, ragdolls 0.81, ragdoll 0.42, scientist 0.42, sees 0.38, trench 0.38,
6: comlink 0.37, fighters 0.35, speeder 0.34, droid 0.3, stormtroopers 0.29, ship 0.26, walker 0.22,
7: raptor 0.97, jungle 0.48, rex 0.36, dinosaur 0.35, jeep 0.34, looks 0.34, raptors 0.33,
8: referring 0.54, ve 0.48, car 0.47, know 0.46, studio 0.41, face 0.4, screen 0.38,
9: door 1.17, room 0.8, don 0.45, just 0.38, know 0.37, numbers 0.35, floor 0.34,
10: apes 0.84, ape 0.57, cage 0.31, looks 0.3, chimp 0.29, horse 0.18, chimpanzee 0.18,
11: disk 0.83, data 0.31, energy 0.21, grid 0.2, circui