## Part-of-Speech Tagging and Named Entity Recognition using NLTK

One task in NLP has been to reliably identify a word's part of speech. This can help us with the ever-present task of identifying content words, but can be used in a variety of analyses. Part-of-speech tagging is a specific instance in the larger category of word tagging, or placing words in pre-determined categories.

Another instance of word tagging named entity recognition. 

Today we'll learn how to identify a word's part of speech and how to extract named entities using NLTK, and think through reasons we may want to do this..

### Learning Goals:

* Understand the intuition behind tagging and information extraction
* Use NLTK to tag the part of speech of each word
* Count most frequent words based on their part of speech
* Extract named entities from a text, and in doing so, learn more about chunking text
* Be equipped to explore this further to get better accuracy for your chosen corpus

### Key Terms

* *part-of-speech tagging*: 
    * the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context
* *named entity recognition*:
    * a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc
* *tree* 
    * data structure made up of nodes or vertices and edges without having any cycle. 
* *treebank*:
    * a parsed text corpus that annotates syntactic or semantic sentence structure
* *tuple*:
    * a sequence of immutable Python objects


### Further Resources

For more information on information extraction using NLTK, see chapter 7: http://www.nltk.org/book/ch07.html

### 0. Part-of-Speech Tagging
On Monday you may have noticed that stop words are typically short function words. Intuitively, if we could identify the part of speech of a word, we would have another way of identifying content words. NLTK can do that too!

NLTK has a function that will tag the part of speech of every token in a text. For this, we will re-create our original tokenized text sentence from Monday, with the stop words and punctuation.

NLTK uses the Penn Treebank Project to tag the part-of-speech of the words. The NLTK algoritm is deterministic - it assigns the most common part of speech for each word, as found in the Penn Treebank. You can find a list of all the part-of-speech tags here:
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [1]:
import nltk
from nltk import word_tokenize

sentence = "For me it has to do with the work that gets done at the crossroads of \
digital media and traditional humanistic study. And that happens in two different ways. \
On the one hand, it's bringing the tools and techniques of digital media to bear \
on traditional humanistic questions; on the other, it's also bringing humanistic modes \
of inquiry to bear on digital media."

sentence_tokens = word_tokenize(sentence)

#check we did everything correctly
sentence_tokens

['For',
 'me',
 'it',
 'has',
 'to',
 'do',
 'with',
 'the',
 'work',
 'that',
 'gets',
 'done',
 'at',
 'the',
 'crossroads',
 'of',
 'digital',
 'media',
 'and',
 'traditional',
 'humanistic',
 'study',
 '.',
 'And',
 'that',
 'happens',
 'in',
 'two',
 'different',
 'ways',
 '.',
 'On',
 'the',
 'one',
 'hand',
 ',',
 'it',
 "'s",
 'bringing',
 'the',
 'tools',
 'and',
 'techniques',
 'of',
 'digital',
 'media',
 'to',
 'bear',
 'on',
 'traditional',
 'humanistic',
 'questions',
 ';',
 'on',
 'the',
 'other',
 ',',
 'it',
 "'s",
 'also',
 'bringing',
 'humanistic',
 'modes',
 'of',
 'inquiry',
 'to',
 'bear',
 'on',
 'digital',
 'media',
 '.']

In [2]:
#use the nltk pos function to tag the tokens
tagged_sentence_tokens = nltk.pos_tag(sentence_tokens)

#view tagged sentence
tagged_sentence_tokens

[('For', 'IN'),
 ('me', 'PRP'),
 ('it', 'PRP'),
 ('has', 'VBZ'),
 ('to', 'TO'),
 ('do', 'VB'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('work', 'NN'),
 ('that', 'WDT'),
 ('gets', 'VBZ'),
 ('done', 'VBN'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('crossroads', 'NNS'),
 ('of', 'IN'),
 ('digital', 'JJ'),
 ('media', 'NNS'),
 ('and', 'CC'),
 ('traditional', 'JJ'),
 ('humanistic', 'JJ'),
 ('study', 'NN'),
 ('.', '.'),
 ('And', 'CC'),
 ('that', 'DT'),
 ('happens', 'VBZ'),
 ('in', 'IN'),
 ('two', 'CD'),
 ('different', 'JJ'),
 ('ways', 'NNS'),
 ('.', '.'),
 ('On', 'IN'),
 ('the', 'DT'),
 ('one', 'CD'),
 ('hand', 'NN'),
 (',', ','),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('bringing', 'VBG'),
 ('the', 'DT'),
 ('tools', 'NNS'),
 ('and', 'CC'),
 ('techniques', 'NNS'),
 ('of', 'IN'),
 ('digital', 'JJ'),
 ('media', 'NNS'),
 ('to', 'TO'),
 ('bear', 'VB'),
 ('on', 'IN'),
 ('traditional', 'JJ'),
 ('humanistic', 'JJ'),
 ('questions', 'NNS'),
 (';', ':'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('other', 'JJ'),
 (',', ','),
 ('it', 

Now comes more complicated code. Stay with me. The above output is a list of *tuples*. A tuple is a sequence of Python objects. In this case, each of these tuples is a sequence of strings. To loop through tuples is intuitively the same as looping through a list, but slightly different syntax. 

Note that this is not a list of lists, as we saw in our lesson on Pandas. This is a list of tuples.

Let's pull out the part-of-speech tag from each tuple above and save that to a list. Notice the order stays exactly the same.

In [3]:
word_tags = [tag for (word, tag) in tagged_sentence_tokens]
print(word_tags)

['IN', 'PRP', 'PRP', 'VBZ', 'TO', 'VB', 'IN', 'DT', 'NN', 'WDT', 'VBZ', 'VBN', 'IN', 'DT', 'NNS', 'IN', 'JJ', 'NNS', 'CC', 'JJ', 'JJ', 'NN', '.', 'CC', 'DT', 'VBZ', 'IN', 'CD', 'JJ', 'NNS', '.', 'IN', 'DT', 'CD', 'NN', ',', 'PRP', 'VBZ', 'VBG', 'DT', 'NNS', 'CC', 'NNS', 'IN', 'JJ', 'NNS', 'TO', 'VB', 'IN', 'JJ', 'JJ', 'NNS', ':', 'IN', 'DT', 'JJ', ',', 'PRP', 'VBZ', 'RB', 'VBG', 'JJ', 'NNS', 'IN', 'NN', 'TO', 'VB', 'IN', 'JJ', 'NNS', '.']


Question: What is the difference in syntax for the above code compared to our standard list comprehension code?


We can count the part-of-speech tags in a similar way we counted words, to output the most frequent types of words in our text.

In [4]:
tagged_frequency = nltk.FreqDist(word_tags)
tagged_frequency.most_common()

[('IN', 11),
 ('JJ', 10),
 ('NNS', 9),
 ('DT', 6),
 ('VBZ', 5),
 ('PRP', 4),
 ('NN', 4),
 ('.', 3),
 ('CC', 3),
 ('VB', 3),
 ('TO', 3),
 ('VBG', 2),
 (',', 2),
 ('CD', 2),
 ('WDT', 1),
 (':', 1),
 ('VBN', 1),
 ('RB', 1)]

This sentence contains a lot of adjectives. So let's first look at the adjectives. Notice the syntax here.

In [5]:
adjectives = [word for (word,pos) in tagged_sentence_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']

#print all of the adjectives
print(adjectives)

['digital', 'traditional', 'humanistic', 'different', 'digital', 'traditional', 'humanistic', 'other', 'humanistic', 'digital']


Let's do the same for nouns.

In [6]:
nouns = [word for (word,pos) in tagged_sentence_tokens if pos=='NN' or pos=='NNS']

#print all of the nouns
print(nouns)

['work', 'crossroads', 'media', 'study', 'ways', 'hand', 'tools', 'techniques', 'media', 'questions', 'modes', 'inquiry', 'media']


And now verbs.

In [9]:
#verbs = [word for (word,pos) in tagged_sentence_tokens if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']
verbs = [word for (word,pos) in tagged_sentence_tokens if pos in ['VB', 'VBD','VBG','VBN','VBP','VBZ']]

#print all of the verbs
print(verbs)

['has', 'do', 'gets', 'done', 'happens', "'s", 'bringing', 'bear', "'s", 'bringing', 'bear']


In [14]:
##Ex: Print the most frequent nouns, adjective, and verbs in the sentence
######What does this tell us?
######Compare this to what we did yesterday with removing stop words.

freq_nouns = nltk.FreqDist(nouns)
freq_nouns.most_common()

freq_verbs = nltk.FreqDist(verbs)
freq_verbs.most_common(2)

[("'s", 2), ('bear', 2)]

### 1. Named Entity Recognition

Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. NLTK has a trained classifier to identify named entities from tagged sentences.

In [15]:
#Sample sentence with a variety of named entities
ne_sentence = "Walter W. Powell is a sociologist at Stanford University, \
with a primary appointment in the School of Education and \
courtesy appointments in the Schools of Business and Engineering, \
and in Public Policy. \
He is co-director of Stanford's Center on Philanthropy and Civil Society. \
He has been an external faculty member at the Santa Fe Institute \
since 2001. He has co-authored with John F. Padgett, Valery Yakubovich, \
Xing Zhong, and Stansilav Shekshnia, among others."

#tokenize our sentence
ne_sentence_tokens = word_tokenize(ne_sentence)
#tag each word with its part of speech
ne_tagged_sentence_tokens = nltk.pos_tag(ne_sentence_tokens)

#use the named entity chunker to tag named entities
chunked = nltk.ne_chunk(ne_tagged_sentence_tokens)
print(chunked)

(S
  (PERSON Walter/NNP)
  W./NNP
  Powell/NNP
  is/VBZ
  a/DT
  sociologist/NN
  at/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  ,/,
  with/IN
  a/DT
  primary/JJ
  appointment/NN
  in/IN
  the/DT
  (ORGANIZATION School/NNP)
  of/IN
  (ORGANIZATION Education/NNP)
  and/CC
  courtesy/NN
  appointments/NNS
  in/IN
  the/DT
  (ORGANIZATION Schools/NNP)
  of/IN
  (ORGANIZATION Business/NNP)
  and/CC
  (GPE Engineering/NNP)
  ,/,
  and/CC
  in/IN
  (GPE Public/JJ)
  Policy/NNP
  ./.
  He/PRP
  is/VBZ
  co-director/NN
  of/IN
  (PERSON Stanford/NNP)
  's/POS
  (ORGANIZATION Center/NNP)
  on/IN
  (GPE Philanthropy/NNP)
  and/CC
  (PERSON Civil/NNP Society/NNP)
  ./.
  He/PRP
  has/VBZ
  been/VBN
  an/DT
  external/JJ
  faculty/NN
  member/NN
  at/IN
  the/DT
  (ORGANIZATION Santa/NNP Fe/NNP Institute/NNP)
  since/IN
  2001/CD
  ./.
  He/PRP
  has/VBZ
  co-authored/VBN
  with/IN
  (PERSON John/NNP F./NNP Padgett/NNP)
  ,/,
  (PERSON Valery/NNP Yakubovich/NNP)
  ,/,
  Xing/VBG
  (GPE Zhon

Python uses the tree data structure to represent chunked sentences. To pull out only PERSONs, or only ORGNIZATIONs, when can loop through the *subtrees* in the chunked sentence and use the .label() function to identify the named entities of interest.

In [16]:
#chunked.subtrees
print(chunked.subtrees)

<bound method Tree.subtrees of Tree('S', [Tree('PERSON', [('Walter', 'NNP')]), ('W.', 'NNP'), ('Powell', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('sociologist', 'NN'), ('at', 'IN'), Tree('ORGANIZATION', [('Stanford', 'NNP'), ('University', 'NNP')]), (',', ','), ('with', 'IN'), ('a', 'DT'), ('primary', 'JJ'), ('appointment', 'NN'), ('in', 'IN'), ('the', 'DT'), Tree('ORGANIZATION', [('School', 'NNP')]), ('of', 'IN'), Tree('ORGANIZATION', [('Education', 'NNP')]), ('and', 'CC'), ('courtesy', 'NN'), ('appointments', 'NNS'), ('in', 'IN'), ('the', 'DT'), Tree('ORGANIZATION', [('Schools', 'NNP')]), ('of', 'IN'), Tree('ORGANIZATION', [('Business', 'NNP')]), ('and', 'CC'), Tree('GPE', [('Engineering', 'NNP')]), (',', ','), ('and', 'CC'), ('in', 'IN'), Tree('GPE', [('Public', 'JJ')]), ('Policy', 'NNP'), ('.', '.'), ('He', 'PRP'), ('is', 'VBZ'), ('co-director', 'NN'), ('of', 'IN'), Tree('PERSON', [('Stanford', 'NNP')]), ("'s", 'POS'), Tree('ORGANIZATION', [('Center', 'NNP')]), ('on', 'IN'), Tree('GPE'

In [17]:
people =  [n for n in chunked.subtrees() if n.label()=="PERSON"]
orgs = [n for n in chunked.subtrees() if n.label()=="ORGANIZATION"]

#print people
people

[Tree('PERSON', [('Walter', 'NNP')]),
 Tree('PERSON', [('Stanford', 'NNP')]),
 Tree('PERSON', [('Civil', 'NNP'), ('Society', 'NNP')]),
 Tree('PERSON', [('John', 'NNP'), ('F.', 'NNP'), ('Padgett', 'NNP')]),
 Tree('PERSON', [('Valery', 'NNP'), ('Yakubovich', 'NNP')]),
 Tree('PERSON', [('Stansilav', 'NNP'), ('Shekshnia', 'NNP')])]

Comments on this list?

In [22]:
#print organizations
len(orgs)
set(people[0])

{('Walter', 'NNP')}

Perhaps it's not very accurate. Wilkens used the Stanford NER in his paper, which by many measures in much more accurate. There is a way to call Stanford's tools in Python, but it requires downloading a bunch of things and refering to them in your Python code. I will write a blog post about this for D-Lab, so stay tuned. If you want to use this in the final project you can explore this option with me.

In [None]:
##Ex: Compare the most frequent part-of-speech used in two of the texts in our data folder.
##Ex: Compare the *number* of organizations and people mentioned in two of the texts in our data folder.