# Ch5: Categorizing & Tagging Words

Goals: 
1. What are lexical categories and how are they used in NLP?
2. What is a good python data structure for stroing words and their categories?
3. How can we automatically tag each word of a text with it's word class?


The process of classifying words into their parts of speech and labeling them accordingly is known as part of speech tagging, POS stagging AKA tagging. Parts of speech are also known as word classes or lexical categories. The collections used for a particular task is known as a tagset. Our emphasis in this chapter is on exploiting tags and tagging text auotmatically.

### Using a tagger

a part of speech tagger, or POS tagger, processes a sequence of words and attaches a part of speech tag to each word;

In [4]:
import nltk
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

* CC: coordinating conjunction
* RB: adverbs
* IN: proposition
* NN: noun
* JJ: adjective

In [3]:
text = nltk.word_tokenize('They refuse to permit us to obtain the refuse permit.')
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN'),
 ('.', '.')]

* VRB: verb, non 3rd person singular
* VB: verb, base singular marker
* DT: determiner
* NN: singular or mass

In [4]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

man time day year car moment world house family child country boy
state job place way war girl work word


observe that searching for woman finds nouns. Searching for bought mostly finds verbs; searching for over generally finds propositions; searching the finds several determiner

In [5]:
text.similar('bought')

made said done put had seen found given left heard was been brought
set got that took in told felt


searching for bought finds verbs

In [6]:
text.similar('the')

a his this their its her an that our any all one these my in your no
some other and


searching for the finds determiners

# Tagged Corpora

In [7]:
tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token

('fly', 'NN')

In [8]:
sent = '''
The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
Fulton/NP-t1 County/NN-t1 purchasing/VBG departments/NNS which WDT it/PPS
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
interest/NN of/IN both/ABX government/NNS ''/'' ./.
'''

In [9]:
[nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'),
 ('grand', 'JJ'),
 ('jury', 'NN'),
 ('commented', 'VBD'),
 ('on', 'IN'),
 ('a', 'AT'),
 ('number', 'NN'),
 ('of', 'IN'),
 ('other', 'AP'),
 ('topics', 'NNS'),
 (',', ','),
 ('AMONG', 'IN'),
 ('them', 'PPO'),
 ('the', 'AT'),
 ('Atlanta', 'NP'),
 ('and', 'CC'),
 ('Fulton', 'NP-T1'),
 ('County', 'NN-T1'),
 ('purchasing', 'VBG'),
 ('departments', 'NNS'),
 ('which', None),
 ('WDT', None),
 ('it', 'PPS'),
 ('said', 'VBD'),
 ('``', '``'),
 ('ARE', 'BER'),
 ('well', 'QL'),
 ('operated', 'VBN'),
 ('and', 'CC'),
 ('follow', 'VB'),
 ('generally', 'RB'),
 ('accepted', 'VBN'),
 ('practices', 'NNS'),
 ('which', 'WDT'),
 ('inure', 'VB'),
 ('to', 'IN'),
 ('the', 'AT'),
 ('best', 'JJT'),
 ('interest', 'NN'),
 ('of', 'IN'),
 ('both', 'ABX'),
 ('government', 'NNS'),
 ("''", "''"),
 ('.', '.')]

In [10]:
nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

In [13]:
nltk.corpus.conll2000.tagged_words()

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]

In [14]:
nltk.corpus.sinica_treebank.tagged_words()

[('一', 'Neu'), ('友情', 'Nad'), ('嘉珍', 'Nba'), ...]

In [15]:
nltk.corpus.mac_morpho.tagged_words()

[('Jersei', 'N'), ('atinge', 'V'), ('média', 'N'), ...]

In [16]:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news')
tag_fd = nltk.FreqDist(tag for (word,tag) in brown_news_tagged)
tag_fd

FreqDist({'NN': 13162, 'IN': 10616, 'AT': 8893, 'NP': 6866, ',': 5133, 'NNS': 5066, '.': 4452, 'JJ': 4392, 'CC': 2664, 'VBD': 2524, ...})

In [22]:
word_tag_pairs = nltk.bigrams(brown_news_tagged)
list(nltk.FreqDist((a,b) for (a,b) in word_tag_pairs if b[1]=='NN'))

[(('per', 'IN'), ('cent', 'NN')),
 (('the', 'AT'), ('state', 'NN')),
 (('last', 'AP'), ('year', 'NN')),
 (('last', 'AP'), ('week', 'NN')),
 (('a', 'AT'), ('year', 'NN')),
 (('the', 'AT'), ('city', 'NN')),
 (('this', 'DT'), ('year', 'NN')),
 (('the', 'AT'), ('jury', 'NN')),
 (('a', 'AT'), ('result', 'NN')),
 (('the', 'AT'), ('board', 'NN')),
 (('the', 'AT'), ('ball', 'NN')),
 (('a', 'AT'), ('number', 'NN')),
 (('the', 'AT'), ('world', 'NN')),
 (('the', 'AT'), ('year', 'NN')),
 (('last', 'AP'), ('night', 'NN')),
 ((',', ','), ('president', 'NN')),
 (('his', 'PP$'), ('wife', 'NN')),
 (('the', 'AT'), ('day', 'NN')),
 (('this', 'DT'), ('week', 'NN')),
 (('a', 'AT'), ('member', 'NN')),
 (('the', 'AT'), ('administration', 'NN')),
 (('the', 'AT'), ('country', 'NN')),
 (('the', 'AT'), ('way', 'NN')),
 (('the', 'AT'), ('man', 'NN')),
 (('the', 'AT'), ('university', 'NN')),
 (('a', 'AT'), ('meeting', 'NN')),
 (('the', 'AT'), ('past', 'NN')),
 (('the', 'AT'), ('problem', 'NN')),
 (('the', 'AT'), (

In [26]:
wsj = nltk.corpus.treebank.tagged_words()
word_tags_fd = nltk.FreqDist(wsj)
set([word+"/"+tag for (word,tag) in word_tags_fd if tag.startswith('V')])

{'signed/VBN',
 'Declining/VBG',
 'learned/VBD',
 'builds/VBZ',
 'depending/VBG',
 'withhold/VB',
 'says/VBZ',
 'reallocate/VB',
 'borrowing/VBG',
 'eased/VBD',
 'defeats/VBZ',
 'require/VB',
 'reduce/VB',
 'clipped/VBD',
 'stare/VBP',
 'subsidize/VB',
 'fattened/VBN',
 'jeopardizing/VBG',
 'total/VB',
 'attributed/VBN',
 'contract/VB',
 'introducing/VBG',
 'adopted/VBN',
 'dropped/VBN',
 'sketching/VBG',
 'contained/VBD',
 'propagandizes/VBZ',
 'grows/VBZ',
 'admits/VBZ',
 'chose/VBD',
 'maintaining/VBG',
 'discouraging/VBG',
 'cry/VBP',
 'bar/VB',
 'painted/VBN',
 'proved/VBD',
 'executed/VBN',
 'fighting/VBG',
 'conducted/VBN',
 'parallel/VB',
 'assure/VB',
 'reported/VBD',
 'denounce/VB',
 'go/VB',
 'feel/VBP',
 'shipped/VBN',
 'attract/VB',
 'think/VBP',
 'avoid/VBP',
 'Used/VBN',
 'worry/VB',
 'fetching/VBG',
 'scrounge/VBP',
 'photocopying/VBG',
 'competed/VBN',
 'admitting/VBG',
 'own/VBP',
 'narrowed/VBD',
 'print/VBP',
 'obligated/VBD',
 'win/VB',
 'trained/VBN',
 'cushioned/

In [33]:
set([word+"/"+tag for (word,tag) in word_tags_fd])

{'Average/NNP',
 '1976/CD',
 'Midvale/NNP',
 '6.25/CD',
 'carillons/NNS',
 '2163.2/CD',
 '*T*-147/-NONE-',
 'COPPER/NNP',
 'starters/NNS',
 'after/RB',
 'stock-market/NN',
 'subscription/NN',
 'withhold/VB',
 'builds/VBZ',
 'counterattack/NN',
 'short/JJ',
 'eased/VBD',
 'defeats/VBZ',
 'Messrs./NNPS',
 'midtown/NN',
 'unlike/IN',
 '*-165/-NONE-',
 'banking/NN',
 'settlement/NN',
 'change-ringing/NN',
 '38.875/CD',
 '*T*-56/-NONE-',
 'colleges/NNS',
 'forum/NN',
 'beaten/JJ',
 'introducing/VBG',
 'Insurance/NNP',
 'Ms./NNP',
 'Labor/NNP',
 'sketching/VBG',
 'tissue/NN',
 'grows/VBZ',
 'goblins/NNS',
 'admits/VBZ',
 'maintaining/VBG',
 'Westborough/NNP',
 'lifes/NNS',
 'Teacher/NN',
 'fighting/VBG',
 '&/CC',
 'conducted/VBN',
 'triple-C/NN',
 'Lyn/NNP',
 'reported/VBD',
 'protocols/NNS',
 'Controls/NNP',
 'attract/VB',
 'judge/NN',
 'plaintiffs/NNS',
 'mailing/NN',
 'think/VBP',
 'well/RB',
 'illegal/JJ',
 'Mutual/NNP',
 'worry/VB',
 'photocopying/VBG',
 'admitting/VBG',
 'underwriters/

In [27]:
cfd1 = nltk.ConditionalFreqDist(wsj)
cfd1['yield'].keys()

dict_keys(['NN', 'VB'])

In [28]:
cfd1['cut'].keys()

dict_keys(['VBD', 'VB', 'VBN', 'NN'])

In [34]:
dict(cfd1)

{'Pierre': FreqDist({'NNP': 1}),
 'Vinken': FreqDist({'NNP': 2}),
 ',': FreqDist({',': 4885}),
 '61': FreqDist({'CD': 5}),
 'years': FreqDist({'NNS': 115}),
 'old': FreqDist({'JJ': 24}),
 'will': FreqDist({'MD': 280, 'NN': 1}),
 'join': FreqDist({'VB': 4}),
 'the': FreqDist({'DT': 4038, 'JJ': 5, 'NNP': 1, 'CD': 1}),
 'board': FreqDist({'NN': 30}),
 'as': FreqDist({'IN': 333, 'RB': 52}),
 'a': FreqDist({'DT': 1874, 'JJ': 2, 'IN': 1, 'LS': 1}),
 'nonexecutive': FreqDist({'JJ': 5}),
 'director': FreqDist({'NN': 32}),
 'Nov.': FreqDist({'NNP': 23, 'NN': 1}),
 '29': FreqDist({'CD': 5}),
 '.': FreqDist({'.': 3828}),
 'Mr.': FreqDist({'NNP': 375}),
 'is': FreqDist({'VBZ': 671}),
 'chairman': FreqDist({'NN': 45}),
 'of': FreqDist({'IN': 2319}),
 'Elsevier': FreqDist({'NNP': 1}),
 'N.V.': FreqDist({'NNP': 3}),
 'Dutch': FreqDist({'JJ': 2, 'NNP': 1}),
 'publishing': FreqDist({'NN': 9, 'VBG': 4}),
 'group': FreqDist({'NN': 43}),
 'Rudolph': FreqDist({'NNP': 3}),
 'Agnew': FreqDist({'NNP': 1}),
 '

In [31]:
cfd2 = nltk.ConditionalFreqDist((tag,word) for (word,tag) in wsj)
cfd2['VB'].keys()

dict_keys(['join', 'make', 'bring', 'be', 'support', 'regulate', 'have', 'recognize', 'slide', 'indicate', 'retain', 'capture', 'blip', 'pour', 'vary', 'go', 'obtain', 'complete', 'issue', 'lift', 'raise', 'act', 'default', 'oversee', 'treat', 'prove', 'expand', 'return', 'remain', 'keep', 'introduce', 'increase', 'cost', 'announce', 'reward', 'shore', 'justify', 'acquire', 'suffer', 'come', 'take', 'withdraw', 'speed', 'refile', 'refund', 'begin', 'entertain', 'block', 'force', 'slash', 'pay', 'audit', 'set', 'rule', 'manufacture', 'help', 'meet', 'produce', 'depend', 'want', 'fund', 'include', 'jump', 'close', 'link', 'complicate', 'do', 'require', 'contain', 'face', 'roll', 'compete', 'receive', 'occur', 'leave', 'decide', 'trade', 'report', 'work', 'continue', 'succeed', 'honor', 'improve', 'pursue', 'aid', 'protect', 'compel', 'enact', 'apply', 'pose', 'craft', 'merit', 'ask', 'halve', 'reach', 'use', 'store', 'share', 'offer', 'start', 'divest', 'grant', 'watch', 'turn', 'direct'

In [38]:
[w for w in cfd1.conditions() if 'VB' in cfd1[w] and 'VBN' in cfd1[w]]

['become',
 'come',
 'set',
 'own',
 'cut',
 'put',
 'read',
 'run',
 'hit',
 'split',
 'hurt',
 'offset',
 'spread',
 'shut',
 'Put',
 'overcome']

In [41]:
idx1 = wsj.index(('become','VB'))
idx1

18689

In [42]:
wsj[idx1-4:idx1+1]

[('is', 'VBZ'),
 ('studying', 'VBG'),
 ('*-1', '-NONE-'),
 ('to', 'TO'),
 ('become', 'VB')]