# Categorizing and Tagging Words

The process of classifying  words as **parts-of-speech** and labeling them accordingly. Parts-of-speech are also known as **word classes** or **lexical categories**. The collection of tags used for a particular class is known as a **tagset**.

## Using a Tagger

In [2]:
import nltk, re
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [3]:
# There is documentation for the tags
nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...


In [4]:
# Another example but containing homonyms
text = nltk.word_tokenize("They refuse topermit us to obtain the refuse permit")

In [5]:
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('topermit', 'VBP'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

In [6]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

In [7]:
# Finds mostly nouns
text.similar("woman")

man time day year car moment world house family child country boy
state job place way war girl work word


In [8]:
# Finds mostly verbs
text.similar('bought')

made said done put had seen found given left heard was been brought
set got that took in told felt


In [9]:
# Finds mostly prepositions
text.similar('over')

in on to of and for with from at by that into as up out down through
is all about


In [10]:
# Finds several determiners
text.similar('the')

a his this their its her an that our any all one these my in your no
some other and


# Tagged Corpora
By convention represented using a tuple consisting of the token and the tag. Can be created using nltk.tag.str2tuple()

In [11]:
tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token

('fly', 'NN')

In [12]:
tagged_token[0]

'fly'

In [13]:
tagged_token[1]

'NN'

List of tagged tokens can be created directly from a string.

In [14]:
sent = '''
        The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
        other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
        Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
        said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
        accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
        interest/NN of/IN both/ABX governments/NNS ''/'' ./.
        '''

In [15]:
[nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'),
 ('grand', 'JJ'),
 ('jury', 'NN'),
 ('commented', 'VBD'),
 ('on', 'IN'),
 ('a', 'AT'),
 ('number', 'NN'),
 ('of', 'IN'),
 ('other', 'AP'),
 ('topics', 'NNS'),
 (',', ','),
 ('AMONG', 'IN'),
 ('them', 'PPO'),
 ('the', 'AT'),
 ('Atlanta', 'NP'),
 ('and', 'CC'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('purchasing', 'VBG'),
 ('departments', 'NNS'),
 ('which', 'WDT'),
 ('it', 'PPS'),
 ('said', 'VBD'),
 ('``', '``'),
 ('ARE', 'BER'),
 ('well', 'QL'),
 ('operated', 'VBN'),
 ('and', 'CC'),
 ('follow', 'VB'),
 ('generally', 'RB'),
 ('accepted', 'VBN'),
 ('practices', 'NNS'),
 ('which', 'WDT'),
 ('inure', 'VB'),
 ('to', 'IN'),
 ('the', 'AT'),
 ('best', 'JJT'),
 ('interest', 'NN'),
 ('of', 'IN'),
 ('both', 'ABX'),
 ('governments', 'NNS'),
 ("''", "''"),
 ('.', '.')]

In [16]:
# Several of the corpora have been tagged
nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

In [17]:
# ...and have simplified versions.
nltk.corpus.brown.tagged_words(tagset = "universal")

[('The', 'DET'), ('Fulton', 'NOUN'), ...]

![](simp_pos.png)

In [18]:
nltk.corpus.nps_chat.tagged_words()

[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

In [19]:
nltk.corpus.conll2000.tagged_words()

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]

In [20]:
nltk.corpus.treebank.tagged_words()

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

Finding the common tag in the news category of the brown corpus

In [21]:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')

In [22]:
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)

In [23]:
tag_fd.keys()

dict_keys(['DET', 'NOUN', 'ADJ', 'VERB', 'ADP', '.', 'ADV', 'CONJ', 'PRT', 'PRON', 'NUM', 'X'])

In [36]:
[x for x in tag_fd.items()]

[('DET', 11389),
 ('NOUN', 30654),
 ('ADJ', 6706),
 ('VERB', 14399),
 ('ADP', 12355),
 ('.', 11928),
 ('ADV', 3349),
 ('CONJ', 2717),
 ('PRT', 2264),
 ('PRON', 2535),
 ('NUM', 2166),
 ('X', 92)]

In [None]:
tag_fd.plot(cumulative=True)

In [None]:
# Most common verb in news text
wsj = nltk.corpus.treebank.tagged_words(tagset='universal')

In [None]:
word_tag_fd = nltk.FreqDist(wsj)

In [None]:
[word +'/'+tag for (word, tag) in word_tag_fd if tag.startswith('V')]

# Automatic Tagging
