# Categorizing and Tagging Words

## part-of-speech tagger (POS)

The idea of POS is to categorize a word into a class, i.e, into a lexical categorie such as, words into verbs, nouns, adverbs, prepositions, etc.


In [0]:
import nltk
nltk.download() #1. d, 2. book, 3. q

In [0]:
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [0]:
# Documentation for each tag
nltk.help.upenn_tagset('CC')

CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet


In [0]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

Although refuse and permit appear twice, they belongs to a different category.

In [0]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

man time day year car moment world house family child country boy
state job place way war girl work word


In [0]:
text.concordance('woman')

Displaying 25 of 224 matches:
arlier when a slender , bespectacled woman broke the one-week-old world record 
larimer st. . the pert , gray-haired woman who came to denver three years ago f
the sheraton-dallas hotel . the only woman recipient , miss garson will receive
ar driven by my wife , who is only a woman . even that isn't satisfactory . if 
ted to be the only one extended to a woman . fort lauderdale -- the first in a 
 judge listened quietly as the young woman poured out her frustrations -- then 
letter . old , tired , trembling the woman came to the cannery . she had , she 
etropolitian atlanta . and now , the woman , tired and trembling , came here to
. the dairy truck driver ; ; the old woman with the stew . `` don't ask me if i
 miles east of here . eleven men , a woman and a teen-age boy tramped over cold
erday afternoon . sat there and as a woman sang , she kept getting thinner and 
is a problem piece about a man and a woman and the three `` figures '' that bot
omehow . u

In [0]:
brown_news_tagged = nltk.corpus.brown.tagged_words(categories='news')
brown_news_tagged

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

In [0]:
tags_freq = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tags_freq.most_common(n=5)

[('NN', 13162), ('IN', 10616), ('AT', 8893), ('NP', 6866), (',', 5133)]

Calculating the bigrams and checking what is the category that appears before a noun, then the freq distribution is calculated. It yields that an **AT (article)** is the most common category.

In [0]:
word_tag_pairs = nltk.bigrams(brown_news_tagged)
list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'NN').most_common(n=5))

[('AT', 3982), ('JJ', 2105), ('NN', 1461), ('IN', 1202), ('PP$', 530)]

In [0]:
nltk.help.brown_tagset('AT')

AT: article
    the an no a every th' ever' ye


After the word: 'often' almost always appears a verb.

In [0]:
brown_lrnd_tagged = nltk.corpus.brown.tagged_words(categories='learned')
tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
fd = nltk.FreqDist(tags)
fd.tabulate()

VBN  VB VBD  JJ  IN  QL   ,  CS  RB  AP VBG  RP VBZ QLP BEN WRB   .  TO  HV 
 15  10   8   5   4   3   3   3   3   1   1   1   1   1   1   1   1   1   1 


### Automatic tagging

In [0]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

In [0]:
brown_sents

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

Let's imagine that that we tag every words as an one category, let's say a noun. It could make sense if the noun is the one that appear mostly.



In [0]:
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
tags[:10]

['AT', 'NP-TL', 'NN-TL', 'JJ-TL', 'NN-TL', 'VBD', 'NR', 'AT', 'NN', 'IN']

In [0]:
nltk.FreqDist(tags).most_common(n=3)

[('NN', 13162), ('IN', 10616), ('AT', 8893)]

In [0]:
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = nltk.word_tokenize(raw)
tokens[:5]

['I', 'do', 'not', 'like', 'green']

In [0]:
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)[:5]

[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN')]

Let's evaluate agains the gold standard (the correct set)

In [0]:
brown_tagged_sents

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

In [0]:
default_tagger.evaluate(brown_tagged_sents)

0.13089484257215028

### Training a tagger

**Unigrams**

The more n in n-grams, we usually encounter the sparse data problem when a context can not be found in the training set (e.g, a 4-grams).

In [8]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
print(size)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

4160


0.8121200039868434

**Combining taggers**

Using the technique of backing off, if a model can assing a tag to a word is just back off and try with the previous model.

In [9]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

0.8452108043456593

**Storing Taggers**



In [13]:
# Saving and loading

from pickle import dump, load

with open('t2.pkl', 'wb') as model:
  dump(t2, model, -1)
!ls
with open('t2.pkl', 'rb') as model:
  tagger = load(model)

tagger

sample_data  t2.pkl


<BigramTagger: size=3149>

In [14]:
# Predicting

text = """The board's action shows what free enterprise
is up against in our complex maze of regulatory laws ."""

tokens = text.split()

tagger.tag(tokens)

[('The', 'AT'),
 ("board's", 'NN$'),
 ('action', 'NN'),
 ('shows', 'NNS'),
 ('what', 'WDT'),
 ('free', 'JJ'),
 ('enterprise', 'NN'),
 ('is', 'BEZ'),
 ('up', 'RP'),
 ('against', 'IN'),
 ('in', 'IN'),
 ('our', 'PP$'),
 ('complex', 'JJ'),
 ('maze', 'NN'),
 ('of', 'IN'),
 ('regulatory', 'NN'),
 ('laws', 'NNS'),
 ('.', '.')]

In [25]:
# Confusion matrix

test_tags = [tag for sent in brown.sents(categories='editorial')
            for (word, tag) in t2.tag(sent)]
            
gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]
print(test_tags[:10])
print(gold_tags[:10])
print(nltk.ConfusionMatrix(gold_tags[:10], test_tags[:10]))

['NN-TL', 'NN', 'VBD', 'AP', 'JJ', 'AT', 'JJ-TL', 'NN-TL', ',', 'WDT']
['NN-HL', 'NN-HL', 'VBD-HL', 'AP-HL', 'NN-HL', 'AT', 'JJ-TL', 'NN-TL', ',', 'WDT']
       |                     V   |
       |     A     J   N N   B   |
       |     P     J   N N   D   |
       |     -     -   - - V - W |
       |   A H A J T N H T B H D |
       | , P L T J L N L L D L T |
-------+-------------------------+
     , |<1>. . . . . . . . . . . |
    AP | .<.>. . . . . . . . . . |
 AP-HL | . 1<.>. . . . . . . . . |
    AT | . . .<1>. . . . . . . . |
    JJ | . . . .<.>. . . . . . . |
 JJ-TL | . . . . .<1>. . . . . . |
    NN | . . . . . .<.>. . . . . |
 NN-HL | . . . . 1 . 1<.>1 . . . |
 NN-TL | . . . . . . . .<1>. . . |
   VBD | . . . . . . . . .<.>. . |
VBD-HL | . . . . . . . . . 1<.>. |
   WDT | . . . . . . . . . . .<1>|
-------+-------------------------+
(row = reference; col = test)



# Chapter 6. Learning to classify text