<div align="center">
    <h1><a href="index.ipynb">Knowledge Discovery in Digital Humanities</a></h1>
</div>

<div align="center">
    <h2>Class 17. Word tagging and categorization</h2>
    <img src="img/tag.svg" width="300">
</div>

###Table of contents

- [Lexical categories](#Lexical-categories)
- [POS tagger](#POS-tagger)
- [Automatic tagging](#Automatic-tagging)

###Lexical categories

- **nouns**: people, places, things, concepts
- **verbs**: actions
- **adjectives**: describes nouns
- **adverbs**: modifies adjectives and verbs
- ...

<br/>
<div align="center">
    <figure>
        <img src="img/pos.png" width="800">
        <figcaption>Lexical categories</figcaption>
    </figure>
</div>

- These word classes are also known as parts-of-speech (POS)
- They arise from simple analysis of the distribution of words in text

###POS tagger

The process of classifying words into their parts-of-speech and labeling them accordingly is known as *part-of-speech tagging*, *POS tagging*, or simply *tagging*.

POS tagging is the third step in the typical natural language processing (NLP) pipeline, following tokenization.

<br/>
<div align="center">
    <figure>
        <img src="img/nlp-pipeline.png" width="600">
        <figcaption>NLP pipeline</figcaption>
    </figure>
</div>

A POS tagger processes a sequence of words, and attaches a part of speech tag to each word. Steps:
1. Tokenization
2. Tagging

Note: import the `nltk` package

In [1]:
import nltk

and run only once the next code (use an IPython shell rather than an IPython notebook)...
```
nltk.download()
```
... choose `d) Download` for the downloader and then `all` as identifier to download all packages. This will download all the corpora and data needed to work with `nltk`.

Example 1:

In [2]:
text = 'And now for something completely different'
tokens = nltk.word_tokenize(text)
nltk.pos_tag(tokens)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

<table align="left">
    <caption>Meaning of abbreviations (I)</caption>
    <thead>
        <th>Abbreviation</th><th>Lexical catefory</th>
    </thead>
    <tbody>
        <tr><td>CC</td><td>coordinating conjunction</td></tr>
        <tr><td>RB</td><td>adverb</td></tr>
        <tr><td>IN</td><td>preposition</td></tr>
        <tr><td>NN</td><td>noun</td></tr>
        <tr><td>JJ</td><td>adjective</td></tr>
    </tbody>
</table>

Example 2:

In [3]:
text = 'They refuse to permit us to obtain the refuse permit'
tokens = nltk.word_tokenize(text)
nltk.pos_tag(tokens)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

<table align="left">
    <caption>Meaning of abbreviations (II)</caption>
    <thead>
        <th>Abbreviation</th><th>Lexical catefory</th>
    </thead>
    <tbody>
        <tr><td>PRP</td><td>personal pronoun</td></tr>
        <tr><td>VBP</td><td>verb in present tense</td></tr>
        <tr><td>TO</td><td>preposition *to*</td></tr>
        <tr><td>VB</td><td>verb</td></tr>
        <tr><td>DT</td><td>determiner</td></tr>
    </tbody>
</table>

Notice that *refuse* and *permit* both appear as a present tense verb (VBP) and a noun (NN). NLTK provides documentation for each tag, which can be queried using the function `nltk.help.upenn_tagset(tag)`. For example:

In [4]:
nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...


###Automatic tagging

- The tag of a word depends on the word itself and its context within a sentence
- Therefore, working with data at the level of complete (tagged) sentences rather than independent (tagged) words

####Dataset
[Brown Corpus](http://www.helsinki.fi/varieng/CoRD/corpora/BROWN/). The *Brown University Standard Corpus of Present-Day American English* (or just Brown Corpus) was the first computer-readable general corpus of texts prepared for linguistic research on modern English. It was compiled by W. Nelson Francis and Henry Kučera at Brown University in the 1960s and contains of over 1 million words (500 samples of 2000+ words each) of running text of edited English prose printed in the United States during the calendar year 1961.

Loading the data:

In [5]:
from nltk.corpus import brown

brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')

In [6]:
sent = brown_sents[0]
sent

[u'The',
 u'Fulton',
 u'County',
 u'Grand',
 u'Jury',
 u'said',
 u'Friday',
 u'an',
 u'investigation',
 u'of',
 u"Atlanta's",
 u'recent',
 u'primary',
 u'election',
 u'produced',
 u'``',
 u'no',
 u'evidence',
 u"''",
 u'that',
 u'any',
 u'irregularities',
 u'took',
 u'place',
 u'.']

In [7]:
tagged_sent = brown_tagged_sents[0]
tagged_sent

[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', u'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", u'NP$'),
 (u'recent', u'JJ'),
 (u'primary', u'NN'),
 (u'election', u'NN'),
 (u'produced', u'VBD'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', u'NN'),
 (u"''", u"''"),
 (u'that', u'CS'),
 (u'any', u'DTI'),
 (u'irregularities', u'NNS'),
 (u'took', u'VBD'),
 (u'place', u'NN'),
 (u'.', u'.')]

####Default tagger
- Assigns the same tag to each token
- Choses the most likely tag

In [8]:
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
tags[:5]

[u'AT', u'NP-TL', u'NN-TL', u'JJ-TL', u'NN-TL']

In [9]:
nltk.FreqDist(tags)

FreqDist({u'NN': 13162, u'IN': 10616, u'AT': 8893, u'NP': 6866, u',': 5133, u'NNS': 5066, u'.': 4452, u'JJ': 4392, u'CC': 2664, u'VBD': 2524, ...})

In [10]:
nltk.FreqDist(tags).max()

u'NN'

In [11]:
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(sent)

[(u'The', 'NN'),
 (u'Fulton', 'NN'),
 (u'County', 'NN'),
 (u'Grand', 'NN'),
 (u'Jury', 'NN'),
 (u'said', 'NN'),
 (u'Friday', 'NN'),
 (u'an', 'NN'),
 (u'investigation', 'NN'),
 (u'of', 'NN'),
 (u"Atlanta's", 'NN'),
 (u'recent', 'NN'),
 (u'primary', 'NN'),
 (u'election', 'NN'),
 (u'produced', 'NN'),
 (u'``', 'NN'),
 (u'no', 'NN'),
 (u'evidence', 'NN'),
 (u"''", 'NN'),
 (u'that', 'NN'),
 (u'any', 'NN'),
 (u'irregularities', 'NN'),
 (u'took', 'NN'),
 (u'place', 'NN'),
 (u'.', 'NN')]

- Unknown words will be nouns (as it happens, most new words are nouns)

Accuracy:

In [12]:
default_tagger.evaluate(brown_tagged_sents)

0.13089484257215028

####Regular expression tagger
- Assigns tags to tokens on the basis of matching patterns

In [13]:
patterns = [
    (r'.*ing$', 'VBG'),              # gerounds
    (r'.*ed$', 'VBD'),               # simple past
    (r'.*es$', 'VBZ'),               # 3rd singular present
    (r'.*ould$', 'MD'),              # modals
    (r'.*\'s$', 'NN$'),              # possessive nouns
    (r'.*s$', 'NNS'),                # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
    (r'.*', 'NN'),                   # nouns (default)
]
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(sent)

[(u'The', 'NN'),
 (u'Fulton', 'NN'),
 (u'County', 'NN'),
 (u'Grand', 'NN'),
 (u'Jury', 'NN'),
 (u'said', 'NN'),
 (u'Friday', 'NN'),
 (u'an', 'NN'),
 (u'investigation', 'NN'),
 (u'of', 'NN'),
 (u"Atlanta's", 'NN$'),
 (u'recent', 'NN'),
 (u'primary', 'NN'),
 (u'election', 'NN'),
 (u'produced', 'VBD'),
 (u'``', 'NN'),
 (u'no', 'NN'),
 (u'evidence', 'NN'),
 (u"''", 'NN'),
 (u'that', 'NN'),
 (u'any', 'NN'),
 (u'irregularities', 'VBZ'),
 (u'took', 'NN'),
 (u'place', 'NN'),
 (u'.', 'NN')]

- The list of regular expressions is processed in order, and the first one that matches is applied
- The final regular expression `.*` is a catch-all that tags everything as a noun (equivalent to the default tagger)

Accuracy:

In [14]:
regexp_tagger.evaluate(brown_tagged_sents)

0.20326391789486245

####Lookup tagger
- Problem: With the previous taggers, a big amount of high-frequency words are tagged as NN but they are not actually nouns
- Solution:
    - Find the hundred most frequent words and store their most likely tag
    - Use this information as the model for a *lookup* tagger (an NLTK `UnigramTagger`)
    - Tag everything else as a noun

In [15]:
brown_news_words = brown.words(categories='news')
brown_news_tagged_words = brown.tagged_words(categories='news')

In [16]:
brown_news_words[:5]

[u'The', u'Fulton', u'County', u'Grand', u'Jury']

In [17]:
brown_news_tagged_words[:5]

[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL')]

In [18]:
# Count the number of times a word appears in total
fd = nltk.FreqDist(brown_news_words)
# Count the number of times a word appears under certain category
cfd = nltk.ConditionalFreqDist(brown_news_tagged_words)

In [19]:
fd.items()[:5]

[(u'stock', 21),
 (u'sunbonnet', 1),
 (u'Elevated', 1),
 (u'narcotic', 2),
 (u'four', 73)]

In [20]:
cfd.items()[:5]

[(u'stock', FreqDist({u'NN': 20, u'VB': 1})),
 (u'sunbonnet', FreqDist({u'NN': 1})),
 (u'Elevated', FreqDist({u'VBN-TL': 1})),
 (u'narcotic', FreqDist({u'JJ': 1, u'NN': 1})),
 (u'four', FreqDist({u'CD': 73}))]

In [21]:
most_common_words = [word for word, freq in fd.most_common()[: 100]]
likely_tags = dict((word, cfd[word].max()) for word in most_common_words)
lookup_tagger = nltk.UnigramTagger(model=likely_tags, backoff=default_tagger)
lookup_tagger.tag(sent)

[(u'The', u'AT'),
 (u'Fulton', 'NN'),
 (u'County', 'NN'),
 (u'Grand', 'NN'),
 (u'Jury', 'NN'),
 (u'said', u'VBD'),
 (u'Friday', 'NN'),
 (u'an', u'AT'),
 (u'investigation', 'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", 'NN'),
 (u'recent', 'NN'),
 (u'primary', 'NN'),
 (u'election', 'NN'),
 (u'produced', 'NN'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', 'NN'),
 (u"''", u"''"),
 (u'that', u'CS'),
 (u'any', u'DTI'),
 (u'irregularities', 'NN'),
 (u'took', 'NN'),
 (u'place', 'NN'),
 (u'.', u'.')]

Accuracy:

In [22]:
lookup_tagger.evaluate(brown_tagged_sents)

0.5817769556656125

- The lookup tagger accuracy increases as the model size grows

<div align="center">
    <figure>
        <img src="img/lookup_tagger_accuracy.png" width="600">
        <figcaption>Lookup tagger accuracy</figcaption>
    </figure>
</div>

####Unigram tagger
- Like the lookup tagger, it assigns the most likely tag to each token
- As opposed to the default tagger, it is trained for setting it up

The unigram tagger is trained by initializing it with tagged sentences. The training process involves inspecting the tag of each word and storing the most likely tag for any word in a dictionary that is stored inside the tagger.

In [23]:
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(sent)

[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', u'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", u'NP$'),
 (u'recent', u'JJ'),
 (u'primary', u'NN'),
 (u'election', u'NN'),
 (u'produced', u'VBD'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', u'NN'),
 (u"''", u"''"),
 (u'that', u'CS'),
 (u'any', u'DTI'),
 (u'irregularities', u'NNS'),
 (u'took', u'VBD'),
 (u'place', u'NN'),
 (u'.', u'.')]

Accuracy:

In [24]:
unigram_tagger.evaluate(brown_tagged_sents)

0.9349006503968017

It is important no to test the tagger with the same data used to train it. A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would be useless for tagging new text. Instead, split the data into:
- Training data (90%)
- Testing data (10%)

In [25]:
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

In [26]:
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.tag(sent)

[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', u'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", u'NP$'),
 (u'recent', u'JJ'),
 (u'primary', u'NN'),
 (u'election', u'NN'),
 (u'produced', u'VBD'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', u'NN'),
 (u"''", u"''"),
 (u'that', u'CS'),
 (u'any', u'DTI'),
 (u'irregularities', u'NNS'),
 (u'took', u'VBD'),
 (u'place', u'NN'),
 (u'.', u'.')]

Accuracy:

In [27]:
unigram_tagger.evaluate(test_sents)

0.8120203329014253

####n-gram tagger
- It uses the context of the word to determine its POS tag
- The context is used for disambiguation; for example: *wind* can be a noun as in *the wind* or a verb as in *to wind*

An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the POS tags of the n-1 preceding tokens:
- 1-gram (unigram) tagger:
    - current token in isolation
- 2-gram (bigram) tagger:
    - current token
    - POS tag of the 1 preceding token
- 3-gram (trigram) tagger:
    - current token
    - POS tag of the 2 preceding tokens
- ...

- n-gram tagger:
    - current token
    - POS tag of the n-1 preceding token

<br/>
<div align="center">
    <figure>
        <img src="img/ngram-context.png" width="600">
        <figcaption>trigram context</figcaption>
    </figure>
</div>

Example:

In [28]:
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(sent)

[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', u'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", u'NP$'),
 (u'recent', u'JJ'),
 (u'primary', u'NN'),
 (u'election', u'NN'),
 (u'produced', u'VBD'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', u'NN'),
 (u"''", u"''"),
 (u'that', u'CS'),
 (u'any', u'DTI'),
 (u'irregularities', u'NNS'),
 (u'took', u'VBD'),
 (u'place', u'NN'),
 (u'.', u'.')]

Accuracy:

In [29]:
bigram_tagger.evaluate(test_sents)

0.10276088906608193

- Problem: Notice that the bigram tagger manages to tag every word in a sentence it saw during training, but does badly on an unseen sentence. As soon as it encounters a new word, it is unable to assign a tag (its tag will be **None**, that is, no tag). Then, it cannot tag the following word even if it was seen during training, simply because it never saw it during training with a **None** tag on the previous word. Consequently, the tagger fails to tag the rest of the sentence and its overall accuracy score is very low.
- Phenomenon's name: Sparse data.
- Reason: As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data.
- Solution: Trade-off between accuracy and coverage (precision/recall trade-off): use a default tagger when the n-gram tagger is unable to classify a word (`backoff` argument).

Combining taggers:
- 1) Try tagging with the n-gram tagger
- 2) If unable, try the (n-1)-gram tagger
- 3) If unable, try the (n-2)-gram tagger
- ...
- n-2) If unable, try the trigram tagger
- n-1) If unable, try the bigram tagger
- n) If unable, try the unigram tagger
- n+1) If unable, use the default tagger

Example:

In [30]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.tag(sent)

[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', 'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", u'NP$'),
 (u'recent', u'JJ'),
 (u'primary', u'NN'),
 (u'election', 'NN'),
 (u'produced', u'VBD'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', 'NN'),
 (u"''", u"''"),
 (u'that', u'CS'),
 (u'any', u'DTI'),
 (u'irregularities', u'NNS'),
 (u'took', u'VBD'),
 (u'place', u'NN'),
 (u'.', u'.')]

Accuracy:

In [31]:
t2.evaluate(test_sents)

0.844911791089405

####Exercise
- Build a tagger by combining a trigram, a bigram, a unigram and a regular expression tagger (for the default case)
- Use it to tag a sentence
- Evaluate its performance

In [32]:
import nltk
from nltk.corpus import brown

patterns = [
    (r'.*ing$', 'VBG'),
    (r'.*ed$', 'VBD'),
    (r'.*es$', 'VBZ'),
    (r'.*ould$', 'MD'),
    (r".\'s$", 'NN$'),
    (r'.*s$', 'NNS'),
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
    (r'.*', 'NN')
]

brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

t0 = nltk.RegexpTagger(patterns)
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t3 = nltk.TrigramTagger(train_sents, backoff=t2)

brown_sents = brown.sents(categories='news')
sent = brown_sents[0]
t3.tag(sent)

[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', 'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", u'NP$'),
 (u'recent', u'JJ'),
 (u'primary', u'NN'),
 (u'election', 'NN'),
 (u'produced', u'VBD'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', 'NN'),
 (u"''", u"''"),
 (u'that', u'CS'),
 (u'any', u'DTI'),
 (u'irregularities', u'NNS'),
 (u'took', u'VBD'),
 (u'place', u'NN'),
 (u'.', u'.')]

In [33]:
t3.evaluate(test_sents)

0.8620552177813217