<div align="center">
    <h1><a href="index.ipynb">Knowledge Discovery in Digital Humanities</a></h1>
</div>

<div align="center">
    <h2>Class 17. Word tagging and categorization</h2>
    <img src="img/tag.svg" width="300">
</div>

###Table of contents

- [Lexical categories](#Lexical-categories)
- [POS tagger](#POS-tagger)
- [Automatic tagging](#Automatic-tagging)
- [n-gram tagging](#n-gram-tagging)

###Lexical categories

- **nouns**: people, places, things, concepts
- **verbs**: actions
- **adjectives**: describes nouns
- **adverbs**: modifies adjectives and verbs
- ...

<br/>
<div align="center">
    <figure>
        <img src="img/pos.png" width="800">
        <figcaption>Lexical categories</figcaption>
    </figure>
</div>

- These word classes are also known as parts-of-speech (POS)
- They arise from simple analysis of the distribution of words in text

###POS tagger

The process of classifying words into their parts-of-speech and labeling them accordingly is known as *part-of-speech tagging*, *POS tagging*, or simply *tagging*.

POS tagging is the third step in the typical natural language processing (NLP) pipeline, following tokenization.

<br/>
<div align="center">
    <figure>
        <img src="img/nlp-pipeline.png" width="600">
        <figcaption>NLP pipeline</figcaption>
    </figure>
</div>

A POS tagger processes a sequence of words, and attaches a part of speech tag to each word. Steps:
1. Tokenization
2. Tagging

Note: import the `nltk` package

In [5]:
import nltk

and run only once the next code (use an IPython shell rather than an IPython notebook)...
```
nltk.download()
```
... choose `d) Download` for the downloader and then `all` as identifier to download all packages. This will download all the corpora and data needed to work with `nltk`.

Example 1:

In [6]:
text = 'And now for something completely different'
tokens = nltk.word_tokenize(text)
nltk.pos_tag(tokens)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

<table align="left">
    <caption>Meaning of abbreviations (I)</caption>
    <thead>
        <th>Abbreviation</th><th>Lexical catefory</th>
    </thead>
    <tbody>
        <tr><td>CC</td><td>coordinating conjunction</td></tr>
        <tr><td>RB</td><td>adverb</td></tr>
        <tr><td>IN</td><td>preposition</td></tr>
        <tr><td>NN</td><td>noun</td></tr>
        <tr><td>JJ</td><td>adjective</td></tr>
    </tbody>
</table>

Example 2:

In [7]:
text = 'They refuse to permit us to obtain the refuse permit'
tokens = nltk.word_tokenize(text)
nltk.pos_tag(tokens)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

<table align="left">
    <caption>Meaning of abbreviations (II)</caption>
    <thead>
        <th>Abbreviation</th><th>Lexical catefory</th>
    </thead>
    <tbody>
        <tr><td>PRP</td><td>personal pronoun</td></tr>
        <tr><td>VBP</td><td>verb in present tense</td></tr>
        <tr><td>TO</td><td>preposition *to*</td></tr>
        <tr><td>VB</td><td>verb</td></tr>
        <tr><td>DT</td><td>determiner</td></tr>
    </tbody>
</table>

Notice that *refuse* and *permit* both appear as a present tense verb (VBP) and a noun (NN). NLTK provides documentation for each tag, which can be queried using the function `nltk.help.upenn_tagset(tag)`. For example:

In [8]:
nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...


###Automatic tagging