<div align="center">
    <h1><a href="index.ipynb">Knowledge Discovery in Digital Humanities</a></h1>
</div>

<div align="center">
    <h2>Class 17. Word tagging and categorization</h2>
    <img src="img/tag.svg" width="300">
</div>

###Table of contents

- [Lexical categories](#Lexical-categories)
- [POS tagger](#POS-tagger)
- [Automatic tagging](#Automatic-tagging)

###Lexical categories

- **nouns**: people, places, things, concepts
- **verbs**: actions
- **adjectives**: describes nouns
- **adverbs**: modifies adjectives and verbs
- ...

<br/>
<div align="center">
    <figure>
        <img src="img/pos.png" width="800">
        <figcaption>Lexical categories</figcaption>
    </figure>
</div>

- These word classes are also known as parts-of-speech (POS)
- They arise from simple analysis of the distribution of words in text

###POS tagger

The process of classifying words into their parts-of-speech and labeling them accordingly is known as *part-of-speech tagging*, *POS tagging*, or simply *tagging*.

POS tagging is the third step in the typical natural language processing (NLP) pipeline, following tokenization.

<br/>
<div align="center">
    <figure>
        <img src="img/nlp-pipeline.png" width="600">
        <figcaption>NLP pipeline</figcaption>
    </figure>
</div>

A POS tagger processes a sequence of words, and attaches a part of speech tag to each word. Steps:
1. Tokenization
2. Tagging

Note: import the `nltk` package

In [5]:
import nltk

and run only once the next code (use an IPython shell rather than an IPython notebook)...
```
nltk.download()
```
... choose `d) Download` for the downloader and then `all` as identifier to download all packages. This will download all the corpora and data needed to work with `nltk`.

Example 1:

In [6]:
text = 'And now for something completely different'
tokens = nltk.word_tokenize(text)
nltk.pos_tag(tokens)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

<table align="left">
    <caption>Meaning of abbreviations (I)</caption>
    <thead>
        <th>Abbreviation</th><th>Lexical catefory</th>
    </thead>
    <tbody>
        <tr><td>CC</td><td>coordinating conjunction</td></tr>
        <tr><td>RB</td><td>adverb</td></tr>
        <tr><td>IN</td><td>preposition</td></tr>
        <tr><td>NN</td><td>noun</td></tr>
        <tr><td>JJ</td><td>adjective</td></tr>
    </tbody>
</table>

Example 2:

In [7]:
text = 'They refuse to permit us to obtain the refuse permit'
tokens = nltk.word_tokenize(text)
nltk.pos_tag(tokens)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

<table align="left">
    <caption>Meaning of abbreviations (II)</caption>
    <thead>
        <th>Abbreviation</th><th>Lexical catefory</th>
    </thead>
    <tbody>
        <tr><td>PRP</td><td>personal pronoun</td></tr>
        <tr><td>VBP</td><td>verb in present tense</td></tr>
        <tr><td>TO</td><td>preposition *to*</td></tr>
        <tr><td>VB</td><td>verb</td></tr>
        <tr><td>DT</td><td>determiner</td></tr>
    </tbody>
</table>

Notice that *refuse* and *permit* both appear as a present tense verb (VBP) and a noun (NN). NLTK provides documentation for each tag, which can be queried using the function `nltk.help.upenn_tagset(tag)`. For example:

In [8]:
nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...


###Automatic tagging

- The tag of a word depends on the word itself and its context within a sentence
- Therefore, working with data at the level of complete (tagged) sentences rather than independent (tagged) words

####Dataset
[Brown Corpus](http://www.helsinki.fi/varieng/CoRD/corpora/BROWN/). The *Brown University Standard Corpus of Present-Day American English* (or just Brown Corpus) was the first computer-readable general corpus of texts prepared for linguistic research on modern English. It was compiled by W. Nelson Francis and Henry Kučera at Brown University in the 1960s and contains of over 1 million words (500 samples of 2000+ words each) of running text of edited English prose printed in the United States during the calendar year 1961.

Loading the data:

In [20]:
from nltk.corpus import brown

brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')

In [16]:
brown_sents[0]

[u'The',
 u'Fulton',
 u'County',
 u'Grand',
 u'Jury',
 u'said',
 u'Friday',
 u'an',
 u'investigation',
 u'of',
 u"Atlanta's",
 u'recent',
 u'primary',
 u'election',
 u'produced',
 u'``',
 u'no',
 u'evidence',
 u"''",
 u'that',
 u'any',
 u'irregularities',
 u'took',
 u'place',
 u'.']

In [17]:
brown_tagged_sents[0]

[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', u'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", u'NP$'),
 (u'recent', u'JJ'),
 (u'primary', u'NN'),
 (u'election', u'NN'),
 (u'produced', u'VBD'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', u'NN'),
 (u"''", u"''"),
 (u'that', u'CS'),
 (u'any', u'DTI'),
 (u'irregularities', u'NNS'),
 (u'took', u'VBD'),
 (u'place', u'NN'),
 (u'.', u'.')]

####Default tagger
- Assigns the same tag to each token
- Choses the most likely tag

In [22]:
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
tags[: 10]

[u'AT',
 u'NP-TL',
 u'NN-TL',
 u'JJ-TL',
 u'NN-TL',
 u'VBD',
 u'NR',
 u'AT',
 u'NN',
 u'IN']

In [24]:
nltk.FreqDist(tags)

FreqDist({u'NN': 13162, u'IN': 10616, u'AT': 8893, u'NP': 6866, u',': 5133, u'NNS': 5066, u'.': 4452, u'JJ': 4392, u'CC': 2664, u'VBD': 2524, ...})

In [25]:
nltk.FreqDist(tags).max()

u'NN'

In [28]:
text = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = nltk.word_tokenize(text)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)

[('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('green', 'NN'),
 ('eggs', 'NN'),
 ('and', 'NN'),
 ('ham', 'NN'),
 (',', 'NN'),
 ('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('them', 'NN'),
 ('Sam', 'NN'),
 ('I', 'NN'),
 ('am', 'NN'),
 ('!', 'NN')]

- Unknown words will be nouns (as it happens, most new words are nouns)

Accuracy:

In [29]:
default_tagger.evaluate(brown_tagged_sents)

0.13089484257215028

####Regular expression tagger

