This notebook will be collected automatically at **6pm on Monday** from `/home/data_scientist/assignments/Week9` directory on the course JupyterHub server. If you work on this assignment on the course Jupyterhub server, just make sure that you save your work and instructors will pull your notebooks automatically after the deadline. If you work on this assignment locally, the only way to submit assignments is via Jupyterhub, and you have to place the notebook file in the correct directory with the correct file name before the deadline.

1. Make sure everything runs as expected. First, restart the kernel (in the menubar, select `Kernel` → `Restart`) and then run all cells (in the menubar, select `Cell` → `Run All`).
2. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed by the autograder.
3. Do not change the file path or the file name of this notebook.
4. Make sure that you save your work (in the menubar, select `File` → `Save and CheckPoint`)

# Problem 9.1. NLP: Basic Concepts.

In this problem, we explore the basic concepts of part of speech (POS) tagging.

In [None]:
import re
import requests
import nltk
import pprint

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.tag import DefaultTagger, UnigramTagger
from nltk.corpus import treebank

from nose.tools import assert_equal, assert_is_instance, assert_true

We use _Alice's Adventures in Wonderland_ by Lewis Carroll, freely available from _Project Gutenberg_.

In [None]:
resp = requests.get('http://www.gutenberg.org/cache/epub/11/pg11.txt')
text = resp.text
print(text[:1000])

assert_is_instance(text, str)
assert_equal(len(text), 167516)

## Tokenize

- Tokenize the text by words by using `word_tokenize()`.

In [None]:
def tokenize(text):
    '''
    Tokenizes the text by words.
    
    Parameters
    ----------
    text: A string.
    
    Returns
    -------
    A list of strings.
    '''
    
    # YOUR CODE HERE
    
    return tokens

In [None]:
word_tokens = tokenize(text)
print('{0} words in course description'.format(len(word_tokens)))
print(40*'-')
print(word_tokens[:13])

In [None]:
assert_is_instance(word_tokens, list)
assert_true(all(isinstance(t, str) for t in word_tokens))
assert_equal(len(word_tokens), 36719)
assert_equal(word_tokens[:5], ['\ufeffProject', 'Gutenberg', "'s", 'Alice', "'s"])
assert_equal(word_tokens[-5:], ['hear', 'about', 'new', 'eBooks', '.'])

## Collocations

- Build bigram collocations by using the pointwise mutual information (PMI). Return the best 10 bigrams.

In [None]:
def find_best_bigrams(tokens):
    '''
    Builds collocations by using the pointwise mutual information (PMI).
    
    Parameters
    ----------
    tokens: A list of strings.
    
    Returns
    -------
    A list of tuples of (str, str).
    '''
    
    # YOUR CODE HERE
    
    return bigrams

In [None]:
top_bigrams = find_best_bigrams(word_tokens)

print('Best {0} bi-grams in text (WP Tokenizer)'.format(10))
print(50*'-')

ppf = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=False)
ppf.pprint(top_bigrams)

In [None]:
assert_equal(
    top_bigrams,
    [('#', '11'), ("'Cheshire", 'Puss'), ("'IT", 'DOES'), ("'ORANGE", 'MARMALADE'), ("'Ou", 'est'),
     ("'Rule", 'Forty-two'), ("'Seven", 'jogged'), ("'With", 'extras'), ("'any", 'shrimp'), ("'than", 'waste')]
)

## DefaultTagger

- Use `DefaultTagger` to associate a tag of our choosing (the `tag` parameter) with words.

In [None]:
def tag_words(words, tag):
    '''
    Associates a tag with words.
    
    Parameters
    ----------
    words: A list of strings.
    tag: A str.
    
    Returns
    -------
    A list of tuples of (str, str)
    '''
    
    # YOUR CODE HERE
    
    return tags

In [None]:
tags = tag_words(word_tokens, 'INFO')
print('Tagged text (WP Tokenizer)')
print(50*'-')
pp = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=True)
pp.pprint(tags[:15])

In [None]:
assert_is_instance(tags, list)
assert_equal(len(tags), 36719)
for i, t in enumerate(tags):
    assert_is_instance(t, tuple)
    assert_equal(len(t), 2)
    assert_is_instance(t[0], str)
    assert_is_instance(t[1], str)
    assert_equal(t[0], word_tokens[i])
    assert_equal(t[1], 'INFO')

## Part of Speech Tagging

- Use a PerceptronTagger to create Part of Speech (PoS) tags.

In [None]:
def tag_pos(words):
    '''
    Creates Part of Speech tags.
    
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A list of tuples of (str, str)
    '''
    
    # YOUR CODE HERE
    
    return pos_tags

In [None]:
pos_tags = tag_pos(word_tokens)

print('PoS tagged text (WP Tokenizer/Univesal Tagger)')
print(60*'-')

ppf.pprint(pos_tags[:15])

In [None]:
assert_equal(len(pos_tags), len(word_tokens))
assert_equal(
    pos_tags[:15],
    [('\ufeffProject', 'JJ'), ('Gutenberg', 'NNP'),
     ("'s", 'POS'), ('Alice', 'NNP'),
     ("'s", 'POS'), ('Adventures', 'NNS'),
     ('in', 'IN'), ('Wonderland', 'NNP'),
     (',', ','), ('by', 'IN'),
     ('Lewis', 'NNP'), ('Carroll', 'NNP'),
     ('This', 'DT'), ('eBook', 'NN'),
     ('is', 'VBZ')]
    )

## Penn Treebank

- Tokenize and tag unigrams in `text` by using `UnigramTagger` and a Penn Treebank tagged sentence and word tokenizer.

In [None]:
def tag_penn(words):
    '''
    Tokenizes text by using a Penn Treebank tagged sentence and word tokenizer.
    
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A list of tuples of (str, str)
    '''
    
    # YOUR CODE HERE
    
    return tags

In [None]:
b_tags = tag_penn(word_tokens)

print('Penn Treebank tagged text (WP Tokenizer)')
print(60*'-')

ppf.pprint(b_tags[:15])

In [None]:
assert_equal(len(b_tags), len(word_tokens))
assert_equal(
    b_tags[:15],
    [('\ufeffProject', None), ('Gutenberg', None),
     ("'s", 'POS'), ('Alice', None),
     ("'s", 'POS'), ('Adventures', None),
     ('in', 'IN'), ('Wonderland', None),
     (',', ','), ('by', 'IN'),
     ('Lewis', 'NNP'), ('Carroll', None),
     ('This', 'DT'), ('eBook', None),
     ('is', 'VBZ')]
    )

## Linking Taggers

- Link the Penn Treebank Corpus tagger with our earlier Default tagger.

In [None]:
def tag_linked(words, default_tag='INFO'):
    '''
    Tokenizes text by using a Penn Treebank tagged sentence and word tokenizers.
    Uses DefaultTagger to assign "default_tag" to any element missed by Penn Treebank tagger.
    
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A list of tuples of (str, str)
    '''
    
    # YOUR CODE HERE
    
    return tags

In [None]:
linked_tags = tag_linked(word_tokens)
print('Penn Treebank tagged text (WP Tokenizer/Linked Tagger)')
print(60*'-')

ppf.pprint(linked_tags[:15])

In [None]:
assert_equal(len(linked_tags), len(word_tokens))
assert_equal(
    linked_tags[:15],
    [('\ufeffProject', 'INFO'), ('Gutenberg', 'INFO'),
     ("'s", 'POS'), ('Alice', 'INFO'),
     ("'s", 'POS'), ('Adventures', 'INFO'),
     ('in', 'IN'), ('Wonderland', 'INFO'),
     (',', ','), ('by', 'IN'),
     ('Lewis', 'NNP'), ('Carroll', 'INFO'),
     ('This', 'DT'), ('eBook', 'INFO'),
     ('is', 'VBZ')]
    )

## Tagged Text Extraction

- Use regular expressions to restrict tokens in the text to Nouns, Verbs, Adjectives, and Adverbs. Return a tuple of PoS tags and the extracted terms.

In [None]:
def extract_tags(words):
    '''
    Restricts tokens in the text to Nouns, Verbs, Adjectives, and Adverbs.
        
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A tuple of (pos_tags, terms)
    pos_tags: A list of tuples of (str, str).
    terms: A list of strings.
           Terms extracted with regex.
           Nouns, verbs, adjectives, or adverbs.
    '''
    
    # YOUR CODE HERE
    
    return pos_tags, terms

In [None]:
pos_tags, terms = extract_tags(word_tokens)

print('POS tagged text (WP Tokenizer)')
print(60*'-')
pp.pprint(pos_tags[:15])
print(60*'-')
print('POS tagged text (WP Tokenizer/RegEx applied)')
print(60*'-')
pp.pprint(terms[:15])

In [None]:
assert_equal(len(pos_tags), len(word_tokens))
assert_equal(
    pos_tags[:15],
    [('\ufeffProject', 'JJ'), ('Gutenberg', 'NNP'), ("'s", 'POS'),
     ('Alice', 'NNP'), ("'s", 'POS'), ('Adventures', 'NNS'), ('in', 'IN'),
     ('Wonderland', 'NNP'), (',', ','), ('by', 'IN'), ('Lewis', 'NNP'),
     ('Carroll', 'NNP'), ('This', 'DT'), ('eBook', 'NN'), ('is', 'VBZ')]
    )
assert_true((len(terms) == 16448) | (len(terms) == 16447))
assert_equal(
    terms[:15],
    ['\ufeffProject', 'Gutenberg', 'Alice', 'Adventures', 'Wonderland', 'Lewis',
     'Carroll', 'eBook', 'is', 'use', 'anyone', 'anywhere', 'cost', 'almost',
     'restrictions']
    )