# Lab 3 - Natural Language Processing with NLTK


## Due: Thursday, January 25, 2018,  11:59:00pm

### Submission instructions
After completing this homework, you will turn in two files via Canvas ->  Assignments -> Lab 3:
Your Notebook, named si330-lab3-YOUR_UNIQUE_NAME.ipynb and
the HTML file, named si330-lab3-YOUR_UNIQUE_NAME.html

### Name:  YOUR NAME GOES HERE
### Uniqname: YOUR UNIQNAME GOES HERE
### People you worked with: [if you didn't work with anyone else write "I worked by myself" here].


## Objectives
After completing this Lab, you should know how to use NLTK to:
* Normalize and Tokenize your text data
* Parts of Speech tagging of a sentence


## Installing NLTK

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python.

You will install a package directly from Jupyter Notebooks.

<b>Make sure you are in the SI 330 environment when you run your Jupyter notebook. In your Jupyter notebook run the following command</b></font>

In [None]:
# First run this cell
import sys
!conda install --yes --prefix {sys.prefix} nltk

In [8]:
import nltk, re
from collections import defaultdict

NLTK comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/

In the next code chunk, you will install the data.

In [None]:
# You can remove this cell once you've installed the corpora
nltk.download('popular')

## Background

NLTK's corpora contains texts from the Gutenberg project. In today's lab we will be working on text from Shakespeare's Julius Caesar. In the chunk below, you can see what books are available in this corpus.

In [None]:
# Texts present in the Gutenberg Corpora
for i in nltk.corpus.gutenberg.fileids():
    print(i)

Now let's import Julius Caesar and save it in a variable.
### <font color="magenta">Print the first 1000 characters to see what the text looks like.</font> 

In [40]:
# We want to get Julius Caesar as raw text. 
# There are other ways in which you could load text from this corpus, but we will use raw text
caesar = nltk.corpus.gutenberg.raw('shakespeare-caesar.txt')

# Print the first 1000 characters of Julius Caesar. Why are we printing characters?

## Normalization and Tokenization

Next, we will normalize and tokenize the text from the play. We will use the <b>```RegexpTokenizer```</b> from  <b>```nltk.tokenizer```</b> package. This will allow us to write our own regular expression and tokenize the text. You only want the words, so write your regular expression accordingly.

### <font color="magenta">Write code to normalize the text, then tokenize the text into words using regex.</font>

In [None]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

# normalize the text by converting it to lower case

tokenizer = RegexpTokenizer(r'') # Fill in with the right regular expression.

word_tokens = tokenizer.tokenize(None) # Pass the normalized text to this method.

## Types, tokens and type-token ration
A useful measure to calculate is the type-token ratio (TTR). For that, we would need to calculate the total number of word types, which is the collection of unique words, and tokens, which here is an instance of a word.
### <font color="magenta">Calculate total number of word types, word tokens, and type-token ratio for the text.</font>

In [None]:
# Write code to calculate the number of types, tokens and then
# divide the number of types by the number of tokens.  
# Note that there are multiple ways to do this

type_token_ratio = None # Calculate the type-token ratio

## Bigrams
Bigrams are sequences of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on.

Here, you will retrieve all the bigrams from the text, store them in a dictionary, and count the number of times each bigram occurs.  nltk makes this easy by providing a bigram() function.  You can get a list of bigrams with the following code:

```
>>> list_of_bigrams = list(nltk.bigrams(['more', 'is', 'said', 'than', 'done']))
>>> print(list_of_bigrams)
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
```

The list comprehension allows us to print the results.  If you omit list() from the above statement, you get a generator, which is useful if you want to iterate over the list of bigrams (which is what you want to do below), but not so great if you want to print out the results.

### <font color="magenta">Calculate the bigram counts - two words occuring one after the other - and store it in a dictionary, along with the number of times it has occured.</font>

In [None]:
# Create a bigram generator using the bigrams() function and iterate over your bigrams 
# to store the count of each bigram in a dictionary.
# Your key should be the bigrams (a 2-tuple). Your values would be the number of occurences of the bigram.
bigram_counts = None

# Implement a sorted function which returns the most commonly used bigrams in descending order
sorted_bigram_counts = sorted()

# Print out the 20 most commonly used bigrams

## Part of Speech Tagging
Next, we will use Part of Speech tagging to tag the words from the play. We need to pass tokenized words to this <b>```pos_tag```</b> function, which returns a list of token - tag tuples. The first two lines in the cell below import the necessary functions. 

### <font color="magenta">Pass the word tokens into pos_tag.</font>

In [10]:
# We will use these libraries for Part of Speech tagging
from nltk.tag import pos_tag
%matplotlib inline
import matplotlib.pyplot as plt

word_tags = pos_tag(None) # change 'None' to your list of word tokens

Next, we will calculate the frequency distribution of each word type. We will use the <b>```nltk.FreqDist()```</b>.

In [None]:
fd = nltk.FreqDist(tag for (word, tag) in word_tags)
fd.plot()

## Tokenization vs. regex

Get the names of all the characters (cast members for clarity) from the play. Cast members are the ones with the lines. This can be done using either <b>```nltk's RegexpTokenizer```</b> or <b>```re.findall```</b>. First try out the <b>```nltk's RegexpTokenizer```</b>. Print the set of character names. (Make sure the character names don't appear twice.)  If you have time and energy, write code to do the same thing using re.findall().

<b>Note: we will be using the raw text, stored in the variable ```caesar``` for this, and not your tokenized words.</b>

Consider the following excerpt:
```
  Flauius. Hence: home you idle Creatures, get you home:
Is this a Holiday? What, know you not
(Being Mechanicall) you ought not walke
Vpon a labouring day, without the signe
Of your Profession? Speake, what Trade art thou?
  Car. Why Sir, a Carpenter
```

There are two cast members in the above text: Flauius and Car.  Don't worry about the fact that Car is an abbreviation.  You should notice that the cast member names are preceded by a variable number of spaces at 
the beginning of a line, followed by a single word, followed by a period.

The tokenizer is a method which you will use and store the output in word_tokens. 

### <font color="magenta">Write the regular expression to tokenize and return only the cast member names and print out the names of the cast members and the number of different cast members.</font>

In [None]:
tokenizer = RegexpTokenizer(r'') #Fill in with the right regular expression.

# You will need to make some changes to this function
cast_member_tokens = tokenizer.tokenize(None) # This should give you the tokens
print(...) # You will need to print the types (not the tokens) to get the unique cast member names