# Lab 3 - Natural Language Processing with NLTK


## Due: Thursday, January 25, 2018,  11:59:00pm

### Submission instructions
After completing this homework, you will turn in three files via Canvas ->  Assignments -> Lab 3:
Your Notebook, named si330-lab3-YOUR_UNIQUE_NAME.ipynb and
the HTML file, named si330-lab3-YOUR_UNIQUE_NAME.html

### Name:  YOUR NAME GOES HERE
### Uniqname: YOUR UNIQNAME GOES HERE
### People you worked with: [if you didn't work with anyone else write "I worked by myself" here].


## Objectives
After completing this Lab, you should know how to use NLTK to:
* Normalize and Tokenize your text data
* Parts of Speech tagging of a sentence


## Part A - Installing NLTK

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python.

You will install a package directly from Jupyter Notebooks.
<font color="magenta"><b>Make sure you are in the SI 330 environment when you run your Jupyter notebook. In your Jupyter notebook run the following command</b></font>

In [None]:
# First run this cell
# import sys
# !conda install --yes --prefix {sys.prefix} nltk

NLTK comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/

In the next code chunk, you will install the data.

In [None]:
# nltk.download()

## Background

NLTK's corpora contains texts from the Gutenberg project. In today's lab we will be working on text from Shakespeare's Julius Caesar. In the chunk below, you can see what books are available in this corpus.

In [21]:
import nltk, re
from collections import defaultdict

# Texts present in the Gutenberg Corpora
for i in nltk.corpus.gutenberg.fileids():
    print(i)

austen-emma.txt
austen-persuasion.txt
austen-sense.txt
bible-kjv.txt
blake-poems.txt
bryant-stories.txt
burgess-busterbrown.txt
carroll-alice.txt
chesterton-ball.txt
chesterton-brown.txt
chesterton-thursday.txt
edgeworth-parents.txt
melville-moby_dick.txt
milton-paradise.txt
shakespeare-caesar.txt
shakespeare-hamlet.txt
shakespeare-macbeth.txt
whitman-leaves.txt


Now, we will import Julius Caesar and save it in a variable. <font color="magenta">Print and see what it looks like.</font> 

In [18]:
# We want to get Julius Caesar as raw text. There are other ways in which you could load text from this corpus
caesar = nltk.corpus.gutenberg.raw('shakespeare-caesar.txt')

# The prints the first 1000 characters of Julius Caesar.

[The Tragedie of Julius Caesar by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Flauius, Murellus, and certaine Commoners ouer the Stage.

  Flauius. Hence: home you idle Creatures, get you home:
Is this a Holiday? What, know you not
(Being Mechanicall) you ought not walke
Vpon a labouring day, without the signe
Of your Profession? Speake, what Trade art thou?
  Car. Why Sir, a Carpenter

   Mur. Where is thy Leather Apron, and thy Rule?
What dost thou with thy best Apparrell on?
You sir, what Trade are you?
  Cobl. Truely Sir, in respect of a fine Workman, I am
but as you would say, a Cobler

   Mur. But what Trade art thou? Answer me directly

   Cob. A Trade Sir, that I hope I may vse, with a safe
Conscience, which is indeed Sir, a Mender of bad soules

   Fla. What Trade thou knaue? Thou naughty knaue,
what Trade?
  Cobl. Nay I beseech you Sir, be not out with me: yet
if you be out Sir, I can mend you

   Mur. What mean'st thou by that? Mend mee, thou
sawcy Fellow?

Next, we will normalize and tokenize the text from the play. We will use the <b>```RegexpTokenizer```</b> from  <b>```nltk.tokenizer```</b> package. This will allow us to write our own regular expression and tokenize the text. You only want the words, so write your regular expression accordingly.

<font color="magenta">Write down the code to normalize the text. Then tokenize the text using regex.</font>

In [78]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = RegexpTokenizer(r'\w+') #Fill in with the write regular expression.

word_tokens = tokenizer.tokenize(caesar.lower())

A useful measure to calculate is the type-token ratio (TTR). For that, we would need to calculate the total number of word types and tokens.
<font color="magenta">Calculate total number of word types, word tokens, and type-token ratio for the text.</font>

In [79]:
diction_words_caesar = defaultdict(int)
for word in word_tokens:
    diction_words_caesar[word] += 1

# sorted_words_caesar = sorted(diction_words_caesar.items(), key = lambda x: x[1], reverse = True)
# for k, v in sorted_words_caesar[:20]:
#     print(k, v)

type_token_ratio = print(len(diction_words_caesar)/len(word_tokens))

0.1446366118909596


Bigram or digram are a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on.

Here, you will retrieve all the bigrams from the text and count the number of times each bigram occurs.

<font color="magenta">Calculate the bigrams - two words occuring one after the other - and store it in a dictionary, along with the number of times it has occured.</font>

In [91]:
bigrams = defaultdict(int)
for i in range(len(word_tokens) - 1):
    bigrams[(word_tokens[i], word_tokens[i+1])] += 1

sorted_bigram_counts = sorted(bigrams.items(), key = lambda x: x[1], reverse = True)
for k, v in sorted_bigram_counts[:20]:
    print(k, v)

('i', 'will') 50
('i', 'am') 48
('my', 'lord') 40
('in', 'the') 40
('it', 'is') 37
('i', 'haue') 36
('to', 'the') 34
('that', 'i') 31
('i', 'do') 31
('of', 'the') 24
('and', 'i') 24
('all', 'the') 23
('you', 'are') 22
('he', 'is') 21
('i', 'know') 21
('to', 'day') 20
('my', 'selfe') 20
('is', 'a') 19
('there', 'is') 19
('cassi', 'i') 18


In [55]:
from nltk.tag import pos_tag

word_tags = pos_tag(diction_words_caesar.keys())

nltk.FreqDist(tag for (word, tag) in word_tags)

FreqDist({'CC': 8,
          'CD': 10,
          'DT': 14,
          'FW': 7,
          'IN': 52,
          'JJ': 564,
          'JJR': 12,
          'JJS': 27,
          'MD': 11,
          'NN': 1042,
          'NNP': 3,
          'NNS': 428,
          'PRP': 10,
          'PRP$': 6,
          'RB': 149,
          'RBR': 6,
          'RBS': 1,
          'RP': 4,
          'TO': 1,
          'VB': 46,
          'VBD': 133,
          'VBG': 122,
          'VBN': 48,
          'VBP': 242,
          'VBZ': 64,
          'WDT': 1,
          'WP': 3,
          'WP$': 1,
          'WRB': 4})

Get the names of all the characters (cast members for clarity) from the play. Cast members are the ones with the lines. You can use either <b>```nltk's RegexpTokenizer```</b> or <b>```re.findall```</b> as you've learnt previously. (Make sure the character names don't appear twice.)

In [66]:
tokenizer = RegexpTokenizer(r'([A-Z][a-z]{2,3}\.)') #Fill in with the write regular expression.

word_tokens = set(tokenizer.tokenize(caesar))
print(word_tokens, len(word_tokens))

{'Ple.', 'Cic.', 'Cask.', 'Cin.', 'Dec.', 'Octa.', 'Fla.', 'King.', 'Treb.', 'Cato.', 'Art.', 'Cobl.', 'Bed.', 'Ant.', 'Cai.', 'Army.', 'Ser.', 'Mur.', 'Dard.', 'Rome.', 'Tit.', 'Clau.', 'Cas.', 'Calp.', 'Mess.', 'Pind.', 'Caes.', 'Met.', 'Oct.', 'Drum.', 'Sold.', 'Song.', 'Clit.', 'Vol.', 'Luc.', 'Cly.', 'Sir.', 'Poet.', 'Bru.', 'Cyn.', 'Var.', 'Deci.', 'Oath.', 'Cal.', 'Por.', 'Pub.', 'All.', 'Wife.', 'Stra.', 'Cob.', 'Lep.', 'Car.', 'Cass.', 'Pin.', 'Dyes.', 'Both.', 'Lord.', 'Brut.', 'Mes.'} 59


In [92]:
characters = set(re.findall(r'([A-Z][a-z]{2,3}\.)', caesar)) #Fill in with the write regular expression.

print(characters)

{'Ple.', 'Cic.', 'Cask.', 'Cin.', 'Dec.', 'Octa.', 'Fla.', 'King.', 'Treb.', 'Cato.', 'Art.', 'Cobl.', 'Bed.', 'Ant.', 'Cai.', 'Army.', 'Ser.', 'Mur.', 'Dard.', 'Rome.', 'Tit.', 'Clau.', 'Cas.', 'Calp.', 'Mess.', 'Pind.', 'Caes.', 'Met.', 'Oct.', 'Drum.', 'Sold.', 'Song.', 'Clit.', 'Vol.', 'Luc.', 'Cly.', 'Sir.', 'Poet.', 'Bru.', 'Cyn.', 'Var.', 'Deci.', 'Oath.', 'Cal.', 'Por.', 'Pub.', 'All.', 'Wife.', 'Stra.', 'Cob.', 'Lep.', 'Car.', 'Cass.', 'Pin.', 'Dyes.', 'Both.', 'Lord.', 'Brut.', 'Mes.'}
