### Getting Started in Python and NLTK

Start by typing a couple of examples of arithmetic into the Python interpreter. For example:
<br>```>>>1 + 2```

In [1]:
1+2

3

Note that if you want to type in a string of text, you surround the string with quotes.
<br>```>>> ‘hello’ ```

In [2]:
"hello"

'hello'

In programming, when we have a value of some type (like the number 3 or the string
‘hello’), we can save that value by assigning it to a variable.
<br> ```>>> num = 1 + 2 ```
<br> ``` >>> num ```
<br>In this example, the name of the variable is “num” and its value is 3.

In [3]:
num = 1 + 2
num

3

Next, you use the Python “import” statement to load the data used in the book examples
into the Python environment:
<br> ``` >>> from nltk.book import * ```


In [4]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


This command loaded 9 of the text examples available from the corpora package (only a
small number of them!). It has used the variable names text1 through text9 for these
examples, and already assigned them values. If you type the variable name, you get a
description of the text:
<br> ``` >>> text1 ```

In [5]:
text1

<Text: Moby Dick by Herman Melville 1851>

The variables sent1 through sent9 have been set to be a list of tokens of the first sentence
of each text.
<br> ``` >>> sent1 ```

In [6]:
sent1

['Call', 'me', 'Ishmael', '.']

### Searching Text

The text data structure has a number of functions to operate on text. One is called
“concordance”, and it will search for any word that you give to the function and show you
the occurrences and some surrounding context.
<br> ``` >>> text1.concordance("monstrous") ```

In [7]:
text1.concordance("monstrous")

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u


Observe the use of the arrow keys with the enter key to select and modify previous lines in
Python, and try a similar example.
<br> ``` >>> text2.concordance("affection") ```

In [8]:
text2.concordance("affection")

Displaying 25 of 79 matches:
, however , and , as a mark of his affection for the three girls , he left them
t . It was very well known that no affection was ever supposed to exist between
deration of politeness or maternal affection on the side of the former , the tw
d the suspicion -- the hope of his affection for me may warrant , without impru
hich forbade the indulgence of his affection . She knew that his mother neither
rd she gave one with still greater affection . Though her late conversation wit
 can never hope to feel or inspire affection again , and if her home be uncomfo
m of the sense , elegance , mutual affection , and domestic comfort of the fami
, and which recommended him to her affection beyond every thing else . His soci
ween the parties might forward the affection of Mr . Willoughby , an equally st
 the most pointed assurance of her affection . Elinor could not be surprised at
he natural consequence of a strong affection in a young and ardent mind . This 
 opinion . 

Another function is “similar” which finds all the words that are used in the same context as
the one given, where the context is the word before and the word after.
<br> ``` >>> text1.similar("monstrous") ```

In [9]:
text1.similar("monstrous")

true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless


We can use this to compare how the same word is used differently in other texts.
<br> ``` >>> text2.similar("monstrous") ```

In [10]:
text2.similar("monstrous")

very so exceedingly heartily a as good great extremely remarkably
sweet vast amazingly


### Counting Vocabulary

Each text from the books was separated into a list of tokens, and this is one of the first NLP
processing steps. The tokens usually consist of words and all the punctuation and other
symbols occurring in the text. To further investigate text, we can count the occurrences of
words.
We start by using the Python length function, “len” to tell us how many things are in a list.
(Strictly speaking, each text variable is an object of type nltk.text.Text, which contains the
text string and some other functions, but we’re trying not to explain much programming
here.)
<br> ``` >>> len(text3) ```
<br> ``` >>> len(text4) ```

In [11]:
len(text3)

44764

In [12]:
len(text4)

149797

Now this is the total number of tokens, and we might also want to find out how many
unique words there are, not counting repetitions. The Python “set” function removes the
repetitions, and we can apply the “sorted” function to that, returning the resulted sorted
list of tokens. If we type the following, lots of words will flash by on the screen.
<br> ``` >>> sorted(set(text3)) ```

In [13]:
sorted(set(text3))

['!',
 "'",
 '(',
 ')',
 ',',
 ',)',
 '.',
 '.)',
 ':',
 ';',
 ';)',
 '?',
 '?)',
 'A',
 'Abel',
 'Abelmizraim',
 'Abidah',
 'Abide',
 'Abimael',
 'Abimelech',
 'Abr',
 'Abrah',
 'Abraham',
 'Abram',
 'Accad',
 'Achbor',
 'Adah',
 'Adam',
 'Adbeel',
 'Admah',
 'Adullamite',
 'After',
 'Aholibamah',
 'Ahuzzath',
 'Ajah',
 'Akan',
 'All',
 'Allonbachuth',
 'Almighty',
 'Almodad',
 'Also',
 'Alvah',
 'Alvan',
 'Am',
 'Amal',
 'Amalek',
 'Amalekites',
 'Ammon',
 'Amorite',
 'Amorites',
 'Amraphel',
 'An',
 'Anah',
 'Anamim',
 'And',
 'Aner',
 'Angel',
 'Appoint',
 'Aram',
 'Aran',
 'Ararat',
 'Arbah',
 'Ard',
 'Are',
 'Areli',
 'Arioch',
 'Arise',
 'Arkite',
 'Arodi',
 'Arphaxad',
 'Art',
 'Arvadite',
 'As',
 'Asenath',
 'Ashbel',
 'Asher',
 'Ashkenaz',
 'Ashteroth',
 'Ask',
 'Asshur',
 'Asshurim',
 'Assyr',
 'Assyria',
 'At',
 'Atad',
 'Avith',
 'Baalhanan',
 'Babel',
 'Bashemath',
 'Be',
 'Because',
 'Becher',
 'Bedad',
 'Beeri',
 'Beerlahairoi',
 'Beersheba',
 'Behold',
 'Bela',
 'Belah

Or we can just find the length of that list.
<br> ``` >>> len(sorted(set(text3))) ```

In [14]:
len(sorted(set(text3)))

2789

Or we can specify just to print the first 30 words in the list of sorted words:
<br> ``` >>> sorted(set(text3))[:30] ```

In [15]:
sorted(set(text3))[:30]

['!',
 "'",
 '(',
 ')',
 ',',
 ',)',
 '.',
 '.)',
 ':',
 ';',
 ';)',
 '?',
 '?)',
 'A',
 'Abel',
 'Abelmizraim',
 'Abidah',
 'Abide',
 'Abimael',
 'Abimelech',
 'Abr',
 'Abrah',
 'Abraham',
 'Abram',
 'Accad',
 'Achbor',
 'Adah',
 'Adam',
 'Adbeel',
 'Admah']

Now let’s compute the ratio of the total number of tokens to the number of unique tokens
and we’ll get an average of how many repetitions there are for each word. First we get a
division operator that uses real arithmetic (aka floating point) instead of integer and then
we divide to get the ratio.
<br> ``` >>> from __future__ import division ```
<br> ``` >>> len(text3) / len(set(text3) ``` <br>
(On average, each word is used about 16 times.)

In [16]:
from __future__ import division

In [17]:
len(text3) / len(set(text3))

16.050197203298673

Now let’s search for and count occurrences of particular words and compare that to the
total number of words.
<br> ``` >>> text3.count("smote") ```

In [18]:
text3.count("smote")

5

Compute the fraction of the number of occurrences of the word compared with the total
number of words and then multiply by 100 to get a percentage.
<br> ``` >>> 100 * text3.count('smote') / len(text3) ```

In [19]:
100 * text3.count('smote') / len(text3)

0.01116968992940756

How does this compare with a more common word, such as the word “a”?
<br> ``` >>> 100 * text3.count('a') / len(text3)```

In [20]:
100 * text3.count('a') / len(text3)

0.7640067911714771

### Try it Out:

1. How many times does the word “lol” occur in text5? What is the percentage of its
occurrences in the text? [Warning: text5 is uncensored chat]

In [21]:
#How many times does the word “lol” occur in text5
text5.count("lol")

704

In [22]:
#What is the percentage of its occurrences in the text
100 * (text5.count("lol") / len(text5))

1.5640968673628082

Think of another word to find occurrences and get the number of occurrences and its
percentage in the text. Save the word, the number of occurrences and its percentage in the
text to post at the end of class.

In [23]:
num = text5.count("book")
print(num)
pre =  100 * (text5.count("book") / len(text5))
print(pre)

open('result.txt','w').write('Number of time that word \"book\" has appeared {} , percentage {}%'.format(num,pre))

9
0.019995556542990445


81

### Processing Text

In the first part of this lab, we counted words from text that had already been tokenized, i.e.
separated into words. Now we’ll look at some text examples that we will need to tokenize.
In addition to the examples that we imported for the NLTK book above, the NLTK has a
number of other corpora, described in Chapter 2. In order to see these, type in

<br> ``` >>> import nltk ```

In [24]:
import nltk

You can then view some books obtained from the Gutenberg on---line book project:
<br> ``` >>> nltk.corpus.gutenberg.fileids() ```

In [25]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

For purposes of this lab, we will work with the first book, Jane Austen’s “Emma”. First, we
save the first fileid (number 0 in the list) into a variable named file1 so that we can reuse it:
<br> ``` >>> file1 = nltk.corpus.gutenberg.fileids( ) [0] ```
<br> ``` >>> file1 ```

In [26]:
file1 = nltk.corpus.gutenberg.fileids( ) [0]

In [27]:
file1

'austen-emma.txt'

We can get the original text, using the raw function:
<br> ``` >>> emmatext = nltk.corpus.gutenberg.raw(file1) ```
<br> ``` >>> len(emmatext) ```

In [28]:
emmatext = nltk.corpus.gutenberg.raw(file1)

In [29]:
len(emmatext)

887071

Since this is quite long, we can view part of it, e.g. the first 120 characters
<br> ``` >>> emmatext[:120] ```

In [30]:
emmatext[:120]

'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nan'

NLTK has several tokenizers available to break the raw text into tokens; we will use one
that separates by white space and also by special characters (punctuation):
<br> ``` >>> emmatokens = nltk.wordpunct_tokenize(emmatext) ```
<br> ``` >>> len(emmatokens) ```
<br> ``` >>> emmatokens[:50] ```

In [31]:
emmatokens = nltk.wordpunct_tokenize(emmatext)

In [32]:
len(emmatokens)

192427

In [33]:
emmatokens[:50]

['[',
 'Emma',
 'by',
 'Jane',
 'Austen',
 '1816',
 ']',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a',
 'comfortable',
 'home',
 'and',
 'happy',
 'disposition',
 ',',
 'seemed',
 'to',
 'unite',
 'some',
 'of',
 'the',
 'best',
 'blessings',
 'of',
 'existence',
 ';',
 'and',
 'had',
 'lived',
 'nearly',
 'twenty',
 '-',
 'one',
 'years',
 'in',
 'the']

We probably want to use the lowercase versions of the words:
<br> ``` >>> emmawords = [w.lower( ) for w in emmatokens] ```
<br> ``` >>> emmawords[:50] ```
<br> ``` >>> len(emmawords) ```

In [34]:
emmawords = [w.lower( ) for w in emmatokens]

In [35]:
emmawords[:50]

['[',
 'emma',
 'by',
 'jane',
 'austen',
 '1816',
 ']',
 'volume',
 'i',
 'chapter',
 'i',
 'emma',
 'woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a',
 'comfortable',
 'home',
 'and',
 'happy',
 'disposition',
 ',',
 'seemed',
 'to',
 'unite',
 'some',
 'of',
 'the',
 'best',
 'blessings',
 'of',
 'existence',
 ';',
 'and',
 'had',
 'lived',
 'nearly',
 'twenty',
 '-',
 'one',
 'years',
 'in',
 'the']

In [36]:
len(emmawords)

192427

We can further view the words by getting the unique words and sorting them:
<br> ``` >>> emmavocab = sorted(set(emmawords))```
<br> ``` >>> emmavocab[:50]```
<br>

We can see that we will probably want to get rid of these special characters – Regular
Expressions to the Rescue! (as in xkcd _ ), but we’ll work on that later.

In [37]:
emmavocab = sorted(set(emmawords))

In [38]:
emmavocab[:50]

['!',
 '!"',
 '!"--',
 "!'",
 "!'--",
 '!)--',
 '!--',
 '!--"',
 '!--(',
 '!--`',
 '"',
 '"\'',
 '"--',
 '"`',
 '&',
 "'",
 "'--",
 "';",
 '(',
 ')',
 '),',
 ')--',
 ').',
 ').--',
 ');--',
 ',',
 ',"',
 ',"--',
 ",'",
 ',\'"',
 ',)',
 ',--',
 ',--"',
 '-',
 '--',
 '--"',
 '--(',
 '--,',
 '----',
 '----------,',
 "--------.'",
 '--.',
 '--."',
 "--.'",
 '--:',
 '--`',
 '.',
 '."',
 '."--',
 ".'"]