# NLTK Examples

The following is a general walkthrough of chapter 1 of the nltk book. The exercises aren't exactly the same but they are similar to the original assignments.

In [31]:
from __future__ import division
import nltk
import numpy as np
import matplotlib
from nltk import *

# Loading Text Data to Use
* To load data to use, we'll first import the texts.

In [23]:
# Load all the texts so that we can use them in the future
from nltk.book import *

* Once the texts have been imported, we can use each text however we see fit to perform analysis

In [4]:
# View possible texts
print("text1:", text1)
print("text2: ", text2)
print("text9: ", text9)

('text1:', <Text: Moby Dick by Herman Melville 1851>)
('text2: ', <Text: Sense and Sensibility by Jane Austen 1811>)
('text9: ', <Text: The Man Who Was Thursday by G . K . Chesterton 1908>)


# Searching Text
 
## Concordance
 *  shows us every occurrence of a given word, together with some context.

In [5]:
# View every case of "monstrous" in Moby dick
text1.concordance("monstrous")

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u


In [6]:
# Search the Inaugural Address Corpus for "terror"
print(text4)
text4.concordance("terror")

<Text: Inaugural Address Corpus>
Displaying 8 of 8 matches:
menaces , by fraud or violence , by terror , intrigue , or venality , the Gove
ameless , unreasoning , unjustified terror which paralyzes needed efforts to c
ublic seemed frozen by a fatalistic terror , we proved that this is not true .
 to alter that uncertain balance of terror that stays the hand of mankind ' s 
eans freeing all Americans from the terror of runaway living costs . All must 
still . They fuel the fanaticism of terror . And they torment the lives of mil
d maintain a strong defense against terror and destruction . Our children will
k to advance their aims by inducing terror and slaughtering innocents , we say


## Similar
* Finds other words that appear in a similar range of contexts as a given word.

In [7]:
# Find words similar in context to "tyrannical" in Moby Dick
text1.similar("monstrous")

imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate


In [8]:
# Find words similar in context to "fuck" in "Chat Corpus"
text2.similar("monstrous")

very exceedingly so heartily a great good amazingly as sweet
remarkably extremely vast


Clearly the different texts have different contexts for the same words

## common_contexts
* Allows for the examination of just the contexts that are shared by two or more words.

In [9]:
# as expected, the following result yields no common contexts
text1.common_contexts(["bad","good"])

No common contexts were found


In [10]:
text3.common_contexts(["pray","be"])

shall_for


In [11]:
# We can easily find common_contexts between words. 
# First - select a word and find similar words contextually.
text4.similar("power")

government freedom peace life country people war duty liberty congress
all union time justice duties prosperity confidence administration
right laws


In [44]:
# Then, view their common contexts
text4.common_contexts(["power","peace"])

of_, of_and that_and of_. its_, their_and the_which and_, the_of of_in
of_is the_that with_. in_and of_to for_, our_. and_of to_, of_that


## dispersion_plot
* Yields a graphical representation of the  location of a word in the text: how many words from the beginning it appears.

In [None]:
# Let's view how the words 'whale', 'monstrous', 'life', and 'death' were used over time in Moby Dick
text1.dispersion_plot(["whale","monstrous","life","death"])

## generate
* Generates random text in the style of the object text

In [72]:
text1.generate()

AttributeError: 'Text' object has no attribute 'generate'

# Counting Vocabulary

In [None]:
# To count the number of words and punctuation symbols in the text, simply call len()
len(text1)

### *Token*
A **token** is an important term that comes up frequently in NLP. It is simply sequence of characters treated as a group. For example, _Jared_, _:_, , _is_, and _:.(_ are all tokens

### Count unique words in corpus

In [8]:
# Number of unique words in Moby Dick
len(set(text1))

19317

In [15]:
# List the words in Moby Dick in hour
sorted(set(text1))

[u'!',
 u'!"',
 u'!"--',
 u"!'",
 u'!\'"',
 u'!)',
 u'!)"',
 u'!*',
 u'!--',
 u'!--"',
 u"!--'",
 u'"',
 u'"\'',
 u'"--',
 u'"...',
 u'";',
 u'$',
 u'&',
 u"'",
 u"',",
 u"',--",
 u"'-",
 u"'--",
 u"';",
 u'(',
 u')',
 u'),',
 u')--',
 u').',
 u').--',
 u'):',
 u');',
 u');--',
 u'*',
 u',',
 u',"',
 u',"--',
 u",'",
 u",'--",
 u',)',
 u',*',
 u',--',
 u',--"',
 u",--'",
 u'-',
 u'--',
 u'--"',
 u"--'",
 u'--\'"',
 u'--(',
 u'---"',
 u'---,',
 u'.',
 u'."',
 u'."*',
 u'."--',
 u".'",
 u'.\'"',
 u'.)',
 u'.*',
 u'.*--',
 u'.,',
 u'.--',
 u'.--"',
 u'...',
 u'....',
 u'.]',
 u'000',
 u'1',
 u'10',
 u'100',
 u'101',
 u'102',
 u'103',
 u'104',
 u'105',
 u'106',
 u'107',
 u'108',
 u'109',
 u'11',
 u'110',
 u'111',
 u'112',
 u'113',
 u'114',
 u'115',
 u'116',
 u'117',
 u'118',
 u'119',
 u'12',
 u'120',
 u'121',
 u'122',
 u'123',
 u'124',
 u'125',
 u'126',
 u'127',
 u'128',
 u'129',
 u'13',
 u'130',
 u'131',
 u'132',
 u'133',
 u'134',
 u'135',
 u'14',
 u'144',
 u'1492',
 u'15',
 u'150',
 u'15

### Analyzing Particular Tokens

In [25]:
# Define functinos to anaylze tokens

def lexical_diversity(text):
    # Computes how lexically diverse is a body of text
    # text: A corpus
    
    return len(set(text)) / len(text)

def percentage(count, total):
    # computs percentage
    return 100 * count / total

def word_percentage(word, corpus):
    # Computes percentage of word occurring in corpus
    # word: character token
    # corpus: corpus of text
    
    count = corpus.count(word)
    return 100 * count / len(corpus)

In [17]:
# Count number of times "whale" occurs in Moby Dick
text1.count("whale")

0.003473673313677301

In [21]:
# Proportion of text taken up by specific word 
100 * text1.count("whale")/len(text1)

0.3473673313677301

In [26]:
# Count percentage of "lol" use in text5
word_percentage("lol", text5)

1.5640968673628082

In [42]:
# Lexical diversity of texts
lexical_diversity(text1)
lexical_diversity(text2)

0.04826383002768831

the moment Python accesses the content of a list from the computer's memory, it is already at the first element; we have to tell it how many elements forward to go. Thus, zero steps forward leaves it at the first element.

# Computing with Language: Simple Statistics

* ### Frequency

In [85]:
# Create Frequency Distribution of words in Moby dick
fdist1 = FreqDist(text1)
print fdist1

<FreqDist with 19317 samples and 260819 outcomes>


In [90]:
# Find the 10 most common words in the frequency distribution
fdist1.most_common(10)

[(u',', 18713),
 (u'the', 13721),
 (u'.', 6862),
 (u'of', 6536),
 (u'and', 6024),
 (u'a', 4569),
 (u'to', 4542),
 (u';', 4072),
 (u'in', 3916),
 (u'that', 2982)]

In [92]:
# Count number of times whale appears
print fdist1['whale']

906


In [None]:
fdist1.plot(40)

* **Happaxes**: Words that occur very infrequently.

In [None]:
fdist1.hapaxes()

In [15]:
# View very long words in Sense and Sensibility
V = set(text2)
long_words = [w for w in V if len(w) > 15]
long_words

[u'incomprehensible',
 u'disqualifications',
 u'disinterestedness',
 u'companionableness']

In [18]:
fdist5 = FreqDist(text5)
sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 10)

[u'#14-19teens',
 u'#talkcity_adults',
 u'((((((((((',
 u'........',
 u'actually',
 u'everyone',
 u'listening',
 u'seriously',
 u'something']

## Collocations and Bigrams

* **Collocation**: A sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses.

In [32]:
list(bigrams(['more','is','said','than','done']))

[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

In [33]:
text5.collocations()

wanna chat; PART JOIN; MODE #14-19teens; JOIN PART; PART PART;
cute.-ass MP3; MP3 player; JOIN JOIN; times .. .; ACTION watches; guys
wanna; song lasts; last night; ACTION sits; -...)...- S.M.R.; Lime
Player; Player 12%; dont know; lez gurls; long time


# Counting Other Things

In [35]:
# Create Frequency Distribution of word length
fdist_word1 = FreqDist([len(w) for w in text2])
fdist_word1

Counter({1: 23009,
         2: 24826,
         3: 28839,
         4: 21352,
         5: 11438,
         6: 9507,
         7: 8158,
         8: 5676,
         9: 3736,
         10: 2596,
         11: 1278,
         12: 711,
         13: 334,
         14: 87,
         15: 24,
         16: 2,
         17: 3})

In [40]:
fdist_word1.freq(3)

0.20369977962366503

In [39]:
fdist_word1.most_common()

[(3, 28839),
 (2, 24826),
 (1, 23009),
 (4, 21352),
 (5, 11438),
 (6, 9507),
 (7, 8158),
 (8, 5676),
 (9, 3736),
 (10, 2596),
 (11, 1278),
 (12, 711),
 (13, 334),
 (14, 87),
 (15, 24),
 (17, 3),
 (16, 2)]