# 2. Accessing Text Corpora and Lexical Resources

In [17]:
import nltk
from __future__ import division

#### Key Terms:

**Corpora**: large bodies of linguistic data

**Text Corpus**: just a large body of text

## 1.   Accessing Text Corpora



### 1.1   Gutenberg Corpus

In [4]:
nltk.corpus.gutenberg.fileids()

[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

In [12]:
# Select and rename "Emma" by Jane Austen
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

# Find concordance of "surprize" in the text
    # first make emma an nltk.Text object so we can call .concordance method
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
# call method
emma.concordance("surprize")

Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity ` 
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on 
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
 the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the my

In [28]:
# the above is cumbersome, so we'll import gutenberg directly
from nltk.corpus import gutenberg
gutenberg.fileids()
emma = gutenberg.words('austen-emma.txt')
emma

[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', ...]

#### Create basic statistics for each gutenberg text

In [30]:
# Create basic statistics for each gutenberg text:
#   average word length, average sentence length, and lexical density
for text in gutenberg.fileids():
    num_chars = len(gutenberg.raw(text)) 
    num_words = len(gutenberg.words(text))
    num_sent = len(gutenberg.sents(text))
    num_vocab = len(set(w.lower() for w in gutenberg.words(text)))
    print "Average Word Length: ", round(num_chars/num_words),"Average Sentence Length", round(num_words/num_sent), "Lexical Density: ", num_words/num_vocab,"Text: ", text

Average Word Length:  5.0 Average Sentence Length 25.0 Lexical Density:  26.2019335512 Text:  austen-emma.txt
Average Word Length:  5.0 Average Sentence Length 26.0 Lexical Density:  16.8245072836 Text:  austen-persuasion.txt
Average Word Length:  5.0 Average Sentence Length 28.0 Lexical Density:  22.1108855224 Text:  austen-sense.txt
Average Word Length:  4.0 Average Sentence Length 34.0 Lexical Density:  79.1614318164 Text:  bible-kjv.txt
Average Word Length:  5.0 Average Sentence Length 19.0 Lexical Density:  5.44234527687 Text:  blake-poems.txt
Average Word Length:  4.0 Average Sentence Length 19.0 Lexical Density:  14.102284264 Text:  bryant-stories.txt
Average Word Length:  4.0 Average Sentence Length 18.0 Lexical Density:  12.1635663887 Text:  burgess-busterbrown.txt
Average Word Length:  4.0 Average Sentence Length 20.0 Lexical Density:  12.940060698 Text:  carroll-alice.txt
Average Word Length:  5.0 Average Sentence Length 20.0 Lexical Density:  11.6371925615 Text:  chesterton

* Observe that average word length appears to be a general property of English, as it remains constant for the authors
* Lexical density and average sentence length appear to be author-dependent.

In [37]:
#### Find longest sentences for each text
for text in gutenberg.fileids():
    text_sents = gutenberg.sents(text)
    max_text = max(len(s) for s in text_sents)
    max_texts = [m for m in text_sents if len(m) == max_text]
    print max_texts

[[u'While', u'he', u'lived', u',', u'it', u'must', u'be', u'only', u'an', u'engagement', u';', u'but', u'she', u'flattered', u'herself', u',', u'that', u'if', u'divested', u'of', u'the', u'danger', u'of', u'drawing', u'her', u'away', u',', u'it', u'might', u'become', u'an', u'increase', u'of', u'comfort', u'to', u'him', u'.--', u'How', u'to', u'do', u'her', u'best', u'by', u'Harriet', u',', u'was', u'of', u'more', u'difficult', u'decision', u';--', u'how', u'to', u'spare', u'her', u'from', u'any', u'unnecessary', u'pain', u';', u'how', u'to', u'make', u'her', u'any', u'possible', u'atonement', u';', u'how', u'to', u'appear', u'least', u'her', u'enemy', u'?--', u'On', u'these', u'subjects', u',', u'her', u'perplexity', u'and', u'distress', u'were', u'very', u'great', u'--', u'and', u'her', u'mind', u'had', u'to', u'pass', u'again', u'and', u'again', u'through', u'every', u'bitter', u'reproach', u'and', u'sorrowful', u'regret', u'that', u'had', u'ever', u'surrounded', u'it', u'.--', u'Sh

## Web and Chat Text

In [38]:
from nltk.corpus import webtext
for fileid in webtext.fileids():
     print(fileid, webtext.raw(fileid)[:65], '...')

(u'firefox.txt', u'Cookie Manager: "Don\'t allow sites that set removed cookies to se', '...')
(u'grail.txt', u'SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop', '...')
(u'overheard.txt', u'White guy: So, do you have any plans for this evening?\nAsian girl', '...')
(u'pirates.txt', u"PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr", '...')
(u'singles.txt', u'25 SEXY MALE, seeks attrac older single lady, for discreet encoun', '...')
(u'wine.txt', u'Lovely delicate, fragrant Rhone wine. Polished leather and strawb', '...')


In [44]:
# Instant messenger chat sessions, originally used to identify sexual predators
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
chatroom[123]

[u'i',
 u'do',
 u"n't",
 u'want',
 u'hot',
 u'pics',
 u'of',
 u'a',
 u'female',
 u',',
 u'I',
 u'can',
 u'look',
 u'in',
 u'a',
 u'mirror',
 u'.']