# 2. Accessing Text Corpora and Lexical Resources

## Goals

1. What are some useful text corpora and lexical resources, and how can we access them with Python?
2. Which Python constructs are most helpful for this work?
3. How do we avoid repeating ourselves when writing Python code?

### 1. Accessing text corpora

Aren't we lucky? NLTK provides a small sample of Project Gutenberg texts.

In [1]:
# Let's import like an expert
from nltk.corpus import gutenberg


In [2]:
# import nltk

In [3]:
# Get the file ids so we can reference a book in the corpus
# nltk.corpus.gutenberg.fileids()

gutenberg.fileids()

For my tastes, Persuasion is the best Austen novel, so we'll pick it instead of Emma, as in the NLTK book example.

In [4]:
# Count how many words in the fileid 'austen-persuasion.txt'
persuasion_words = gutenberg.words('austen-persuasion.txt')
len(persuasion_words)

98171

Next, we'll write a small program looping over each of the `gutenberg.fileids()`, and capturing the characters per word, the words per sentence, and then finally, the number of words per unique word. 

In [5]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 19 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt


Observe that average word length appears to be a general property of English, since it has a recurrent value of 4. (In fact, the average word length is really 3 not 4, since the num_chars variable counts space characters.) By contrast average sentence length and lexical diversity appear to be characteristics of particular authors.

Notice that `gutenberg.raw()` gives you the number of characters, including spaces.

But let's take a look at `gutenberg.sents()`.

In [16]:
persuasion_sentences = gutenberg.sents('austen-persuasion.txt')
persuasion_sentences[10]

['"',
 'Heir',
 'presumptive',
 ',',
 'William',
 'Walter',
 'Elliot',
 ',',
 'Esq',
 '.,',
 'great',
 'grandson',
 'of',
 'the',
 'second',
 'Sir',
 'Walter',
 '."']

In [18]:
longest_len = max(len(s) for s in persuasion_sentences)

In [21]:
longest_sentence = [s for s in persuasion_sentences if len(s) == longest_len]

In [22]:
longest_sentence

[['For',
  ',',
  'though',
  'shy',
  ',',
  'he',
  'did',
  'not',
  'seem',
  'reserved',
  ';',
  'it',
  'had',
  'rather',
  'the',
  'appearance',
  'of',
  'feelings',
  'glad',
  'to',
  'burst',
  'their',
  'usual',
  'restraints',
  ';',
  'and',
  'having',
  'talked',
  'of',
  'poetry',
  ',',
  'the',
  'richness',
  'of',
  'the',
  'present',
  'age',
  ',',
  'and',
  'gone',
  'through',
  'a',
  'brief',
  'comparison',
  'of',
  'opinion',
  'as',
  'to',
  'the',
  'first',
  '-',
  'rate',
  'poets',
  ',',
  'trying',
  'to',
  'ascertain',
  'whether',
  'Marmion',
  'or',
  'The',
  'Lady',
  'of',
  'the',
  'Lake',
  'were',
  'to',
  'be',
  'preferred',
  ',',
  'and',
  'how',
  'ranked',
  'the',
  'Giaour',
  'and',
  'The',
  'Bride',
  'of',
  'Abydos',
  ';',
  'and',
  'moreover',
  ',',
  'how',
  'the',
  'Giaour',
  'was',
  'to',
  'be',
  'pronounced',
  ',',
  'he',
  'showed',
  'himself',
  'so',
  'intimately',
  'acquainted',
  'with',
 