# Text Processing

We will be using NLTK library to illustrate basic text processing functionalities: tokenization, lemmization, stop words, ...


In [1]:
import nltk

## Text Corpora

A text corpus is a large body of text. Many corpora are designed to contain a careful balance of material in one or more genres.  

Let's start by loading the Gutenberg corpora.  The Project Gutenberg corpora is electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. We begin by querying to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

[Reference](https://www.sketchengine.eu/gutenberg-corpora-2020/)

In [2]:
nltk.download('gutenberg')
nltk.download('punkt')
from nltk.corpus import gutenberg
gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /Users/pmui/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /Users/pmui/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

The first text is Emma by Jane Austen.  How many words does it contain?

In [3]:
emma = gutenberg.words('austen-emma.txt')
len(emma)

192427

Let's print out all info about the gutenberg corpora by looping over all the values of fileid corresponding to the gutenberg file identifiers listed earlier and then computing statistics for each text. 

In [11]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print("chars/word, words/sent, words/vocab")
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

chars/word, words/sent, words/vocab
5 25 26 austen-emma.txt
chars/word, words/sent, words/vocab
5 26 17 austen-persuasion.txt
chars/word, words/sent, words/vocab
5 28 22 austen-sense.txt
chars/word, words/sent, words/vocab
4 34 79 bible-kjv.txt
chars/word, words/sent, words/vocab
5 19 5 blake-poems.txt
chars/word, words/sent, words/vocab
4 19 14 bryant-stories.txt
chars/word, words/sent, words/vocab
4 18 12 burgess-busterbrown.txt
chars/word, words/sent, words/vocab
4 20 13 carroll-alice.txt
chars/word, words/sent, words/vocab
5 20 12 chesterton-ball.txt
chars/word, words/sent, words/vocab
5 23 11 chesterton-brown.txt
chars/word, words/sent, words/vocab
5 19 11 chesterton-thursday.txt
chars/word, words/sent, words/vocab
4 21 25 edgeworth-parents.txt
chars/word, words/sent, words/vocab
5 26 15 melville-moby_dick.txt
chars/word, words/sent, words/vocab
5 52 11 milton-paradise.txt
chars/word, words/sent, words/vocab
4 12 9 shakespeare-caesar.txt
chars/word, words/sent, words/vocab
4 12 8 

The raw() function gives us the contents of the file without any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt')) tells us how many letters occur in the text, including the spaces between words. 

In [17]:
len(gutenberg.raw('blake-poems.txt'))

38153

The sents() function divides the text up into its sentences, where each sentence is a list of words:

In [12]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
macbeth_sentences

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

In [13]:
macbeth_sentences[1116]

['Double',
 ',',
 'double',
 ',',
 'toile',
 'and',
 'trouble',
 ';',
 'Fire',
 'burne',
 ',',
 'and',
 'Cauldron',
 'bubble']

In [15]:
longest_len = max(len(s) for s in macbeth_sentences)
longest_len

158

In [16]:
[s for s in macbeth_sentences if len(s) == longest_len]

[['Doubtfull',
  'it',
  'stood',
  ',',
  'As',
  'two',
  'spent',
  'Swimmers',
  ',',
  'that',
  'doe',
  'cling',
  'together',
  ',',
  'And',
  'choake',
  'their',
  'Art',
  ':',
  'The',
  'mercilesse',
  'Macdonwald',
  '(',
  'Worthie',
  'to',
  'be',
  'a',
  'Rebell',
  ',',
  'for',
  'to',
  'that',
  'The',
  'multiplying',
  'Villanies',
  'of',
  'Nature',
  'Doe',
  'swarme',
  'vpon',
  'him',
  ')',
  'from',
  'the',
  'Westerne',
  'Isles',
  'Of',
  'Kernes',
  'and',
  'Gallowgrosses',
  'is',
  'supply',
  "'",
  'd',
  ',',
  'And',
  'Fortune',
  'on',
  'his',
  'damned',
  'Quarry',
  'smiling',
  ',',
  'Shew',
  "'",
  'd',
  'like',
  'a',
  'Rebells',
  'Whore',
  ':',
  'but',
  'all',
  "'",
  's',
  'too',
  'weake',
  ':',
  'For',
  'braue',
  'Macbeth',
  '(',
  'well',
  'hee',
  'deserues',
  'that',
  'Name',
  ')',
  'Disdayning',
  'Fortune',
  ',',
  'with',
  'his',
  'brandisht',
  'Steele',
  ',',
  'Which',
  'smoak',
  "'",
  'd',
 

## Brown Corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, etc.  A complete list of genres for the Brown Corpus can be found at: http://icame.uib.no/brown/bcm-los.html.

We can access the corpus as a list of words, or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read:

In [19]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

The Brown Corpus is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics. Let's compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre. Remember to import nltk before doing the following:

In [28]:
news_text = brown.words(categories='news')

# let's find the frequency of words within a text
news_dist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print(m + ':', news_dist[m], end=' ')

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389 

In [31]:
five_w = ['what', 'when', 'where', 'who', 'why']
for f in five_w:
    print(f + ':', news_dist[f], end=' ')

what: 95 when: 169 where: 59 who: 268 why: 14 

In [30]:
fiction_text = brown.words(categories='fiction')
fiction_dist = nltk.FreqDist(w.lower() for w in fiction_text)
for m in modals:
    print(m + ':', fiction_dist[m], end=' ')

can: 39 could: 168 may: 10 might: 44 must: 55 will: 56 

In [32]:
for f in five_w:
    print(f + ':', fiction_dist[f], end=' ')

what: 186 when: 192 where: 89 who: 112 why: 42 

We would like to obtain counts for each genre of interest. We'll use NLTK's support for conditional frequency distributions.

In [33]:
cfd = nltk.ConditionalFreqDist((genre, word)
                                for genre in brown.categories()
                                for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']

In [34]:
cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 


In [35]:
cfd.tabulate(conditions=genres, samples=five_w)

                 what  when where   who   why 
           news    76   128    58   268     9 
       religion    64    53    20   100    14 
        hobbies    78   119    72   103    10 
science_fiction    27    21    10    13     4 
        romance   121   126    54    89    34 
          humor    36    52    15    48     9 
