In [None]:
%pprint
import nltk
import re
nltk.download() # click on 'all' when prompted in GUI

### Playing with text in NLTK

Load some example texts from NLTK.

In [None]:
from nltk.book import *

`text1` for example is Moby Dick.

In [None]:
text1

And `text2` is Sense and Sensibility.

In [None]:
text2

We can look at how long these texts are.

In [None]:
len(text1)

In [None]:
len(text2)

Or how many unique words they contain.

In [None]:
len(set(text1))

In [None]:
len(set(text2))

We can also look at the most common words in these texts.

In [None]:
fdist1 = FreqDist(text1)

In [None]:
fdist2 = FreqDist(text2)

Unsuprisingly, these aren't the most interesting of words. 

These commonly used words in a language that don't add much to a meaning of the text are called *stop words*. In working with language they are often removed at the beginning of the process.

In [None]:
fdist1.most_common(10)

In [None]:
fdist2.most_common(10)

There are some pretty cool things you can do with the NLTK `Text` type (`text1` and `text2` here are of the type `Text`).

You can for example get every occurance of a chosen word together with some context.

In [None]:
text1.concordance('monstrous')

In [None]:
text2.concordance('monstrous')

We seen that Herman Melville mostly used `monstrous` in the negative context, while Jane Austin used it mostly for (poaitive) emphasis.

This is obvious when we look at words that are similar to `monstrous` in the two texts.

In [None]:
text1.similar("monstrous")

In [None]:
text2.similar("monstrous")

We can also look at the context that is shared by two words.

In [None]:
text2.common_contexts(["monstrous", "very"])

Let's now try to find words which words commonly co-occur in Moby Dick and Inaugural Address Corpus

In [None]:
text1.collocations()

In [None]:
text4.collocations()

### Working with text in Python

For the example texts we loaded above, `sentN` variables hold the first sentence in each text.

In [None]:
sent1

In [None]:
sent2

We can join the sentence into a string like this.

In [None]:
' '.join(sent1)

And append sentences like this. This works both when a sentence is in the format and in the string format.

In [None]:
sent1 + sent2

In [None]:
' '.join(sent1) + ' '.join(sent2)

We can also append a word to a list.

In [None]:
sent1.append('Some')

Let's now count occurance of a word in text.

In [None]:
text1.count('whale')

And find out when the whale first appears in the text.

In [None]:
text1.index('whale')

Since text here is a list of words, we can index it and access parts of text like we would with a list.

In [None]:
text1[4716:4726]

In [None]:
text1[:8]

In [None]:
text1[260796:]

** List comprehension **

List comprehension is a powerful tool in general, and that holds when working with text.

In [None]:
[w for w in set(text1) if len(w) > 15]

In [None]:
sent9

In [None]:
[w for w in sent9 if w.startswith('a')]

In [None]:
[w for w in sent9 if w.endswith('d')]

In [None]:
[w for w in sent9 if w.isupper()]

In [None]:
[w for w in sent9 if w.istitle()]

In [None]:
[w for w in sent9 if len(w) > 4]

In [None]:
[len(w) for w in sent9]

In [None]:
[w.upper() for w in sent9]

### NLTK Text Corpora

There are a number of text corpora available through NLTK that are preloaded when the NLTK is imported and can be accessed immediately. Some include text only, while some have additional data available such as categories text has been grouped into, or part-of-speech tags.

*Gutenberg Corpus*

NLTK includes a small selection of texts from the [Project Gutenberg](http://www.gutenberg.org/) electronic text archive.

In [None]:
gutenberg.fileids()

Let's load the text of Emma by Jane Austin.

In [None]:
emma = gutenberg.raw('austen-emma.txt')

In [None]:
emma

We can also load the text already broken down into words or sentences.

In [None]:
emma_words = gutenberg.words('austen-emma.txt')

In [None]:
emma_words

In [None]:
emma_sentences = gutenberg.sents('austen-emma.txt')

In [None]:
emma_sentences

In [None]:
emma_sentences[10]

*Brown Corpus*

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre.

In [None]:
from nltk.corpus import brown

In [None]:
brown.categories()

In [None]:
brown.words(categories = 'religion')

In [None]:
brown.sents(categories = 'humor')

*Reuters Corpus*

The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into training and test set. Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics.

In [None]:
from nltk.corpus import reuters

In [None]:
reuters.categories()

In [None]:
reuters.raw(categories = 'potato')

In [None]:
reuters.words(categories = 'coconut-oil')

In [None]:
reuters.sents(categories = 'sugar')

*Inaugural Address Corpus*

Inaugural Address Corpus is actually a of 55 texts, one for each presidential address.

In [None]:
from nltk.corpus import inaugural

In [None]:
inaugural.fileids()

NLTK's `ConditionalFreqDist` lets us explore text in interesting ways.

1) How does a use of modals differ between the texts of various categories?

In [None]:
cfd = nltk.ConditionalFreqDist(
           (genre, word)
           for genre in brown.categories()
           for word in brown.words(categories = genre))

In [None]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [None]:
cfd.tabulate(conditions = brown.categories(), samples = modals)

2) How did the use of certain words in inaugural addresses change throughout history?

In [None]:
cfd = nltk.ConditionalFreqDist(
           (w, fileid[:4])
           for fileid in inaugural.fileids()
           for word in inaugural.words(fileid)
           for w in ['america', 'freedom', 'war']
           if word.lower().startswith(w))

In [None]:
cfd.plot()