<a href="https://colab.research.google.com/github/moO0lk/LING227/blob/main/11_NLTK_corpus_resources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Relevant readings

 [NLTK Book, Chapter 2, Section 1](https://www.nltk.org/book/ch02.html)

# NLTK Text Corpora

The first section of Chapter 2 in NLTK teaches you how to explore the built-in corpora provided by NLTK. It is important to note that some of the examples used in Chapter 1 of NLTK were pedagogical; the authors have provided us with copora and texts that have already been pre-processed in various ways, or otherwise simplified. While a corpus *is* a large collection of language and documents, a corpus usually also contains metadata telling the user about different categories *within* a corpus. Categories can be anything - genre, speaker, task, etc.

In this notebook, we will explore some of the other corpus resources available through the NLTK.

We will also look at how you can load your own data into Colab to create your own corpus using NLTK functions.

Run the following cell to load in the required resources for this notebook.



In [None]:
# import the NLTK library
import nltk

# download resources for this notebook (takes a bit)
nltk_resources = ['gutenberg', 'punkt_tab', 'brown', 'state_union']

nltk.download(nltk_resources)

## The Gutenberg Corpus

A good example of a corpus with different categories is the Gutenberg corpus used by NLTK, which is a collection of different public domain books. [Project Gutenberg](https://www.gutenberg.org/) is a website containing thousands of free eBooks, and is named after [Johannes Gutenberg](https://en.wikipedia.org/wiki/Johannes_Gutenberg), associated with the development of the printing press.

<img src = https://i.imgur.com/skJBrKl.png height = '200'>


The Gutenberg corpus is part of the `nltk.corpus` module, which provides several built-in methods for working with text data. You can see a complete list of the methods here: [NLTK corpus package](https://www.nltk.org/api/nltk.corpus.html). You can also see a breakdown in Table 1.3 in Chapter 2 of the NLTK book.

Please note that NLTK includes just a small set of books from Project Gutenberg.  We access the gutenberg data using `nltk.corpus.gutenberg` followed by different NLTK functions and methods.

To see the list of all of the files, we use the `.fileids()` method. The filenames are in the form of `author-title.txt`

In [None]:
# inspect the different files in the gutenberg corpus - have you read any of these books?
nltk.corpus.gutenberg.fileids()

Note that the `.fileids()` represent some metadata — the file id contains both the author and the name of the text. So, in this corpus, there are different texts grouped by different authors. As such, *authors* represent the *categories* of this corpus.


Chapter 2 introduces you to the methods associated with the NLTK corpus class somewhat implicitly, but it is good to look through all the possibilities. Perhaps the most basic possibility is to use the `.raw()` method to obtain a view of the raw text file.

Note that we need to input the fileid of the text we are interested in.


In [None]:
# we can select a single book using the book's name and the format we want, such as words or sentences.
# We use `raw` to get the raw text file (as a string)
macbeth = nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt')

# look at the entire text file.
macbeth

There are different methods for sorting the corpus into words and sentences:

In [None]:
# the .words method returns words as tokens
macbeth_words = nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')
macbeth_words[0:10]

In [None]:
# could we do this on our own?
# why do we get different results when we manually split the text?
# do you think the .words is using .split(), or nltk.word_tokenize()?
nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt').split()[:10]

In [None]:
# use .sents to get the sentences
macbeth_sents = nltk.corpus.gutenberg.sents('shakespeare-macbeth.txt')
macbeth_sents[0:2]

Chapter 2 also provides a discussion about how to import Python modules into a shorthand, which saves typing. This helps provide some clarity into why many Python scripts and examples start with lines such as `from x import y as z` — this just means import something and give it a shorter name.

For the current example, we can import gutenburg directly, and thus can avoid typing the `nltk.corpus` bit before it. And, you can do this same thing with other functions from other packages / modules.

In [None]:
# import the gutenberg package directly
from nltk.corpus import gutenberg

# now you can access `gutenberg` without needing to type `nltk.corpus`
gutenberg.fileids()

## Looping through Gutenberg

Chapter 2 includes a demonstration of the different methods NLTK can provide for a raw text by asking you to think about how the following function works.

First, look at how the loop works. The loop is over the `fileids()` in `gutenberg.fileids()`, which is just a list of the different text files names. In the body of the loop, the fileid is used to access the specific text - this is why the variable `fileid` is placed inside the brackets for `gutenberg.raw()` and all the other functions (`.words`, `.sents`).

Examine the print statement and the output - do you understand how they have controlled the output using this `print()` statement? You might find it useful to include some comments in the code cell below, explaining what each line is doing.

The book also claims there are some patterns related to average word length, average sentence length, and lexical diversity for different authors. Do you see these patterns? For instance - who has the longest sentences? The shortest?

In [None]:
# Can you add comments to this code explaining what each line is doing?
for fileid in gutenberg.fileids():
  num_chars = len(gutenberg.raw(fileid))
  num_words = len(gutenberg.words(fileid))
  num_sents = len(gutenberg.sents(fileid))
  num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
  print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

## The Brown Corpus

You should read through the explanation of various corpora in Chapter 2. One of the more well-known corpora is the [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus), which has been used in numerous studies of language either for checking word use and collocations, or for compiling frequency statistics of words in spoken and written English. As Chapter 2 explains, the Brown Corpus is an ideal example for how one can use different categories in a corpus to test questions about differences in language use.

We load in Brown in a similar manner to Gutenberg. The Brown corpus has a function that Gutenberg did not have: `.categories()`. This function shows different genres or classifications of texts in the Brown corpus.

In [None]:
# load in Brown and look at the categories
from nltk.corpus import brown

brown.categories()

You can select a specific subsection of the brown corpus using `categories = ` when accessing the Brown text as raw, words, sentences, etc.

In [None]:
# look at the a sentence from one of the the texts labelled as "humour" in the Brown corpus
brown.sents(categories = 'humor')[80]

### Comparing language among different genres

The different genres or categories in Brown allows for a means to compare different writing styles. The NLTK book provide an analysis of [modal verbs](https://en.wikipedia.org/wiki/Modal_verb) as an example.

Modal verbs are auxiliary verbs which provide a level of certainty, possibility, or urgency upon a main verb. In English, these are words such as *must*, *will*, *could*, etc.

In the following example, the authors of NLTK wrote a function to create a frequency distribution of modal verbs in the `news` category of Brown using the `nltk.FreqDist()` function.

Note how they do this. First they define a list of modal words — so they can provide the program with the targets it is trying to find. Then they save the words of the Brown corpus to a new variable `news_text`. Then they create a frequency distribution from a list comprehension which first lowercases the word (i.e., pre-processes it) and then only includes words if they are in the list of modals. Any word *not* in that list of modals will in turn not be included in the resulting Frequency Distribution.

Run the cell and then ask yourself, what do you think about the frequency of modal verbs in the `news` category - does it make sense that `will` is the most frequent modal verb for news?


In [None]:
# create a frequency distribution for specific modal verbs

# define list of modal verbs
modals = ['can', 'could', 'may', 'might', 'must', 'will']

# create an object of words which occur in the news category of brown
news_text = brown.words(categories = 'news')

# create a frequency distribution - does it make sense for there to be a .lower() here?
fdist = nltk.FreqDist(w.lower() for w in news_text)

# loop through each modal and print the fdist
for m in modals:
  print(m + ':', fdist[m], end = ' ') # the end argument replaces the default newline which comes at the end of a print statement

## **Your Turn**

Have a play with the Brown corpus, explore the different categories and make sure you are comfortable loading them into Python.

## State of the Union Corpus

Another corpus which provides us some interesting data to use for various comparisons is a set of speeches given by US presidents over the years. Each year, the sitting president gives a "state of the union" speech which explains how everything they have done is good and how everything their opponents want to do is bad. Because most US Presidents will serve four or eight subequent years in office, this provides a neat way to compare the speech of different US Presidents over the years.

We load in the corpus just like Gutenberg and Brown



In [None]:
# Load in the state_union corpus
from nltk.corpus import state_union

# fileids shows us the different files
state_union.fileids()

Much like Gutenberg, we can access various properties of specific speeches by using the `.raw()`, `.words()`, or `.sents()` methods with a specific fileid in the brackets:

In [None]:
# Raw words of 1945 speech
state_union.raw('1945-Truman.txt')

In [None]:
# tokenized version (truncated output)
state_union.words('1945-Truman.txt')

In [None]:
# sentences (truncated output)
state_union.sents('1945-Truman.txt')

We could ask a variety of questions about the nature of State of the Union addresses as they have occurred over time.

- Have they changed in length?
- What words are similar among all presidents?
- What words are unique to different presidents?
- Which president is the most lexically diverse?
- and so on...

Do you have an idea about how to do this? It probably involves some sort of looping over the fileids and then performing a function. For example, I'll write a for loop which reports the most frequent word from each speech. Run the function and then inspect the results. What could you do to get more interesting results? And, what does this function say about the distribution of words in the English language?


In [None]:
for fileid in state_union.fileids():
  most_frequent_word = nltk.FreqDist(state_union.words(fileid)).most_common(1)
  print(f'{fileid} most frequent word is: \t {most_frequent_word}')