## Getting Started with NLTK

The first step is to type a special command at the Python prompt which tells the interpreter to load some texts for us to explore: from nltk.book import *. This says "from NLTK's book module, load all items." The book module contains all the data you will need as you read this chapter. After printing a welcome message, it loads the text of several books (this will take a few seconds).

In [1]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Searching Text


There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in Moby Dick by entering text1 followed by a period, then the term concordance, and then placing "monstrous" in parentheses:

In [5]:
text1.concordance("monstrous")

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u


### Your Turn: Try searching for other words;


A concordance permits us to see words in context. For example, we saw that monstrous occurred in contexts such as the ___ pictures and a ___ size . What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses:

In [6]:
text1.similar("monstrous")

imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate


In [7]:
text2.similar("monstrous")

very exceedingly so heartily a great good amazingly as sweet
remarkably extremely vast


Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very.

The term common_contexts allows us to examine just the contexts that are shared by two or more words, such as monstrous and very. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma:

In [8]:
text2.common_contexts(["monstrous", "very"])

a_pretty is_pretty a_lucky am_glad be_glad


## Counting Vocabulary

The most obvious fact about texts that emerges from the preceding examples is that they differ in the vocabulary they use. In this section we will see how to use the computer to count the words in a text in a variety of useful ways.

Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear. We use the term len to get the length of something, which we'll apply here to the book of Genesis:

In [9]:
len(text3)

44764

So Genesis has 44,764 words and punctuation symbols, or "tokens." A token is the technical name for a sequence of characters — such as hairy, his, or :) — that we want to treat as a group. When we count the number of tokens in a text, say, the phrase to be or not to be, we are counting occurrences of these sequences. Thus, in our example phrase there are two occurrences of to, two of be, and one each of or and not. But there are only four distinct vocabulary items in this phrase. How many distinct words does the book of Genesis contain? To work this out in Python, we have to pose the question slightly differently. The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items of text3 with the command: set(text3). When you do this, many screens of words will fly past. Now try the following:

In [10]:
sorted(set(text3))

[u'!',
 u"'",
 u'(',
 u')',
 u',',
 u',)',
 u'.',
 u'.)',
 u':',
 u';',
 u';)',
 u'?',
 u'?)',
 u'A',
 u'Abel',
 u'Abelmizraim',
 u'Abidah',
 u'Abide',
 u'Abimael',
 u'Abimelech',
 u'Abr',
 u'Abrah',
 u'Abraham',
 u'Abram',
 u'Accad',
 u'Achbor',
 u'Adah',
 u'Adam',
 u'Adbeel',
 u'Admah',
 u'Adullamite',
 u'After',
 u'Aholibamah',
 u'Ahuzzath',
 u'Ajah',
 u'Akan',
 u'All',
 u'Allonbachuth',
 u'Almighty',
 u'Almodad',
 u'Also',
 u'Alvah',
 u'Alvan',
 u'Am',
 u'Amal',
 u'Amalek',
 u'Amalekites',
 u'Ammon',
 u'Amorite',
 u'Amorites',
 u'Amraphel',
 u'An',
 u'Anah',
 u'Anamim',
 u'And',
 u'Aner',
 u'Angel',
 u'Appoint',
 u'Aram',
 u'Aran',
 u'Ararat',
 u'Arbah',
 u'Ard',
 u'Are',
 u'Areli',
 u'Arioch',
 u'Arise',
 u'Arkite',
 u'Arodi',
 u'Arphaxad',
 u'Art',
 u'Arvadite',
 u'As',
 u'Asenath',
 u'Ashbel',
 u'Asher',
 u'Ashkenaz',
 u'Ashteroth',
 u'Ask',
 u'Asshur',
 u'Asshurim',
 u'Assyr',
 u'Assyria',
 u'At',
 u'Atad',
 u'Avith',
 u'Baalhanan',
 u'Babel',
 u'Bashemath',
 u'Be',
 u'Because'

In [11]:
len(set(text3))

2789

By wrapping sorted() around the Python expression set(text3) [1], we obtain a sorted list of vocabulary items, beginning with various punctuation symbols and continuing with words starting with A. All capitalized words precede lowercase words. We discover the size of the vocabulary indirectly, by asking for the number of items in the set, and again we can use len to obtain this number [2]. Although it has 44,764 tokens, this book has only 2,789 distinct words, or "word types." A word type is the form or spelling of the word independently of its specific occurrences in a text — that is, the word considered as a unique item of vocabulary. Our count of 2,789 items will include punctuation symbols, so we will generally call these unique items types instead of word types.

Now, let's calculate a measure of the lexical richness of the text. The next example shows us that the number of distinct words is just 6% of the total number of words, or equivalently that each word is used 16 times on average (remember if you're using Python 2, to start with from __future__ import division).

In [13]:
#Only for python2
from __future__ import division

len(set(text3)) / len(text3)

0.06230453042623537