# Ch.1 Exercises

The following exercises are from **Chapter 1: Language Processing and Python** of the book *Natural Language Processing with Python — Analyzing Text with the Natural Language Toolkit* by Steven Bird, Ewan Klein, and Edward Loper.

[[Read Now]](http://www.nltk.org/book/ch01.html)

In [1]:
import nltk

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


**Exercise 1**

☼ Try using the Python interpreter as a calculator, and typing expressions like `12 / (4 + 1)`.

In [2]:
35 + (26 / 2)

48.0

**Exercise 2**

☼ Given an alphabet of 26 letters, there are 26 to the power 10, or `26 ** 10`, ten-letter strings we can form. That works out to `141167095653376`. How many hundred-letter strings are possible?

In [3]:
26 ** 100

3142930641582938830174357788501626427282669988762475256374173175398995908420104023465432599069702289330964075081611719197835869803511992549376

**Exercise 3**

☼ The Python multiplication operation can be applied to lists. What happens when you type  `['Monty', 'Python'] * 20`, or `3 * sent1`?

In [4]:
# creates a list with 'Monty' and 'Python' in it 20 times each
['Monty', 'Python'] * 20

['Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python',
 'Monty',
 'Python']

In [5]:
# prints sentence 1 three times
3 * sent1

['Call',
 'me',
 'Ishmael',
 '.',
 'Call',
 'me',
 'Ishmael',
 '.',
 'Call',
 'me',
 'Ishmael',
 '.']

**Exercise 4**

☼ Review [1](http://www.nltk.org/book/ch01.html#sec-computing-with-language-texts-and-words) on computing with language. How many words are there in `text2`? How many distinct words are there?

In [6]:
# number of words in text2
len(text2)

141576

In [7]:
# number of distinct words in text2
len(set(text2))

6833

In [8]:
# number of distinct alphabet words in text2
len(set([w for w in text2 if w.isalpha()]))

6713

**Exercise 5**

☼ Compare the lexical diversity scores for humor and romance fiction in [1.1](http://www.nltk.org/book/ch01.html#tab-brown-types). Which genre is more lexically diverse?

Humor shows a lexical diversity score of `0.23`, whereas romance has a score of `0.121`. Because unique types take up a higher percentage of the humor tokens, we can gather that the genre of humor is more lexically diverse than that of romance. In other words, a humorous text is likely to use more different words than a romance text.

**Exercise 6**

☼ Produce a dispersion plot of the four main protagonists in Sense and Sensibility: Elinor, Marianne, Edward, and Willoughby. What can you observe about the different roles played by the males and females in this novel? Can you identify the couples?

**Dispersion plots** allow us to see the *locations* of certain words in a text. Where do they occur in the text with respect  to the beginning and the end?

In [9]:
# produce a dispersion plot

text2.dispersion_plot(['Elinor', 'Marianne', 'Edward', 'Willoughby'])

<Figure size 640x480 with 1 Axes>

In general, it seems that the female characters are the main protagonists of the novels, whereas the males might be supporting characters. In particular, Elinor seems to be the heroine, as her name appears the most times and dispersed quite evenly throughout the novel. Marianne appears as well, though slightly less often.

If we were to identify couples, I would guess that the names of those in the couple would appear together. Thus, we can draw lines up from the men's dense areas to the women's, and use that to determine which pairs partner together. Willoughby's high density areas seem to line up perfectly with Marianne's. In contrast, Marianne's sparse areas match up with areas that are dense for Edward. As it is unlikely that a "couple" would not be seen in the same context for the entire book, I conclude that Edward is partnered with Elinor, and Marianne with Willoughby.

**Exercise 7**

☼ Find the collocations in `text5`.

The function `.collocations()` appears to be deprecated in the current version of `nltk`, so I followed [this answer](https://stackoverflow.com/questions/21165702/nltk-collocations-for-specific-words) to replicate the results.

In [10]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

In [11]:
finder = BigramCollocationFinder.from_words(text5)

# filter out bigrams that don't appear at least 5 times
finder.apply_freq_filter(5)

# return top 10 results
print(finder.nbest(bigram_measures.likelihood_ratio, 10))

[('.', 'ACTION'), ('I', "'m"), ('pm', 'me'), ('wanna', 'chat'), ('PART', 'JOIN'), ('MODE', '#14-19teens'), ('+', 'o'), ('/', 'm'), ('do', "n't"), ('are', 'you')]


I find it very interesting that the above code was able to pull out "pm me" and "wanna chat?" which are things one is very likely to hear in a chat room. I'm also surprised to see conjunctions recognized as well, with `I` and `'m`, as well as `do` and `n't`.

**Exercise 8**

☼ Consider the following Python expression: `len(set(text4))`. State the purpose of this expression. Describe the two steps involved in performing this computation.

The purpose of this expression is to find how many unique words (or *types*) there are in `text4`. The first step (innermost parenthesis) is to remove duplicate words using `set()`. The second and final step is to count how many unique words there are using `len()`.

In [12]:
len(set(text4))

9754

**Exercise 9a**

☼ Review [2](http://www.nltk.org/book/ch01.html#sec-a-closer-look-at-python-texts-as-lists-of-words) on lists and strings.

Define a string and assign it to a variable, e.g., `my_string = 'My String'` (but put something more interesting in the string). Print the contents of this variable in two ways, first by simply typing the variable name and pressing enter, then by using the `print` statement.

In [13]:
carl_sagan = "If you wish to make an apple pie from scratch, you must first invent the universe."
carl_sagan

'If you wish to make an apple pie from scratch, you must first invent the universe.'

In [14]:
print(carl_sagan)

If you wish to make an apple pie from scratch, you must first invent the universe.


**Exercise 9b**

☼ Review [2](http://www.nltk.org/book/ch01.html#sec-a-closer-look-at-python-texts-as-lists-of-words) on lists and strings.

Try adding the string to itself using `my_string + my_string`, or multiplying it by a number, e.g., `my_string * 3`. Notice that the strings are joined together without any spaces. How could you fix this?

In [15]:
carl_sagan + carl_sagan

'If you wish to make an apple pie from scratch, you must first invent the universe.If you wish to make an apple pie from scratch, you must first invent the universe.'

In [16]:
carl_sagan*3

'If you wish to make an apple pie from scratch, you must first invent the universe.If you wish to make an apple pie from scratch, you must first invent the universe.If you wish to make an apple pie from scratch, you must first invent the universe.'

In [17]:
carl_sagan + " " + carl_sagan

'If you wish to make an apple pie from scratch, you must first invent the universe. If you wish to make an apple pie from scratch, you must first invent the universe.'

In [18]:
sentences = [carl_sagan]*3
' '.join(sentences)

'If you wish to make an apple pie from scratch, you must first invent the universe. If you wish to make an apple pie from scratch, you must first invent the universe. If you wish to make an apple pie from scratch, you must first invent the universe.'

**Exercise 10**

☼ Define a variable `my_sent` to be a list of words, using the syntax `my_sent = ["My", "sent"]` (but with your own words, or a favorite saying).

Use `' '.join(my_sent)` to convert this into a string.

Use `split()` to split the string back into the list form you had to start with.

In [19]:
vows = ["forever", "and", "always", "and", "eternity"]

In [20]:
' '.join(vows)

'forever and always and eternity'

In [21]:
' '.join(vows).split()

['forever', 'and', 'always', 'and', 'eternity']

**Exercise 11**

☼ Define several variables containing lists of words, e.g., `phrase1`, `phrase2`, and so on. Join them together in various combinations (using the plus operator) to form whole sentences. What is the relationship between `len(phrase1 + phrase2)` and `len(phrase1) + len(phrase2)`?

In [22]:
rick1 = ['wubba', 'lubba']
rick2 = ['dub']*2

' '.join(rick1 + rick2)

'wubba lubba dub dub'

In [23]:
len(rick1 + rick2) # returns the length of the concatenated list, i.e. 4

4

In [24]:
len(rick1) + len(rick2) # adds the length of each list together; in this case, will return 4

4

As for the relationship between the two previous lines of code, they're essentially doing the same thing. Taking the length of a concatenated list is the same as taking the length of each list separately and adding them together.

**Exercise 12**

☼ Consider the following two expressions, which have the same value. Which one will typically be more relevant in NLP? Why?

* `"Monty Python"[6:12]`

* `["Monty", "Python"][1]`

I think both could be considered relevant, but I'm going to have to go with #2. The first might be seen as more relevant because it is a string, and NLP has at its core unstructured text as data. However, the "processing" portion of NLP has as *its* core functions such as tokenization, which will break strings like #1 down into lists more like #2. Thus, in an NLP application, we might be working with lists of tokenized words rather than trying to slice-and-dice full strings, so #2 should be more relevant in practice.

**Exercise 13**

☼ We have seen how to represent a sentence as a list of words, where each word is a sequence of characters. What does `sent1[2][2]` do? Why? Experiment with other index values.

`sent1[2][2]` pulls the third element out of the third element of `sent1`. In other words, since `sent1` is a sentence, the third element is a string, and the third element of a string is going to be a character.

In [25]:
sent1[2][2]

'h'

In [26]:
sent1[-1][-1]

'.'

**Exercise 14**

☼ The first sentence of `text3` is provided to you in the variable `sent3`. The index of *the* in `sent3` is 1, because `sent3[1]` gives us `'the'`. What are the indexes of the two other occurrences of this word in `sent3`?

In [27]:
for index, word in enumerate(sent3):
    if word == 'the':
        print("'the' at: sent[" + str(index) + "]")

'the' at: sent[1]
'the' at: sent[5]
'the' at: sent[8]


**Exercise 15**

☼ Review the discussion of conditionals in [4](http://www.nltk.org/book/ch01.html#sec-making-decisions). Find all words in the Chat Corpus (`text5`) starting with the letter *b*. Show them in alphabetical order.

In [28]:
sorted(set([w for w in text5 if w.startswith('b')]))

['b',
 'b-day',
 'b/c',
 'b4',
 'babay',
 'babble',
 'babblein',
 'babe',
 'babes',
 'babi',
 'babies',
 'babiess',
 'baby',
 'babycakeses',
 'bachelorette',
 'back',
 'backatchya',
 'backfrontsidewaysandallaroundtheworld',
 'backroom',
 'backup',
 'bacl',
 'bad',
 'bag',
 'bagel',
 'bagels',
 'bahahahaa',
 'bak',
 'baked',
 'balad',
 'balance',
 'balck',
 'ball',
 'ballin',
 'balls',
 'ban',
 'band',
 'bandito',
 'bandsaw',
 'banjoes',
 'banned',
 'baord',
 'bar',
 'barbie',
 'bare',
 'barely',
 'bares',
 'barfights',
 'barks',
 'barn',
 'barrel',
 'base',
 'bases',
 'basically',
 'basket',
 'battery',
 'bay',
 'bbbbbyyyyyyyeeeeeeeee',
 'bbiam',
 'bbl',
 'bbs',
 'bc',
 'be',
 'beach',
 'beachhhh',
 'beam',
 'beams',
 'beanbag',
 'beans',
 'bear',
 'bears',
 'beat',
 'beaten',
 'beatles',
 'beats',
 'beattles',
 'beautiful',
 'because',
 'beckley',
 'become',
 'bed',
 'bedford',
 'bedroom',
 'beeeeehave',
 'beeehave',
 'been',
 'beer',
 'before',
 'beg',
 'begin',
 'behave',
 'behind',

**Exercise 16**

☼ Type the expression `list(range(10))` at the interpreter prompt. Now try `list(range(10, 20))`,  `list(range(10, 20, 2))`, and `list(range(20, 10, -2))`. We will see a variety of uses for this built-in function in later chapters.

In [29]:
list(range(10)) # returns 0~9

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [30]:
list(range(10,20)) # returns 10~19

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [31]:
list(range(10,20,2)) # skip counts by 2 from 10 to 19

[10, 12, 14, 16, 18]

In [32]:
list(range(20,10,-2)) # skip counts backwards by 2 from 20 to 11

[20, 18, 16, 14, 12]

**Exercise 17**

◑ Use `text9.index()` to find the index of the word *sunset*. You'll need to insert this word as an argument between the parentheses. By a process of trial and error, find the slice for the complete sentence that contains this word.

In [33]:
text9.index("sunset")

629

In [34]:
' '.join(text9[621:644])

'THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .'

**Exercise 18**

◑ Using list addition, and the `set` and `sorted` operations, compute the vocabulary of the sentences `sent1 ... sent8`.

In [35]:
sorted(set(sent1 + sent2 + sent3 + sent4 + sent5 + sent6 + sent7 + sent9))

['!',
 ',',
 '-',
 '.',
 '1',
 '29',
 '61',
 ':',
 'ARTHUR',
 'Call',
 'Citizens',
 'Dashwood',
 'Fellow',
 'God',
 'House',
 'I',
 'In',
 'Ishmael',
 'JOIN',
 'KING',
 'London',
 'Nov.',
 'PMing',
 'Park',
 'Pierre',
 'Representatives',
 'SCENE',
 'Saffron',
 'Senate',
 'Sussex',
 'THE',
 'The',
 'Vinken',
 'Whoa',
 '[',
 ']',
 'a',
 'and',
 'as',
 'been',
 'beginning',
 'board',
 'clop',
 'cloud',
 'created',
 'director',
 'earth',
 'family',
 'had',
 'have',
 'heaven',
 'in',
 'join',
 'lay',
 'lol',
 'long',
 'me',
 'nonexecutive',
 'of',
 'old',
 'on',
 'people',
 'problem',
 'ragged',
 'red',
 'settled',
 'side',
 'suburb',
 'sunset',
 'the',
 'there',
 'to',
 'will',
 'wind',
 'with',
 'years']

**Exercise 19**

◑ What is the difference between the following two lines? Which one will give a larger value? Will this be the case for other texts?
 	
```python
>>> sorted(set(w.lower() for w in text1))

>>> sorted(w.lower() for w in set(text1))
```

The second expression will give a higher value. Expression 1 first lowercases all tokens, meaning that when `set()` is called title-cased and lowercased duplicates will be removed. However, expression 2 calls set on the tokens beforehand, which means there will be some duplicates when `.lower()` is called because casing is an important distinction in Python (that is, `This` and `this` are considered different elements and will not be removed upon calling `set()`). This should be the case, not only for the other texts, but for any text passed to these expressions. (However, an exception could be made for, say, the Chat Corpus, wherein most values are lowercased by default; however, the logic here still stands: on average, the second expression will almost always yield a higher value.)

In [36]:
len(sorted(set(w.lower() for w in text5)))

5441

In [37]:
len(sorted(w.lower() for w in set(text5)))

6066

**Exercise 20**

◑ What is the difference between the following two tests: `w.isupper()` and `not w.islower()`?

There is no difference. They both test whether an item `w` is uppercase.

**Exercise 21**

◑ Write the slice expression that extracts the last two words of `text2`.

In [38]:
text2[-2:]

['THE', 'END']

**Exercise 22**

◑ Find all the four-letter words in the Chat Corpus (`text5`). With the help of a frequency distribution (`FreqDist`), show these words in decreasing order of frequency.

In [39]:
FreqDist([w for w in text5 if len(w) == 4]).most_common(10)

[('JOIN', 1021),
 ('PART', 1016),
 ('that', 274),
 ('what', 183),
 ('here', 181),
 ('....', 170),
 ('have', 164),
 ('like', 156),
 ('with', 152),
 ('chat', 142)]

**Exercise 23**

◑ Review the discussion of looping with conditions in [4](http://www.nltk.org/book/ch01.html#sec-making-decisions). Use a combination of `for` and `if` statements to loop over the words of the movie script for *Monty Python and the Holy Grail* (`text6`) and `print` all the uppercase words, one per line.

In [40]:
for word in text6:
    if word.istitle():
        print(word)

Whoa
Halt
Who
It
I
Arthur
Uther
Pendragon
Camelot
King
Britons
Saxons
England
Pull
I
Patsy
We
Camelot
I
What
Ridden
Yes
You
What
You
So
We
Mercea
Where
We
Found
In
Mercea
The
What
Well
The
Are
Not
They
What
A
It
It
It
A
Well
Will
Arthur
Court
Camelot
Listen
In
Please
Am
I
I
It
African
Oh
African
European
That
Oh
I
Will
Camelot
But
African
Oh
So
Wait
Supposing
No
Well
They
What
Well
Bring
Bring
Bring
Bring
Bring
Bring
Bring
Bring
Ninepence
Bring
Bring
Bring
Bring
Here
Ninepence
I
What
Nothing
Here
I
Ere
He
Yes
I
He
Well
He
I
No
You
Oh
I
It
I
Oh
I
I
Well
I
Well
He
No
I
Robinson
They
Well
Thursday
I
I
You
Look
I
I
Ah
Not
See
Thursday
Right
All
Who
I
Must
Why
He
King
Arthur
King
Arthur
Old
Man
Man
Sorry
What
I
I
I
I
Well
I
Man
Well
Dennis
Well
I
Dennis
Well
I
What
I
Well
I
Oh
And
By
By
If
Dennis
Oh
How
How
I
Arthur
King
Britons
Who
King
The
Britons
Who
Britons
Well
We
Britons
I
I
I
You
We
A
Oh
That
If
Please
I
Who
No
Then
We
What
I
We
We
Yes
But
Yes
I
By
Be
But
Be
I
Order
Who
Heh
I
Well
I


Back
Riiight
Come
Run
Run
Pull
My
Come
Back
Back
Right
Come
Everything
All
That
Just
Christ


**Exercise 24**

◑ Write expressions for finding all words in `text6` that meet the conditions listed below. The result should be in the form of a list of words: `['word1', 'word2', ...]`.

* Ending in *ize*
* Containing the letter *z*
* Containing the sequence of letters *pt*
* Having all lowercase letters except for an initial capital (i.e., `titlecase`)

In [41]:
candidates = []

for w in text6:
    if w.islower() or w.istitle():
        if w.endswith('ize') or 'z' in w or 'pt' in w:
            candidates.append(w)

candidates

['empty',
 'zone',
 'aptly',
 'amazes',
 'Thpppppt',
 'Thppt',
 'Thppt',
 'empty',
 'Fetchez',
 'Fetchez',
 'Thppppt',
 'temptress',
 'temptation',
 'ptoo',
 'zoop',
 'zoo',
 'zhiv',
 'frozen',
 'zoosh',
 'Chapter',
 'excepting',
 'Thpppt']

**Exercise 25**

◑ Define `sent` to be the list of words `['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']`. Now write code to perform the following tasks:

* Print all words beginning with *sh*
* Print all words longer than four characters

In [42]:
sent = ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']

In [43]:
print([w for w in sent if w.startswith("sh")])

['she', 'shells', 'shore']


In [44]:
print([w for w in sent if len(w) > 4])

['sells', 'shells', 'shore']


**Exercise 26**

◑ What does the following Python code do? `sum(len(w) for w in text1)` 

Can you use it to work out the average word length of a text?

Gets the total length of the text, if all words were placed end to end. You can use it to find the average word lenght of a text, by dividing it by the number of words in the text.

In [45]:
sum(len(w) for w in text1) / len(text1)

3.830411128023649

**Exercise 27**

◑ Define a function called `vocab_size(text)` that has a single parameter for the text, and which returns the vocabulary size of the text.

In [46]:
def vocab_size(text):
    return len(set(text))

In [47]:
vocab_size(text1)

19317

In [48]:
vocab_size(text5)

6066

**Exercise 28**

◑ Define a function `percent(word, text)` that calculates how often a given word occurs in a text, and expresses the result as a percentage.

In [49]:
def percent(word, text):
    return 100*(text.count(word) / len(text))

In [50]:
percent('God', text3)

0.5160396747386293

In [51]:
percent('America', text4)

0.13174597728754248

In [52]:
percent('nation', text4)

0.16125158678423165

In [53]:
percent('I', text5)

1.2797156187513885

**Exercise 29**

◑ We have been using sets to store vocabularies. Try the following Python expression: `set(sent3) < set(text1)`. Experiment with this using different arguments to `set()`. What does it do? Can you think of a practical application for this?

In [54]:
set(sent3) < set(text1)

True

`set()` removes duplicates from a given argument. A good practical application is to find the total number of unique words in a corpus. Of course, coupling `set()` with other functions, such as `.lower()` and `.isalpha()`, can help to increase our accuracy in determining this number.