# Counting "[Ww]hales?" in *Moby Dick*

Just for fun, we will count the number of times the word *whale* or *Whale*, or plurals thereof, occur in the book *Moby Dick*. We'll use two methods to do this. But there are plenty of other ways to do it than these two here. Some other ways might be worse, but I'm sure there are some other ways that are better too. 

In [1]:
import urllib.request as ur
import re
import string
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize

# An nltk add-on module. This will download it, if you don't have it already.
_ = nltk.download('punkt', quiet=True)

## Get the *Moby Dick* text

Get the text of *Moby Dick* from Project Gutenberg. We'll just read the whole file in as one big (multi-line) string. This file will have Project Gutenberg's boilerplate at the start and at the end, and we'll strip that out. Using a regular expression pattern, we'll find the string starting with "Call me Ishmael" and including any characters (including line breaks) until we hit "End of Project Gutenberg's Moby Dick". We'll call this string `mobydick`.

In [2]:
fulltext = ur.urlopen("http://www.gutenberg.org/ebooks/2701.txt.utf-8").read()

mobydick = re.search(b"(Call me Ishmael.*)End of Project Gutenberg's Moby Dick.*$",
                     fulltext, re.S).groups()[0].strip().decode('utf-8')

## Define some regular expressions

We can define a regular that matches all and only the following strings: *whale*, *whales*, *Whale* and *Whales*.

In [3]:
whale_regex = re.compile('^[Ww]hales?$')
is_whale_word = lambda word: True if whale_regex.match(word) else False

We'll also make some regular expressions that match any punctuation character, and non empty strings of nothing but punctuation characters.

In [4]:
punctuation_regex = re.compile('[%s]' % re.escape(string.punctuation))
punctuation_string_regex = re.compile('^[%s]+$' % re.escape(string.punctuation))

## Some helper functions

We'll count the whale words by tokenizing the text and then find the whale words using the regular expression above.

For convenience, we'll define a function that count whale words in a list of words, and also a general tokenize and count function, and a function to summarize our results.

In [5]:
def get_whale_words(words):
    "Return a list of all words that match the whale_regex"
    return [word for word in words if is_whale_word(word)]

def tokenize_and_count(text, tokenizer):
    '''Return a dictionary with all words in `text` tokenized and a list 
    of whale words too.'''
    
    results = dict(all_words = tokenizer(text))
    results['whale_words'] = get_whale_words(results['all_words'])
    
    return results

def summarize_results(results):
    """Return a string summarizing the results of a the whale word search."""
    
    whale_words, all_words = (results['whale_words'], results['all_words'])
    
    count_breakdown = "\n".join(['%s: %s' % (string, count) 
                           for string, count in Counter(whale_words).items()])
    
    summary = "There are %d whale words in %d words. This is a ratio of %2.5f." % (
        len(whale_words), len(all_words), len(whale_words)/len(all_words)
    )

    return "\n".join([summary,
                      "The matching words breakdown as follows:",
                      count_breakdown])


## Tokenize using *Natural Language Toolkit*

We'll tokenize `mobydick` using the *Natural Language Toolkit*. This will use a built in tokenizer to return a list of what it defines as "words". This will treat things like "." and "--", and so on, as words. We can filter out these punctuation character strings by definining a new regular expression.

In [6]:
def nltk_tokenize_and_count(text):
    
    nltk_tokenizer = lambda text: [word for word in word_tokenize(text)
                                   if not punctuation_string_regex.match(word)]

    return tokenize_and_count(text, nltk_tokenizer)

In [7]:
nltk_results = nltk_tokenize_and_count(mobydick)

In [8]:
print(summarize_results(nltk_results))

There are 1242 whale words in 212164 words. This is a ratio of 0.00585.
The matching words breakdown as follows:
whale: 741
whales: 221
Whales: 23
Whale: 257


## Tokenize the old school way

The old unix way to tokenize is to remove your punctuation characters and then split the remain string by whitespaces and line breaks. 

In [9]:
def oldschool_tokenize_and_count(text):
    
    old_school_tokenizer = lambda text: punctuation_regex.sub('', text).split()

    return tokenize_and_count(text, old_school_tokenizer)

In [10]:
oldschool_results = oldschool_tokenize_and_count(mobydick)

In [11]:
print(summarize_results(oldschool_results))

There are 1240 whale words in 208420 words. This is a ratio of 0.00595.
The matching words breakdown as follows:
whale: 663
whales: 296
Whales: 66
Whale: 215


## Which method is better?

The counts from the two methods are quite close, but the breakdowns are obviously different. We see, for example, more *whales* and *Whales* counts in the old school way than in the nltk way. Why is this? Without looking into it more carefully, I presume it is to do with the possesive case words like "whale's". The old school tokenizer will strip out the "'" and leave the words as "whales", while the nltk tokenize will treat "whale's" as two separate words, "whale" and "'s", which is actually correct.

Let's look at a simple example for comparison. Here it is easy to see how the discrepancies arise.

In [12]:
text = """
Whales are big fish. Actually, a whale is not a fish at all. 
A whale's tail is the tail of a whale.
"""

In [13]:
print(summarize_results(nltk_tokenize_and_count(text)))

There are 4 whale words in 23 words. This is a ratio of 0.17391.
The matching words breakdown as follows:
whale: 3
Whales: 1


In [14]:
print(summarize_results(oldschool_tokenize_and_count(text)))

There are 4 whale words in 22 words. This is a ratio of 0.18182.
The matching words breakdown as follows:
whale: 2
whales: 1
Whales: 1
