# Who had the larger vocabulary, Shakespeare or Dickens?

Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase. 

In [1]:
import string

In [2]:
def word_getter(file):
    """Convert a text file into a list of words"""
    words = [] # holds the list of words in the file
    table = str.maketrans({key: None for key in string.punctuation}) # takes the punctuation out of strings
    with open(file, 'r+') as f: # opens file
        text = f.read() # reads it 
        text_list = text.split() # splits on whitespace
        for w in text_list: # loop through all the words
            w = w.lower() # make them lower case
            w = w.translate(table) # remove punctuation
            words.append(w) # add to list
    return words

In [3]:
word_getter('words.txt')

['this',
 'is',
 'a',
 'line',
 'of',
 'words',
 'this',
 'is',
 'a',
 'second',
 'line',
 'of',
 'words',
 'this',
 'is',
 'a',
 'third',
 'line',
 'of',
 'words']

Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format.

Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before.

Then modify the program to count the total number of words in the book, and the number of times each word is used.

Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary? 

In [13]:
def word_getter(file):
    """Convert a text file into a list of words"""
    words = [] # holds the list of words in the file
    table = str.maketrans({key: None for key in string.punctuation}) # takes the punctuation out of strings
    with open(file, 'r+') as f: # opens file
        text = f.read() # reads it 
        text_list = text.split() # splits on whitespace
        for w in text_list: # loop through all the words
            w = w.lower() # make them lower case
            w = w.translate(table) # remove punctuation
            words.append(w) # add to list
    return words

def count_words(words):
    """words is a list containing non-unique words"""
    return len(words)

def count_unique(words):
    """words is a list of non-unique words"""
    from collections import Counter
    return Counter(words)

def top_n_words(counted, n=20):
    """counted is a Counter object"""
    try:
        top_n = counted.most_common(n)
    except:
        print('counted is type {}, should be collections.Counter'.format(type(counted)))
    return top_n

def words_comparer(list1, list2):
    """return the words from list2 that are not in list1"""
    return [x for x in list2 if x not in list1]

def analyze(file):
    words = word_getter(file)
    n_words = count_words(words)
    count_n_words = count_unique(words)
    return n_words, count_n_words

In [5]:
shakes_n, shakes_count = analyze('macbeth.txt')

In [6]:
dickens_n, dickens_count = analyze('olivertwist.txt')

In [7]:
print('Macbeth contains {} words. Oliver Twist contains {} words'.format(shakes_n, dickens_n))

Macbeth contains 17738 words. Oliver Twist contains 160915 words


Modify the program from the previous exercise to print the 20 most frequently used words in the book.

In [8]:
top_n_words(shakes_count)

[('the', 647),
 ('and', 545),
 ('to', 383),
 ('of', 338),
 ('i', 331),
 ('a', 239),
 ('that', 227),
 ('my', 203),
 ('you', 203),
 ('in', 199),
 ('is', 180),
 ('not', 165),
 ('it', 161),
 ('with', 153),
 ('his', 146),
 ('be', 137),
 ('macb', 137),
 ('your', 126),
 ('our', 123),
 ('haue', 122)]

Modify the previous program to read a word list (see Section 9.1) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?

In [12]:
word_list = word_getter('lotsofwords.txt')

In [15]:
macbeth_not_in_list = words_comparer(word_list, word_getter('macbeth.txt'))

In [17]:
print('There are {} words in Macbeth that are not in big word list.'.format(count_words(macbeth_not_in_list)))

There are 4561 words in Macbeth that are not in big word list.


In [21]:
print('{} of these words are unique'.format(count_words(set(macbeth_not_in_list))))

1571 of these words are unique


We can look at the top 20 most frequent words that are not in the big word list.

In [23]:
top_n_words(count_unique(macbeth_not_in_list))

[('i', 331),
 ('a', 239),
 ('macb', 137),
 ('haue', 122),
 ('macbeth', 62),
 ('vpon', 58),
 ('macd', 58),
 ('vs', 55),
 ('rosse', 49),
 ('ile', 35),
 ('feare', 35),
 ('banquo', 34),
 ('selfe', 32),
 ('1', 32),
 ('exeunt', 30),
 ('speake', 29),
 ('lenox', 28),
 ('vp', 26),
 ('th', 26),
 ('mal', 25)]

The most frequent word is 'i'. That's really one that should be in the list, same with 'a', the second most common word. Macb and macd might be typos, while macbeth is a proper noun. Not to surprising it's not in the word list. The rest are either really obscure words or typos (except for 1, which is a number and does not belong in the big word list).