# Books

## Attribution

The data for this project come from
[Project Gutenberg](https://www.gutenberg.org/):
[_The Importance of Being Earnest: A Trivial Comedy for Serious People_ by Oscar Wilde](https://www.gutenberg.org/cache/epub/844/pg844.txt)
and
[_Macbeth_ by William Shakespeare](https://www.gutenberg.org/cache/epub/2264/pg2264.txt).
Both are available in the public domain and both are encoded using UTF-8.

## Investigation

__Who has the most to say in _The Importance of Being Earnest_?__

My all-time favorite play is Oscar Wilde's _The Importance of Being Earnest: A
Trivial Comedy for Serious People_. In this investigation, we would like to
determine which character has the most dialogue, measured by word count.

_Hypothesis:_ Not sure. Jack is the protagonist, but Algernon is a chronic yapper.

Later, I will apply the same method to another one of my favorite plays, the
tragedy of _Macbeth_ by William Shakespeare.

_Hypothesis:_ Definitely Macbeth himself.

### Parsing

To implement the parser, we begin by reading each line from the file:

In [1]:
with open('data/pg844.txt') as stream:
    for line in stream:
        print(line)

The Project Gutenberg eBook of The Importance of Being Earnest: A Trivial Comedy for Serious People

    

This ebook is for the use of anyone anywhere in the United States and

most other parts of the world at no cost and with almost no restrictions

whatsoever. You may copy it, give it away or re-use it under the terms

of the Project Gutenberg License included with this ebook or online

at www.gutenberg.org. If you are not located in the United States,

you will have to check the laws of the country where you are located

before using this eBook.



Title: The Importance of Being Earnest: A Trivial Comedy for Serious People



Author: Oscar Wilde



Release date: March 1, 1997 [eBook #844]

                Most recently updated: February 13, 2021



Language: English



Credits: David Price





*** START OF THE PROJECT GUTENBERG EBOOK THE IMPORTANCE OF BEING EARNEST: A TRIVIAL COMEDY FOR SERIOUS PEOPLE ***









The Importance of Being Earnest



A Trivial Comedy for Serious Peo

After observing the output, we notice that each dialogue section begins with the character's name in uppercase letters, followed by a full stop. We can restrict our output to only these sections.

In [2]:
with open('data/pg844.txt') as stream:
    for line in stream:
        line = line.strip()
        if line.endswith('.') and line.isupper():
            print(line, end = ' ')

ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. LANE. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. LANE. LADY BRACKNELL. ALGERNON. LADY BRACKNELL. ALGERNON. GWENDOLEN. JACK. GWENDOLEN. LADY BRACKNELL. ALGERNON. LADY BRACKNELL. GWENDOLEN. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LAN

There are some extraneous sections at the end of this output. We can ignore them by terminating the parser after detecting the end of the source text.

In [3]:
with open('data/pg844.txt') as stream:
    for line in stream:
        line = line.strip()
        
        if line.startswith('*** END OF THE PROJECT GUTENBERG EBOOK'):
            break
            
        if line.endswith('.') and line.isupper():    
            print(line, end = ' ')

ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. LANE. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. JACK. ALGERNON. LANE. LADY BRACKNELL. ALGERNON. LADY BRACKNELL. ALGERNON. GWENDOLEN. JACK. GWENDOLEN. LADY BRACKNELL. ALGERNON. LADY BRACKNELL. GWENDOLEN. ALGERNON. LANE. ALGERNON. LANE. ALGERNON. LAN

Now we can extract each dialogue block:

In [4]:
def process_dialogue(character, words):
    if not character or not len(words):
        return
    
    print(len(words), "words. ", end = "")
    
with open('data/pg844.txt') as stream:
    character = None
    words = []
    
    for line in stream:
        line = line.strip()
        
        if line.startswith('*** END OF THE PROJECT GUTENBERG EBOOK'):
            break
            
        if line.endswith('.') and line.isupper():
            process_dialogue(character, words)

            character = line
            words = []
            
            print(line, end = ' ')
            continue

        words += line.split(' ')

    process_dialogue(character, words)

ALGERNON. 9 words. LANE. 9 words. ALGERNON. 37 words. LANE. 3 words. ALGERNON. 18 words. LANE. 8 words. ALGERNON. 48 words. LANE. 8 words. ALGERNON. 20 words. LANE. 28 words. ALGERNON. 9 words. LANE. 42 words. ALGERNON. 15 words. LANE. 16 words. ALGERNON. 12 words. LANE. 7 words. ALGERNON. 43 words. LANE. 11 words. ALGERNON. 13 words. JACK. 16 words. ALGERNON. 25 words. JACK. 9 words. ALGERNON. 8 words. JACK. 27 words. ALGERNON. 8 words. JACK. 5 words. ALGERNON. 9 words. JACK. 9 words. ALGERNON. 22 words. JACK. 27 words. ALGERNON. 7 words. JACK. 4 words. ALGERNON. 20 words. JACK. 5 words. ALGERNON. 25 words. JACK. 18 words. ALGERNON. 16 words. JACK. 6 words. ALGERNON. 61 words. JACK. 23 words. ALGERNON. 45 words. JACK. 10 words. ALGERNON. 34 words. JACK. 16 words. ALGERNON. 43 words. JACK. 8 words. ALGERNON. 19 words. JACK. 5 words. ALGERNON. 32 words. JACK. 3 words. ALGERNON. 30 words. JACK. 28 words. ALGERNON. 18 words. LANE. 6 words. JACK. 43 words. ALGERNON. 17 words. JACK. 33 word

Finally, we can count, aggregate, and summarize the data. We also introduce logic to exclude the stage instructions (enclosed in brackets) from the word count. And, if multiple characters are speaking at the same time, we want to "double-count" their dialogue, once for each speaker.

In [19]:
def process_dialogue(counts, characters, words):
    if not characters or not words:
        return

    for character in characters.split(','):
        character = character.strip('. ')
    
        if character in counts:
            counts[character] = counts[character] + words
        else:
            counts[character] = words

def process_play(path):
    counts = {}

    with open(path) as stream:
        character = None
        words = 0
        
        for line in stream:
            line = line.strip()
            
            if line.startswith('*** END OF THE PROJECT GUTENBERG EBOOK'):
                break
                
            if line.endswith('.') and line.isupper():
                process_dialogue(counts, character, words)
    
                character = line
                words = 0
    
                continue

            bracketed = False
            
            for word in line.split(' '):
                if word.endswith(']'):
                    bracketed = False
                    continue
                    
                if bracketed:
                    continue
                    
                if word.startswith('['):
                    bracketed = True
                    continue

                words += 1
    
        process_dialogue(counts, character, words)
    
    for character in sorted(counts, key = lambda x: -counts[x]):
        print(character, counts[character], "words.")

process_play('data/pg844.txt')

JACK 4293 words.
ALGERNON 4271 words.
LADY BRACKNELL 3040 words.
CECILY 2982 words.
GWENDOLEN 2352 words.
MISS PRISM 1014 words.
CHASUBLE 823 words.
LANE 202 words.
MERRIMAN 184 words.


These results confirm our hypothesis: Jack has the most dialogue in _The Importance of Being Earnest_. Algernon, however, is _very_ close behind. Jack and Algernon's word counts are within one-half of one percent.

In [16]:
process_play('data/pg1533.txt')

MACBETH 5744 words.
LADY MACBETH 2031 words.
MALCOLM 1589 words.
MACDUFF 1256 words.
ROSS 947 words.
BANQUO 857 words.
DUNCAN 568 words.
LENNOX 544 words.
FIRST WITCH 407 words.
LADY MACDUFF 390 words.
DOCTOR 378 words.
PORTER 323 words.
HECATE 289 words.
SIWARD 253 words.
SOLDIER 239 words.
GENTLEWOMAN 220 words.
ALL 205 words.
LORD 188 words.
MESSENGER 183 words.
SON 182 words.
FIRST MURDERER 172 words.
SECOND WITCH 146 words.
THIRD WITCH 141 words.
ANGUS 141 words.
OLD MAN 106 words.
SECOND MURDERER 92 words.
APPARITION 76 words.
MENTEITH 76 words.
CAITHNESS 76 words.
DONALBAIN 59 words.
THIRD MURDERER 56 words.
YOUNG SIWARD 55 words.
MURDERER 47 words.
SEYTON 38 words.
SERVANT 37 words.
FLEANCE 17 words.
LORDS 16 words.
BOTH MURDERERS 10 words.
SOLDIERS 5 words.


Macbeth takes the lion's share, with Lady Macbeth having only thirty-five percent of her husband's dialogue. She should have died hereafter, there would have been a time for more words (for her to say). 