In [406]:
import urllib.request

def fetch(url):
    response = urllib.request.urlopen(url)
    return response.read()

In [407]:
macbethUrl = "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
macbethSource = fetch(macbethUrl)
macbethSource[0:80]

b'\xef\xbb\xbfThis Etext file is presented by Project Gutenberg, in\r\ncooperation with World'

I have followed the directions given in the Jupyter notebook. I need to find a way to use BeautifulSoup to extract just the text and exclude all markups. So I utilized the function given in the notebook. I tweaked it slightly to use the BeautifulSoup command soup.text. This seems to have been successful. 

In [408]:
from bs4 import BeautifulSoup as bs
def extract(source):
    soup = bs(source) 
    return soup.text
macbethSoup = extract(macbethSource)

macbethSoup[0:1000]

'This Etext file is presented by Project Gutenberg, in\r\ncooperation with World Library, Inc., from their Library of the\r\nFuture and Shakespeare CDROMS.  Project Gutenberg often releases\r\nEtexts that are NOT placed in the Public Domain!!\r\n\r\n*This Etext has certain copyright implications you should read!*\r\n\r\n<>\r\n\r\n*Project Gutenberg is proud to cooperate with The World Library*\r\nin the presentation of The Complete Works of William Shakespeare\r\nfor your reading for education and entertainment.  HOWEVER, THIS\r\nIS NEITHER SHAREWARE NOR PUBLIC DOMAIN. . .AND UNDER THE LIBRARY\r\nOF THE FUTURE CONDITIONS OF THIS PRESENTATION. . .NO CHARGES MAY\r\nBE MADE FOR *ANY* ACCESS TO THIS MATERIAL.  YOU ARE ENCOURAGED!!\r\nTO GIVE IT AWAY TO ANYONE YOU LIKE, BUT NO CHARGES ARE ALLOWED!!\r\n\r\n\r\n**Welcome To The World of Free Plain Vanilla Electronic Texts**\r\n\r\n**Etexts Readable By Both Humans and By Computers, Since 1971**\r\n\r\n*These Etexts Prepared By Hundreds of Volu

Now I will work with the regex commands. I need to understand what this function is actually doing. So I see that we import the regex commands and use the variable directions to define a regex. Then we write a function that is called clean(text). Within the function, there is a variable called lines. This variable takes a regular expression substitution, so it substitutes the regular expression found in the text and replaces it with something else in the text. Then the split divides the text by line breaks. The return of the function tells the computer to return x for each x in lines (defined above) only when the x does not match the regex; so return everything except the regex. Okay. Now I can move forward. 

In [409]:
import re
directions = r'Etext'

def clean(text):
    lines = re.sub('\r', "\n", text).split("\n")
    return [x for x in lines if not re.match(directions, x)]

macbeth = clean(macbethSoup)

Okay, so I am starting with the function given in the Jupyter notebook for the Godfather text. What I need to do is to split my text, but I need to use the '\r' for carriage return to split the lines. I need to figure out how to use this function to include my regexs. However, I'm not sure I can actually use this function for all my regex and I need to break it down to better understand it. I can see below that my text is now properly split. I'm going to work with the split text and utilize each regex as its own step to clean the text. 

In [410]:
macbeth[0:100]

['This Etext file is presented by Project Gutenberg, in',
 '',
 'cooperation with World Library, Inc., from their Library of the',
 '',
 'Future and Shakespeare CDROMS.  Project Gutenberg often releases',
 '',
 '',
 '',
 '',
 '*This Etext has certain copyright implications you should read!*',
 '',
 '',
 '',
 '<>',
 '',
 '',
 '',
 '*Project Gutenberg is proud to cooperate with The World Library*',
 '',
 'in the presentation of The Complete Works of William Shakespeare',
 '',
 'for your reading for education and entertainment.  HOWEVER, THIS',
 '',
 'IS NEITHER SHAREWARE NOR PUBLIC DOMAIN. . .AND UNDER THE LIBRARY',
 '',
 'OF THE FUTURE CONDITIONS OF THIS PRESENTATION. . .NO CHARGES MAY',
 '',
 'BE MADE FOR *ANY* ACCESS TO THIS MATERIAL.  YOU ARE ENCOURAGED!!',
 '',
 'TO GIVE IT AWAY TO ANYONE YOU LIKE, BUT NO CHARGES ARE ALLOWED!!',
 '',
 '',
 '',
 '',
 '',
 '**Welcome To The World of Free Plain Vanilla Electronic Texts**',
 '',
 '',
 '',
 '**Etexts Readable By Both Humans and By Comput

So, first I want to get rid of the metadata at the beginning and end of the text. To do this, I need to find the place in the text where the play begins and ends. I already have this info from when I did this in Atom. I'm breaking this into a for loop so I can better understand it. I am using the match function to find the spot in the text where I know the actual play begins. I will use enumerate so I know which line the play begins on and so that I can create a new version of the text beginning on that line. 

In [411]:
for num, x in enumerate(macbeth):
    if re.match(r'.*(?=^SCENE:)',x):
        print(x)
        print(num)

SCENE: Scotland and England
499


In [412]:
for num, x in enumerate(macbeth):
    if re.match(r'.*(?=^SCENE:)',x): 
        print("Start Line: ",num)
        beginningline = num
    if re.search(r'(-THE END-)', x): 
        print("End Line: ",num)
        endline = num
        
        
step2 = macbeth[beginningline:endline]
step2[0:50]

Start Line:  499
End Line:  6243


['SCENE: Scotland and England',
 '',
 '',
 '',
 '',
 '',
 'ACT I. SCENE I.',
 '',
 'A desert place. Thunder and lightning.',
 '',
 '',
 '',
 'Enter three Witches.',
 '',
 '',
 '',
 '  FIRST WITCH. When shall we three meet again?',
 '',
 '    In thunder, lightning, or in rain?',
 '',
 "  SECOND WITCH. When the hurlyburly's done,",
 '',
 "    When the battle's lost and won.",
 '',
 '  THIRD WITCH. That will be ere the set of sun.',
 '',
 '  FIRST WITCH. Where the place?',
 '',
 '  SECOND WITCH. Upon the heath.',
 '',
 '  THIRD WITCH. There to meet with Macbeth.',
 '',
 '  FIRST WITCH. I come, Graymalkin.',
 '',
 '  ALL. Paddock calls. Anon!',
 '',
 '    Fair is foul, and foul is fair.',
 '',
 '    Hover through the fog and filthy air.                Exeunt.',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'SCENE II.',
 '']

Now I have only the text of the play. So, now need to work through the regex expressions. I will begin with the expression that finds all left-justified text and replaces it with blanks. I'm trying a simple expression to see if it works. So, this is telling the computer to look through each line in the step2 text and print everything except the left justified text (as told by the regular expression). 

In [413]:
step3 = [x for x in step2 if not re.match('^\S+', x)]


step2[0:50]


['SCENE: Scotland and England',
 '',
 '',
 '',
 '',
 '',
 'ACT I. SCENE I.',
 '',
 'A desert place. Thunder and lightning.',
 '',
 '',
 '',
 'Enter three Witches.',
 '',
 '',
 '',
 '  FIRST WITCH. When shall we three meet again?',
 '',
 '    In thunder, lightning, or in rain?',
 '',
 "  SECOND WITCH. When the hurlyburly's done,",
 '',
 "    When the battle's lost and won.",
 '',
 '  THIRD WITCH. That will be ere the set of sun.',
 '',
 '  FIRST WITCH. Where the place?',
 '',
 '  SECOND WITCH. Upon the heath.',
 '',
 '  THIRD WITCH. There to meet with Macbeth.',
 '',
 '  FIRST WITCH. I come, Graymalkin.',
 '',
 '  ALL. Paddock calls. Anon!',
 '',
 '    Fair is foul, and foul is fair.',
 '',
 '    Hover through the fog and filthy air.                Exeunt.',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'SCENE II.',
 '']

Okay, so above is the old text and below is the new. Comparing them, it appears the expression did what I wanted (got rid of all the left justified text). 

In [414]:
step3[0:50]

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '  FIRST WITCH. When shall we three meet again?',
 '',
 '    In thunder, lightning, or in rain?',
 '',
 "  SECOND WITCH. When the hurlyburly's done,",
 '',
 "    When the battle's lost and won.",
 '',
 '  THIRD WITCH. That will be ere the set of sun.',
 '',
 '  FIRST WITCH. Where the place?',
 '',
 '  SECOND WITCH. Upon the heath.',
 '',
 '  THIRD WITCH. There to meet with Macbeth.',
 '',
 '  FIRST WITCH. I come, Graymalkin.',
 '',
 '  ALL. Paddock calls. Anon!',
 '',
 '    Fair is foul, and foul is fair.',
 '',
 '    Hover through the fog and filthy air.                Exeunt.',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

So now I want to remove all the capitalized character names. For this, I need to find capital letters using at the beginning of the line. So I will use the same regular expression from Atom. Starting at new line find from 1 to 4 spaces. Then look for capital letter characters, any number followed by any spaces. Also include the . with \.

In [27]:
step4 = [re.sub(r'(^\s{1,4})[A-Z\s]*\.', "", x) for x in step3]
step4[0:50]


['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 ' When shall we three meet again?',
 '',
 '    In thunder, lightning, or in rain?',
 '',
 " When the hurlyburly's done,",
 '',
 "    When the battle's lost and won.",
 '',
 ' That will be ere the set of sun.',
 '',
 ' Where the place?',
 '',
 ' Upon the heath.',
 '',
 ' There to meet with Macbeth.',
 '',
 ' I come, Graymalkin.',
 '',
 ' Paddock calls. Anon!',
 '',
 '    Fair is foul, and foul is fair.',
 '',
 '    Hover through the fog and filthy air.                Exeunt.',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

Okay, this looks good! The only thing left is to get rid of the asides in brackets and the right justified exits. So I will start with the right-justified exits. I will use the substitute function to substitute a blank space with text found after 8 or more blank spaces (which catches all the right justified text). 

In [28]:
step5 = [re.sub('\s{8,}.*', '', x) for x in step4] 
step5[0:100]

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 ' When shall we three meet again?',
 '',
 '    In thunder, lightning, or in rain?',
 '',
 " When the hurlyburly's done,",
 '',
 "    When the battle's lost and won.",
 '',
 ' That will be ere the set of sun.',
 '',
 ' Where the place?',
 '',
 ' Upon the heath.',
 '',
 ' There to meet with Macbeth.',
 '',
 ' I come, Graymalkin.',
 '',
 ' Paddock calls. Anon!',
 '',
 '    Fair is foul, and foul is fair.',
 '',
 '    Hover through the fog and filthy air.',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 ' What bloody man is that? He can report,',
 '',
 '    As seemeth by his plight, of the revolt',
 '',
 '    The newest state.',
 '',
 ' This is the sergeant',
 '',
 '    Who like a good and hardy soldier fought',
 '',
 "    'Gainst my captivity. Hail, brave friend!",
 '',
 '    Say to the King the knowledge of the broil',
 '',
 '    As thou didst leave it.',
 '',
 ' Doubtful it stood,',
 '',


Now all I need is to get rid of the asides in brackets. I will use the same substitution method as above, using the regex to find all text at the begining of the lines in brackets.

In [29]:
step6 = [re.sub('^\s*\[.*\]', '', x) for x in step5] 
step6[0:100]

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 ' When shall we three meet again?',
 '',
 '    In thunder, lightning, or in rain?',
 '',
 " When the hurlyburly's done,",
 '',
 "    When the battle's lost and won.",
 '',
 ' That will be ere the set of sun.',
 '',
 ' Where the place?',
 '',
 ' Upon the heath.',
 '',
 ' There to meet with Macbeth.',
 '',
 ' I come, Graymalkin.',
 '',
 ' Paddock calls. Anon!',
 '',
 '    Fair is foul, and foul is fair.',
 '',
 '    Hover through the fog and filthy air.',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 ' What bloody man is that? He can report,',
 '',
 '    As seemeth by his plight, of the revolt',
 '',
 '    The newest state.',
 '',
 ' This is the sergeant',
 '',
 '    Who like a good and hardy soldier fought',
 '',
 "    'Gainst my captivity. Hail, brave friend!",
 '',
 '    Say to the King the knowledge of the broil',
 '',
 '    As thou didst leave it.',
 '',
 ' Doubtful it stood,',
 '',


Great! I think I've got it! The biggest challenge is understanding the line-break structure, as well as integrating Python syntax into my understanding of regex. Combining both is challenging, but I think I have at least a basic understanding now.