Lists of titles derived from the works of William Shakespeare are housed at **Titles from Shakespeare**: http://www.barbarapaul.com/shake.html

The website above contains links to 37 subpages, one for each of Shakepseare's works, that list the title, author, and quoted passage. The exception is Hamlet, in which the site is broken up into five individual acts.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests, re
import os

pd.set_option('display.max_colwidth', None)

Scraping the 37 urls from the main site:

In [2]:
url = 'http://www.barbarapaul.com/shake.html'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')

In [3]:
links ={}

for link in soup.find_all('a', style="text-decoration:none;"):
    href = link.get('href')
    abb = href.replace('http://www.barbarapaul.com/shake/', '').replace('.html', '')
    links[abb] = href
    
assert len(links) == 37

In [4]:
#hamlet.html contains 5 subpages. Remove hamlet and add the five subpages

links.pop('hamlet')

links['hamlet1'] = 'http://www.barbarapaul.com/shake/hamlet1.html'
links['hamlet2'] = 'http://www.barbarapaul.com/shake/hamlet2.html'
links['hamlet3'] = 'http://www.barbarapaul.com/shake/hamlet3.html'
links['hamlet4'] = 'http://www.barbarapaul.com/shake/hamlet4.html'
links['hamlet5'] = 'http://www.barbarapaul.com/shake/hamlet5.html'

In [5]:
print('Number of plays in dictionary:', len(links))
for key, value in links.items():
    print(key, ':', value)

Number of plays in dictionary: 41
allswell : http://www.barbarapaul.com/shake/allswell.html
antony : http://www.barbarapaul.com/shake/antony.html
ayli : http://www.barbarapaul.com/shake/ayli.html
errors : http://www.barbarapaul.com/shake/errors.html
corio : http://www.barbarapaul.com/shake/corio.html
cymb : http://www.barbarapaul.com/shake/cymb.html
1henryiv : http://www.barbarapaul.com/shake/1henryiv.html
2henryiv : http://www.barbarapaul.com/shake/2henryiv.html
henryv : http://www.barbarapaul.com/shake/henryv.html
1henryvi : http://www.barbarapaul.com/shake/1henryvi.html
2henryvi : http://www.barbarapaul.com/shake/2henryvi.html
3henryvi : http://www.barbarapaul.com/shake/3henryvi.html
julius : http://www.barbarapaul.com/shake/julius.html
john : http://www.barbarapaul.com/shake/john.html
lear : http://www.barbarapaul.com/shake/lear.html
labours : http://www.barbarapaul.com/shake/labours.html
macbeth : http://www.barbarapaul.com/shake/macbeth.html
m4m : http://www.barbarapaul.com/shake

Using the `links` dictionary and Beautiful Soup, I have extracted the title, author, and shakepearean play the title was derived from and create a data frame for each play. I then merged the 41 dataframes together into one large dataframe.

In [6]:
big_df = pd.DataFrame(columns = ['Play', 'Author', 'Title'])

for key, value in links.items():
    url = value
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc)
    
    shake_titles = []

    for item in soup.find_all('li'):
        title = item.text
        title = re.sub(' +', ' ', title) 
        shake_titles.append(title.replace('\n', ''))
        
    title_df = pd.DataFrame(shake_titles, columns=['Play'])

    new = title_df['Play'].str.split(": ", n=1, expand=True)
    title_df['Author'] = new[0]
    title_df['Title'] = new[1]
    title_df['Play'].replace(to_replace=r'.+', value=key, regex=True, inplace=True)
        
    print(key, title_df.shape)
    big_df = big_df.append(title_df)

allswell (37, 3)
antony (18, 3)
ayli (129, 3)
errors (6, 3)
corio (6, 3)
cymb (3, 3)
1henryiv (18, 3)
2henryiv (14, 3)
henryv (34, 3)
1henryvi (3, 3)
2henryvi (13, 3)
3henryvi (17, 3)
julius (86, 3)
john (44, 3)
lear (250, 3)
labours (9, 3)
macbeth (289, 3)
m4m (11, 3)
merchant (123, 3)
wives (15, 3)
mnd (71, 3)
muchado (37, 3)
othello (90, 3)
pericles (2, 3)
richii (102, 3)
richiii (43, 3)
romeo (79, 3)
shrew (16, 3)
tempest (111, 3)
timon (12, 3)
titus (12, 3)
troilus (12, 3)
12n (86, 3)
2gents (25, 3)
winter (25, 3)
sonnets (66, 3)
hamlet1 (195, 3)
hamlet2 (105, 3)
hamlet3 (252, 3)
hamlet4 (54, 3)
hamlet5 (64, 3)


In [7]:
big_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2584 entries, 0 to 63
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Play    2584 non-null   object
 1   Author  2584 non-null   object
 2   Title   2577 non-null   object
dtypes: object(3)
memory usage: 80.8+ KB


In [8]:
big_df.sample(10)

Unnamed: 0,Play,Author,Title
49,hamlet1,Nick O'Donahoe,Too Too Solid Flesh
43,hamlet3,Huia Mase,Outrageous Fortune
214,macbeth,B. Pinkerton,This Petty Pace
233,lear,Rodney Atkinson,Europe's Full Circle: Corporate Elites and the New Fascism
9,richii,Denisa Newborough,Fire in My Blood
43,ayli,Michael Bender,All the World's a Stage
44,tempest,Margaret Paul Joseph,Caliban in Exile: The Outsider in Caribbean Fiction
216,macbeth,Pamela Gray Ahearn,All Our Yesterdays
82,richii,Kenneth Lindley,Of Graves and Epitaphs
120,merchant,Norah Kelly,On Such a Night


In [9]:
big_dfv2 = big_df.reset_index()
big_dfv2.drop(columns=['index'], inplace=True)

In [10]:
no_title = big_dfv2['Title'].isna()
big_dfv2[no_title]

Unnamed: 0,Play,Author,Title
343,julius,"William Saffire, comp.:Lend Me Your Ears",
1086,merchant,The Gentler Virtues in Greek Literature,
1980,hamlet1,Power of Imagery for Personal Enrichment,
2074,hamlet1,"Violence, Horror & Sensationalism in Stories for Children",
2107,hamlet1,New Environments and Why They Happen,
2134,hamlet2,Sydney Sipho Sepamla:Children of the Earth,
2357,hamlet3,David Christie Murray & Henry Herman:One Traveller Returns,


Data cleaning notes:
1. The author:title for `big_dfv2.iloc[2357]` was not split properly. Correct this entry.
2. The author:title for `big_dfv2.iloc[2134]` was not split properly. Correct this entry.
3. `big_dfv2.loc[2107]` was improperly added as a separate list item. Partial non-shakespearean title: delete
4. `big_dfv2.loc[2074]` was improperly added as a separate list item. Partial non-shakespearean title: delete
5. `big_dfv2.loc[1980]` was a) improperly added as a separate list item and belows to `big_dfv2.loc[1979]` and b) this book is duplicated `big_dfv2.iloc[2000]`. Delete these two entries.
6. `big_dfv2.loc[1086]` was improperly added as a separate list item. Partial non-shakespearean title: delete
7. The author:title for `big_dfv2.iloc[343]` was not split properly. Correct this entry.

In [11]:
big_dfv2.iloc[2357]['Title'] = 'One Traveller Returns'
big_dfv2.iloc[2357]['Author'] = 'David Christie Murray & Henry Herman'

big_dfv2.iloc[2134]['Author'] = 'Sydney Sipho Sepamla'
big_dfv2.iloc[2134]['Title'] = 'Children of the Earth'

big_dfv2.iloc[343]['Author'] = 'William Saffire'
big_dfv2.iloc[343]['Title'] = 'Lend Me Your Ears'

In [12]:
big_dfv2 = big_dfv2.drop(index=[2107, 2074, 1980, 1979, 1086])

In [13]:
big_dfv2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2579 entries, 0 to 2583
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Play    2579 non-null   object
 1   Author  2579 non-null   object
 2   Title   2579 non-null   object
dtypes: object(3)
memory usage: 80.6+ KB


<div class="alert alert-block alert-info">
Rename Shakespeare's plays to match the plays from the Kaggle/Wikipedia dataset and consolodate the 5 hamlets. 
</div>

In [14]:
big_dfv3 = big_dfv2
plays = big_dfv3.Play.unique()
print(len(plays), plays)

41 ['allswell' 'antony' 'ayli' 'errors' 'corio' 'cymb' '1henryiv' '2henryiv'
 'henryv' '1henryvi' '2henryvi' '3henryvi' 'julius' 'john' 'lear'
 'labours' 'macbeth' 'm4m' 'merchant' 'wives' 'mnd' 'muchado' 'othello'
 'pericles' 'richii' 'richiii' 'romeo' 'shrew' 'tempest' 'timon' 'titus'
 'troilus' '12n' '2gents' 'winter' 'sonnets' 'hamlet1' 'hamlet2' 'hamlet3'
 'hamlet4' 'hamlet5']


In [15]:
renaming = {'allswell':'alls well that ends well', 'antony':'antony and cleopatra', 'ayli':'as you like it', 'errors':'comedy of errors', 'corio':'coriolanus', 'cymb':'cymbeline', '1henryiv':'henry iv', '2henryiv':'henry iv', 'henryv': 'henry v', '1henryvi':'henry vi part 1', '2henryvi':'henry vi part 2', '3henryvi':'henry vi part 3', 'julius':'julius caesar', 'john':'king john', 'lear':'king lear', 'labours':'loves labours lost', 'm4m':'measure for measure', 'merchant':'merchant of venice', 'wives':'merry wives of windsor', 'mnd':'midsummer nights dream', 'muchado':'much ado about nothing', 'richii':'richard ii', 'richiii':'richard iii', 'romeo':'romeo and juliet', 'shrew':'taming of the shrew', 'timon':'timon of athens', 'titus':'titus andronicus', 'troilus':'troilus and cressida', '12n':'twelfth night', '2gents':'two gentlemen of verona', 'winter':'winters tale', 'hamlet1':'hamlet', 'hamlet2':'hamlet', 'hamlet3':'hamlet', 'hamlet4':'hamlet', 'hamlet5':'hamlet'} 
big_dfv3.Play = big_dfv3.Play.replace(renaming)

In [16]:
plays = big_dfv3.Play.unique()
print(len(plays), plays)

36 ['alls well that ends well' 'antony and cleopatra' 'as you like it'
 'comedy of errors' 'coriolanus' 'cymbeline' 'henry iv' 'henry v'
 'henry vi part 1' 'henry vi part 2' 'henry vi part 3' 'julius caesar'
 'king john' 'king lear' 'loves labours lost' 'macbeth'
 'measure for measure' 'merchant of venice' 'merry wives of windsor'
 'midsummer nights dream' 'much ado about nothing' 'othello' 'pericles'
 'richard ii' 'richard iii' 'romeo and juliet' 'taming of the shrew'
 'tempest' 'timon of athens' 'titus andronicus' 'troilus and cressida'
 'twelfth night' 'two gentlemen of verona' 'winters tale' 'sonnets'
 'hamlet']


<div class="alert alert-block alert-info">
Many of the book titles are compound and contain a subtitle. I have split the `Title` column on " : and ; " and created two new columns: `Primary Title` and `Subtitle`
</div>

In [17]:
sub = big_dfv3['Title'].str.split(": |; |:|;", n = 1, expand = True)
big_dfv3['Primary Title']= sub[0]
big_dfv3['Subtitle']= sub[1] 

In [18]:
big_dfv3.sample(10)

Unnamed: 0,Play,Author,Title,Primary Title,Subtitle
1914,hamlet,Kate Wilhelm,The Hamlet Trap,The Hamlet Trap,
1923,hamlet,Wulf Sachs et alia,Black Hamlet (Parallax: Re-Visions of Culture and Society),Black Hamlet (Parallax,Re-Visions of Culture and Society)
1023,merchant of venice,Manning Coles,All That Glitters,All That Glitters,
805,macbeth,August Derleth,"Night's Yawning Peal, a Ghostly Company","Night's Yawning Peal, a Ghostly Company",
59,as you like it,Margaret Peterson,Love Is Enough,Love Is Enough,
2424,hamlet,Joan Howes,A Cry of Players,A Cry of Players,
185,comedy of errors,Virgil Burnett,A Comedy of Eros,A Comedy of Eros,
451,king lear,Diana L. Paxson,The Serpent's Tooth,The Serpent's Tooth,
1160,midsummer nights dream,Alicen White,Nor Spell Nor Charm,Nor Spell Nor Charm,
984,measure for measure,Charles Hooper,Brief Authority,Brief Authority,


<div class="alert alert-block alert-info">
    <b> Additional Considerations:</b> While the split successfully parses out many primary titles, there are still instances of Shakespearean titles showing up in the Subtitle column or in the Primary Title column in (parentheses) or separted by a comma. Additionally, many of the titles are a play on words such as <i>A Comedy of Eros</i> or <i>The Seven Ages of Woman</i>. Finally, what about articles? Is <i>The Web of Life</i> the same as <i>Web of Life</i> and <i>A Mingled Yarn</i> the same as <i>Mingled Yarn</i>?
</div>

In [19]:
titles_df = big_dfv3.drop(columns=['Title'])

In [20]:
titles_df.sample(5)

Unnamed: 0,Play,Author,Primary Title,Subtitle
1883,sonnets,Brad Steiger,Inside Heaven's Gate,The UFO Cult Leaders Tell Their Story in Their Own Words
432,king lear,Eileen Winwood,Words of Love,
2300,hamlet,Junerwanda Michaels,What Dreams May Come,
2099,hamlet,"James Ursini and Alain Silver, eds.",More Things Than Are Dreamt Of,Masterpieces of Supernatural Horror
2281,hamlet,Mary Luytens,Perchance To Dream,


In [21]:
dups = titles_df.duplicated()
titles_df[dups]

Unnamed: 0,Play,Author,Primary Title,Subtitle
569,king lear,Alicia Brandon,Full Circle,
604,king lear,"Edward Haskell, ed.",Full Circle,The Moral Force of Unified Science
1135,midsummer nights dream,Robert F. Baylus,A Midsummer Night's Murder,
1846,winters tale,Isaac Asimov,The Gods Themselves,
1960,hamlet,Samuel Rogers,Less Than Kind,
2558,hamlet,Augusto Monterroso,The Rest Is Silence,


In [22]:
print('Initial title size: ',titles_df.shape[0])

titles_df.drop(index=[2558, 1960, 1846, 1135, 604, 569], inplace=True)
titles_df.shape[0]

Initial title size:  2579


2573

<div class="alert alert-block alert-info">
Investigating titles used more then once. Duplicated entries (rows) have been deleted above.
</div>

In [23]:
duplicated_titles = {}

for title in titles_df['Primary Title']:
    if title in duplicated_titles:
        duplicated_titles[title] += 1
    else:
        duplicated_titles[title] = 1

In [24]:
print('Number of duplicated titles: ', len({k for k, v in duplicated_titles.items() if v >= 2}))
print('Number of unique titles: ',len(duplicated_titles))
print('Initial title size: ',titles_df.shape[0])

Number of duplicated titles:  330
Number of unique titles:  1430
Initial title size:  2573


In [25]:
print({k for k, v in duplicated_titles.items() if v >= 2})

{'I Could a Tale Unfold', "Passion's Slave", 'Not of Woman Born', 'Call Back Yesterday', 'SWAK', 'Goodnight, Desdemona (Good Morning, Juliet)', 'Rue with a Difference', 'He Should Have Died Hereafter', 'The Better Part of Valor', 'Enter Two Murderers', "On Fortune's Wheel", "All the World's a Stage", 'The Whirligig of Time', 'Merely Players', 'Enter Murderers', 'Lend Me Your Ears', 'Flame of Love', 'Shylock', 'Too Like the Lightning', 'To Prove a Villain', 'Salad Days', 'Living Light', 'Cauldron Bubble', 'How Like an Angel', 'To Be or Not To Be', "A Midsummer Night's Murder", 'How Like a God', 'This Mortal Coil', 'The Shadow of His Wings', 'The Queen Is Dead', 'Some Must Watch', 'Tomorrow and Tomorrow and Tomorrow', 'Whirligig of Time', 'Shame the Devil', 'This Above All', 'Good Night, Sweet Prince', 'An Ear to the Ground', "Fortune's Wheel", 'Forms of Things Unknown', 'Uncertain Glory', 'Exeunt Murderers', 'Come Away, Death', "Play's the Thing", 'The Glass of Fashion', 'Such Sweet Thu

<div class="alert alert-block alert-info">
Does capitalization affact the duplicates? What about punctuation? Must be careful about punctuation at this point, as the large data set of lines from Shakespeare's plays still contain punctuation (and capitalization for that matter).
</div>

In [26]:
titles_df2 = titles_df
titles_df2['Primary Title'] = titles_df2['Primary Title'].str.lower()

titles_df2.sample(5)

Unnamed: 0,Play,Author,Primary Title,Subtitle
2100,hamlet,Philip K. Dick,time out of joint,
725,macbeth,Gary Russell,dr. who,Instruments of Darkness
229,henry iv,Paul Haggard,dead is the door-nail,
86,as you like it,Charles Earle Funk,thereby hangs a tale,
1000,merchant of venice,David Polowe,shylock,A One-act Comedy in Verse


In [27]:
duplicated_titles2 = {}

for title in titles_df2['Primary Title']:
    if title in duplicated_titles2:
        duplicated_titles2[title] += 1
    else:
        duplicated_titles2[title] = 1


print('Number of lower case duplicated titles: ', len({k for k, v in duplicated_titles2.items() if v >= 2}))
print('Number of original duplicated titles: ', len({k for k, v in duplicated_titles.items() if v >= 2}))

Number of lower case duplicated titles:  332
Number of original duplicated titles:  330


<div class="alert alert-block alert-info">
I'm just going to bite the bullet here and create a unique list of book titles (lower case) and play of origin.
</div>


In [28]:
titles_df3 = titles_df2.drop(columns=['Author', 'Subtitle'])
titles_df3.rename(columns={'Primary Title': 'Title'}, inplace=True)

In [29]:
titles_df3.sample(5)

Unnamed: 0,Play,Title
537,king lear,god's spies
755,macbeth,sleep no more
1332,richard ii,fire in the blood
2480,hamlet,me darlin' dublin's dead and gone
1298,othello,sweet revenge


In [30]:
titles_vc = titles_df3.Title.value_counts()

In [31]:
df_titles_vc = pd.DataFrame(titles_vc)
df_titles_vc = df_titles_vc.reset_index()
df_titles_vc.columns = ['Title', 'Occurences']

In [32]:
df_titles_vc.sample(5)

Unnamed: 0,Title,Occurences
868,the better part of valour,1
1161,romeo and -- jane,1
938,a nasty piece of work,1
725,all that glisters,1
79,lend me your ears,5


In [33]:
titles_df4 = pd.merge(titles_df3, df_titles_vc, on='Title')
titles_df4.sample(5)

Unnamed: 0,Play,Title,Occurences
1434,richard ii,the hollow crown,8
254,henry v,when the blast of war blows,1
159,as you like it,one man in his time,12
2387,hamlet,further country matters,1
2213,hamlet,to be or not to be married,1


In [34]:
uniq_titles_df = titles_df4.drop_duplicates(keep='first')
print(uniq_titles_df.shape)

(1461, 3)


<div class="alert alert-block alert-info">
Using fuzzy wuzzy to see if some of these "unique" titles, have typos and.or can otherwise be binned with another set of titles. I think that at the moment, after reviewing the output, this would take way too much time proofing my eye and hand-editing. </div>


In [35]:
uniq_titleist = uniq_titles_df['Title'].unique().tolist()
sorted(uniq_titleist)[:10]


['"better angels" of capitalism',
 '"to be or not to be, that is the question"...for ulster',
 '"to be or not to be,"  happiness or misery? "that is the question." being four lectures on the functions and disorders of the  nervous system and the reproductive organs',
 '"to do" or "to make," that\'s the question',
 "'pale cast...'",
 '...and all our yesterdays',
 '100 more tales from  all our yesterdays',
 '2001 nights',
 'a beast with two backs',
 'a better world']

In [36]:
from fuzzywuzzy import process, fuzz

score_sort = [(x,) + i
             for x in uniq_titleist 
             for i in process.extract(x, uniq_titleist, scorer=fuzz.token_sort_ratio)]

similarity_sort = pd.DataFrame(score_sort, columns=['title_sort','match_sort','score_sort'])
similarity_sort.head()



Unnamed: 0,title_sort,match_sort,score_sort
0,all's well that ends well,all's well that ends well,100
1,all's well that ends well,all's well that ends wrong,86
2,all's well that ends well,jing tao hai land de ren sheng (all's well that ends well),62
3,all's well that ends well,all that glitters in water,59
4,all's well that ends well,all's well,57


In [37]:
possible_matches = similarity_sort[(similarity_sort.score_sort < 100) & (similarity_sort.score_sort >=95)]
possible_matches.sample(20)


Unnamed: 0,title_sort,match_sort,score_sort
726,alarum and excursion,alarums and excursions,95
3516,better angel,better angels,96
2772,all that glitters #2,all that glitters #3,95
3451,murder's out of tune,murder out of tune,95
2802,all that glitters #8,all that glitters #2,95
566,the better part of valour,the better part of valor,98
2506,walking shadows,walking shadow,97
2774,all that glitters #2,all that glitters #5,95
2771,all that glitters #2,all that glitters #1,95
3526,better angels,better angel,96


<div class="alert alert-block alert-info">
I have also created a second list of the subtitles to see if any Shakespearean titles ended up in there.
</div>


In [38]:
subtitle_df = titles_df2[['Play', 'Subtitle']]
subtitle_df.shape

(2573, 2)

In [39]:
subtitle_df = subtitle_df.dropna()
subtitle_df.shape

(641, 2)

In [40]:
subtitle_df['Subtitle'] = subtitle_df['Subtitle'].str.lower()

In [41]:
colons = subtitle_df['Subtitle'].str.contains(':|;')
subtitle_df[colons]

Unnamed: 0,Play,Subtitle
9,alls well that ends well,"the indiaman's daughter; or, all's well that ends well. a tale of boston and our own times"
22,alls well that ends well,life force: the energetic constitution of man and the neuro-endocrine connection
142,as you like it,the seven ethical ages of western man. william albert levi (value inquiry book series; 27)
435,king lear,the romance continues: expressions of affection and desire
613,king lear,phase ii manual: a support group guidebook guidebook for battered women
619,king lear,"kingship in the german epic: alexanderlied, rolandslied, ""spielmannsepen"""
653,king lear,full circle: a legacy of metalwork
1093,merchant of venice,confession is good for your soul: receiving the mercy of a forgiving god
1143,midsummer nights dream,midsummer night's babelpoul anderson: a midsummer tempest
1439,richard iii,this sun of york: a biography of edward iv


In [42]:
dups2 = subtitle_df.Subtitle.duplicated()
subtitle_df[dups2]

Unnamed: 0,Play,Subtitle
141,as you like it,an autobiography
591,king lear,poems
592,king lear,a novel
609,king lear,an autobiography
711,macbeth,a collection of poems
1083,merchant of venice,an autobiography
1236,othello,a novel
1561,taming of the shrew,sealed with a kiss
1562,taming of the shrew,sealed with a loving kiss -- writing of love
1579,tempest,a novel


<div class="alert alert-block alert-info">
Enough splitting! The subtitle list is messy. I can always come back in and dig for more examples of I need to, so for now, I am going to hold off on incorportating the subtitle list.  
</div>

In [43]:
datapath = '../data'

In [44]:
datapath_uniq_titles_df = os.path.join(datapath, 'uniq_titles.csv')
uniq_titles_df.to_csv(datapath_uniq_titles_df, index=False)

In [45]:
datapath_possible_matches = os.path.join(datapath, 'possible_matches.csv')
possible_matches.to_csv(datapath_possible_matches, index=False)

In [46]:
datapath_subtitle_df = os.path.join(datapath, 'subtitles.csv')
subtitle_df.to_csv(datapath_subtitle_df, index=False)