Lists of titles derived from the works of William Shakespeare are housed at **Titles from Shakespeare**: http://www.barbarapaul.com/shake.html

The website above contains links to 37 subpages, one for each of Shakepseare's works, that list the title, author, and quoted passage. The exception is Hamlet, in which the site is broken up into five individual acts.

In [22]:
import pandas as pd
from bs4 import BeautifulSoup
import requests, re

pd.set_option('display.max_colwidth', None)

Scraping the 37 urls from the main site:

In [23]:
url = 'http://www.barbarapaul.com/shake.html'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')

In [24]:
links ={}

for link in soup.find_all('a', style="text-decoration:none;"):
    href = link.get('href')
    abb = href.replace('http://www.barbarapaul.com/shake/', '').replace('.html', '')
    links[abb] = href
    
assert len(links) == 37

In [25]:
#hamlet.html contains 5 subpages. Remove hamlet and add the five subpages

links.pop('hamlet')

links['hamlet1'] = 'http://www.barbarapaul.com/shake/hamlet1.html'
links['hamlet2'] = 'http://www.barbarapaul.com/shake/hamlet2.html'
links['hamlet3'] = 'http://www.barbarapaul.com/shake/hamlet3.html'
links['hamlet4'] = 'http://www.barbarapaul.com/shake/hamlet4.html'
links['hamlet5'] = 'http://www.barbarapaul.com/shake/hamlet5.html'

In [26]:
print('Number of plays in dictionary:', len(links))
for key, value in links.items():
    print(key, ':', value)

Number of plays in dictionary: 41
allswell : http://www.barbarapaul.com/shake/allswell.html
antony : http://www.barbarapaul.com/shake/antony.html
ayli : http://www.barbarapaul.com/shake/ayli.html
errors : http://www.barbarapaul.com/shake/errors.html
corio : http://www.barbarapaul.com/shake/corio.html
cymb : http://www.barbarapaul.com/shake/cymb.html
1henryiv : http://www.barbarapaul.com/shake/1henryiv.html
2henryiv : http://www.barbarapaul.com/shake/2henryiv.html
henryv : http://www.barbarapaul.com/shake/henryv.html
1henryvi : http://www.barbarapaul.com/shake/1henryvi.html
2henryvi : http://www.barbarapaul.com/shake/2henryvi.html
3henryvi : http://www.barbarapaul.com/shake/3henryvi.html
julius : http://www.barbarapaul.com/shake/julius.html
john : http://www.barbarapaul.com/shake/john.html
lear : http://www.barbarapaul.com/shake/lear.html
labours : http://www.barbarapaul.com/shake/labours.html
macbeth : http://www.barbarapaul.com/shake/macbeth.html
m4m : http://www.barbarapaul.com/shake

Using the `links` dictionary and Beautiful Soup, I have extracted the title, author, and shakepearean play the title was derived from and create a data frame for each play. I then merged the 41 dataframes together into one large dataframe.

In [27]:
big_df = pd.DataFrame(columns = ['Play', 'Author', 'Title'])

for key, value in links.items():
    url = value
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc)
    
    shake_titles = []

    for item in soup.find_all('li'):
        title = item.text
        title = re.sub(' +', ' ', title) 
        shake_titles.append(title.replace('\n', ''))
        
    title_df = pd.DataFrame(shake_titles, columns=['Play'])

    new = title_df['Play'].str.split(": ", n=1, expand=True)
    title_df['Author'] = new[0]
    title_df['Title'] = new[1]
    title_df['Play'].replace(to_replace=r'.+', value=key, regex=True, inplace=True)
        
    print(key, title_df.shape)
    big_df = big_df.append(title_df)

allswell (37, 3)
antony (18, 3)
ayli (129, 3)
errors (6, 3)
corio (6, 3)
cymb (3, 3)
1henryiv (18, 3)
2henryiv (14, 3)
henryv (34, 3)
1henryvi (3, 3)
2henryvi (13, 3)
3henryvi (17, 3)
julius (86, 3)
john (44, 3)
lear (250, 3)
labours (9, 3)
macbeth (289, 3)
m4m (11, 3)
merchant (123, 3)
wives (15, 3)
mnd (71, 3)
muchado (37, 3)
othello (90, 3)
pericles (2, 3)
richii (102, 3)
richiii (43, 3)
romeo (79, 3)
shrew (16, 3)
tempest (111, 3)
timon (12, 3)
titus (12, 3)
troilus (12, 3)
12n (86, 3)
2gents (25, 3)
winter (25, 3)
sonnets (66, 3)
hamlet1 (195, 3)
hamlet2 (105, 3)
hamlet3 (252, 3)
hamlet4 (54, 3)
hamlet5 (64, 3)


In [28]:
big_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2584 entries, 0 to 63
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Play    2584 non-null   object
 1   Author  2584 non-null   object
 2   Title   2577 non-null   object
dtypes: object(3)
memory usage: 80.8+ KB


In [31]:
big_df.sample(10)

Unnamed: 0,Play,Author,Title
9,m4m,Rebecca Pitts,Brief Authority: Fragments of One Woman's Testament
168,hamlet1,Fiona Sinclair,Most Unnatural Murder
79,lear,Norman S. Barrett,Monsters of the Deep
83,ayli,Maurice Druon,The Seven Ages of Paris
277,macbeth,M. C. Varley,A Charmed Life
12,hamlet1,Saul Landau,My Dad Was Not Hamlet
100,merchant,Eugene England,The Quality of Mercy: Personal Essays on Mormon Experience
28,allswell,Emma Erichson,"The Waif; or, The Web of Life"
34,hamlet5,Frederick May,The Rest Is Silence
35,lear,Doris G. Bargen,A Woman's Weapon: Spirit Possession in The Tale of Genji


In [40]:
big_dfv2 = big_df.reset_index()
big_dfv2.drop(columns=['index'], inplace=True)

In [41]:
no_title = big_dfv2['Title'].isna()
big_dfv2[no_title]

Unnamed: 0,Play,Author,Title
343,julius,"William Saffire, comp.:Lend Me Your Ears",
1086,merchant,The Gentler Virtues in Greek Literature,
1980,hamlet1,Power of Imagery for Personal Enrichment,
2074,hamlet1,"Violence, Horror & Sensationalism in Stories for Children",
2107,hamlet1,New Environments and Why They Happen,
2134,hamlet2,Sydney Sipho Sepamla:Children of the Earth,
2357,hamlet3,David Christie Murray & Henry Herman:One Traveller Returns,


Data cleaning notes:
1. The author:title for `big_dfv2.iloc[2357]` was not split properly. Correct this entry.
2. The author:title for `big_dfv2.iloc[2134]` was not split properly. Correct this entry.
3. `big_dfv2.loc[2107]` was improperly added as a separate list item. Partial non-shakespearean title: delete
4. `big_dfv2.loc[2074]` was improperly added as a separate list item. Partial non-shakespearean title: delete
5. `big_dfv2.loc[1980]` was a) improperly added as a separate list item and belows to `big_dfv2.loc[1979]` and b) this book is duplicated `big_dfv2.iloc[2000]`. Delete these two entries.
6. `big_dfv2.loc[1086]` was improperly added as a separate list item. Partial non-shakespearean title: delete
7. The author:title for `big_dfv2.iloc[343]` was not split properly. Correct this entry.

In [48]:
big_dfv2.iloc[2357]['Title'] = 'One Traveller Returns'
big_dfv2.iloc[2357]['Author'] = 'David Christie Murray & Henry Herman'

big_dfv2.iloc[2134]['Author'] = 'Sydney Sipho Sepamla'
big_dfv2.iloc[2134]['Title'] = 'Children of the Earth'

big_dfv2.iloc[343]['Author'] = 'William Saffire'
big_dfv2.iloc[343]['Title'] = 'Lend Me Your Ears'

In [50]:
big_dfv2 = big_dfv2.drop(index=[2107, 2074, 1980, 1979, 1086])

In [52]:
big_dfv2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2579 entries, 0 to 2583
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Play    2579 non-null   object
 1   Author  2579 non-null   object
 2   Title   2579 non-null   object
dtypes: object(3)
memory usage: 80.6+ KB
