In [2]:
import pandas as pd
import collections
import fuzzywuzzy
from fuzzywuzzy import process
import pprint

Using data from Wikipedia, I've generate two dictionaries for the categorization of Shakespeare's plays. The first dictionary `play_cat_strict` uses the traditional categorization of plays into comedy, history, and tragedy. The second dictionary `play_cat_modern` has been reorganized to better reflect the tone of the play. Histories have been collapsed into the tragedy category and tragicomedies have been grouped as romances. 

In [3]:
play_cat_strict = {'Comedy' : ['The Tempest',  'The Two Gentlemen of Verona',  'The Merry Wives of Windsor',  'Measure for Measure',  'The Comedy of Errors',  'Much Ado About Nothing',  'Loves Labours Lost',  'A Midsummer Nights Dream',  'The Merchant of Venice',  'As You Like It',  'The Taming of the Shrew',  'Alls Well That Ends Well',  'Twelfth Night',  'The Winters Tale',  'Pericles, Prince of Tyre',  'The Two Noble Kinsmen'], 'History' : ['King John', 'Edward III',  'Richard II',  'Henry IV, Part 1',  'Henry IV, Part 2',  'Henry V',  'Henry VI, Part 1',  'Henry VI, Part 2',  'Henry VI, Part 3',  'Richard III',  'Henry VIII'], 'Tragedy' : ['Troilus and Cressida', 'Coriolanus', 'Titus Andronicus',  'Romeo and Juliet',  'Timon of Athens',  'Julius Caesar',  'Macbeth',  'Hamlet',  'King Lear',  'Othello',  'Antony and Cleopatra',  'Cymbeline']}
print('Number of plays per category in the strict dataset')
for cat, plays in play_cat_strict.items():
    print(cat, ':', len(plays))

play_cat_modern = {'Comedy' : ['The Two Gentlemen of Verona',  'The Merry Wives of Windsor',  'Measure for Measure',  'The Comedy of Errors',  'Much Ado About Nothing',  'Loves Labours Lost',  'A Midsummer Nights Dream',  'The Merchant of Venice',  'As You Like It',  'The Taming of the Shrew',  'Alls Well That Ends Well',  'Twelfth Night'], 'Romance' : ['Pericles, Prince of Tyre',  'Cymbeline',  'The Winters Tale',  'The Tempest',  'The Two Noble Kinsmen'],  'Tragedy' : ['Troilus and Cressida', 'Coriolanus', 'Titus Andronicus',  'Romeo and Juliet',  'Timon of Athens',  'Julius Caesar',  'Macbeth',  'Hamlet',  'King Lear',  'Othello',  'Antony and Cleopatra', 'King John', 'Edward III',  'Richard II',  'Henry IV, Part 1',  'Henry IV, Part 2',  'Henry V',  'Henry VI, Part 1',  'Henry VI, Part 2',  'Henry VI, Part 3',  'Richard III',  'Henry VIII']}
print('Number of plays per category in the modern dataset')
for cat, plays in play_cat_modern.items():
    print(cat, ':', len(plays))

Number of plays per category in the strict dataset
Comedy : 16
History : 11
Tragedy : 12
Number of plays per category in the modern dataset
Comedy : 12
Romance : 5
Tragedy : 22


I am grouping like this to keep my options open during sentiment analysis. There may be discernable differences in tone between comedies and tragedies (light/dark, happy/sad, fun/serious). Simimaly, the histories category may be more neutral (strict dataset) or appropriately binned as a tragedy (modern). Finally, the romance plays are are different from the rest in that they are a mixture of comedy and tradegy, and the comingling may muddy sentiment analysis of the strict dataset.

In [4]:
wiki_plays = collections.OrderedDict()

for cat, plays in play_cat_strict.items():
    for i in range(len(plays)):
        wiki_plays[plays[i]] = []
        wiki_plays[plays[i]].append(cat)

for cat, plays in play_cat_modern.items():
    for i in range(len(plays)):
        wiki_plays[plays[i]].append(cat)

In [5]:
print(wiki_plays['The Tempest'])

['Comedy', 'Romance']


In [6]:
wiki_df = pd.DataFrame.from_dict(wiki_plays, orient='index', columns=['Strict', 'Modern'])
wiki_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 39 entries, The Tempest to Cymbeline
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Strict  39 non-null     object
 1   Modern  39 non-null     object
dtypes: object(2)
memory usage: 936.0+ bytes


In [7]:
wiki_df = wiki_df.reset_index()
wiki_df = wiki_df.rename(columns={'index':'Play'})
wiki_df.sample(5)

Unnamed: 0,Play,Strict,Modern
6,Loves Labours Lost,Comedy,Comedy
10,The Taming of the Shrew,Comedy,Comedy
26,Henry VIII,History,Tragedy
19,"Henry IV, Part 1",History,Tragedy
5,Much Ado About Nothing,Comedy,Comedy


Shalespeare's plays have been generously textised and are as a csv on Kaggle: https://www.kaggle.com/kingburrito666/shakespeare-plays

In [8]:
shakespeare_plays = pd.read_csv('../data/Shakespeare_data.csv')

In [9]:
shakespeare_plays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111396 entries, 0 to 111395
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Dataline          111396 non-null  int64  
 1   Play              111396 non-null  object 
 2   PlayerLinenumber  111393 non-null  float64
 3   ActSceneLine      105153 non-null  object 
 4   Player            111389 non-null  object 
 5   PlayerLine        111396 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 5.1+ MB


Compare play names in `wiki_plays` with `shakespeare_plays`

In [22]:
shakespeare_plays.sample(5)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
35552,35553,Hamlet,45.0,4.5.151,KING CLAUDIUS,If you desire to know the certainty
5205,5206,"Henry VI, Part 1",2.0,4.2.20,General,And strong enough to issue out and fight:
54571,54572,Loves Labours Lost,93.0,4.3.361,BIRON,"They are the books, the arts, the academes,"
28440,28441,Coriolanus,38.0,5.6.150,CORIOLANUS,"With six Aufidiuses, or more, his tribe,"
11624,11625,"Henry VI, Part 3",42.0,4.1.124,CLARENCE,"You that love me and Warwick, follow me."


In [10]:
shake_uniq = shakespeare_plays.Play.unique()
wiki_uniq = wiki_df.Play.unique()

In [11]:
print('Number of plays in kaggle file:', len(shake_uniq))
print('Number of plays in wiki categeory files:', len(wiki_uniq))

Number of plays in kaggle file: 36
Number of plays in wiki categeory files: 39


In [12]:
inconsistent = set(wiki_df.Play).difference(shakespeare_plays.Play)
print(len(inconsistent), inconsistent)

20 {'The Winters Tale', 'The Two Noble Kinsmen', 'Henry IV, Part 2', 'Henry IV, Part 1', 'A Midsummer Nights Dream', 'Edward III', 'Henry VI, Part 2', 'The Merry Wives of Windsor', 'Pericles, Prince of Tyre', 'The Two Gentlemen of Verona', 'Alls Well That Ends Well', 'Much Ado About Nothing', 'As You Like It', 'Henry VI, Part 3', 'Macbeth', 'The Comedy of Errors', 'The Taming of the Shrew', 'The Merchant of Venice', 'Henry VI, Part 1', 'Measure for Measure'}


Three of Shakespeare's plays are not prepresented in the kaggle dataset, and there is a namining discrepency between 20 plays. Investigating further...

In [13]:
shake_uniq.sort()
shake_uniq

array(['A Comedy of Errors', 'A Midsummer nights dream', 'A Winters Tale',
       'Alls well that ends well', 'Antony and Cleopatra',
       'As you like it', 'Coriolanus', 'Cymbeline', 'Hamlet', 'Henry IV',
       'Henry V', 'Henry VI Part 1', 'Henry VI Part 2', 'Henry VI Part 3',
       'Henry VIII', 'Julius Caesar', 'King John', 'King Lear',
       'Loves Labours Lost', 'Measure for measure', 'Merchant of Venice',
       'Merry Wives of Windsor', 'Much Ado about nothing', 'Othello',
       'Pericles', 'Richard II', 'Richard III', 'Romeo and Juliet',
       'Taming of the Shrew', 'The Tempest', 'Timon of Athens',
       'Titus Andronicus', 'Troilus and Cressida', 'Twelfth Night',
       'Two Gentlemen of Verona', 'macbeth'], dtype=object)

In [14]:
wiki_uniq.sort()
wiki_uniq

array(['A Midsummer Nights Dream', 'Alls Well That Ends Well',
       'Antony and Cleopatra', 'As You Like It', 'Coriolanus',
       'Cymbeline', 'Edward III', 'Hamlet', 'Henry IV, Part 1',
       'Henry IV, Part 2', 'Henry V', 'Henry VI, Part 1',
       'Henry VI, Part 2', 'Henry VI, Part 3', 'Henry VIII',
       'Julius Caesar', 'King John', 'King Lear', 'Loves Labours Lost',
       'Macbeth', 'Measure for Measure', 'Much Ado About Nothing',
       'Othello', 'Pericles, Prince of Tyre', 'Richard II', 'Richard III',
       'Romeo and Juliet', 'The Comedy of Errors',
       'The Merchant of Venice', 'The Merry Wives of Windsor',
       'The Taming of the Shrew', 'The Tempest',
       'The Two Gentlemen of Verona', 'The Two Noble Kinsmen',
       'The Winters Tale', 'Timon of Athens', 'Titus Andronicus',
       'Troilus and Cressida', 'Twelfth Night'], dtype=object)

In [15]:
fuzzy_scores = {}

for proper_name in inconsistent:
    fuzzy_scores[proper_name] = fuzzywuzzy.process.extract(proper_name, shake_uniq, limit=3, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

In [16]:
pprint.pprint(fuzzy_scores)

{'A Midsummer Nights Dream': [('A Midsummer nights dream', 100),
                              ('Much Ado about nothing', 43),
                              ('A Winters Tale', 42)],
 'Alls Well That Ends Well': [('Alls well that ends well', 100),
                              ('As you like it', 37),
                              ('Henry VI Part 1', 36)],
 'As You Like It': [('As you like it', 100),
                    ('Romeo and Juliet', 47),
                    ('Julius Caesar', 44)],
 'Edward III': [('Henry VIII', 60), ('Henry IV', 44), ('Julius Caesar', 43)],
 'Henry IV, Part 1': [('Henry VI Part 1', 80),
                      ('Henry VI Part 2', 73),
                      ('Henry VI Part 3', 73)],
 'Henry IV, Part 2': [('Henry VI Part 2', 80),
                      ('Henry VI Part 1', 73),
                      ('Henry VI Part 3', 73)],
 'Henry VI, Part 1': [('Henry VI Part 1', 100),
                      ('Henry VI Part 2', 93),
                      ('Henry VI Part 3', 93)],
 'H

'Henry IV, Part 2', 'The Two Noble Kinsmen', and 'Edward III' are completely missing from the `shakespeare_plays` dataset. The kaggle discussion seems to indicate that Part 1 and Part 2 have been collapesed into a single play. It looks like articles ("The" and "A") have been stripped from some of the plays in `shakespeare_plays`; however, there are also several instances of an incorrect article. There is also inconsistent capitalization in that dataset as well. I can use fuzzywuzzy scores to safely replace all the 100 score matches. The next closest score of 93 with start changing around the Henry IV and Henry the VI datasets.

In [18]:
def replace_col_matches(df, column, string_to_match, min_ratio = 100):
    """This function gets the top 10 closest matches to our input string a list of unique strings and 
    replaces all rows with matches with a fuzzy ratio > 90 matches with the input matches
    """
    strings = df[column].unique()
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=1, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
    rows_with_matches = df[column].isin(close_matches)
    df.loc[rows_with_matches, column] = string_to_match

In [19]:
remakespeare_plays = shakespeare_plays
type(remakespeare_plays)

pandas.core.frame.DataFrame

In [20]:
for proper_name in inconsistent:
    replace_col_matches(remakespeare_plays, 'Play', proper_name)

In [47]:
print(set(wiki_df.Play).difference(remakespeare_plays.Play))


{'The Winters Tale', 'The Two Noble Kinsmen', 'Henry IV, Part 2', 'Henry IV, Part 1', 'Edward III', 'The Merry Wives of Windsor', 'Pericles, Prince of Tyre', 'The Two Gentlemen of Verona', 'The Comedy of Errors', 'The Taming of the Shrew', 'The Merchant of Venice'}


In [48]:
print(set(remakespeare_plays.Play).difference(wiki_df.Play))


{'Henry IV', 'Taming of the Shrew', 'A Winters Tale', 'Pericles', 'A Comedy of Errors', 'Merchant of Venice', 'Merry Wives of Windsor', 'Two Gentlemen of Verona'}


As much as it pains me, I think it is prudent to remove the articles 'The' and 'A', as this seems to be an area of frequent error and/or omission

In [50]:
renaming = {'A Winters Tale':'Winters Tale', 'Pericles': 'Pericles, Prince of Tyre', 'A Comedy of Errors': 'Comedy of Errors', 'A Midsummer Nights Dream': 'Midsummer Nights Dream', 'The Tempest':'Tempest' }
remakespeare_plays.Play = remakespeare_plays.Play.replace(renaming)
remakespeare_plays.Play.unique()

array(['Henry IV', 'Henry VI, Part 1', 'Henry VI, Part 2',
       'Henry VI, Part 3', 'Alls Well That Ends Well', 'As You Like It',
       'Antony and Cleopatra', 'Comedy of Errors', 'Coriolanus',
       'Cymbeline', 'Hamlet', 'Henry V', 'Henry VIII', 'King John',
       'Julius Caesar', 'King Lear', 'Loves Labours Lost', 'Macbeth',
       'Measure for Measure', 'Merchant of Venice',
       'Merry Wives of Windsor', 'Midsummer Nights Dream',
       'Much Ado About Nothing', 'Othello', 'Pericles, Prince of Tyre',
       'Richard II', 'Richard III', 'Romeo and Juliet',
       'Taming of the Shrew', 'Tempest', 'Timon of Athens',
       'Titus Andronicus', 'Troilus and Cressida', 'Twelfth Night',
       'Two Gentlemen of Verona', 'Winters Tale'], dtype=object)

In [51]:
shake_df = pd.merge(remakespeare_plays, wiki_df,how='left', on='Play')
shake_df.sample(5)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,Strict,Modern
74037,74038,Othello,126.0,3.3.391,OTHELLO,"Farewell the plumed troop, and the big wars,",Tragedy,Tragedy
95032,95033,Timon of Athens,26.0,,FLAVIUS,Exit,Tragedy,Tragedy
98465,98466,Titus Andronicus,1.0,4.4.15,SATURNINUS,"This to Apollo, this to the god of war,",Tragedy,Tragedy
49998,49999,King Lear,2.0,1.5.6,KENT,your letter.,Tragedy,Tragedy
8544,8545,"Henry VI, Part 2",39.0,4.2.65,CADE,"all shall eat and drink on my score, and I will",History,Tragedy


## Data Cleaning in Progress...