This analysis checks if all the features are indeed as strongly correlated as Iarkho claims. The data we use is the only data for individual plays that Iarkho provides in his article: Shakespeare, Kleist, and Friedrich Schiller's romantic plays.

In [1]:
import pandas as pd
import numpy as np

In [2]:
def sigma_iarkho(variants, weights):  
    """ 
    The function allows calculating standard range following iarkho's procedure. 
    Parameters: 
        variants - a list with distinct variants in the ascending order, e.g. [1, 2, 3, 4, 5] 
        weights - a list of weights corresponding to these variants, e.g. [20, 32, 18, 9, 1] 
    Returns: 
        sigma - standard range per iarkho 
    """  
    weighted_mean_variants = np.average(variants, weights=weights)  
    differences_squared = [(variant - weighted_mean_variants)**2 for variant in variants] 
    weighted_mean_difference = np.average(differences_squared, weights=weights)  
    sigma = weighted_mean_difference**0.5  
      
    return sigma 

## I. Case 1: Shakespeare's Tragedies

In [3]:
plays = ['Troilus and Cressida', 'Coriolanus', 'Titus Andronicus', 'Romeo and Juliet', 'Timon of Athens',
       'Julius Caesar', 'Macbeth', 'Hamlet', 'King Lear', 'Othello']

shakespeare = pd.DataFrame(plays, columns=['title'])
shakespeare = pd.concat([shakespeare, pd.DataFrame(np.zeros((10, 10)))], axis=1)
columns= ['title'] + [i for i in range(1, 11)]
shakespeare.columns = columns

In [4]:
# this data comes from Iarkho's article table VIII (page 571)
shakespeare.iloc[0, 1:] = [34, 45, 23, 13, 5, 0, 0, 1, 0, 1]
shakespeare.iloc[1, 1:] = [12, 38, 30, 10, 7, 2, 2, 1, 0, 0]
shakespeare.iloc[2, 1:] = [16, 19, 21, 9, 2, 2, 0, 0, 0, 0]
shakespeare.iloc[3, 1:] = [22, 54, 18, 7, 3, 0, 0, 0, 0, 0]
shakespeare.iloc[4, 1:] = [15, 31, 19, 11, 3, 2, 2, 1, 0, 0]
shakespeare.iloc[5, 1:] = [14, 29, 16, 10, 5, 1, 2, 0, 0, 0]
shakespeare.iloc[6, 1:] = [21, 40, 18, 9, 3, 1, 1, 0, 0, 0]
shakespeare.iloc[7, 1:] = [21, 51, 25, 4, 2, 3, 0, 0, 0, 0]
shakespeare.iloc[8, 1:] = [13, 48, 26, 12, 5, 1, 0, 0, 0, 0]
shakespeare.iloc[9, 1:] = [17, 43, 24, 11, 4, 0, 0, 0, 0, 0]

In [5]:
sigmas = []
for num in range(shakespeare.shape[0]):
    sigma = sigma_iarkho(shakespeare.columns[1:11].tolist(), shakespeare.iloc[num, 1:11].tolist())
    sigmas.append(sigma)

In [6]:
# this data comes from Dracor
shakespeare['num_dramatic_characters'] = [34, 67, 27, 38, 68, 51, 45, 38, 33, 28]

In [7]:
shakespeare['sigma'] = sigmas

In [8]:
# data also comes from the Iarkho, the same table
shakespeare['mobility_coefficient'] = [122, 102, 69, 104, 83, 78, 93, 106, 105, 99]

In [9]:
shakespeare['perc_non_duologues'] = round((1 - (shakespeare[2] / shakespeare['mobility_coefficient'])) * 100, 3)

In [10]:
shakespeare

Unnamed: 0,title,1,2,3,4,5,6,7,8,9,10,num_dramatic_characters,sigma,mobility_coefficient,perc_non_duologues
0,Troilus and Cressida,34.0,45.0,23.0,13.0,5.0,0.0,0.0,1.0,0.0,1.0,34,1.397102,122,63.115
1,Coriolanus,12.0,38.0,30.0,10.0,7.0,2.0,2.0,1.0,0.0,0.0,67,1.384298,102,62.745
2,Titus Andronicus,16.0,19.0,21.0,9.0,2.0,2.0,0.0,0.0,0.0,0.0,27,1.222728,69,72.464
3,Romeo and Juliet,22.0,54.0,18.0,7.0,3.0,0.0,0.0,0.0,0.0,0.0,38,0.938128,104,48.077
4,Timon of Athens,15.0,31.0,19.0,11.0,3.0,2.0,2.0,1.0,0.0,0.0,68,1.470178,83,62.651
5,Julius Caesar,14.0,29.0,16.0,10.0,5.0,1.0,2.0,0.0,0.0,0.0,51,1.382736,78,62.821
6,Macbeth,21.0,40.0,18.0,9.0,3.0,1.0,1.0,0.0,0.0,0.0,45,1.18816,93,56.989
7,Hamlet,21.0,51.0,25.0,4.0,2.0,3.0,0.0,0.0,0.0,0.0,38,1.061813,106,51.887
8,King Lear,13.0,48.0,26.0,12.0,5.0,1.0,0.0,0.0,0.0,0.0,33,1.060698,105,54.286
9,Othello,17.0,43.0,24.0,11.0,4.0,0.0,0.0,0.0,0.0,0.0,28,1.025041,99,56.566


In [11]:
shakespeare.iloc[:, 11:].corr()

Unnamed: 0,num_dramatic_characters,sigma,mobility_coefficient,perc_non_duologues
num_dramatic_characters,1.0,0.660024,-0.171564,0.13811
sigma,0.660024,1.0,-0.24364,0.71269
mobility_coefficient,-0.171564,-0.24364,1.0,-0.560857
perc_non_duologues,0.13811,0.71269,-0.560857,1.0


### Summary:
We do not observe 90+ positive correlations between the four features. While some features are correlated: e.g, standard range (sigma) and number of dramatic characters (correlation of 0.66) and percentage of non-dulogues and sigma (correlation of 0.71), others are negatively correlated (e.g., mobility coefficient and percentage of non-duologes, correlation -0.56, mobility coefficient and number of dramatic characters (-0.17), mobility coefficient and sigma (-0.24).

## Case 2:  Kleist

In [12]:
plays = ['Die Familie Schroffenstein', 'Das Käthchen von Heilbronn oder die Feuerprobe',
        'Die Hermannsschlacht', 'Prinz Friedrich von Homburg']

kleist = pd.DataFrame(plays, columns=['title'])
kleist= pd.concat([kleist, pd.DataFrame(np.zeros((4, 11)))], axis=1)
columns= ['title'] + [i for i in range(1, 12)]
kleist.columns = columns

In [13]:
# this data comes from Iarkho's article table XIII (page 575)
kleist.iloc[0, 1:] = [13, 40, 17, 2, 2, 2, 0, 0, 0, 0, 0]
kleist.iloc[1, 1:] = [6, 25, 16, 8, 7, 1, 5, 1, 0, 0, 0]
kleist.iloc[2, 1:] = [10, 20, 12, 9, 5, 4, 6, 1, 0, 1, 0]
kleist.iloc[3, 1:] = [6, 15, 11, 6, 1, 5, 1, 2, 1, 0, 0]

In [14]:
sigmas = []
for num in range(kleist.shape[0]):
    sigma = sigma_iarkho(kleist.columns[1:].tolist(), kleist.iloc[num, 1:].tolist())
    sigmas.append(sigma)

In [15]:
# this data comes from Dracor
kleist['num_dramatic_characters'] = [29, 61, 84, 43]

In [16]:
kleist['sigma'] = sigmas

In [17]:
# data also comes from the Iarkho, the same table
kleist['mobility_coefficient'] = [76, 70, 68, 48]

In [18]:
kleist['perc_non_duologues'] = round((1 - (kleist[2] / kleist['mobility_coefficient'])) * 100, 3)

In [19]:
kleist

Unnamed: 0,title,1,2,3,4,5,6,7,8,9,10,11,num_dramatic_characters,sigma,mobility_coefficient,perc_non_duologues
0,Die Familie Schroffenstein,13.0,40.0,17.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,29,1.036388,76,47.368
1,Das Käthchen von Heilbronn oder die Feuerprobe,6.0,25.0,16.0,8.0,7.0,1.0,5.0,1.0,0.0,0.0,0.0,61,1.687768,70,64.286
2,Die Hermannsschlacht,10.0,20.0,12.0,9.0,5.0,4.0,6.0,1.0,0.0,1.0,0.0,84,2.044647,68,70.588
3,Prinz Friedrich von Homburg,6.0,15.0,11.0,6.0,1.0,5.0,1.0,2.0,1.0,0.0,0.0,43,1.993043,48,68.75


In [20]:
kleist.iloc[:, 12:].corr()

Unnamed: 0,num_dramatic_characters,sigma,mobility_coefficient,perc_non_duologues
num_dramatic_characters,1.0,0.714814,0.042125,0.74755
sigma,0.714814,1.0,-0.66797,0.994588
mobility_coefficient,0.042125,-0.66797,1.0,-0.621712
perc_non_duologues,0.74755,0.994588,-0.621712,1.0


### Summary:

The only two features which have extremely high positive correlation are **percentage of non-duologues** and **sigma**(0.995). The other features have various correlations ranginf from -0.66 to 0.74. 

## Case 3: Schiller's Romantic plays

In [21]:
plays = ['Die Räuber', 'Die Verschwörung des Fiesco zu Genua',
        'Die Piccolomini', 'Wallensteins Tod',
        'Die Jungfrau von Orleans', 'Wilhelm Tell']

schiller = pd.DataFrame(plays, columns=['title'])
schiller = pd.concat([schiller, pd.DataFrame(np.zeros((6, 17)))], axis=1)


columns= ['title'] + [i for i in range(1, 18)]
schiller.columns = columns

In [22]:
# this data comes from Iarkho's article table XXII (page 583)
schiller.iloc[0, 1:] = [23, 26, 18, 2, 5, 3, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
schiller.iloc[1, 1:] = [17, 37, 18, 9, 5, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
schiller.iloc[2, 1:] = [2, 17, 9, 3, 3, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
schiller.iloc[3, 1:] = [17, 38, 23, 14, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
schiller.iloc[4, 1:] = [12, 27, 16, 15, 6, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
schiller.iloc[5, 1:] = [8, 17, 14, 11, 6, 7, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]

In [23]:
sigmas = []
for num in range(schiller.shape[0]):
    sigma = sigma_iarkho(schiller.columns[1:].tolist(), schiller.iloc[num, 1:].tolist())
    sigmas.append(sigma)

In [24]:
# this data comes from Dracor
schiller['num_dramatic_characters'] = [26, 36, 28, 33, 47, 67]

In [25]:
schiller['sigma'] = sigmas

In [26]:
# data also comes from the Iarkho, the same table
schiller['mobility_coefficient'] = [73, 90, 38, 97, 80, 67]

In [27]:
schiller['perc_non_duologues'] = round((1 - (schiller[2] / schiller['mobility_coefficient'])) * 100, 3)

In [28]:
schiller

Unnamed: 0,title,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,num_dramatic_characters,sigma,mobility_coefficient,perc_non_duologues
0,Die Räuber,23.0,26.0,18.0,2.0,5.0,3.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,26,1.702204,73,64.384
1,Die Verschwörung des Fiesco zu Genua,17.0,37.0,18.0,9.0,5.0,2.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,36,1.366079,90,58.889
2,Die Piccolomini,2.0,17.0,9.0,3.0,3.0,1.0,3.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,28,1.595744,38,55.263
3,Wallensteins Tod,17.0,38.0,23.0,14.0,3.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,33,1.149565,97,60.825
4,Die Jungfrau von Orleans,12.0,27.0,16.0,15.0,6.0,2.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,47,1.426315,80,66.25
5,Wilhelm Tell,8.0,17.0,14.0,11.0,6.0,7.0,2.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,67,2.532392,67,74.627


In [29]:
schiller.iloc[:, 18:].corr()

Unnamed: 0,num_dramatic_characters,sigma,mobility_coefficient,perc_non_duologues
num_dramatic_characters,1.0,0.721848,0.046561,0.852803
sigma,0.721848,1.0,-0.431327,0.756643
mobility_coefficient,0.046561,-0.431327,1.0,0.140871
perc_non_duologues,0.852803,0.756643,0.140871,1.0


### Summary:
In Schiller's tragedies, there are some features that have strong positive correlations, i.e., **number of dramatic characters** and **sigma** (0.72), **number of dramatic characters** and **percentage of non-dulolgues** (0.85). No features have 0.9 + positive correlations. However, there are also features that have a negative correlation, i.e., mobility coefficient and sigma (-0.43) or weak correlations (mobility coefficient and percentage of non-duologues, 0.14). 


## Conclusion:
We have examined the five act tragedies written by Shakespeare, Kleist, and Schiller. We took the data from Iarkho, calculated sigmas, and took the number of dramatic characters from DraCor.

Out of all examined cases, we found a very high positive correltion (0.995) only in two features **percentage of non-duologues** and **sigma** and only in the tragedies on Kleist. In Shakespeare and Schiller's tragedies, these features had a strong positive correlation but not over 0.9 (0.72 in Shakespeare and 0.76 in Schiller). 

Other features consistently had negative correlation, e.g. **mobility coefficient** and **sigma**:
- Shakespeare: -0.24
- Kleist: - 0.67
- Schiller: - 0.43.

We conclude that to reduce the four features to one dimension by taking the mean of the four features, like Iarkho suggests, would be a mistake.

One guess we can make is that Iarkho arrived at extremely high positive correlations by calculated correlations on aggregated feature values, e.g., by author, and may have run into Simpson's paradox that was not known at the time.