## Exploration of Russian Individual Comedies and Authors

In this analysis, we will explore Russian comedies that represent the minimum and thr maximum of each feature as well as the comedies that are the closest to the min. Additionally, we will analyze the speech distribution of each playwright. Finally, we will generate open-form scores for each comedian, which will help us determine how experimental he was in the history of the Russian four and five-act comedy in verse.

To account for different number of acts (4 vs. 5), we will multiply such features as the number of dramatic characters and the mobility coefficient by 5/4 and rounded to the nearest integer.

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
from os import listdir
import json

In [2]:
def summary_features(df, feature):
    print('Mean, standard deviation, median, min and max values for the period:')
    display(pd.DataFrame(df[feature].describe()[['mean', 'std', '50%','min', 'max']]).round(2))
    print('Period Max:')
    display(pd.DataFrame(df[df[feature] == df[feature].max()][['last_name', 
                                                               'first_name', 
                                                               'title', 
                                                               'creation_date', feature]]).round(2))
    print('Period Min:')
    display(pd.DataFrame(df[df[feature] == df[feature].min()][['last_name', 
                                                               'first_name', 
                                                               'title', 
                                                               'creation_date', 
                                                               feature]]).round(2))
    print('The closest to the mean:')
    df_copy = df.copy()
    df_copy['diff_with_mean'] = df_copy[feature].apply(lambda x: np.absolute(x - df_copy[feature].mean()))
    display(pd.DataFrame(df_copy[df_copy['diff_with_mean'] == df_copy['diff_with_mean'].min()][['last_name', 
                                                                                   'first_name', 
                                                                                   'title', 
                                                                                   'creation_date', 
                                                                                   feature]]).round(2))

In [3]:
def coefficient_unused_dramatic_characters(data):
    total_present = 0
    total_non_speakers = 0
    for act in data['play_summary'].keys():
        for scene in data['play_summary'][act].keys():
            # identify the raw number of non-speaking dramatic characters
            num_non_speakers = len([item for item in data['play_summary'][act][scene].items() 
                                if (item[1] == 0  or item[1] == 'non_speaking') and item[0] not in ['num_utterances',
                                                                   'num_speakers',
                                                                   'perc_non_speakers']])
            total_non_speakers += num_non_speakers
            # calculate the total number of dramatic characters
            total_present += (data['play_summary'][act][scene]['num_speakers'] + num_non_speakers)
    coefficient_unused = (total_non_speakers / total_present ) * 100        
    
    return coefficient_unused

In [4]:
def get_data(input_directory):
    all_files = [f for f in listdir(input_directory) if f.count('.json') > 0]
    dfs = []
    for file in all_files:
        with open(input_directory + '/' + file) as json_file:
            data = json.load(json_file)
            not_used = coefficient_unused_dramatic_characters(data)
            df = pd.DataFrame([not_used], columns=['coefficient_unused'], index=[file.replace('.json','')])
            dfs.append(df)
            
    features_df = pd.concat(dfs, axis=0, sort=False).round(2)
    
    return features_df

In [5]:
def make_list(row):
    speech_dist = []
    for value in row[1:-1].split('\n '):
        speech_dist.append([int(num) for num in re.findall('[0-9]+', value)])
        
    return speech_dist

In [6]:
def speech_distribution_by_period(period_df):
    all_distributions = []
    for row in period_df['speech_distribution']:
        speech_dist_df = pd.DataFrame(row).T
        # rename columns to make sure they start with 1 and not 0
        speech_dist_df.columns = speech_dist_df.iloc[0, :]
        # no need to include the variants as a row - they will be column names
        only_counts_df = pd.DataFrame(speech_dist_df.iloc[1, :])
        only_counts_df.columns = ['raw_numbers']
        only_counts_df['percentage'] = only_counts_df['raw_numbers'] / only_counts_df.sum().values[0]
        all_distributions.append(round(only_counts_df['percentage'], 4))
    period_df_dist = pd.concat(all_distributions, axis=1).fillna(0)
    # take the mean for each period
    mean_per_type = pd.DataFrame(period_df_dist.mean(axis=1)).T 
    mean_per_type.index.name = 'number_of_speakers'
    mean_per_type = (mean_per_type * 100).round(2)
        
    return mean_per_type

In [7]:
def sigma_iarkho(df):
    """
    The function allows calculating standard range following iarkho's procedure.
    Parameters:
        df  - a dataframe where columns are variants, i.e., the distinct number of speakers in the ascending order, 
              e.g. [1, 2, 3, 4, 5] and values weights corresponding to these variants, i.e.,
              the number of scenes, e.g. [20, 32, 18, 9, 1]
    Returns:
        sigma - standard range per iarkho
    """
    weighted_mean_variants = np.average(df.columns.tolist(), weights=df.values[0])
    differences_squared = [(variant - weighted_mean_variants)**2 for variant in df.columns]
    weighted_mean_difference = np.average(differences_squared, weights=df.values[0])
    sigma = round(weighted_mean_difference**0.5, 2)

    return sigma

In [8]:
def sigma_summary(df, playwrights_lst):
    sigmas = []
    for playwright in playwrights_lst:
        selection = df[(df.last_name == playwright[0]) & (df.first_name == playwright[1])].copy()
        sigma = selection.pipe(speech_distribution_by_period).pipe(sigma_iarkho)
        sigmas.append(sigma)
        
    summary = pd.DataFrame(sigmas, columns=['sigma_iarkho'])
    summary['z_score'] = (summary['sigma_iarkho'] - df['sigma_iarkho'].mean()) / df['sigma_iarkho'].std()
    summary.index = playwrights_lst
    
    return summary

In [9]:
def authors_data(data_df, feature):
    overall_mean = round(data_df[feature].mean(), 2)
    overall_std = round(data_df[feature].std(), 2)
    statistics = ['mean'] 
    all_authors = pd.DataFrame(data_df.groupby(['last_name', 'first_name'])[feature].mean())
    all_authors.columns= ['mean']
    all_authors['z_score'] = (all_authors['mean'] - overall_mean) / overall_std
    
    return  all_authors

In [10]:
def playwrights_place(df, with_z_score=True):
    if with_z_score:
        column = 'z_score'
        sigma_col = column
    else:
        column = ['mean']
        sigma_col = 'sigma_iarkho'
    summary = pd.DataFrame(authors_data(df, 'num_present_characters')[column])
    summary.columns = ['num_present_characters']
    # make sure the order of the playwrights is the same
    
    ind = summary.index
    summary['mobility_coefficient'] = authors_data(df, 'mobility_coefficient', 
                                                        ).loc[ind, column]
    summary['sigma_iarkho'] = sigma_summary(df, ind)[sigma_col]
    summary['polylogues'] = authors_data(df, 'percentage_polylogues', 
                                                         ).loc[ind, column]
    summary['monologues'] = authors_data(df, 'percentage_monologues', 
                                                         ).loc[ind, column]
    summary = summary.round(2)
    if with_z_score:
        summary['monologues'] = summary['monologues'].apply(lambda x: -x)
        summary['open_form_score'] = round(summary.apply(lambda x: x.mean(), axis=1), 2)
        summary = summary.sort_values(by='open_form_score', ascending=False)
        
    return summary

In [11]:
comedies = pd.read_csv('../Russian_Comedies/Data/Comedies_Raw_Data.csv')
# sort by creation date
comedies_sorted = comedies.sort_values(by='creation_date').copy()
# select only original comedies and five act
original_comedies = comedies_sorted[(comedies_sorted['translation/adaptation'] == 0)].copy()

# rename the columns 
original_comedies = original_comedies.rename(columns={'stage_directions_frequency': 'frequency',
                                                   'average_length_of_stage_direction': 'average_length',
                                                   'degree_of_verse_prose_interaction': 'verse_prose_interaction',
                                                   'num_scenes_iarkho': 'mobility_coefficient', 
                                                   'percentage_non_duologues': 'percentage_non_dialogues',
                                                   'percentage_above_two_speakers': 'percentage_polylogues',
                                                    'percentage_scenes_with_discontinuous_change_characters': 'discontinuous_scenes'})

In [12]:
# calculate the coefficient of non-used dramatic characters
unused_coefficient = get_data('../Russian_Comedies/Play_Jsons/')
unused_coefficient['index'] = unused_coefficient.index.tolist()
original_comedies = original_comedies.merge(unused_coefficient, on='index')

In [13]:
original_comedies['last_name'] = original_comedies['last_name'].str.strip()
original_comedies['speech_distribution'] = original_comedies['speech_distribution'].apply(make_list)

In [14]:
four_act = original_comedies[original_comedies.num_acts == 4].copy()
five_act = original_comedies[original_comedies.num_acts == 5].copy()
four_act['num_present_characters'] = round(four_act['num_present_characters'] * 5/4, 0)
four_act['mobility_coefficient'] = round(four_act['mobility_coefficient'] * 5/4, 0)

In [15]:
combined_df = pd.concat([four_act, five_act])
combined_df = combined_df.sort_values(by='creation_date')

## Part 1. Iarkho's Original Features

### The Number of Dramatic Characters

In [16]:
summary_features(combined_df, 'num_present_characters')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,num_present_characters
mean,17.0
std,8.89
50%,14.0
min,8.0
max,42.0


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,num_present_characters
14,Griboedov,Aleksandr,Gore ot uma,1824,42.0


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,num_present_characters
0,Nikolev,Nikolai,Samoliubivyi stikhotvorets,1775,8.0
2,Efim’ev,Dmitrii,Prestupnik ot igry ili bratom prodannaia sestra,1788,8.0


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,num_present_characters
6,Golitsyn,Aleksei,Novye chudaki ili Prozhekter,1797,17.0


### The Mobility Coefficient

In [17]:
summary_features(combined_df, 'mobility_coefficient')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,mobility_coefficient
mean,61.43
std,18.14
50%,59.0
min,41.0
max,111.0


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,mobility_coefficient
20,Grigor’ev,Petr,Zhiteiiskaia shkola,1849,111.0


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,mobility_coefficient
13,Kokoshkin,Fedor,"Vospitalie, ili vot pridanoe",1824,41.0


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,mobility_coefficient
3,Kniazhnin,Iakov,Chudaki,1790,60.0
4,Klushin,Aleksandr,Smekh i gore,1792,60.0
9,Shakhovskoi,Aleksandr,"Urok koketkam, ili lipetskie vody",1815,60.0


### The Standard Range of the Number of Speaking Characters (Sigma)

In [18]:
summary_features(combined_df, 'sigma_iarkho')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,sigma_iarkho
mean,1.54
std,0.52
50%,1.48
min,0.74
max,2.77


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,sigma_iarkho
5,Kapnist,Vasilii,Iabeda,1794,2.77


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,sigma_iarkho
19,Krol’,Nikolai,Komediia iz sovremennoi zhizni,1849,0.74


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,sigma_iarkho
16,Zagoskin,Mikhail,Blagorodnyi teatr,1828,1.57


### The Percentage of Polylogues

In [19]:
summary_features(combined_df, 'percentage_polylogues')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,percentage_polylogues
mean,38.51
std,13.86
50%,39.39
min,15.38
max,61.22


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_polylogues
12,Shakhovskoi,Aleksandr,Pustodumy,1819,61.22


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_polylogues
2,Efim’ev,Dmitrii,Prestupnik ot igry ili bratom prodannaia sestra,1788,15.38


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_polylogues
3,Kniazhnin,Iakov,Chudaki,1790,38.33


### The Percentage of Monologues

In [20]:
summary_features(combined_df, 'percentage_monologues')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,percentage_monologues
mean,22.03
std,8.79
50%,21.21
min,6.12
max,42.31


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_monologues
2,Efim’ev,Dmitrii,Prestupnik ot igry ili bratom prodannaia sestra,1788,42.31


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_monologues
12,Shakhovskoi,Aleksandr,Pustodumy,1819,6.12


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_monologues
20,Grigor’ev,Petr,Zhiteiiskaia shkola,1849,22.52


### Speech Distribution for Each Playwright

#### Nikolai Krol' (1823 - 1871)

In [21]:
krol = speech_distribution_by_period(combined_df[combined_df.last_name == 'Krol’'])
display(krol)
print('The standard range of the number of speaking characters:', sigma_iarkho(krol))

Unnamed: 0_level_0,1,2,3,4
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,33.96,49.06,15.09,1.89


The standard range of the number of speaking characters: 0.74


#### Rafail Zotov (1796 - 1871)

In [22]:
zotov = speech_distribution_by_period(combined_df[combined_df.last_name == 'Zotov'])
display(zotov)
print('The standard range of the number of speaking characters:', sigma_iarkho(zotov))

Unnamed: 0_level_0,1,2,3,4,5
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,15.38,57.69,21.15,3.85,1.92


The standard range of the number of speaking characters: 0.81


####  Dmitrii Efim'ev (1768 - 1804)

In [23]:
efimev = speech_distribution_by_period(combined_df[combined_df.last_name == 'Efim’ev'])
display(efimev)
print('The standard range of the number of speaking characters:', sigma_iarkho(efimev))

Unnamed: 0_level_0,1,2,3,4,5
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,42.31,42.31,11.54,1.92,1.92


The standard range of the number of speaking characters: 0.86


#### Nikolai Nikolev (1758 - 1815)

In [24]:
nikolev = speech_distribution_by_period(combined_df[combined_df.last_name == 'Nikolev'])
display(nikolev)
print('The standard range of the number of speaking characters:', sigma_iarkho(nikolev))

Unnamed: 0_level_0,1,2,3,4,5,6
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,31.11,51.11,4.44,8.89,2.22,2.22


The standard range of the number of speaking characters: 1.12


#### Kniazhnin	Iakov (1740 - 1791)

In [25]:
kniazhnin = speech_distribution_by_period(combined_df[combined_df.last_name == 'Kniazhnin'])
display(kniazhnin)
print('The standard range of the number of speaking characters:', sigma_iarkho(kniazhnin))

Unnamed: 0_level_0,1,2,3,4,5,6,8
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,15.61,39.78,22.72,13.11,4.39,3.49,0.91


The standard range of the number of speaking characters: 1.32


#### Nikolai Seliavin (1774 - 1833)

In [26]:
seliavin = speech_distribution_by_period(combined_df[combined_df.last_name == 'Seliavin'])
display(seliavin)
print('The standard range of the number of speaking characters:', sigma_iarkho(seliavin))

Unnamed: 0_level_0,1,2,3,4,5,7,9
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,25.42,50.85,13.56,3.39,3.39,1.69,1.69


The standard range of the number of speaking characters: 1.42


#### Boris Fedorov (1794 - 1875)

In [27]:
fedorov = speech_distribution_by_period(combined_df[combined_df.last_name == 'Fedorov'])
display(fedorov)
print('The standard range of the number of speaking characters:', sigma_iarkho(fedorov))

Unnamed: 0_level_0,0,1,2,3,4,5,6,7
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1.19,27.38,36.9,15.48,13.1,1.19,1.19,3.57


The standard range of the number of speaking characters: 1.43


#### Mikhail Zagoskin (1789 - 1852)

In [28]:
zagoskin = speech_distribution_by_period(combined_df[combined_df.last_name == 'Zagoskin'])
display(zagoskin)
print('The standard range of the number of speaking characters:', sigma_iarkho(zagoskin))

Unnamed: 0_level_0,1,2,3,4,5,6,7
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,20.09,32.39,18.58,16.19,8.1,3.14,1.52


The standard range of the number of speaking characters: 1.44


Fedorov's comedy *Chudnyia vtrechi* indeed has some scenes with no speaking characters.

#### Aleksandr Klushin (1763 - 1804)

In [29]:
klushin = speech_distribution_by_period(combined_df[combined_df.last_name == 'Klushin'])
display(klushin)
print('The standard range of the number of speaking characters:', sigma_iarkho(klushin))

Unnamed: 0_level_0,1,2,3,4,5,6,7,8
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,20.0,33.33,26.67,11.67,1.67,3.33,1.67,1.67


The standard range of the number of speaking characters: 1.48


#### Petr Grigor’ev (1807 - 1854)

In [30]:
grigorev = speech_distribution_by_period(combined_df[combined_df.last_name == 'Grigor’ev'])
display(grigorev)
print('The standard range of the number of speaking characters:', sigma_iarkho(grigorev))

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,9
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,1.8,22.52,33.33,19.82,11.71,8.11,0.9,0.9,0.9


The standard range of the number of speaking characters: 1.48


#### Aleksandr Shakhovskoi (1777 - 1846)

In [31]:
shakhovskoi = speech_distribution_by_period(combined_df[combined_df.last_name == 'Shakhovskoi'])
display(shakhovskoi)
print('The standard range of the number of speaking characters:', sigma_iarkho(shakhovskoi))

Unnamed: 0_level_0,1,2,3,4,5,6,7,9,10
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,12.22,35.49,23.45,19.74,4.54,1.86,1.02,0.84,0.84


The standard range of the number of speaking characters: 1.5


#### Vasilii Golovin (1776 - 1831)

In [32]:
golovin = speech_distribution_by_period(combined_df[combined_df.last_name == 'Golovin'])
display(golovin)
print('The standard range of the number of speaking characters:', sigma_iarkho(golovin))

Unnamed: 0_level_0,1,2,3,4,5,6,7,9
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,28.57,46.03,12.7,1.59,4.76,3.17,1.59,1.59


The standard range of the number of speaking characters: 1.57


#### Aleksandr Soboloev

In [33]:
sobolev = speech_distribution_by_period(combined_df[combined_df.last_name == 'Sobolev'])
display(sobolev)
print('The standard range of the number of speaking characters:', sigma_iarkho(sobolev)) 

Unnamed: 0_level_0,1,2,3,4,5,6,9
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,13.04,28.26,19.57,19.57,8.7,8.7,2.17


The standard range of the number of speaking characters: 1.69


#### Fedor Kokoshkin (1773-1838)

In [34]:
kokoshkin = speech_distribution_by_period(combined_df[combined_df.last_name == 'Kokoshkin'])
display(kokoshkin)
print('The standard range of the number of speaking characters:', sigma_iarkho(kokoshkin))

Unnamed: 0_level_0,1,2,3,4,6,7,9
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,24.24,36.36,18.18,9.09,6.06,3.03,3.03


The standard range of the number of speaking characters: 1.86


####  Aleksei Golitsyn (1767 - 1800)

In [35]:
golitsyn = speech_distribution_by_period(combined_df[combined_df.last_name == 'Golitsyn'])
display(golitsyn)
print('The standard range of the number of speaking characters:', sigma_iarkho(golitsyn))

Unnamed: 0_level_0,1,2,3,4,5,9
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,29.79,34.04,19.15,8.51,4.26,4.26


The standard range of the number of speaking characters: 1.75


#### Anonymous 

In [36]:
anonymous = speech_distribution_by_period(combined_df[combined_df.last_name == 'Unknown'])
display(anonymous)
print('The standard range of the number of speaking characters:', sigma_iarkho(anonymous))

Unnamed: 0_level_0,1,2,3,4,5,6,7,9,13
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,20.0,35.56,15.56,13.33,4.44,2.22,4.44,2.22,2.22


The standard range of the number of speaking characters: 2.32


#### Aleksandr Griboedov (1795 - 1829)

In [37]:
griboedov = speech_distribution_by_period(combined_df[combined_df.last_name == 'Griboedov'])
display(griboedov)
print('The standard range of the number of speaking characters:', sigma_iarkho(griboedov))

Unnamed: 0_level_0,1,2,3,4,5,9,10,19
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,26.67,41.33,18.67,6.67,1.33,2.67,1.33,1.33


The standard range of the number of speaking characters: 2.54


#### Vasilii Kapnist (1758 - 1823)

In [38]:
kapnist = speech_distribution_by_period(combined_df[combined_df.last_name == 'Kapnist'])
display(kapnist)
print('The standard range of the number of speaking characters:', sigma_iarkho(kapnist))

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,12
number_of_speakers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,6.38,34.04,23.4,8.51,2.13,2.13,8.51,4.26,4.26,4.26,2.13


The standard range of the number of speaking characters: 2.77


### Observations:
- The playwright with the minimum number of speaking characters of 4 was Nikolai Krol'.
- The playwright with the maximum number of speaking characters (19) was Aleksandr Griboedov. 

## Part 2. Stage Directions

### Stage Directions Frequency

In [39]:
summary_features(combined_df, 'frequency')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,frequency
mean,18.58
std,5.19
50%,17.3
min,5.23
max,29.57


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,frequency
20,Grigor’ev,Petr,Zhiteiiskaia shkola,1849,29.57


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,frequency
10,Sobolev,Aleksandr,Tri zhenikha ili liubov‘ nyneshniago sveta,1817,5.23


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,frequency
16,Zagoskin,Mikhail,Blagorodnyi teatr,1828,18.56


### The Average Length of Stage Directions

In [40]:
summary_features(combined_df, 'average_length')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,average_length
mean,2.69
std,0.74
50%,2.71
min,1.09
max,3.94


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,average_length
14,Griboedov,Aleksandr,Gore ot uma,1824,3.94


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,average_length
7,Seliavin,Nikolai,Zhenikhi ili pobezhdennyi predrassudok,1806,1.09


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,average_length
0,Nikolev,Nikolai,Samoliubivyi stikhotvorets,1775,2.71


### The Degree of Verse and Prose Interaction

In [41]:
summary_features(combined_df, 'verse_prose_interaction')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,verse_prose_interaction
mean,7.67
std,3.18
50%,7.74
min,1.08
max,14.22


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,verse_prose_interaction
9,Shakhovskoi,Aleksandr,"Urok koketkam, ili lipetskie vody",1815,14.22


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,verse_prose_interaction
10,Sobolev,Aleksandr,Tri zhenikha ili liubov‘ nyneshniago sveta,1817,1.08


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,verse_prose_interaction
11,Fedorov,Boris,Chudnyia vstrechi,1818,7.74


## Part 3. Verse Features

### The Percentage of Scenes With Split Verse Lines

In [42]:
summary_features(combined_df, 'percentage_scene_split_verse')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,percentage_scene_split_verse
mean,30.71
std,13.74
50%,30.44
min,3.33
max,56.0


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_scene_split_verse
12,Shakhovskoi,Aleksandr,Pustodumy,1819,56.0


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_scene_split_verse
4,Klushin,Aleksandr,Smekh i gore,1792,3.33


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_scene_split_verse
0,Nikolev,Nikolai,Samoliubivyi stikhotvorets,1775,30.44


### The Percentage of Scenes With Split Rhymes

In [43]:
summary_features(combined_df, 'percentage_scene_split_rhymes')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,percentage_scene_split_rhymes
mean,39.44
std,15.16
50%,36.74
min,6.67
max,67.31


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_scene_split_rhymes
18,Zotov,Rafail,Novaia shkola muzhei,1842,67.31


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_scene_split_rhymes
4,Klushin,Aleksandr,Smekh i gore,1792,6.67


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_scene_split_rhymes
8,Unknown,Unknown,V sem''e ne bez uroda,1813,40.0


### The Percentage of Open Scenes

In [44]:
summary_features(combined_df, 'percentage_open_scenes')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,percentage_open_scenes
mean,55.6
std,18.23
50%,54.76
min,6.67
max,85.0


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_open_scenes
9,Shakhovskoi,Aleksandr,"Urok koketkam, ili lipetskie vody",1815,85.0


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_open_scenes
4,Klushin,Aleksandr,Smekh i gore,1792,6.67


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_open_scenes
11,Fedorov,Boris,Chudnyia vstrechi,1818,54.76


### The Percentage of Scenes With Split Verse Lines and Rhymes

In [45]:
summary_features(combined_df, 'percentage_scenes_rhymes_split_verse')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,percentage_scenes_rhymes_split_verse
mean,14.55
std,8.95
50%,12.7
min,3.33
max,38.0


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_scenes_rhymes_split_verse
12,Shakhovskoi,Aleksandr,Pustodumy,1819,38.0


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_scenes_rhymes_split_verse
4,Klushin,Aleksandr,Smekh i gore,1792,3.33


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,percentage_scenes_rhymes_split_verse
1,Kniazhnin,Iakov,Khvastun,1785,14.54


## Part 4. Other Features

### The Coefficient of Unused Dramatic Characters

In [46]:
summary_features(combined_df, 'coefficient_unused')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,coefficient_unused
mean,23.37
std,11.11
50%,24.62
min,5.0
max,45.48


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,coefficient_unused
16,Zagoskin,Mikhail,Blagorodnyi teatr,1828,45.48


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,coefficient_unused
7,Seliavin,Nikolai,Zhenikhi ili pobezhdennyi predrassudok,1806,5.0


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,coefficient_unused
9,Shakhovskoi,Aleksandr,"Urok koketkam, ili lipetskie vody",1815,22.43


In [47]:
summary_features(combined_df, 'discontinuous_scenes')

Mean, standard deviation, median, min and max values for the period:


Unnamed: 0,discontinuous_scenes
mean,6.38
std,4.16
50%,6.38
min,1.52
max,17.02


Period Max:


Unnamed: 0,last_name,first_name,title,creation_date,discontinuous_scenes
6,Golitsyn,Aleksei,Novye chudaki ili Prozhekter,1797,17.02


Period Min:


Unnamed: 0,last_name,first_name,title,creation_date,discontinuous_scenes
16,Zagoskin,Mikhail,Blagorodnyi teatr,1828,1.52


The closest to the mean:


Unnamed: 0,last_name,first_name,title,creation_date,discontinuous_scenes
5,Kapnist,Vasilii,Iabeda,1794,6.38


### Summary:
1. The first five-act verse comedy in the history of Russian literature, Nikolai Nikolev's *Samoliubivyi stikhotvorets* (1775) had the minimum number of dramatic characters (8), minimum mobility coefficient (45), and was the closest to the mean based on the average length of a stage direction and the percentage of scenes with split verse lines (30.44%).


2. Iakov Kniazhnin's *Khvastun* (1785) was the closest to the mean based on the percentage of scenes with split verse lines and rhymes (14.54%), and the coefficient of unused dramatic characters (18.85).


3. Dmitrii Efim’ev's *Prestupnik ot igry ili bratom prodannaia sestra* (1788) had the highest percentage of monologues (42.31%), it had the minimum number of dramatic characters (8), and the minimum percentage of polylogues (15.38%). 


4. Iakov Kniazhnin's *Chudaki*	(1790)	was the closest to the mean based on the mobility coefficient (60) and the percentage of polylogues (38.33%).

5. Aleksandr Klushin's	*Smekh i gore* (1792) had the minimum percentage of scenes with split verse lines (3.33%), the minimum percentage of scenes with split rhymes (6.67%), the minimum percentage of open scenes (6.67%), and the minimum percentage of scenes with split verse lines and rhymes (3.33%). t was the closest to the mean based on the mobility coefficient (60).


6. Vasilii Kapnist's *Iabeda* (1794) had the highest observed sigma	(2.77) and it was the closest to the mean based on the percentage of discontinuous scenes (6.38%).


7. Aleksei Golitsyn's *Novye chudaki ili Prozhekter* (1797) had the maximum percentage of discontinuous scenes (17.02%).


8. Nikolai Seliavin's *Zhenikhi ili pobezhdennyi predrassudok*	(1806) had the minimum	average length of a stage direction (1.09) and coefficient of unused dramatic characters (5.0).


9. *V sem''e ne bez uroda* (1813) by an anonymous author was the closest to the mean based on the percentage of scenes with split rhymes.


10. Aleksandr' Shakhovskoi's *Urok koketkam, ili lipetskie vody*	(1815) had the highest percentage of open scenes (85%), it had the maximum degree of verse and prose interaction (14.22). It was also the closest to the mean based on the mobility coefficient (60) and the coefficient of unused dramatic characters (22.43).


11. Aleksandr Soboloev's *Tri zhenikha ili liubov‘ nyneshniago sveta* (1817) had the minimum frequency of stage directions (5.23) and degree of verse and prose interaction (1.08).


12. Boris Fedorov's *Chudnyia vstrechi* (1818) was the closest to the central tendency based on the degree of verse and prose interaction (7.74) and the percentage of open scenes (54.76).


13. Aleksandr Shakhovskoi's *Pustodumy* (1819) had the maximum percentage of polylogues (61.22%), the maximum percentage of scenes with split verse lines (56%), and the maximum percentage of scenes with split verse lines and rhymes (38%); it also had the minimum percentage of monologues (6.12%). 

14. Fedor Kokoshkin's *Vospitalie, ili vot pridanoe* (1824) had the minimum mobility coefficient of 41.


15. Aleksandr Griboedov's *Gore ot uma* (1824) had the maximum number of dramatic characters (42) and the maximum average length of a stage direction.


16. Mikhail	Zagoskin's *Blagorodnyi teatr* (1828) had the maximum coefficient of unused dramatic characters (45.48%) and the minimum percentage of discontinuous scenes (1.52%). It was the closest to the mean based on the standard range of the number of speaking characters (1.57), frequency of stage directions.


17. Rafail Zotov's *Novaia shkola muzhei* (1842) had the maximum percentage of scenes with split rhymes (67.31%).

 
18. Nikolai	Krol’s *Komediia iz sovremennoi zhizni* (1849) had the minimum standard range of the number of speaking characters (0.74).


19. Piotr Grigor’ev's *Zhiteiiskaia shkola* (1849) had the maximum mobility coefficient (111) and frequency of stage directions (29.57). It was the closest to the mean based on the percentage of monologues (22.52%).



## Open-Form Scores

We also would like to be able to place each author's comedic style in the context of the history of the Russian comedy in verse. As was the case with the French comedians, we will use such features as:

- the number of dramatic characters
- the mean mobility coefficient
- the standard range of the number of speaking characters (sigma)
- the mean percentage of polylogues
- the mean percentage of monologues. 


Open Form Scores:
For all features, we will calculate the z-score: $z=(x-u)/s$ where x is the mean value of the feature for a playwright, u is the mean of the feature  and s is the standard deviation of this feature. For the percentage of monologues, we will reverse the sign, i.e., will use - z-score since it is the lower value of the percentage of monologues that indicates a more open form.
The open form score will be the mean z-score.  

For example, we will calculate the z-score for the number of dramatic characters in Aleksandr Griboedov's *Gore ot uma* in the following way: (42 - 17.0) / 8.89 ≈ 2.82, 
After we repeat this calculation for all features, we will arrive at the following z-scores (2.81 + 1.80 + 1.91-0.47) and -z-score for the percentage of monologues of -0.53, his open form score = (2.81 + 1.80 + 1.91-0.47-0.53) / 5 ≈ 1.10. The open form scores can be positive and negative, where a high positive number indicates the most open form, whereas the high negative number indicates the least open form.

In [48]:
results_with_open_form = playwrights_place(combined_df, with_z_score=True)

In [49]:
results_raw = playwrights_place(combined_df, with_z_score=False).loc[results_with_open_form.index, :]
results_raw['open_form_score'] = results_with_open_form.open_form_score.tolist()

### Raw Numbers

In [50]:
results_raw

Unnamed: 0_level_0,Unnamed: 1_level_0,num_present_characters,mobility_coefficient,sigma_iarkho,polylogues,monologues,open_form_score
last_name,first_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Griboedov,Aleksandr,42.0,94.0,2.54,32.0,26.67,1.1
Kapnist,Vasilii,15.0,47.0,2.77,59.57,6.38,0.93
Grigor’ev,Petr,21.0,111.0,1.48,42.34,22.52,0.66
Zagoskin,Mikhail,29.5,77.0,1.44,47.52,20.09,0.59
Unknown,Unknown,19.0,45.0,2.32,44.44,20.0,0.29
Shakhovskoi,Aleksandr,13.0,54.5,1.5,52.28,12.22,0.24
Sobolev,Aleksandr,9.0,46.0,1.69,58.7,13.04,0.2
Fedorov,Boris,24.0,84.0,1.43,34.52,27.38,0.18
Kniazhnin,Iakov,14.5,57.5,1.32,44.62,15.61,0.05
Klushin,Aleksandr,9.0,60.0,1.48,46.67,20.0,-0.05


### Z-Scores and Open-Form Scores

In [51]:
results_with_open_form 

Unnamed: 0_level_0,Unnamed: 1_level_0,num_present_characters,mobility_coefficient,sigma_iarkho,polylogues,monologues,open_form_score
last_name,first_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Griboedov,Aleksandr,2.81,1.8,1.91,-0.47,-0.53,1.1
Kapnist,Vasilii,-0.22,-0.8,2.35,1.52,1.78,0.93
Grigor’ev,Petr,0.45,2.73,-0.11,0.28,-0.06,0.66
Zagoskin,Mikhail,1.41,0.86,-0.19,0.65,0.22,0.59
Unknown,Unknown,0.22,-0.91,1.49,0.43,0.23,0.29
Shakhovskoi,Aleksandr,-0.45,-0.38,-0.07,0.99,1.12,0.24
Sobolev,Aleksandr,-0.9,-0.85,0.29,1.46,1.02,0.2
Fedorov,Boris,0.79,1.24,-0.21,-0.29,-0.61,0.18
Kniazhnin,Iakov,-0.28,-0.22,-0.42,0.44,0.73,0.05
Klushin,Aleksandr,-0.9,-0.08,-0.11,0.59,0.23,-0.05


In [52]:
results_with_open_form[results_with_open_form.open_form_score > 0].shape[0]

9

In [53]:
results_with_open_form.shape

(18, 6)

### Conclusion:
- Aleksandr Griboedov (as represented by *Gore ot uma*) was the most experimental playwright in the history of the Russian five-act and four-act comedies (open-form score of 1.10). His comedy *Gore ot uma* (1824) had the maximum number of dramatic characters (42) and the maximum average length of a stage direction. It had the highest observed number of speaking characters (19).
- Vasilii Kapnist was in second place with an open-form score of 0.93. His comedy *Iabeda* (1794) had the highest observed sigma (2.77).
- Dmitrii Efim'ev had the lowest open-form score (-1.36). His comedy *Prestupnik ot igry ili bratom prodannaia sestra* (1788) had the highest percentage of monologues (42.31%), it had the minimum number of dramatic characters (8), and the minimum percentage of polylogues (15.38%).
- Authors with positive open-form scores as well as negative open-form scores co-existed during the tentative Period One (1775 to 1794) and the tentative Period Two (1795 to 1849).
- Half of the Russian comedians (9) wrote in a closed style while the other half wrote in an open style.