# *To catch a protagonist* in DraCor
**by Ingo Börner**

In the paper [*To catch a Protagonist: Quantitatice Domninance Relations in German-Language Drama (1730-1930)*](https://dh2018.adho.org/to-catch-a-protagonist-quantitative-dominance-relations-in-german-language-drama-1730-1930/) ({cite:t}`fischer_catch_2018`) an algorithm is described, that allows to identify characters that are the quantitatively dominant characters of a play based on a set of network-based and count based measures:

> In order to systematically describe the extent of this deviation, we calculate eight values for each character of the 465 dramas of our corpus, three count-based measures (number of scenes a character appears in, number of speech acts, number of spoken words) and five network-related measures (degree, weighted degree, betweenness centrality, closeness centrality, eigenvector centrality). For each measurement a ranking is created. The rankings are then merged into two meta-rankings: one count-based and one network-based. The two meta-rankings are then combined into an overall ranking.

The original algorithm was implemented in the tool [Dramavis](https://github.com/lehkost/dramavis) by Christopher Kittel. *Dramavis* operates on XML ["zwischenformat"](https://dlina.github.io/Introducing-Our-Zwischenformat) files created in the [DLINA](https://dlina.github.io) project. 

The following notebook adapts the code of the respective modules to work with data returned by the [DraCor API](https://dracor.org/doc/api). The aim is to be able to recreate the `*_chars.csv`-files that were used in the study. The data can be found in the [repository](https://github.com/dlina/catch-protagonist) on github in the folder [`allmetrics`](https://github.com/dlina/catch-protagonist/tree/main/data/allmetrics).

The implementation will be tested with the play *Emilia Galotti*. The original algorithm operated on the corresponding [LINA](https://dlina.github.io/linas/88) and produced the file [`88_Emilia Galotti_chars.csv`](https://github.com/dlina/catch-protagonist/blob/main/data/allmetrics/88_Emilia%20Galotti_chars.csv) as output In DraCor the play can be accessed [here](https://dracor.org/ger/lessing-emilia-galotti).


## Step 1. Get the basic measures

We need to get the following basic measures on characters:

**Network measures**

* betweenness
* degree
* closeness
* ~~closeness corrected~~
* weighted degree
* eigenvector centrality

**count-based measures**

* frequency/appearances
* number of *speech acts*
* number of *words*

### Network and count-based metrics via Dracor API

The Python-Packages [`requests`](https://docs.python-requests.org/en/latest/) and the library [`json`](https://docs.python.org/3/library/json.html#module-json) will be used to query the API and parse the response:

In [None]:
# if not installed, uncomment the following line and run the cell:
# !pip install requests

In [None]:
import requests
import json

In [None]:
# set corpus and playname
corpusname = "ger"
playname = "lessing-emilia-galotti"

In [None]:
# base url of the DraCor-API
api_base = "https://dracor.org/api/v1/"

To retrieve the network-data and speech-amounts data on single characters the function `/corpora/{corpusname}/plays/{playname}/characters` is used as follows:

In [None]:
# send a request to the endpoint and parse results
request_url = api_base + "corpora/" + corpusname + "/plays/" + playname + "/characters"
print(request_url)
r = requests.get(request_url)
character_data = json.loads(r.text)

The API function returns data on the characters, including the network and count-based metrics:

In [None]:
character_data

The data on the characters are in a dictionary:
```
{'betweenness': 0.24696969696969698,
  'closeness': 0.8,
  'degree': 9,
  'eigenvector': 0.44898463593218985,
  'gender': 'MALE',
  'id': 'marinelli',
  'isGroup': False,
  'name': 'Marinelli',
  'numOfScenes': 19,
  'numOfSpeechActs': 221,
  'numOfWords': 4343,
  'weightedDegree': 30}
```

We don't know anything about the network metrics of the whole play, though. If we want to retrieve this information, we would have to use the API function `/corpora/{corpusname}/play/{playname}/metrics`, which would also tell us, if there are several sub-networks in a dictionary-field with the key `numConnectedComponents`. This could be relevant, because we can also calculate some network-metrics differently, e.g. the *closeness*.

## Preparation: Get the metrics and construct a pandas data frame

In the Dramavis implementation an object of the class `DramaAnalyzer` is created, which contains the information on characters in a pandas data frame. We will create the same data structure to be able to use the same methods for calculating means and ranking.

The rows in the [table](https://raw.githubusercontent.com/dlina/catch-protagonist/main/data/allmetrics/88_Emilia%20Galotti_chars.csv) are:

`name,betweenness,degree,closeness,closeness_corrected,strength,eigenvector_centrality,avg_distance,avg_distance_corrected,frequency,speech_acts,words,lines,chars ...`

We will not include all rows, but only the ones, that are relevant for the rankings:

`name`,`betweenness`,`degree`,`closeness`,~~closeness_corrected~~,`strength`,`eigenvector_centrality`,~~avg_distance,avg_distance_corrected~~,`frequency`,`speech_acts`,`words`,~~lines,chars~~ ...

following rows will be called differently to follow DraCor conventions of the [API output](https://dracor.org/api/corpora/ger/play/lessing-emilia-galotti/cast):

* `name` → `id`; later this will be used to construct URIs
* `strength` → `weightedDegree`
* `eigenvector_centrality` → `eigenvector`
* `frequency` → `numOfScenes`
* `speech_acts` → `numOfSpeechActs`
* `words` → `numOfWords`



The package [`pandas`](https://pandas.pydata.org/) is used to handle the data as a dataframe. Therefore we need to import the package.

In [None]:
# if not installed, uncomment the following line and run the cell:
# !pip install pandas

In [None]:
import pandas as pd

First, we need to transform the parsed JSON API response to a list of lists, that is then turned into the data frame `df`.

In [None]:
# columns
cols = ["id","betweenness","degree","closeness","weightedDegree","eigenvector","numOfScenes","numOfSpeechActs","numOfWords"]

# prepare the data for the data frame
df_data = []
for character in character_data:
    row = []
    for key in cols:
        row.append(character[key])
    df_data.append(row)

# construct the data frame
df = pd.DataFrame(df_data, columns = cols)

#turn the column "id" to the index
df = df.set_index('id')
#output
df
        

We can now query the data, e.g. output the values of a single character by requesting a row by its index value, which is the `id` of the character. 

In [None]:
# get the values of a single character
df.loc["marinelli"]

## Step 2. Calculate the ranks

In Dramavis the function [`get_character_ranks`](https://github.com/lehkost/dramavis/blob/v0.4/dramalyzer.py#L282-L288) creates the rankings of the count-based and network-based measures. We will adapt this function to operate on the created data frame and rename the columns:

In [None]:
metrics_to_rank = ['degree', 'closeness', 'betweenness', 'weightedDegree', 'eigenvector', 'numOfScenes', 'numOfSpeechActs', 'numOfWords']
for metric in metrics_to_rank:
    df[metric + "_rank"] = df[metric].rank(method='min', ascending=False)
df
    

## Step 3. Rank on average and standard deviation of the individual rankings

In Dramavis the individual rankings are then used for the calculation of an average ranking and the standard deviation, which are then also ranked. This is done by the function [`get_centrality_ranks`](https://github.com/lehkost/dramavis/blob/v0.4/dramalyzer.py#L305-L310).

The following columns will be added to the data frame:

* (1) `centrality_rank_avg`: The average of all rankings
* (2) `centrality_rank_std`: Standard deviation of the rankings
* (3) `centrality_rank_avg_rank`: A ranking is created from the average of all rankings (1)
* (4) `centrality_rank_std_rank`: A ranking is created from the standard deviation of all rankings (2)

The following dramavis code is adapted accordingly to operate on the dataframe:

In [None]:
ranks = [c for c in df.columns if c.endswith("rank")]
df['centrality_rank_avg'] = df[ranks].sum(axis=1)/len(ranks)
df['centrality_rank_std'] = df[ranks].std(axis=1)/len(ranks)
for metric in ['centrality_rank_avg', 'centrality_rank_std']:
    df[metric + "_rank"] = df[metric].rank(method='min', ascending=True)
df

Based on the calculation of `centrality_rank_avg_rank`, the "central" characters can be already queried as follows:

In [None]:
df[df["centrality_rank_avg_rank"] == 1].index.tolist()

## Additional Step: Create Rankings and combined rankings of network-based and count-based metrics separately

In addition to a ranking that combines all metrics and rankings derived thereof, the function [`get_structural_ranking_measures`](https://github.com/lehkost/dramavis/blob/v0.4/dramalyzer.py#L318-L330) treats network-based and count-based values separately and only then aggregates them to a combined overall ranking.

The function adds the following rows to the data frame:

* (1) `avg_graph_rank`: a ranking based on the rankings of the network-values (*degree*, *closeness*, *betweenness*, *strength* or *weightedDegree* and *eigenvector centrality* or *eigenvector*)
* (2) `avg_content_rank`: a ranking based on the rankings of the count-based values (*frequency* or *numOfScenes*, *speech acts* and *words*)
* (3) `overall_avg`: the two rankings (1+2) are combined by calculating the mean
* (4) `overall_avg_rank`: based on the overall average (3) a ranking is created

The following code is adapted accordingly to operate on the dataframe. The ranking stability measures are not implemented here.

In [None]:
#renamed the columns to match the DraCor values here:
graph_ranks = ['degree_rank', 'closeness_rank', 'betweenness_rank', 'weightedDegree_rank', 'eigenvector_rank']
content_ranks = ['numOfScenes_rank', 'numOfSpeechActs_rank', 'numOfWords_rank']
avg_graph_rank = df[graph_ranks].mean(axis=1).rank(method='min')
avg_content_rank = df[content_ranks].mean(axis=1).rank(method='min')
df["avg_graph_rank"] = avg_graph_rank
df["avg_content_rank"] = avg_content_rank
df["overall_avg"] = df[["avg_graph_rank", "avg_content_rank"]].mean(axis=1)
df["overall_avg_rank"] = df["overall_avg"].rank(method='min')
df

Based on the calculation of `overall_avg_rank`, the "central" characters can be queried as follows:

In [None]:
df[df["overall_avg_rank"] == 1].index.tolist()