In [1]:
%load_ext lab_black

# Explore `data` (WIP)

The preprocessed data, derived from the raw data, is provided in CSV files. 
There are three types of files:
- `*.cast.csv`
<br/>
The characters and character groups in the cast list of the play, plus information on whether they are part of some other character group (e.g., in _A Midsummer Night's Dream_, _Titania_'s fairies constitute one group of characters, and each named fairy is a member of that group).
- `*.raw.csv`
<br/>
A bare-bones, unaggregated representation of the raw XML data in tabular form, with redundant elements filtered out and additional derived attributes annotated.
- `*.agg.csv`
<br/>
An aggregated representation of the raw XML data in tabular form, derived from `*.raw.csv`, containing only derived attributes. 
<br/>
This is the basis of all our graph representations.

In [3]:
from glob import glob

import pandas as pd

from statics import DATA_PATH

In [4]:
data_files = sorted(glob(f"{DATA_PATH}/*csv"))

In [5]:
agg, cast, raw = data_files[:3]

## Characters (`*.cast.csv`)

A `*.cast.csv` file provides a quick overview of the characters in a play. We provide it for convenience.

In [6]:
cast_df = pd.read_csv(cast)
cast_df.head()

Unnamed: 0,xml:id,corresp
0,ATTENDANTS.0.1_MND,#ATTENDANTS_MND
1,ATTENDANTS.0.2_MND,#ATTENDANTS_MND
2,ATTENDANTS_MND,
3,Bottom_MND,
4,Demetrius_MND,


## Unaggregated tabular representation of the raw data (`*.raw.csv`)

A `*.raw.csv` file has one column for each attribute name that occurs in the descendants of the `<body>` tag of the raw XML, plus columns for the tag name (`tag`) and the text of leaf elements (`text`). 
It is almost as verbose as the raw XML, but allows for faster exploration via the `pd.DataFrame` API.

In [7]:
raw_df = pd.read_csv(raw, low_memory=False)
raw_df.head()

Unnamed: 0,tag,type,n,text,xml:id,who,lemma,ana,part,rendition,prev,act,scene,onstage,stagegroup_raw,speaker
0,div,act,1,,,,,,,,,1,0,,0,
1,div,scene,1,,,,,,,,,1,1,,0,
2,stage,entrance,SD 1.1.0,,stg-0000,"{'#Theseus_MND', '#Philostrate_MND', '#ATTENDA...",,,,,,1,1,#ATTENDANTS_MND #Hippolyta_MND #Philostrate_MN...,1,
3,w,,SD 1.1.0,Enter,fs-mnd-0000070,,,,,,,1,1,#ATTENDANTS_MND #Hippolyta_MND #Philostrate_MN...,1,
4,c,,,,,,,,,,,1,1,#ATTENDANTS_MND #Hippolyta_MND #Philostrate_MN...,1,


For example, out of curiosity, we can inspect the instances of rare attributes, such as `part`, `rendition`, or `prev`:

In [20]:
raw_df.query("not part.isna()")

Unnamed: 0,tag,type,n,text,xml:id,who,lemma,ana,part,rendition,prev,act,scene,onstage,stagegroup_raw,speaker
184,l,,1.1.11,,ftln-0011,,,,I,,,1,1,#ATTENDANTS_MND #Hippolyta_MND #Philostrate_MN...,1,#Hippolyta_MND
192,l,,1.1.12,,ftln-0012,,,,F,,,1,1,#ATTENDANTS_MND #Hippolyta_MND #Philostrate_MN...,1,#Theseus_MND
892,l,,1.1.54,,ftln-0054,,,,I,,,1,1,#ATTENDANTS_MND #Demetrius_MND #Egeus_MND #Her...,3,#Hermia_MND
900,l,,1.1.55,,ftln-0055,,,,F,,,1,1,#ATTENDANTS_MND #Demetrius_MND #Egeus_MND #Her...,3,#Theseus_MND
2859,l,,1.1.170,,ftln-0170,,,,I,,,1,1,#Hermia_MND #Lysander_MND,4,#Lysander_MND
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32212,l,,5.1.86,,ftln-1864,,,,I,,,5,1,#ATTENDANTS_MND #Demetrius_MND #Helena_MND #He...,89,#Philostrate_MND
32222,l,,5.1.87,,ftln-1865,,,,F,,,5,1,#ATTENDANTS_MND #Demetrius_MND #Helena_MND #He...,89,#Theseus_MND
34016,l,,5.1.189,,ftln-1967,,,,I,,,5,1,#ATTENDANTS_MND #Bottom_MND #Demetrius_MND #He...,95,#Bottom_MND
34346,l,,5.1.206,,ftln-1984,,,,I,,,5,1,#ATTENDANTS_MND #Bottom_MND #Demetrius_MND #Fl...,96,#Bottom_MND


In [18]:
raw_df.query("not rendition.isna()")

Unnamed: 0,tag,type,n,text,xml:id,who,lemma,ana,part,rendition,prev,act,scene,onstage,stagegroup_raw,speaker
5960,stage,business,SD 1.2.95,,stg-0352,{'#Quince_MND'},,,,inline,,1,2,#Bottom_MND #Flute_MND #Quince_MND #Snout_MND ...,9,#Quince_MND
11090,stage,business,SD 2.2.8.1,,stg-0648.1,{'#FAIRIES.TITANIA_MND'},,,,centered,,2,2,#FAIRIES.TITANIA_MND #Titania_MND,20,
15284,l,,3.1.83,,ftln-0886,,,,,align{w0142380},,3,1,#Bottom_MND #Flute_MND #Quince_MND #RobinGoodf...,36,#Bottom_MND
16138,stage,delivery,SD 3.1.127,,stg-0930,,,,,indentProse,,3,1,#Bottom_MND,45,#Bottom_MND
25880,l,,3.2.490,,ftln-1501,,,,,indent,,3,2,#Demetrius_MND #Helena_MND #Hermia_MND #Lysand...,68,#RobinGoodfellow_MND
25889,l,,3.2.491,,ftln-1502,,,,,indent,,3,2,#Demetrius_MND #Helena_MND #Hermia_MND #Lysand...,68,#RobinGoodfellow_MND
26236,stage,exit,SD 4.1.17,,stg-1521,{'#FAIRIES.TITANIA.Cobweb_MND'},,,,inline,,4,1,#Bottom_MND #FAIRIES.TITANIA.Mote_MND #FAIRIES...,72,#Bottom_MND


In [19]:
raw_df.query("not prev.isna()")

Unnamed: 0,tag,type,n,text,xml:id,who,lemma,ana,part,rendition,prev,act,scene,onstage,stagegroup_raw,speaker
34356,lg,quatrain,,,stz-1985,,,,,,stz-1984,5,1,#ATTENDANTS_MND #Bottom_MND #Demetrius_MND #Fl...,96,
34383,lg,quatrain,,,stz-1986,,,,,,stz-1985,5,1,#ATTENDANTS_MND #Bottom_MND #Demetrius_MND #Fl...,96,
34429,lg,couplet,,,stz-1988,,,,,,stz-1987,5,1,#ATTENDANTS_MND #Bottom_MND #Demetrius_MND #Fl...,96,
34480,lg,couplet,,,stz-1990,,,,,,stz-1989,5,1,#ATTENDANTS_MND #Bottom_MND #Demetrius_MND #Fl...,96,
34533,lg,couplet,,,stz-1992,,,,,,stz-1991,5,1,#ATTENDANTS_MND #Bottom_MND #Demetrius_MND #Fl...,96,
34588,lg,couplet,,,stz-1994,,,,,,stz-1993,5,1,#ATTENDANTS_MND #Bottom_MND #Demetrius_MND #Fl...,96,


This allows us to make the _informed_ choice to ignore the `part`, `rendition`, and `prev` attributes going forward.
<br/>
We do the same for `lemma` and `ana`, which are of linguistic interest but out of scope for our current graph representations.

Consequently, the columns from `*.raw.csv` that are immediately taken from the raw XML data _and_ that we care about are:
- `tag`: used to identify elements of interest (e.g., we don't care about whitespace and punctuation, represented by `c` and `pc` tags)
- `type`: used to identify the start of acts, the start of scenes, and instances of stage directions we process (`entrance` and `exit`)
- `n`: used to identify groups of words that belong together (e.g., words spoken in the same line, or words contained in the same stage direction)
- `xml:id`: unique identifiers, always helpful
- `who`: used to identify characters entering, exiting, or speaking

Using the data in these columns, we derive the following additional columns:
- `act`: the act in which the tag occurs
- `scene`: the scene in which the tag occurs
- `onstage`: who (we think) is on stage when the tag occurs (derived by walking through the play, managing state following the stage directions, and performing some tricks to deal with the incompleteness of stage directions)
- `stagegroup_raw`: the number of changes to `onstage` witnessed until the tag occurs
- `speaker`: who speaks when the tag occurs (derived from the `who` attribute of `sp` tags)

## Aggregated tabular representation of the raw data (`*.agg.csv`)

In [8]:
agg_df = pd.read_csv(agg)
agg_df.head()

Unnamed: 0,act,scene,stagegroup,stagegroup_raw,setting,onstage,speaker,n_lines,n_tokens
0,1,1,1,1,1,#ATTENDANTS_MND #Hippolyta_MND #Philostrate_MN...,#Theseus_MND,6,43
1,1,1,1,1,2,#ATTENDANTS_MND #Hippolyta_MND #Philostrate_MN...,#Hippolyta_MND,5,35
2,1,1,1,1,3,#ATTENDANTS_MND #Hippolyta_MND #Philostrate_MN...,#Theseus_MND,5,30
3,1,1,2,2,4,#ATTENDANTS_MND #Hippolyta_MND #Theseus_MND,#Theseus_MND,4,29
4,1,1,3,3,5,#ATTENDANTS_MND #Demetrius_MND #Egeus_MND #Her...,#Egeus_MND,1,6
