# Datanotes (please ignore)

speaker - directed hyperedge part 1
onstage - directed hyperedge part 2
index - edge ids at highest reasonable granularity (I think)

granularities:
- act, scene
- act, scene, stagegroup (unweighted)
- act, scene, stagegroup (weighted by speech)
- act, scene, stagegroup, speaker
- act, scene, stagegroup, speaker, directed (speaker -> onstage)

what we are currently lacking:
- principled treatment of speech lacking who annotations (mostly if not only songs)
- principled treatment of asides and other communication modifiers
- principled treatment of unannotated deaths
- principled treatment of prologues, epilogues, and inductions

- 'business': character actions, e.g., sleeping, waking, doing stuff (includes who)
- 'delivery': asides, characters speaking in others' voices, other modifications to who hear what how
- 'entrance': characters entering (includes who)
- 'exit': characters leaving (includes who)
- 'mixed': container for stage directions composed of multiple other types
- 'modifier': characters appearing to be other characters
- 'sound': music etc.

### NB: There are some inconsistencies and weirdnesses even in this data - these are my WIP notes on what I observed

- MND: redundant castGroup wrapper for first castGroup
- MND: weird encoding of stage directions with multiple types (e.g., business and sound)
- MND: nested encoding of mixed stage directions

# Generating hypergraph representations

In [None]:
from glob import glob
import pandas as pd

In [None]:
from hyperbard.preprocessing import get_filename_base

In [None]:
import hypernetx as hnx
import random
random.seed(1234)

In [None]:
import matplotlib.pyplot as plt
from itertools import product

In [None]:
from matplotlib import cm

In [None]:
datapath = "../data"
files = sorted(glob(f"{datapath}/*agg.csv"))

### Different hypergraph representations

In [None]:
for file in files:
    file_short = get_filename_base(file).split("_")[0]
    print(file_short)

    df = pd.read_csv(file)
    base_name = df.at[0,"speaker"].split("_")[-1]
    
    # basic hg
    edges = []
    for (act, scene), group in df.groupby(["act", "scene", "onstage"]).agg(dict(n_tokens="sum")).reset_index().groupby(["act","scene"]):
        joined_group = tuple(sorted(set(" ".join(group.onstage).split())))
        edges.append(joined_group)
    H = hnx.Hypergraph(edges)
    hdf = H.dataframe()
    hdf.columns = hdf.columns.map(int)
    hdf[sorted(hdf.columns)].to_csv(f"../hypergraphs/{file_short}_act-scene.csv")
    
    # draw
    hnx.draw(H, 
         node_labels={n.uid:n.uid.split("_")[0][1:] for n in H.nodes()}, 
         node_radius=dict((df.groupby(["speaker"]).agg({"n_tokens":"sum"}) / 2000 + 1).n_tokens.items()),
         with_edge_labels=False, edges_kwargs=dict(edgecolors=[cm.viridis_r(x/len(H)) for x in range(len(H))]), 
         **layout_kwargs
        )
    plt.suptitle(file_short, y=1.0)
    plt.tight_layout()
    plt.savefig(f"../graphics/{file_short}_act-scene.pdf")
    plt.close()
    
    # more fine-grained hg
    df_grouped = df.groupby(["stagegroup", "onstage", "act", "scene"]).agg({"n_tokens":"sum"}).reset_index()
    df_grouped.onstage = df_grouped.onstage.map(lambda x: tuple(x.split()))
    
    Hs = {(act,scene): hnx.Hypergraph(dict(df_grouped.query("act == @act and scene == @scene").onstage))
          for (act,scene) in set(zip(df.act, df.scene))}

    layout_kwargs = {'layout_kwargs': {'seed': 1234}}
    n_acts = len(set(tup[0] for tup in Hs))
    n_scenes = len(set(tup[1] for tup in Hs))
    fig, ax = plt.subplots(n_scenes, n_acts, figsize=(n_acts*5,n_scenes*5))
    for act, scene in sorted(Hs):
        tax = ax[scene-1][act-1]
        tax.set_title(f"Act {act}, Scene {scene}")
        hnx.draw(Hs[(act,scene)], ax=tax,
                 node_labels={n.uid:n.uid.split("_")[0][1:] for n in Hs[(act,scene)].nodes()},
                 node_radius=dict((df.query("act == @act and scene == @scene").groupby(["speaker"]).agg({"n_tokens":"sum"}) / 250 + 1).n_tokens.items()),
                 with_edge_labels=False, edges_kwargs=dict(edgecolors=[cm.viridis_r(x/len(H)) for x in range(len(H))]), **layout_kwargs)
    for x,y in product(list(range(n_scenes)), list(range(n_acts))):
        ax[x][y].axis('off')
    plt.suptitle(file_short, y=1.0)
    plt.tight_layout()
    plt.savefig(f"../graphics/{file_short}_act-scene-stagegroup.pdf")
    plt.close()