(chapter-1-3)=
# 1.3 Using DraCor: Four Showcases

In this section we will give an idea of possible uses of DraCor as a prototype of a *Programmable Corpus Enviroinment* by showcasing four examples which are included in this Jupyter Notebook[^source_of_notebook]:

[^source_of_notebook]: This Jupyter Notebook is based on CLS INFRA Deliverable D7.1 {cite:p}`boerner_2023_report`. The showcases have been adapted to the latest major DraCor API Version 1.0 and the use Python whenever possible.

* first, a basic one-click download of modeled text data;
* second, an approach to geo-based visualization of corpus metadata using Linked Open Data;
* third, an API-based approach to standardized extraction of specific textual data across different corpora; and
* fourth, a method-based approach based on Social Network Analysis metrics.

(section-1-3-1)=
## 1.3.1 Showcase 1: One-Click Download of Modeled Text Data

When considering an ecosystem for Computational Literary Studies, one usually thinks of
applications that operate at a relatively sophisticated technical resp. computational level. But
actually, it is central to the development of the discipline and its infrastructures that it remains
accessible even to novices and beginners. It is also for this reason that an essential component
of DraCor is a user-friendly front-end that, on the one hand, provides a set of
services with an easily accessible graphical user interface (GUI) and, on the other hand, allows
access to the corpora data with as little technical expertise as possible. The frontend with its
graphical user interface thus also assumes didactic functions: Texts can be easily navigated
through various tabs and viewed in different shapes and modes of modeling.

For example, any play from the corpora contained in DraCor can be displayed in a text
view that does not differ significantly from classic ways of displaying texts in e-readers or web
browsers (see {numref}`fulltext-view`). While texts appear in such a full-text view as conventional epistemic
objects of literary studies (ready for close reading), after a tab change (see {numref}`download-tab`), one-click
downloads of differently modeled derivations from these full texts can be downloaded for “distant
reading” (Moretti 2013).

% Figure is rendered in the HTML output here

```{figure} ./images/fulltext-view.png
---
width: 600px
name: fulltext-view
---
Full text view of a DraCor play in the front-end
```

% Figure is rendered in the HTML output here

```{figure} ./images/download-tab.png
---
width: 600px
name: download-tab
---
Download options for a DraCor play in the front-end
```

This allows DraCor to introduce the different epistemic and technical manifestations of text in the CLS in an easily accessible way. Furthermore, it is also possible to work with these modeled text data immediately, which allows a quick introduction to methods and tools of the CLS.

For this showcase, we choose to use the data of a co-occurrence network (i.e. a network of characters connected via their co-presence on the stage), downloading the XML-based GEXF format that can be opened with open source programming libraries such as [networkx](https://networkx.org) {cite:p}`hagberg_2008_networkx` or the widely used open source desktop software [Gephi](https://gephi.org) {cite:p}`bastian_2009_gephi`. With a few clicks after the download, it is thus possible to create a network graph (see Fig. 03) that now allows the literary text to be viewed in an entirely different modeling mode, predestined for distant reading {cite:p}`moretti_2011_network`.

% Figure is rendered in the HTML output here

```{figure} ./images/gephi-oneclick-networkgraph.png
---
width: 600px
name: gephi-oneclick-networkgraph
---
Network visualization with Gephi, based on a DraCor one-click download
```

In Session 5 on Wednesday we will demonstrate how do do network analysis with DraCor in Gephi.

(section-1-3-2)=
## 1.3.2 Showcase 2: Geo-Mapping Locations of First Performances

However, DraCor's beginner-friendly front-end is just one way to access and use the data in this
Programmable Corpus. Instead, the computational literary scholar will usually access the data
either directly in the form of the TEI-XML[^tei] or via the various APIs and API endpoints. In the
following showcase, the focus is on DraCor’s SPARQL[^sparql] endpoint.

[^tei]: TEI-XML is a type of the XML format that complies the standards defined by the “Text Encoding Initiative (TEI)”, cf. [https://tei-c.org](https://tei-c.org)

[^sparql]: [SPARQL Specification](https://www.w3.org/TR/sparql11-query). A tutorial on how to use SPARQL to query Wikidata can be accessed at [https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial](https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial)

During the homogenization of metadata that theater plays undergo as part of the
integration into the DraCor environment, [Wikidata identifiers](https://www.wikidata.org/wiki/Wikidata:Identifiers) (entity IDs) for both authors and
individual works are typically included in the metadata of each play encoded in the <teiHeader>.
For example, for the German-language bourgeois tragedy “Emilia Galloti” by Gotthold Ephraim
Lessing this data is available in DraCor (`<idno @type="wikidata">`). First for the author:[^author_wikidata_encoding_example]


[^author_wikidata_encoding_example]: [Source of the cited code snippet](https://github.com/dracor-org/gerdracor/blob/3dc874101e2d10d687510aeb5ff8a907331843c1/tei/lessing-emilia-galotti.xml#L10-L18)


```
<author>
    <persName>
        <forename>Gotthold</forename>.
        <forename>Ephraim</forename>
        <surname>Lessing</surname>
    </persName>
    <idno type="wikidata">Q34628</idno>
    <idno type="pnd">118572121</idno>
</author>
```

Then for the individual work:[^link_to_wikidata_example]

```
<listRelation>
    <relation name="wikidata" active="https://dracor.org/entity/ger000088"
        passive="http://www.wikidata.org/entity/Q782653"/>
</listRelation>
```

[^link_to_wikidata_example]: [Source of the cited code snippet](https://github.com/dracororg/gerdracor/blob/3dc874101e2d10d687510aeb5ff8a907331843c1/tei/lessing-emilia-galotti.xml#L122-L124)

Thanks to this metadata, the plays in DraCor can be linked to further information from, among others, [Wikidata](https://www.wikidata.org)[^dracor_wikidata_property], thus be embedded in the wide ecosystem of Linked Open Data and thereby benefit from the often crowd-based data enrichment projects in the World Wide Web. For example, numerous Wikidata entries on plays contain information about the “location of first performances”.[^location_of_first_performance_property] In the case of Lessing's "Emilia Galotti", this location is the "Hagenmarkt-Theater", which also has a [Wikidata entry](https://www.wikidata.org/wiki/Q1270860). The entry for "location of the first performance" in
Wikidata has information about its "coordinate location",[^coordinates_property] which provides the corresponding
geodata (52°16'1.9" N, 10°31'28.9" E). This embedding of DraCor plays in the Linked Open Data
Cloud now makes it possible to run SPARQL queries for the entire corpora, for example. 

[^dracor_wikidata_property]: On Wikidata DraCor IDs are recorded with the DraCor property `P12233`, cf. [https://www.wikidata.org/wiki/Property:P12233](https://www.wikidata.org/wiki/Property:P12233). The DraCor API provides a ["Mix'n'match" endpoint](https://dracor.org/doc/api#/wikidata/wikidata-mixnmatch) to allow the Wikidata system to harvest DraCor identifiers, cf. [https://meta.wikimedia.org/wiki/Mix'n'match/Import](https://meta.wikimedia.org/wiki/Mix'n'match/Import). If you want to learn more about DraCor and Wikidata ask for it in the "Open Topics"-Session on Wednesday. 

[^location_of_first_performance_property]: Wikidata Property `P4647`, see [https://www.wikidata.org/wiki/Property:P4647](https://www.wikidata.org/wiki/Property:P4647) for more information.

[^coordinates_property]:Wikidata Property `P625`, see [https://www.wikidata.org/wiki/Property:P625](https://www.wikidata.org/wiki/Property:P625) for more information.

In the following we will use the SPARQL endpoint of the staging DraCor System at [https://staging.dracor.org/sparql](https://staging.dracor.org/sparql)[^sparql_intro]. We will send a SPARQL query using the Python package `SPARQLWrapper` which provides a convenient way to combine Python and SPARQL. The query we are going to use will retrieve plays contained in the German Drama Corpus (GerDraCor) that are linked to a Wikidata entity. For each such play it will query Wikidata and retrieve the location of the first performance (in most cases a theatre) and its coordinates. We then simplify the returned data structure and turn it into a [GeoPandas](https://geopandas.org) dataframe which will be used to display a map.

[^sparql_intro]: For a (slightly outdated) introduction to SPARQL and LOD with DraCor see [this notebook](https://github.com/dracor-org/dracor-notebooks/blob/lod-intro/lod-intro/lod-intro.ipynb). The RDF-Serialization of the DraCor data is currently under development. Meanwhile the SPARQL interface of the production instance of DraCor at (dracor.org)[https://dracor.org] is deactiveated. For testing purposes the staging instance of DraCor can be used. The data is modeled according to an old version of the [DraCor Ontlogy](https://vowl.acdh.oeaw.ac.at/#iri=https://raw.githubusercontent.com/dracor-org/dracor-schema/ontology/ontology/dracor-ontology.xml). 

In [None]:
# Import packages (These are pre-installed in the Docker container)

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import geopandas as gpd
from shapely import wkt

After importing the needed Python packages we assign the federated SPARQL query to a variable:

In [None]:
# Federated Queries (i.e. using multiple SPARQL endpoints to combine information) tend to be slow. 
# The query below results in frequent timeouts if the limit clause ("LIMIT") is set to a resonable number 
# (i.e. GerDraCor contains somehat 700 plays). You can increase the limit but maybe have to run the cell multiple times until it works.

query = """
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX urn: <http://fliqz.com/>
PREFIX dracon:<http://dracor.org/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT * FROM <urn:x-arq:UnionGraph> WHERE {
  ?play dracon:in_corpus <https://dracor.org/ger> ;
        owl:sameAs ?wd .
  
  SERVICE <https://query.wikidata.org/sparql> {
  ?wd wdt:P4647 ?location ;
      rdfs:label ?playLabel .
  
  ?location wdt:P625 ?coords ;
         rdfs:label ?locationLabel .

    FILTER (lang(?locationLabel) = "en")
    FILTER (lang(?playLabel) = "de")
  }
}
LIMIT 200
"""

You can also test the [SPARQL query](https://staging.dracor.org/sparql#query=PREFIX%20owl%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%0APREFIX%20urn%3A%20%3Chttp%3A%2F%2Ffliqz.com%2F%3E%0APREFIX%20dracon%3A%3Chttp%3A%2F%2Fdracor.org%2Fontology%23%3E%0APREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0A%0ASELECT%20*%20FROM%20%3Curn%3Ax-arq%3AUnionGraph%3E%20WHERE%20%7B%0A%20%20%3Fplay%20dracon%3Ain_corpus%20%3Chttps%3A%2F%2Fdracor.org%2Fger%3E%20%3B%0A%20%20%20%20%20%20%20%20owl%3AsameAs%20%3Fwd%20.%0A%20%20%0A%20%20SERVICE%20%3Chttps%3A%2F%2Fquery.wikidata.org%2Fsparql%3E%20%7B%0A%20%20%3Fwd%20wdt%3AP4647%20%3Flocation%20%3B%0A%20%20%20%20%20%20rdfs%3Alabel%20%3FplayLabel%20.%0A%20%20%0A%20%20%3Flocation%20wdt%3AP625%20%3Fcoords%20%3B%0A%20%20%20%20%20%20%20%20%20rdfs%3Alabel%20%3FlocationLabel%20.%0A%0A%20%20%20%20FILTER%20(lang(%3FlocationLabel)%20%3D%20%22en%22)%0A%20%20%20%20FILTER%20(lang(%3FplayLabel)%20%3D%20%22de%22)%0A%20%20%7D%0A%7D%0ALIMIT%20800&endpoint=https%3A%2F%2Fstaging.dracor.org%2Ffuseki%2Fsparql&requestMethod=POST&tabTitle=Location%20of%20first%20performances%20(GerDraCor)&headers=%7B%7D&contentTypeConstruct=application%2Fn-triples%2C*%2F*%3Bq%3D0.9&contentTypeSelect=application%2Fsparql-results%2Bjson%2C*%2F*%3Bq%3D0.9&outputFormat=table) in the [YASGUI](https://yasgui.triply.cc)-based DraCor SPARQL interface on the staging server. The following cell contains the Python code that sends the query:

In [None]:
%%time

# Send the SPARQL query to the staging.dracor.org server. 
# It will return the SPARQL results (including the "bindings") as JSON which are then converted to a Python-native data structure

sparql = SPARQLWrapper("https://staging.dracor.org/fuseki/sparql")
sparql.setReturnFormat(JSON)
sparql.addExtraURITag("timeout","120000")
sparql.setQuery(query)
results = sparql.queryAndConvert()

The results are returned in the [SPARQL Query Results JSON Format](https://www.w3.org/TR/sparql12-results-json) and are simplified in the following cell. These data is turned into a GeoPandas Dataframe which can be explored as an interactive map.

In [None]:
# Create a Pandas Dataframe/Geopandas Dataframe and explore the map
simple_results = []
for binding in results["results"]["bindings"]:
    item = {}
    for key in binding.keys():
        item[key] = binding[key]["value"]
    simple_results.append(item)
df = pd.DataFrame(simple_results)
df['geometry'] = gpd.GeoSeries.from_wkt(df['coords'])
gdf = gpd.GeoDataFrame(df, geometry="geometry", crs="EPSG:4326")
gdf.explore()

(section-1-3-3)=
## 1.3.3 Showcase 3: Extracting Stage Directions for NLP

While the previous showcase uses the SPARQL endpoint of our Programmable Corpora prototype DraCor, the following showcase uses the custom developed DraCor API. Alongside the TEI encoded DraCor plays are, among others, various XQuery-based extractor functions, which make it possible, via the DraCor API, to retrieve specific and standardized text segments and use them as input for, for example, Natural Language Processing (NLP) pipelines. Thus, for instance,the TEI-based structure of the data in DraCor can be used to address specific research questions, as in our next showcase.

Again, the homogenization of the drama corpora in DraCor serves as a starting point for our showcase. During this homogenization, all plays are systematically structured in such a way that the speaker's text can be consistently differentiated from the stage directions. For this purpose, the corresponding TEI elements are used, where `<stage>`[^tei-stage-element] distinctly tags the text of the stage directions.20 

[^tei-stage-element]: Cf. [Documentation of the element `<stage>`](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-stage.html).
The following TEI snippet from Lessing’s “Emilia Galotti” exemplifies the data structure.

```
<sp who="#appiani">
    <speaker>APPIANI</speaker>
    <stage>tritt tiefsinnig, mit vor sich hingeschlagnen Augen herein, und
    kömmt ihnen näher, ohne sie zu erblicken; bis Emilia ihm entgegen
    springt.</stage>
    <p>Ah, meine Teuerste! – Ich war mir Sie in dem Vorzimmer nicht vermutend.</p>
</sp>
```

Via the DraCor API it is now possible to get all stage directions of a play with the corresponding
request URL: [https://dracor.org/api/v1/corpora/ger/plays/lessing-emilia-galotti/stage-directions](https://dracor.org/api/v1/corpora/ger/plays/lessing-emilia-galotti/stage-directions). 

In Session 5 on Tuesday we will explain how to use the DraCor API in Python. For demonstration purposes the following cell uses a "shortcut" function[^api_get] to request the data from the API and assign it to a variable. We then print an excerpt (some 500 characters) of the returned text data.

[^api_get]: There are several ways to retrieve data from the web in Python. A generic way unsing the Python package `requests` will be explained in Session 5 on Tuesday. The function `api_get` is imported from in the `stabledracor`package which we will use in Session 6 on Wednesday. And, of course, there is the Python package [pydracor](https://github.com/dracor-org/pydracor) which provides even more convenient access to the DraCor API.

In [15]:
from stabledracor.client import api_get

emilia_galotti_stage_directions = api_get(
    corpusname="ger", 
    playname="lessing-emilia-galotti", 
    method="stage-directions", parse_json=False)

print(f"{emilia_galotti_stage_directions[0:500]} ...")

Die Szene, ein Kabinett des Prinzen.
an einem Arbeitstische, voller Briefschaften und Papiere, deren einige er durchläuft.
Indem er noch eine von den Bittschriften aufschlägt, und nach dem unterschriebnen Namen sieht.
Er lieset.
Er unterschreibt und klingelt; worauf ein Kammerdiener hereintritt.
Der Kammerdiener geht ab.
welcher wieder herein tritt.
Der Kammerdiener geht ab.
Bitter, indem er den Brief in die Hand nimmt.
Und ihn wieder wegwirft.
der nochmals herein tritt.
Steht auf.
Conti. Der Pr ...


At the same time, it is possible to retrieve all the spoken texts of the plays via another endpoint. For
Lessing’s “Emilia Galotti”, the corresponding request URL would be:
https://dracor.org/api/v1/corpora/ger/plays/lessing-emilia-galotti/spoken-text.

The obtained text can now be further processed in various ways. In an early showcase, {cite:ts}`trilcke_2020_opening` performed sentence splitting on the text data for all plays in the German-language drama corpus GerDraCor[^gerdracor_used_for_stage_direction_paper] and then compared the average sentence lengths for the stage directions with those of the speaker text. The result showed that the sentence lengths in the speaker texts were longer on average overall, but that at the same time a development can be observed leading to a successive convergence of sentence lengths (see Fig. 05) – a development that, as the authors have suggested , can be explained in the context of research debates about the epification of drama in the 19th century (cf. {cite:p}`trilcke_2020_opening`).

[^gerdracor_used_for_stage_direction_paper]: Regarding the exact data used for the study the authors report on their corpus: "Of the 474 plays available in GerDraCor, we removed librettos and 3 plays without SD, which yields a corpus of 384 plays that are pre-processed using the DramaNLP package." {cite:p}`trilcke_2020_opening` The tool [DramaNLP](https://github.com/quadrama/DramaNLP) was developed in the context of [QuaDramA project](https://quadrama.github.io) as was the R package [DramaAnalysis](https://github.com/quadrama/DramaAnalysis) which was used for the analysis of the data.

% Figure is rendered in the HTML output here

```{figure} ./images/sentence-length-datagraph.png
---
width: 600px
name: sentence-length-datagraph
---
Mean Sentence Length in Stage Directions and Spoken Text in GerDraCor plays visualized with Datagraph (cf. {cite:p}`trilcke_2020_opening`)
```

In the following we use the NLP package [spaCy](https://spacy.io) to demonstrate how the API response containing the text of the stage directions can be split into sentence tokens with the [Sentencizer](https://spacy.io/api/sentencizer) component:

In [17]:
from spacy.lang.de import German

nlp = German()
nlp.add_pipe("sentencizer")

# Pass the downloaded stage directions to the NLP pipeline to perform the sentence splitting
doc = nlp(emilia_galotti_stage_directions)

We can list the first ten sentences to get a feeling if the splitting is acceptable[^hint_on_newline]:

[^hint_on_newline]: Some pre-processing of the API result might be necessary to achieve better results, i.e. there are some newline characters `\n` that could be removed depending on how they are handled in the further processing.

In [22]:
list(doc.sents)[:10]

[Die Szene, ein Kabinett des Prinzen.,
 
 an einem Arbeitstische, voller Briefschaften und Papiere, deren einige er durchläuft.,
 
 Indem er noch eine von den Bittschriften aufschlägt, und nach dem unterschriebnen Namen sieht.,
 
 Er lieset.,
 
 Er unterschreibt und klingelt; worauf ein Kammerdiener hereintritt.,
 
 Der Kammerdiener geht ab.,
 
 welcher wieder herein tritt.,
 
 Der Kammerdiener geht ab.,
 
 Bitter, indem er den Brief in die Hand nimmt.,
 
 Und ihn wieder wegwirft.]

In the above mentioned study the sentence length is measure in tokens. If we wanted to perform rock-soldid repetition of the study[^repeading_research_outlook] we would need to look into how the tokens are created, but this goes beyond this short tutorial. In our quick SpaCy re-implementation the pipeline component has already performed a tokenization. If we want to inspect the tokens orf the first sentence we can output them as shown in the next cell:

[^repeading_research_outlook]: On Wednesday in his lightnig talk  Christof Schöch will go into "Repeating Research" in more detail and introduce his conceptual framework.

In [51]:
for token in list(doc.sents)[0]:
    print(token)

print("\n\n---")
print(f"The first sentence consists of {len(list(doc.sents)[0])} tokens.")

Die
Szene
,
ein
Kabinett
des
Prinzen
.


---
The first sentence consists of 8 tokens.


In [52]:
sentences = []
token_count = []
for sentence in doc.sents:
    sentences.append(sentence.text)

['Die Szene, ein Kabinett des Prinzen.',
 '\nan einem Arbeitstische, voller Briefschaften und Papiere, deren einige er durchläuft.',
 '\nIndem er noch eine von den Bittschriften aufschlägt, und nach dem unterschriebnen Namen sieht.',
 '\nEr lieset.',
 '\nEr unterschreibt und klingelt; worauf ein Kammerdiener hereintritt.',
 '\nDer Kammerdiener geht ab.',
 '\nwelcher wieder herein tritt.',
 '\nDer Kammerdiener geht ab.',
 '\nBitter, indem er den Brief in die Hand nimmt.',
 '\nUnd ihn wieder wegwirft.',
 '\nder nochmals herein tritt.',
 '\nSteht auf.',
 '\nConti.',
 'Der Prinz.',
 '\nDer Prinz.',
 'Conti, mit den Gemälden, wovon er das eine verwandt gegen einen Stuhl lehnet.',
 '\nindem er das andere zurecht stellet.',
 '\nnach einer kurzen Betrachtung.',
 '\netwas ärgerlich.',
 '\nindem er es holt, und noch verkehrt in der Hand hält.',
 '\nMit dem Finger auf die Stirne.',
 '\nMit dem Finger auf das Herz.',
 '\nIndem der Maler das Bild umwendet.',
 '\nindem er sich zu fassen sucht, aber 

(section-1-3-4)=
## 1.3.4 Showcase 4: Plotting Network Measures for Thousands of Plays

Connected to the DraCor corpora are various microservices that––following the principle of
“method as a microservice”––apply specific methods of CLS to the text data in the drama corpora.
The outputs from these microservices are, in the form of metrics, made available via the DraCor
API. Part of these microservice-based and thus research-driven API functions rely on methods
from Social Network Analysis. Again, based on the homogenized TEI structure of the plays in
DraCor, in particular the semi-automated speaker identification, a dedicated microservice first
automatically constructs network graphs to which then a number of algorithms from Network
Analysis are applied.

The technical capability to retrieve metrics for the texts from several different drama
corpora via one single API enables the use of standardized analyses for comparative literary
studies, as Trilcke et al. (in press) have shown applying the concept of “Small World” to almost
3,000 dramas of European literature.

Analyzing plays with reference to the “Small World” concept requires the calculation of the
network metric of “Average Path Length”.22 This metric can, as outlined, be retrieved via the
DraCor API. For our final showcase, we pull from the DraCor API a file of aggregate metrics,
including “Average Path Length”. For example, the request URL for the German-language corpus
GerDraCor in this case is: https://dracor.org/api/corpora/ger/metadata/csv

After collecting the corresponding data for all DraCor corpora via the API, we filter the data
based on the metadata for plays published between 1500 and 1900. In a final step, we plot the
“Average Path Length” for the now remaining 2,622 plays as a chart (Fig. 06)––thus with just a
few clicks taking a decisive step towards a fully-fledged “distant reading” study, whereby at the
same time it becomes clear what is still missing in these data (and what would have to be provided
in an elaborated study): the interpretation of the data, which has to elaborate the meaning of such
plots.

Fig. 06: Average Path Length for 2,622 plays in DraCor visualized with Datagraph