(section-6-3)=
# 6.3 Reproducible Research with DraCor and Docker

In {ref}`section-1-3-3` we outlined how the study "Opening the Stage – A Quantitative Look at Stage Directions in German Drama" {cite:p}`trilcke_2020_opening` could be re-implemented in Python. Although {cite:t}`trilcke_2020_opening` indicate the data on which their results are based, it is not straigt forward to repeat the study without making efforts to reconstruct their corpus. Regarding the constitution of the dataset the authors note:

> Of the 474 plays available in GerDraCor, we removed librettos and 3 plays without SD, which yields a corpus of 384 plays that are pre-processed using the DramaNLP package. {cite:p}`trilcke_2020_opening`

From this description is not self-evident which version of GerDraCor was used. The only information that may support the identification of the version is the information about the number of plays “474” included in GerDraCor at the time of analyzing the corpus. 

In our recent report ["On Versioning Living and Programmable Corpora"](https://versioning-living-corpora.clsinfra.io/3-2_gerdracor_corpus_archeology.html) {cite:p}`boerner_2024_versioning-living-corpora` we conducted an in-depth analysis of the genesis of a DraCor corpus (namely “GerDraCor”), taking the Git commit history as a basis. Based on the commits to the repository we can show how the corpus "grew" over the time. In the following code cells we plot the developent of the number of plays included in the GerDraCor corpus over time ({numref}`fig_num_corpus_documents`).

In [None]:
# This is needed to re-use outputs of code in the markdown cells. 
# This cell is removed in the rendered report
from myst_nb import glue

In [None]:
# Import packages pandas and matplotlib for data analysis
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# The following code used custom functions developed for the above mentioned report

from customutils import GitHubRepo

# Because it takes quite some time to download and prepare the commit history for
# the analysis we use pre-downloaded and processed data
gerdracor = GitHubRepo(repository_name="gerdracor", 
                  import_commit_list="gerdracor_commits/gerdracor_commits.json",
                  import_commit_details="gerdracor_commits/gerdracor_commits_detailed.json",
                  import_data_folder_objects="gerdracor_commits/gerdracor_data_folder_objects.json",
                  import_corpus_versions="gerdracor_commits/gerdracor_corpus_versions.json")

# create a pandas data frame containing the information about implicit versions of GerDraCor
# that are identified by the commit hashes on GitHub; we include the nummer of plays included
# (column "document_count") at a certain date (column "date_from").
corpus_num_of_docs_df = gerdracor.get_corpus_versions_as_df(columns=["date_from","id","document_count"]).set_index("date_from")

# Based on this information we create a line plot
fig_num_corpus_documents = plt.figure()

plt.ylabel("Documents", figure=fig_num_corpus_documents)
plt.xlabel("Date", figure=fig_num_corpus_documents)

plt.plot(corpus_num_of_docs_df["document_count"], figure=fig_num_corpus_documents)

glue("fig_num_corpus_documents", fig_num_corpus_documents, display=False)

```{glue:figure} fig_num_corpus_documents
---
figwidth: 800px
name: fig_num_corpus_documents
---
Development of the number of documents in  all versions in GerDraCor
```

We can filter the pandas DataFrame `corpus_num_of_docs_df` for all versions that contain 474 plays:

In [None]:
# Filter the dataframe on versions that have exactly 474 plays
corpus_num_of_docs_df[corpus_num_of_docs_df["document_count"] == 474]

In [None]:
glue("cnt_versions_gerdracor_474", len(corpus_num_of_docs_df[corpus_num_of_docs_df["document_count"] == 474]))

Filtering the DataFrame shows that there are {glue}`cnt_versions_gerdracor_474` versions that consist of 474 files. This example clearly illustrates that specifying the number of plays included is not sufficient to reproduce the data exactly.

Still we can look into these versions of the corpus to find out which plays were already available at that time. The corpus archeology script provides a functionality for this:

In [None]:
# Retrieve the playnames form the analyzed commit history in the given date range
playnames = gerdracor.get_plays_in_corpus_versions_in_date_range(date_start="2019-09-03", date_end="2019-12-20")

# Just check if there are really 474 distinct playnames
assert(len(playnames) == 474)

We can now use these playname identifiers to setup a custom local DraCor corpus containing only these plays:

In [None]:
%%bash

# Stop and remove all Docker containers to avoid conflicts 
# especially regarding ports in the next section
# This cell does not show up in the final rendering of the report
# If you want to use the Docker containers above and play around with them, 
# the following commands should NOT be run

docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)

In [None]:
from stabledracor.client import StableDraCor
dracor = StableDraCor()
dracor.run()