# Introduction

This notebook introduces some of the functionality of the *stable-dracor-client* and the *dracor-sandbox* Docker container. It is assumed that the infrastructure has been started by using the docker compose file `compose.yml` (`docker compose up`) and run in the bundled jupyter lab instance on http://localhost:8888.

## DraCor in Docker and the *dracor-sandbox* Docker container

There are several possibilities to set up a local DraCor infrastructure. 

### DraCor Docker containers running on the local machine
One way would be to simply clone the repository of the DraCor API eXist-DB application (https://github.com/dracor-org/dracor-api) and use the provided docker compose file [`compose.yml`](https://github.com/dracor-org/dracor-api/blob/main/compose.yml) to start a local Docker-based DraCor system. If everything went well issuing the the command `docker ps` in a terminal will show multiple running Docker containers. 

(TODO: add screenshot here)

In this approach the environment in which the local infrastructure is running can vary. Depending on the setup of the local host machine (operation-system, version of Docker daemon installed, ...) there might be some issues, e.g. problems connection to `localhost` on some Windows installations. Although, the *stable-dracor-client* can be used with this setup as well, hence we will assume that the following *Docker in Docker* approach to setting up the local infrastructure will be used.

### Docker in Docker
Another way to setup a local DraCor infrastructure using Docker is to make use of a technique that is called *Docker in Docker*. This means that on the host machine there is running a container that has Docker installed. This Docker container is then used to start additional containers inside this controlled environment. This way the environment in which the DraCor services will be executed can be prepared in advanced to make sure that everything works as expected.

When issuing the command `docker compose up` in the root directory of the cloned stable-dracor repository only a single Docker container `dracor-sandbox` will be started. It not only has the Docker daemon installed in it but also features a (this) Jupyter Lab instance (running on http://localhost:8888) that can be used to prepare custom corpora or run Python scripts.

When starting the `dracor-sandbox` container there are no Docker containers running. To test this, it is possible to "enter" the container from the host machine with the command `docker exec -it dracor-sandbox /bin/bash`. In the interactive shell that is now executed inside the container the command `docker ps` will return an empty list:

```
194dc81eb130:/home/dracor# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
```

The `dracor-sandbox` container can be left with the command `exit`.

There are also two folders mounted to the container: `import` and `export`. They also show up in the *File Browser* in Jupyter Lab. These folder provide an easy way to get data into and out of the `dracor-sandbox` container. A usecase would be to add locally available XML files. When copying them into the `import` folder on the host machine, they become available to the container. The `export` folder functions the same way. Files that should become available on the host machine even when the Docker container is not running anymore can be stored in this folder. This could, for example, be the results of an analysis or a *manifest* that describes the composition of the custom corpora (see Section X).

Having only one Docker container with the DraCor system components inside also makes it possible to "freeze" not only the whole local Dracor system with the corpora set up (as we prototyped for the paper 
[Detecting Small Worlds in a Corpus of Thousands of Theater Plays](https://github.com/dracor-org/small-world-paper/tree/publication-version)) but in addition to that the environment that was used to assemble it (this Jupyter Lab instance with the Python notebooks).

## Installing the client and prequisits

### Installation

Within this Jupyter Lab instance the client has already been installed. If the client should be used standalone it can be installed from within the root directory of the stable-dracor repository (containing the necessary `pyproject.toml` file) with the command `pip install .` (the dot is necessary!).

### Activate Logging
It is recommended to activate logging in the notebook. This can be done by importing the library logging and setting the log level. For normal use the log level `INFO` is sufficient, but if something does not work as expected, the log level should be set to `DEBUG` to see additional messages which might help to identify the problem.
The following cell activates logging for this notebook:

In [1]:
import logging
#logging.basicConfig(level=logging.DEBUG)
logging.basicConfig(level=logging.INFO)

### Importing the client 
After the client has been installed it can be imported with the following command:

In [2]:
from stabledracor.client import StableDraCor

## Setting up Docker containers with the the *stable-dracor-client*

The following section explains how to set up the Docker containers providing the DraCor services of a local DraCor infrastructure. 

### First step: Creating an instance of the StableDraCor class
To use the client it is necessary to create an instance of the class `StableDraCor` that has been imported to the notebook with the previous command. The most basic command just instantiates an objec without any additional arguments:

In [3]:
dracor = StableDraCor()

INFO:root:Initialized new StableDraCor instance: 'None' (ID: ba2d7c1a-96b0-4edb-aa4b-72e8d8e5011c).
INFO:root:Docker is available.


If logging is activated some warnings will appear: We find out that there is no API currently available under the default endpoint url and that no DraCor Docker containers can be found. 

Normally, the output of the client can be trusted in this regard, but to verify that there are no running containers it is possible to directly query the docker daemon running in the `dracor-sandbox` container. The command to list running containers is `docker ps`. In the following cell starts with a `!` though. This is a way to execute shell commands directly from within a Jupyter notebook.

In [4]:
!docker ps

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


### GitHub Access Token
The log output from above contains the warning, that a "Personal GitHub Acces Token" has not been supplied. Providing a token is recommended, because for some operations, e.g. adding a corpus from a repository on GitHub, the client relies on the GitHub API. Unauthorized calls to this API are subject to rate limiting, which means, that only a few calls can be send without being authorized to GitHub. The [FAQ](03_faq.ipynb#Why-do-I-need-a-GitHub-Access-Token-and-how-can-I-get-one?) contains instructions  on how to generate such a token and explains how to make it available to a notebook or script as an environmental variable upon upon the creation of a `dracor-sandbox` Docker container. Although it is possible to add the value of the token directly in the script, it is not considered a good practise because an access token is like a password. Especially, if a notebook is shared, this means of "hiding" the password should be used. 

The code in the followin cell uses the library `os` to read an environmental variable `GITHUB_TOKEN` and stores it in the variable `github_token` to later use it when initializing the client.

In [5]:
import os
github_token = os.environ.get("GITHUB_TOKEN")

In [6]:
# Optional: Check if the Token is available (see FAQ Notebook)
assert github_token is not None, "It is recommended to use a GitHub Personal Access Token. See FAQ Notebook for details."

If the container has been started without setting a token with the `.env` file as described in the [FAQ](03_faq.ipynb#Why-do-I-need-a-GitHub-Access-Token-and-how-can-I-get-one?), the following command can be used to create it (substitute `value` with your token, of course):

`export GITHUB_TOKEN=value`

It would be possible to issue this command from within the notebook using the prefix `!`, but then the cell containing the value of the token must be removed afterwards for security reasons. 

Another option would be to enter the `dracor-sandbox` container (see [FAQ](03_faq.ipynb#How-can-I-get-into-the-dracor-sandbox-Docker-container?)) or use the Terminal of Jupyter Lab (in the *Launcher* at http://localhost:8888 click on *Terminal*). The shell command `printenv` can be used to list all set variables, `echo $GITHUB_TOKEN` will output the value of the environment variable.

An instance of the client can be created by supplying the token as the argument `github_access_token`:

In [7]:
dracor = StableDraCor(github_access_token=github_token)

INFO:root:Initialized new StableDraCor instance: 'None' (ID: 03466582-0107-481b-bafe-566c7e70c828).
INFO:root:Docker is available.


### Attaching Metadata to the *stable-dracor* instance
It is possible to attach to the stable-dracor system. Currently, adding a `name` and a `description` is supported. In addition to that every time an *stable-dracor* is initialized, a universal unique id is generated. This ensures, that each instance can be identified by its unique identifier. These information, `id`, `name` and `description` are also included with the so-called manifest describing a stable-dracor instance. In the following cell we create the an instance once more, provide a GitHub Token and attach metadata:

In [8]:
dracor = StableDraCor(
    name="my-stable-dracor",
    description="DraCor system created with the introduction notebook to showcase the features of the stable-dracor-client.",
    github_access_token=github_token)

INFO:root:Initialized new StableDraCor instance: 'my-stable-dracor' (ID: 7f4f9ec9-40b2-4b92-8f33-5ef83714a12b).
INFO:root:Docker is available.


### Additional options when creating a stable-dracor instance
There are some options to change the behaviour of the client instance, e.g. explicitly set a different base url of the API. For details see the [FAQ](03_faq.ipynb#How-can-I-change-the-URL-of-the-API-used).

### Second step: Starting the Docker containers
The second step to setup a DraCor system with the *stable-dracor-client* is to start the Docker containers of the system components: eXist-DB with pre-installed application ("DraCor API"), Metrics Service, Frontend, Triple Store. When using the `dracor-sandbox` these services will run inside the "outer" Docker container. 

The most basic way to start these containers is to call the method `run` without any additional arguments: `dracor.run()`. In this case the client will fetch the [default configuration](https://raw.githubusercontent.com/dracor-org/stabledracor/master/configurations/compose.fullstack.empty.yml) which is defined in `compose.fullstack.empty.yml` in the stable-dracor repository on GitHub. 

A configuration file, which is basically a docker-compose file, will specify the Docker images and tags of the components to be used, e.g. currently the default configuration uses the api image `dracor/dracor-api:v0.90.1-local` ([Link](https://github.com/dracor-org/stabledracor/blob/2dc461e6f3d8106f5291ba0b1f6779b7adb52c5d/configurations/compose.fullstack.empty.yml#L8)). 

The images of all DraCor components can be found on [DockerHub](https://hub.docker.com/u/dracor). For example the available images of the DraCor eXist-DB and the application powering the API can be found [here](https://hub.docker.com/r/dracor/dracor-api/tags). The image, that is currently used in the default configuration can be viewed [here](https://hub.docker.com/layers/dracor/dracor-api/v0.90.1-local/images/sha256-e2af569e41398b2b5b527dbb21ada90dea467568ce901b2607b1cc41a6743a75?context=explore). 

Executing `run()` without explicitly specifying a configuration can be considered a fall-back. It is possible to pass a path to a different docker-compose file when starting the client as the keyword argument `compose_file`. Another option is to use `url` and point to a location from which a docker-compose file can be downloaded.

**TODO**: Add configurations to dracor-sandbox container! `!ls ../configurations`! Explain which configuration to use. Difficult because of ARM/AMD architecture?

In [9]:
dracor.run()

INFO:root:Fetched default compose file (configuration) from https://raw.githubusercontent.com/dracor-org/stabledracor/master/configurations/compose.fullstack.empty.yml.
 fuseki Pulling 
 metrics Pulling 
 frontend Pulling 
 api Pulling 
 6d588874b473 Pulling fs layer 
 79c6f3c25c4d Pulling fs layer 
 2e95d5881799 Pulling fs layer 
 c494480ca267 Pulling fs layer 
 221ce4f52e61 Pulling fs layer 
 4f4fb700ef54 Pulling fs layer 
 95b30520563c Pulling fs layer 
 10f6182d9824 Pulling fs layer 
 97b4d79b8743 Pulling fs layer 
 01655217cf18 Pulling fs layer 
 5322f71909aa Pulling fs layer 
 ad9a27c51771 Pulling fs layer 
 aa345ce91a97 Pulling fs layer 
 c494480ca267 Waiting 
 221ce4f52e61 Waiting 
 4f4fb700ef54 Waiting 
 95b30520563c Waiting 
 10f6182d9824 Waiting 
 97b4d79b8743 Waiting 
 01655217cf18 Waiting 
 ad9a27c51771 Waiting 
 aa345ce91a97 Waiting 
 5322f71909aa Waiting 
 a9fe95647e78 Pulling fs layer 
 4015b6e8cc8d Pulling fs layer 
 0e86b181efa0 Pulling fs layer 
 94abd992e68d Pulling

True

### Listing running containers
It is possible to check if Docker containers are currently running with the already familiar command `docker ps`. This should now list four running containers.

In [10]:
!docker ps

CONTAINER ID   IMAGE                                 COMMAND                  CREATED          STATUS                             PORTS                    NAMES
ac2e4e6d8a73   dracor/dracor-frontend:v1.6.0-dirty   "/docker-entrypoint.…"   16 seconds ago   Up 15 seconds                      0.0.0.0:8088->80/tcp     my-stable-dracor-frontend-1
8c9975f92468   dracor/dracor-api:v0.90.1-local       "./entrypoint.sh"        16 seconds ago   Up 15 seconds (health: starting)   0.0.0.0:8080->8080/tcp   my-stable-dracor-api-1
3d8cc36cdf62   dracor/dracor-metrics:v1.2.0          "pipenv run hug -f m…"   16 seconds ago   Up 15 seconds                      0.0.0.0:8030->8030/tcp   my-stable-dracor-metrics-1
35802186a396   dracor/dracor-fuseki:v1.0.0           "/usr/bin/tini -- /e…"   16 seconds ago   Up 15 seconds                      0.0.0.0:3030->3030/tcp   my-stable-dracor-fuseki-1


The *stable-dracor-client* provides a method (`list_docker_containers`) that gets the same result from the docker daemon and turns it into a data structure native to Python:

In [11]:
dracor.list_docker_containers()

[{'Command': '"/docker-entrypoint.…"',
  'CreatedAt': '2023-11-23 13:25:14 +0000 UTC',
  'ID': 'ac2e4e6d8a73',
  'Image': 'dracor/dracor-frontend:v1.6.0-dirty',
  'Labels': 'com.docker.compose.depends_on=api:service_started:false,com.docker.compose.image=sha256:29b9e97eb7e327121e46916d743b02bb8964329715c61b632e44256cf0c54a39,com.docker.compose.oneoff=False,com.docker.compose.project=my-stable-dracor,com.docker.compose.service=frontend,maintainer=NGINX Docker Maintainers <docker-maint@nginx.com>,com.docker.compose.config-hash=b7a27a200311b8bcdb1845cb14b127be75bcd50748ce9ef423c7c3902c6e0bdc,com.docker.compose.container-number=1,com.docker.compose.project.config_files=-,com.docker.compose.project.working_dir=/home/dracor/notebooks,com.docker.compose.version=2.23.0',
  'LocalVolumes': '0',
  'Mounts': '',
  'Names': 'my-stable-dracor-frontend-1',
  'Networks': 'my-stable-dracor_default',
  'Ports': '0.0.0.0:8088->80/tcp',
  'RunningFor': '17 seconds ago',
  'Size': '406B (virtual 67.9MB)',

It is possible to operate on the returned list, e.g. count the number of containers:

In [12]:
print(f"There are {len(dracor.list_docker_containers())} running Docker containers.")

There are 4 running Docker containers.


### Listing available images
When the infrastructure was run the client downloaded the images specified in the configuration file into the container. These can be listed with the Docker command `docker images`:

In [13]:
!docker images

REPOSITORY               TAG             IMAGE ID       CREATED        SIZE
dracor/dracor-api        v0.90.1-local   171df59ae0ab   5 months ago   367MB
dracor/dracor-metrics    v1.2.0          1806ba4d7047   5 months ago   944MB
dracor/dracor-frontend   v1.6.0-dirty    29b9e97eb7e3   5 months ago   67.9MB
dracor/dracor-fuseki     v1.0.0          8063e90771d2   5 months ago   294MB


The client provides the method `list_docker_images` to retrieve this listing as well in a data structure that can be better processed with Python:

In [14]:
dracor.list_docker_images()

[{'Containers': 'N/A',
  'CreatedAt': '2023-06-07 09:28:11 +0000 UTC',
  'CreatedSince': '5 months ago',
  'Digest': '<none>',
  'ID': '171df59ae0ab',
  'Repository': 'dracor/dracor-api',
  'SharedSize': 'N/A',
  'Size': '367MB',
  'Tag': 'v0.90.1-local',
  'UniqueSize': 'N/A',
  'VirtualSize': '367.3MB'},
 {'Containers': 'N/A',
  'CreatedAt': '2023-06-07 08:24:13 +0000 UTC',
  'CreatedSince': '5 months ago',
  'Digest': '<none>',
  'ID': '1806ba4d7047',
  'Repository': 'dracor/dracor-metrics',
  'SharedSize': 'N/A',
  'Size': '944MB',
  'Tag': 'v1.2.0',
  'UniqueSize': 'N/A',
  'VirtualSize': '943.6MB'},
 {'Containers': 'N/A',
  'CreatedAt': '2023-06-06 07:48:38 +0000 UTC',
  'CreatedSince': '5 months ago',
  'Digest': '<none>',
  'ID': '29b9e97eb7e3',
  'Repository': 'dracor/dracor-frontend',
  'SharedSize': 'N/A',
  'Size': '67.9MB',
  'Tag': 'v1.6.0-dirty',
  'UniqueSize': 'N/A',
  'VirtualSize': '67.88MB'},
 {'Containers': 'N/A',
  'CreatedAt': '2023-06-06 07:46:38 +0000 UTC',
  '

### Getting API info
A quick way to test if a connection to the API can be established is calling the method `get_api_info`. This will render the response of the `/info` endpoint http://localhost:8088/api/info of the DraCor API.

In [15]:
dracor.get_api_info()

{'name': 'DraCor API',
 'version': '0.90.1-2-g19a3f46-dirty',
 'status': 'beta',
 'existdb': '6.0.1',
 'base': 'http://localhost:8088/api'}

### Accessing the front-end
The DraCor frontend running inside the `dracor-sandbox` can be easily accessed from outside the container: Pointing the browser to http://localhost:8088 and/or at http://127.0.0.1:8088 should show the frontend. No corpora have been loaded yet.

The other components of the DraCor system, e.g. the Triple Store, can not be reached from outside the `dracor-sandbox`. See [FAQ](03_faq.ipynb#How-can-I-access-other-services-than-the-frontend-and-the-API-from-outside-the-dracor-sandbox?) on how to map the necessary ports to access the other services if needed.

## Documentation of the system and its components in the *manifest*
When setting up a local DraCor infrastructure with the *stable-dracor-client* the system tries to 'document' itself, which means that the client can generate a data structure, the *Manifest*, that contains information on the system's components and the composition all corpora loaded. 

The objective of the manifest is to provide a means to fully describe a local DraCor system in such a way, that, by only relying on the manifest, the system can be re-created at some later stage. 

In the following section only the `system` and the `sevices` parts of the manifest are explained. The `corpora` will be introduced at a later stage when corpora have been added to the system.

To output the manifest use the method `get_manifest`.

In [16]:
dracor.get_manifest()

{'version': 'v1',
 'system': {'id': '7f4f9ec9-40b2-4b92-8f33-5ef83714a12b',
  'name': 'my-stable-dracor',
  'description': 'DraCor system created with the introduction notebook to showcase the features of the stable-dracor-client.',
  'timestamp': '2023-11-23T13:25:31.445797'},
 'services': {'api': {'container': '8c9975f92468',
   'image': 'dracor/dracor-api:v0.90.1-local',
   'version': '0.90.1-2-g19a3f46-dirty',
   'existdb': '6.0.1'},
  'frontend': {'container': 'ac2e4e6d8a73',
   'image': 'dracor/dracor-frontend:v1.6.0-dirty'},
  'metrics': {'container': '3d8cc36cdf62',
   'image': 'dracor/dracor-metrics:v1.2.0'},
  'triplestore': {'container': '35802186a396',
   'image': 'dracor/dracor-fuseki:v1.0.0'}},
 'corpora': {}}

The field `version` defines the version of the manifest specification, which, in the current state of development will be `v1`.

The field `system` contains the metadata provided when initializing a new instance (see section [Attaching Metadata ...](#Attaching-Metadata-to-the-stable-dracor-instance)). Additonally there is a field `timestamp` that contains the date and time at which the system was described, i.e. the point in time when the manifest was generated by calling the method.

The field `services` contains information on the individual system components, at least in allows to identify the Docker image (`image`) the container was created from. In the following cell we request the manifest and query for the image of the api service:

In [17]:
print(f"The API container is based on the Docker image {dracor.get_manifest()['services']['api']['image']}.")

The API container is based on the Docker image dracor/dracor-api:v0.90.1-local.


## Loading Corpora
The following section introduces the methods to add corpora to the now running local DraCor infrastructure.

### Copying an existing corpus
A quick way to load a corpus in the local DraCor infrastructure is to copy an existing corpus which is done with the method `copy_corpus`. In the next cell the *Tatar Drama Corpus* (TatDraCor) from the live production instance at https://dracor.org is copied. It is used for demonstration purposes here because if contains only three plays. Feel free to change the following cell to copy a different corpus by changing the value of the keyword argument `source_corpusname`. 

In [18]:
dracor.copy_corpus(source_corpusname="tat")

INFO:root:Successfully created corpus tat. All metadata is available. Plays have not been added yet.
INFO:root:Added contents of corpus tat from https://dracor.org/api/. 3 plays were added.
INFO:root:Copying tat (as tat) was successful. Plays (that were not excluded) were also copied entirely.


True

If everything went well and the method returned the value `True` the corpus should be displayed on the frontend at http://localhost:8088. 

A corpus can also be copied from a different source. The URL of the respective DraCor system must be provided as argument `source_api_url`. In the following cell the *Dutch Drama Corpus*, which is currently in development and therefore not available from the production instance at https://dracor.org is added. The corpus is already included in the staging system at http://staging.dracor.org. We need to set the value of `source_api_url` to `http://staging.dracor.org/api/`:

In [19]:
dracor.copy_corpus(source_corpusname="dutch", source_api_url="http://staging.dracor.org/api/")

INFO:root:Added contents of corpus dutch from http://staging.dracor.org/api/. 1 plays were added.
INFO:root:Copying dutch (as dutch) was successful. Plays (that were not excluded) were also copied entirely.


True

When copying it is possible to change some aspects of the corpus. In the following cell we copy the "Bashkir Drama Corpus", which, in the production instance on https://dracor.org, is identified by the corpusname `bash`. 

The corpus on dracor.org contains 3 plays, of which we will import only two by excluding the play "Аҡ билеттәр" by the author Шагит Худайбердин (Shagit Khudayberdin). The other two plays are both by the author Мостай Кәрим (Mustai Karim), thus the resulting local corpus will only include plays by a single author. The identifier `playname` of the play to exclude (`khudayberdin-aq-bilettar`) must be passed as keyword argument `exclude` in the form of a list because `exclude` can be used to skip multiple plays as well.

We also change the metadata of the corpus. Because the local corpus will be a single author corpus by Mustai Karim we will call it "KarimDraCor" and change the description accordingly. This can be achived by creating a dictionary `karim_meta` containing the new metadata fields. 

In [20]:
karim_meta = dict(
    name="kar",
    title="Mustai Karim Drama Corpus",
    description="Corpus of plays by Mustai Karim derived from the Bashkir Drama Corpus (BashDraCor)."
)

dracor.copy_corpus(source_corpusname="bash", exclude=["khudayberdin-aq-bilettar"], metadata=karim_meta)

INFO:root:Added contents of corpus bash from https://dracor.org/api/. 2 plays were added.
INFO:root:Copying kar (as kar) was successful. Plays (that were not excluded) were also copied entirely.


True

### Documentation of copied corpus in the manifest
The manifest documents the consitution of added corpora. As explained in section on the [manifest as a documentation of the system components](#Documentation-of-the-system-and-its-components-in-the-manifest) the manifest can be output with the method `get_manifest`. Loaded corpora are documented in the field `corpora`. If you followed the notebook to this point the infrastructure contains three corpora with the names `tat`, `dutch`, `kar`. 

In [21]:
# uncomment this to see the whole manifest:
# dracor.get_manifest()
# the following line gets the section `corpora`
dracor.get_manifest()["corpora"]

{'tat': {'corpusname': 'tat',
  'timestamp': '2023-11-23T13:25:32.435087',
  'sources': {'tat': {'type': 'api',
    'corpusname': 'tat',
    'url': 'https://dracor.org/api/corpora/tat',
    'timestamp': '2023-11-23T13:25:32.435093',
    'num_of_plays': 3}},
  'num_of_plays': 3},
 'dutch': {'corpusname': 'dutch',
  'timestamp': '2023-11-23T13:25:37.631544',
  'sources': {'dutch': {'type': 'api',
    'corpusname': 'dutch',
    'url': 'http://staging.dracor.org/api/corpora/dutch',
    'timestamp': '2023-11-23T13:25:37.631550',
    'num_of_plays': 1}},
  'num_of_plays': 1},
 'kar': {'corpusname': 'kar',
  'timestamp': '2023-11-23T13:25:41.428662',
  'sources': {'bash': {'type': 'api',
    'corpusname': 'bash',
    'url': 'https://dracor.org/api/corpora/bash',
    'timestamp': '2023-11-23T13:25:41.428668',
    'exclude': {'type': 'slug', 'ids': ['khudayberdin-aq-bilettar']},
    'num_of_plays': 2}},
  'num_of_plays': 2}}

In [22]:
# get the keys of the corpora dictionaries
list(dracor.get_manifest()["corpora"].keys())

['tat', 'dutch', 'kar']

For example, the corpus `tat` was copied directly from the is described as such:

In [23]:
dracor.get_manifest()["corpora"]["tat"]

{'corpusname': 'tat',
 'timestamp': '2023-11-23T13:25:32.435087',
 'sources': {'tat': {'type': 'api',
   'corpusname': 'tat',
   'url': 'https://dracor.org/api/corpora/tat',
   'timestamp': '2023-11-23T13:25:32.435093',
   'num_of_plays': 3}},
 'num_of_plays': 3}

The field `sources` contains the source the corpus was derived from: It is retrieved via an API (`type` = `api`) from the location (`url`) `https://dracor.org/api/corpora/tat`, which is the live production instance of DraCor. 

The second corpus (`dutch`) was copied from the DraCor staging instance at http://staging.dracor.org, as is documented in the respective part of the manifest.

In [24]:
dracor.get_manifest()["corpora"]["dutch"]["sources"]

{'dutch': {'type': 'api',
  'corpusname': 'dutch',
  'url': 'http://staging.dracor.org/api/corpora/dutch',
  'timestamp': '2023-11-23T13:25:37.631550',
  'num_of_plays': 1}}

The field `timestamp` containes the date and time when the corpus was copied, the value of the field `num_of_plays` is the number of plays that were copied from the source corpus.

In case of the third corpus that was added the manifest contains information about the excluded plays. The field `exclude` provides the information that the plays with the ids (`ids`; the `type` of the identifiers is `slug`, meaning "playname" consisting of author and title) were not copied from the source corpus with the identifier `bash` at the url `https://dracor.org/api/corpora/bash`:

In [25]:
dracor.get_manifest()["corpora"]["kar"]

{'corpusname': 'kar',
 'timestamp': '2023-11-23T13:25:41.428662',
 'sources': {'bash': {'type': 'api',
   'corpusname': 'bash',
   'url': 'https://dracor.org/api/corpora/bash',
   'timestamp': '2023-11-23T13:25:41.428668',
   'exclude': {'type': 'slug', 'ids': ['khudayberdin-aq-bilettar']},
   'num_of_plays': 2}},
 'num_of_plays': 2}

Bear in mind that the corpora published on the DraCor platform (production and staging) are so-called "living corpora". This means that to some of them plays are still being added and the encoding can change. Although the information when a corpus was copied and how many plays were available at that point in time, in most cases it will not posssible to re-create the exact same composition of this corpus at some later point in time. It must be noted that when using the copy mechanism the manifest alone is not a sufficent source to reproduce the contents of the system. If reproducibility is the goal, then the following method of adding data should be used.

### Adding a corpus from a GitHub Repository
Although copying from a running DraCor instance is a convenient way to quickly get data in a local instance, it is not the best approach if the contents of a corpus should be transparent and traceable. Therefore the *stable-dracor-client* provides the method `add_corpus_from_repo` that retrieves data from a repository on GitHub. In most cases corpora that are published on DraCor have their designated data repositories on GitHub. They are listed on the [page of the dracor.org organization on Github](https://github.com/orgs/dracor-org/repositories) (bear in mind that not all of these repositories are corpora).

**TODO**: add a method to retrieve repositories that contain a `corpus.xml` file --> corpus repository. BUT: Not all have corpus.xml (which is bad practise).

To add a corpus directly from Github the method expects the name of the repository as the keyword argument `repository_name`. The code in the following cell adds the data of the Spanish Drama Corpus from the repository https://github.com/dracor-org/spandracor. The repository name is `spandracor`: 

In [26]:
dracor.add_corpus_from_repo(repository_name="spandracor")

INFO:root:Successfully created corpus span.
INFO:root:Play 'clarin-teresa' retrieved from 'https://raw.githubusercontent.com/dracor-org/spandracor/184ebf975ad9cd674ff37cab44a181fa7ed8d85f/tei/clarin-teresa.xml' has been successfully added to corpus 'span'. Checked and found local play data.
INFO:root:Play 'dicenta-juan-jose' retrieved from 'https://raw.githubusercontent.com/dracor-org/spandracor/184ebf975ad9cd674ff37cab44a181fa7ed8d85f/tei/dicenta-juan-jose.xml' has been successfully added to corpus 'span'. Checked and found local play data.
INFO:root:Play 'echegaray-arrastrarse' retrieved from 'https://raw.githubusercontent.com/dracor-org/spandracor/184ebf975ad9cd674ff37cab44a181fa7ed8d85f/tei/echegaray-arrastrarse.xml' has been successfully added to corpus 'span'. Checked and found local play data.
INFO:root:Play 'echegaray-mancha' retrieved from 'https://raw.githubusercontent.com/dracor-org/spandracor/184ebf975ad9cd674ff37cab44a181fa7ed8d85f/tei/echegaray-mancha.xml' has been succes

True

The corpus should be available in the local instance at http://localhost:8088/span. 

When we output the manifest we see that the `type` of the source is `repository` (in case of copying it was `api`, see [previous section](##Documentation-of-copied-corpus-in-the-manifest)) and the URL of the repository is included as `url`. In addition to a `timestamp` that contains date and time the process was initiated, the manifest contains the field `commit`. When calling the method as in the previous cell the client will fetch the data represented by the most recent [commit](https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/about-commits). A commit represents the state of the data at a given point in time. This means, that if we know the commit (and the repository is still there, of course), we can precicely get the data in the state it was when it was commited. 

In [27]:
dracor.get_manifest()["corpora"]["span"]

{'corpusname': 'span',
 'timestamp': '2023-11-23T13:25:48.287503',
 'sources': {'span': {'type': 'repository',
   'corpusname': 'span',
   'url': 'https://github.com/dracor-org/spandracor',
   'commit': '184ebf975ad9cd674ff37cab44a181fa7ed8d85f',
   'timestamp': '2023-11-23T13:25:48.287508',
   'num_of_plays': 25}},
 'num_of_plays': 25}

To add a corpus at a given state represented by a commit the method provide the commit id as keyword argument `commit`. In the following cell we add the `Roman Drama Corpus` from its GitHub Repository in the state it was on January 2001, which can be identified with the commit `952ae76d8b9d51725b652b8c2c5d6538c592abd6` (see [this commit on GitHub](https://github.com/dracor-org/romdracor/commit/952ae76d8b9d51725b652b8c2c5d6538c592abd6)).

In [28]:
dracor.add_corpus_from_repo(repository_name="romdracor", commit="952ae76d8b9d51725b652b8c2c5d6538c592abd6")

INFO:root:Successfully created corpus rom.
INFO:root:Play 'plautus-amphitruo' retrieved from 'https://raw.githubusercontent.com/dracor-org/romdracor/952ae76d8b9d51725b652b8c2c5d6538c592abd6/tei/plautus-amphitruo.xml' has been successfully added to corpus 'rom'. Checked and found local play data.
INFO:root:Play 'plautus-asinaria' retrieved from 'https://raw.githubusercontent.com/dracor-org/romdracor/952ae76d8b9d51725b652b8c2c5d6538c592abd6/tei/plautus-asinaria.xml' has been successfully added to corpus 'rom'. Checked and found local play data.
INFO:root:Play 'plautus-aulularia' retrieved from 'https://raw.githubusercontent.com/dracor-org/romdracor/952ae76d8b9d51725b652b8c2c5d6538c592abd6/tei/plautus-aulularia.xml' has been successfully added to corpus 'rom'. Checked and found local play data.
INFO:root:Play 'plautus-bacchides' retrieved from 'https://raw.githubusercontent.com/dracor-org/romdracor/952ae76d8b9d51725b652b8c2c5d6538c592abd6/tei/plautus-bacchides.xml' has been successfully a

True

By default the client assumes that the data is published under the dracor-org organization on GitHub but it is possible to change this behavior, e.g. by explicitly setting the "owner" of the repository with the keyword argument `repository_owner`. In the next cell we will add data from a fork of the *Shakespeare Drama Corpus* that does not contain linguistic markup. 

In the version of ShakeDraCor in the dracor-org organization Hamlet's famouse lines are encoded as such
```
<sp xml:id="sp-1762" who="#Hamlet_Ham">
            <speaker xml:id="spk-1762">
              <w xml:id="fs-ham-0271840">HAMLET</w>
            </speaker>
            <l xml:id="ftln-1762" n="3.1.64">
              <w xml:id="fs-ham-0271850" n="3.1.64" lemma="to" ana="#acp-cs">To</w>
<c> </c>
              <w xml:id="fs-ham-0271870" n="3.1.64" lemma="be" ana="#vvi">be</w>
<c> </c>
              <w xml:id="fs-ham-0271890" n="3.1.64" lemma="or" ana="#cc">or</w>
<c> </c>
              <w xml:id="fs-ham-0271910" n="3.1.64" lemma="not" ana="#xx">not</w>
<c> </c>
              <w xml:id="fs-ham-0271930" n="3.1.64" lemma="to" ana="#acp-cs">to</w>
<c> </c>
              <w xml:id="fs-ham-0271950" n="3.1.64" lemma="be" ana="#vvi">be</w>
              <pc xml:id="fs-ham-0271960" n="3.1.64">—</pc>
[...]

```
[*Hamlet* in the shakedracor repository of dracor-org](https://github.com/dracor-org/shakedracor/blob/main/tei/hamlet.xml)

The following passage is taken from the [forked repository](https://github.com/ingoboerner/shakedracor) with linguistic markup removed:

```
<sp xml:id="sp-1762" who="#Hamlet_Ham">
                  <speaker xml:id="spk-1762">HAMLET </speaker>
                  <l xml:id="ftln-1762" n="3.1.64">To be or not to be— that is the question: </l>
...
```
[*Hamlet* in the fork of the shakedracor repository](https://github.com/ingoboerner/shakedracor/blob/main/tei/hamlet.xml#L3803-L3805)

By setting the keyword argument `repository_owner` to the username of the owner of the fork (`ingoboerner`) the client will fetch data from this user instead of the default `dracor-org` institution.

In [30]:
dracor.add_corpus_from_repo(repository_owner="ingoboerner", repository_name="shakedracor")

INFO:root:Successfully created corpus shake.
INFO:root:Play 'a-midsummer-night-s-dream' retrieved from 'https://raw.githubusercontent.com/ingoboerner/shakedracor/3a420de7d253a505d1d3b8225e6bb6659577d82f/tei/a-midsummer-night-s-dream.xml' has been successfully added to corpus 'shake'. Checked and found local play data.
INFO:root:Play 'all-s-well-that-ends-well' retrieved from 'https://raw.githubusercontent.com/ingoboerner/shakedracor/3a420de7d253a505d1d3b8225e6bb6659577d82f/tei/all-s-well-that-ends-well.xml' has been successfully added to corpus 'shake'. Checked and found local play data.
INFO:root:Play 'antony-and-cleopatra' retrieved from 'https://raw.githubusercontent.com/ingoboerner/shakedracor/3a420de7d253a505d1d3b8225e6bb6659577d82f/tei/antony-and-cleopatra.xml' has been successfully added to corpus 'shake'. Checked and found local play data.
INFO:root:Play 'as-you-like-it' retrieved from 'https://raw.githubusercontent.com/ingoboerner/shakedracor/3a420de7d253a505d1d3b8225e6bb66595

True

The source of the corpus will be reflected in the manifest:

In [34]:
dracor.get_manifest()["corpora"]["shake"]

{'corpusname': 'shake',
 'timestamp': '2023-11-23T13:29:52.227215',
 'sources': {'shake': {'type': 'repository',
   'corpusname': 'shake',
   'url': 'https://github.com/ingoboerner/shakedracor',
   'commit': '3a420de7d253a505d1d3b8225e6bb6659577d82f',
   'timestamp': '2023-11-23T13:29:52.227220',
   'num_of_plays': 37}},
 'num_of_plays': 37}

A corpus repository on GitHub should at least include a TEI-XML file `corpus.xml` that provides the metadata of the corpus. Optionally, but highly recommended, a corpus repository should contain a `README.md` file containing these information in a more human-reader-friedly form. The `Readme.md` is ignored by the *stable-dracor-client*. If no further parameters are set the client will look for the `corpus.xml` and uses this as a source of the corpus metadata. If no `corpus.xml` can be found, the import will fail. Although each corpus should contain such a file, sometimes it is not the case (especially, in relatively new corpora in development). It is still possible to import such a corpus if the flag `use_metadata_of_corpus_xml` is set to `False`. In this case the corpus will be imported with some boilderplate metadata.

To illustrate this, we can look at the *Czech Drama Corpus* in the state it was in early October 2023 (commit: `e7595849d47c5d426d95d6af3f59542f2982e8b0`). If you open the following link Github will return an error message because at this time, the corpus repository did not contain a `corpus.xml` file: https://github.com/dracor-org/czedracor/blob/e7595849d47c5d426d95d6af3f59542f2982e8b0/corpus.xml

Trying to load this state of the corpus will result in an error. To see the error uncomment the code in the following cell and run it.

In [36]:
# TODO: this should provide a better error message!
# Uncomment to see the error

# dracor.add_corpus_from_repo(repository_name="czdracor", commit="e7595849d47c5d426d95d6af3f59542f2982e8b0")

In [39]:
dracor.add_corpus_from_repo(
    repository_name="czdracor", 
    commit="e7595849d47c5d426d95d6af3f59542f2982e8b0", 
    use_metadata_of_corpus_xml=False
    )

INFO:root:Successfully created corpus czdracor.
INFO:root:Play 'capek-j-gassirova-loutna' retrieved from 'https://raw.githubusercontent.com/dracor-org/czdracor/e7595849d47c5d426d95d6af3f59542f2982e8b0/tei/capek-j-gassirova-loutna.xml' has been successfully added to corpus 'czdracor'. Checked and found local play data.
INFO:root:Play 'capek-j-zeme-mnoha-jmen' retrieved from 'https://raw.githubusercontent.com/dracor-org/czdracor/e7595849d47c5d426d95d6af3f59542f2982e8b0/tei/capek-j-zeme-mnoha-jmen.xml' has been successfully added to corpus 'czdracor'. Checked and found local play data.
INFO:root:Play 'capek-k-j-adam-stvoritel' retrieved from 'https://raw.githubusercontent.com/dracor-org/czdracor/e7595849d47c5d426d95d6af3f59542f2982e8b0/tei/capek-k+j-adam-stvoritel.xml' has been successfully added to corpus 'czdracor'. Checked and found local play data.
INFO:root:Play 'capek-k-j-lasky-hra-osudna' retrieved from 'https://raw.githubusercontent.com/dracor-org/czdracor/e7595849d47c5d426d95d6af

True

Another way would be to explicitly provide metadata in the keyword argument `corpus_metadata`. Still, if there is no `corpus.xml` the keyword argument `use_metadata_of_corpus_xml` must be set to `False`.

In [43]:
old_cz_meta = {
    "name" : "oldcz",
    "title" : "Older CzDraCor Version",
    "description" : "Version of CzDraCor as it was in Oktorber 2023."
}

dracor.add_corpus_from_repo(
    repository_name="czdracor", 
    commit="e7595849d47c5d426d95d6af3f59542f2982e8b0", 
    corpus_metadata=old_cz_meta,
    use_metadata_of_corpus_xml=False
    )

#TODO: There is a stupid error I still need to fix

INFO:root:Play 'capek-j-gassirova-loutna' retrieved from 'https://raw.githubusercontent.com/dracor-org/czdracor/e7595849d47c5d426d95d6af3f59542f2982e8b0/tei/capek-j-gassirova-loutna.xml' has been successfully added to corpus 'oldcz'. Checked and found local play data.
INFO:root:Play 'capek-j-zeme-mnoha-jmen' retrieved from 'https://raw.githubusercontent.com/dracor-org/czdracor/e7595849d47c5d426d95d6af3f59542f2982e8b0/tei/capek-j-zeme-mnoha-jmen.xml' has been successfully added to corpus 'oldcz'. Checked and found local play data.
INFO:root:Play 'capek-k-j-adam-stvoritel' retrieved from 'https://raw.githubusercontent.com/dracor-org/czdracor/e7595849d47c5d426d95d6af3f59542f2982e8b0/tei/capek-k+j-adam-stvoritel.xml' has been successfully added to corpus 'oldcz'. Checked and found local play data.
INFO:root:Play 'capek-k-j-lasky-hra-osudna' retrieved from 'https://raw.githubusercontent.com/dracor-org/czdracor/e7595849d47c5d426d95d6af3f59542f2982e8b0/tei/capek-k+j-lasky-hra-osudna.xml' has 

UnboundLocalError: cannot access local variable 'source_name' where it is not associated with a value

The files of the individual plays are normally stored in a folder named `tei`. If the folder name is different (should not be the case for the more "official" DraCor corpora), the name of the folder can be provided as keyword argument `repository_data_folder`. Please keep in mind, that the client can only handle data folders that are direct subfolders of the repository root. It is not possible to import from a deeply nested folders. Although the client tries to allow some flexibility in setting up the corpus repository, in general, it would be a good idea to stick to the DraCor convetions (see also [Report on Programmable Corpora](https://doi.org/10.5281/zenodo.7664964), Section 5.2.1 GitHub Repositories, p.26).

Finally, if data is not hosted on GitHub but some other Git-based system (e.g. GitLab), it is possible to change the base-url by providing it as keyword argument `repository_base_url` (default is: `"github.com"`).