# Brieven van Hooft - Notebook

## Introduction

This notebook provides access to the linguistic and socio-linguistic
annotations that were added to the letters by P.C van Hooft in an annotation
project in 2017 by Marjo van Koppen and Marijn Schraagen.

The letters come from "De briefwisseling van Pieter Corneliszoon Hooft, edited
by H.W van Tricht e.a.,", as published by the DBNL in the following three
parts:

* [Part 1](https://www.dbnl.org/tekst/hoof001hwva02_01/)
* [Part 2](https://www.dbnl.org/tekst/hoof001hwva03_01/)
* [Part 3](https://www.dbnl.org/tekst/hoof001hwva04_01/)

License information for these works can be found in
[here](https://www.dbnl.org/titels/gebruiksvoorwaarden.php?id=hoof001hwva03).
We did not receive the rights to publish the editorial parts of the texts that
are not from the 17th century. They will still be available in this notebook as
they can be downloaded from DBNL directly, but republishing them is not
permitted unfortunately.

The annotations were initially published in a combination of FoLiA XML and
other stand-off formats. In 2024, they have been re-aligned with the original
DBNL sources and published as a [STAM](https://annotation.github.io/stam) model.

This notebook provides search and visualisation functionality on this STAM
model. We will guide you through several examples. All code in this notebook
can be executed, and if needed, modified to your liking.

## Setup

### Installing software

In [1]:
!pip install stam

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

After installation we import this library:

In [2]:
from stam import *
from tabulate import tabulate

### Obtaining the data

We first obtain the data by downloading the original texts of the three books
from DBNL, and by downloading the STAM model from Zenodo. The latter may take a
while, please wait until it reports being done:

In [3]:
import os.path
from urllib.request import urlretrieve

print("Downloading data...")
if not os.path.exists("hoof001hwva02.txt"):
    urlretrieve("https://www.dbnl.org/nieuws/text.php?id=hoof001hwva02","hoof001hwva02.txt")
if not os.path.exists("hoof001hwva03.txt"):
    urlretrieve("https://www.dbnl.org/nieuws/text.php?id=hoof001hwva03","hoof001hwva03.txt")
if not os.path.exists("hoof001hwva04.txt"):
    urlretrieve("https://www.dbnl.org/nieuws/text.php?id=hoof001hwva04","hoof001hwva04.txt")
if not os.path.exists("hoof001hwva.output.store.stam.json"):
    #TODO: adapt link to Zenodo before final publication
    urlretrieve("https://download.anaproy.nl/hoof001hwva.output.store.stam.json","hoof001hwva.output.store.stam.json")
print("Done!")

Done!


### Loading the data

Finally, we will load the texts and all annotations into memory:

In [4]:
store = AnnotationStore(file="hoof001hwva.output.store.stam.json")

## Data exploration

### Vocabularies

Before we get to the actual texts and annotations, we first want to give some
insight into the vocabularies that are used in this project. Understanding and
exploring the vocabularies is important to be able to make sensible queries
later on.

Vocabularies used by the annotations are grouped into so-called **annotation data
sets**, within these sets, **keys** are defined. Notable keys in this project are the following:

| Set | Key	| Explanation |
| --- | --- | ----------- |
| `https://w3id.org/folia/v2/` | `elementtype` | Indicates the type of FoLiA element of this annotation (e.g. `s` (sentence), `w`(word), `pos`, `lemma`) | 
| `gustave-pos` | `class` | The Part-of-Speech tag, manually assigned by the annotator, according to the CGN tagset and an extension thereof |
| `gustave-lemma` |	`class` | The lemma, manually assigned by the annotator |
| `http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn`	| `class` | The Part-of-Speech tag, automatically annotated by Frog, according to the CGN tagset |
| `http://ilk.uvt.nl/folia/sets/frog-mblem-nl` | `class` | The lemma, automatically annotated by Frog |
| `https://w3id.org/folia/v2/` | `confidence` | The confidence value that was assigned to the annotation (a value between 0 and 1, occurs with automatic annotations by Frog) | 
| `brieven-van-hooft-metadata` |  `dbnl_id` | The full letter identifier as assigned by the DBNL. You will find this key and others in this set on annotations of letters as a whole. |
| `brieven-van-hooft-metadata` |  `dated` | The date of a letter |
| `brieven-van-hooft-metadata` |  `recipient` | The name of the recipient of a letter |
| `brieven-van-hooft-metadata` |  `letter_id` | The letter sequence number (not necessarily entirely numerical) |
| `brieven-van-hooft-metadata` |  `invididual` | `True` if the recipient is an individual, `False` if it's an organization or group  |
| `brieven-van-hooft-metadata` |  `gender` | The gender of the recipient: `male` or `female` (not much space for gender fluidity in the 17th century) |
| `brieven-van-hooft-metadata` |  `function` | Occupation of the recipient, type of organisation of the recipient or type of personal relation to the recipient. Free value. |
| `brieven-van-hooft-metadata` |  `literary` | `True` if the recipient is a literary author, `False` otherwise |
| `brieven-van-hooft-categories` |  `function` | Function of the letter (closed vocabulary). You will find this key and others in this set on annotations of letters as a whole. |
| `brieven-van-hooft-categories` |  `topic` | Topic of the letter (closed vocabulary) |
| `brieven-van-hooft-categories` |  `business` | `True` if it's a business letter, `False` if it's a personal letter |
| `brieven-van-hooft-categories` |  `accompanying` | `True` if it's an accompanying letter, `False` if it's an independent letter |
| `brieven-van-hooft-categories` |  `part` | This key is found on annotations that identifies *parts* of letters, values are a closed vocabulary containing `greeting`, `opening`, `narratio`, `closing`, `finalgreeting` |

If you want to see what keys exist in a particular set, adapt and run the following code:

In [None]:
dataset = store.dataset("brieven-van-hooft-metadata")
for k in sorted(str(x) for x in dataset.keys()):
    print(k)

Keys in turn may be associated with one or more **values**. Such a key/value
pair is then called **annotation data**.

If you want to explore all the values (annotation data) that exist for a given
set and key, then you can adapt and run the following code. This code by
default shows all the manually annotated Part-of-Speech tags that occur in the
data, and a frequency count in how many annotations this data occurs:

In [8]:
#first we set the dataset and the key we want to query, you can adapt this to query for other sets and keys:
dataset = store.dataset("gustave-pos")
key = dataset.key("class")

#then we obtain the data, the frequency, sort it, and render it, don't worry if you don't understand this part entirely
tabulate(sorted(((str(data), data.annotations_len()) for data in key.data())), tablefmt="html")

TypeError: '<' not supported between instances of 'stam.AnnotationData' and 'stam.AnnotationData'

Or if you want this sorted by frequency instead, you can adapt and run the following:

In [None]:
#don't worry if you don't understand this line
tabulate(sorted(((str(data), data.annotations_len()) for data in key.data()), key=lambda x: -1 * x[1]), tablefmt="html")

To give an impression of the various metadata available in this project, we will give an overview of all metadata per letters:

In [None]:
dataset = store.dataset("brieven-van-hooft-metadata")
key = dataset.key("dbnl_id")

for dbnl_id, data in sorted(((str(data), data) for data in key.data())):
    annotation = next(data.annotations())
    print(dbnl_id)