Practicals 1: Exploring the digital chant ecosystem
===================================================

Getting our hands dirty, we will take a look around the digital chant ecosystem: the Cantus database, Cantus Index, and then we start working with the chant data contained therein. We take a look at some basic statistics of the dataset and visualise it. This will all happen in a google collab environment, so no need to install anything! All you need is an internet connection, no programming or math background required (but it may be a good idea to do a Python tutorial ahead of time).

Installing requirements
-----------------------

We need some libraries that are not installed by default.

In [None]:
!pip install scipy
!pip install matplotlib
!pip install copia
!pip install pycantus

Collecting copia
  Downloading copia-0.1.4-py3-none-any.whl.metadata (2.0 kB)
Collecting matplotlib>=3.3.2 (from copia)
  Downloading matplotlib-3.10.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting seaborn>=0.11.0 (from copia)
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting tqdm>=4.48.0 (from copia)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting pytest>=6.2.2 (from copia)
  Downloading pytest-8.3.5-py3-none-any.whl.metadata (7.6 kB)
Collecting contourpy>=1.0.1 (from matplotlib>=3.3.2->copia)
  Downloading contourpy-1.3.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib>=3.3.2->copia)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib>=3.3.2->copia)
  Downloading fonttools-4.57.0-cp311-cp311-macosx_10_9_universal2.whl.metadata (102 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib>=3.3.2->copia)
  Using cached kiwisol

Getting data
------------

We work with the CantusCorpus v0.2 dataset. It is derived from the Cantus database. It contains records of almost 500,000 instances of chants, hundreds of records of manuscripts (sources), and a number of auxiliary files that encode controlled vocabularies for certain data.

CantusCorpus is somewhat dated (2020), and we are working on a more comprehensive dataset, but the advantage is that previous results for this data exist -- and especially in the context of learning, this stability is an advantage.

In [1]:
!wget https://ufallab.ms.mff.cuni.cz/~hajicj/public_html/DH-Latvia_2025/data/cantuscorpus-v0.2.zip

--2025-05-03 18:22:39--  https://ufallab.ms.mff.cuni.cz/~hajicj/public_html/DH-Latvia_2025/data/cantuscorpus-v0.2.zip
Resolving ufallab.ms.mff.cuni.cz (ufallab.ms.mff.cuni.cz)... 195.113.18.181
Connecting to ufallab.ms.mff.cuni.cz (ufallab.ms.mff.cuni.cz)|195.113.18.181|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51685924 (49M) [application/zip]
Saving to: ‘cantuscorpus-v0.2.zip’


2025-05-03 18:22:43 (17.4 MB/s) - ‘cantuscorpus-v0.2.zip’ saved [51685924/51685924]



In [2]:
!unzip cantuscorpus-v0.2.zip

Archive:  cantuscorpus-v0.2.zip
   creating: cantuscorpus-v0.2/
   creating: cantuscorpus-v0.2/csv/
  inflating: cantuscorpus-v0.2/csv/office.csv  
  inflating: cantuscorpus-v0.2/csv/chant-demo-sample.csv  
  inflating: cantuscorpus-v0.2/csv/indexer.csv  
  inflating: cantuscorpus-v0.2/csv/chant.csv  
  inflating: cantuscorpus-v0.2/csv/feast.csv  
  inflating: cantuscorpus-v0.2/csv/century.csv  
  inflating: cantuscorpus-v0.2/csv/orig_id.csv  
  inflating: cantuscorpus-v0.2/csv/notation.csv  
  inflating: cantuscorpus-v0.2/csv/siglum.csv  
  inflating: cantuscorpus-v0.2/csv/.feast.csv.swp  
  inflating: cantuscorpus-v0.2/csv/provenance.csv  
  inflating: cantuscorpus-v0.2/csv/genre.csv  
  inflating: cantuscorpus-v0.2/csv/source.csv  
  inflating: cantuscorpus-v0.2/.DS_Store  
   creating: cantuscorpus-v0.2/antiphons/
  inflating: cantuscorpus-v0.2/antiphons/test-chants.csv  
  inflating: cantuscorpus-v0.2/antiphons/gregobase-chantstrings.csv  
  inflating: cantuscorpus-v0.2/antiphons/

Loading CantusCorpus
--------------------

The dataset is in a CSV format, which is just a very simple way to store a table: each line of the file is a table row, and a "separator" character is defined -- usually a comma (CSV stands for comma-separated values), sometimes a tab (then it can be called TSV, tab-separated values). The file also usually has a header row that contains the names of the table columns.

Let's load all the chants first.

First, we set up the path into the dataset.

In [3]:
import os
import logging

# This is the directory into which we downloaded CantusCorpus
DATA_ROOT = 'cantuscorpus-v0.2'

# The 'chant.csv' file is where the chant data is.
DATASET_CSV_NAME = os.path.join(DATA_ROOT, 'csv', 'chant.csv')

# It is pretty good practice to check that you got key file paths right -- that the files actually exist.
if not os.path.isfile(DATASET_CSV_NAME):
    raise ValueError('CantusCorpus dataset CSV file not found at path {}. Is the DATA_ROOT set correctly?'
                     ' Current location: {}'.format(DATASET_CSV_NAME, os.getcwd()))


Now we know what file we are loading from, so we could just go ahead and use the above definition of a CSV file to read it and parse the individual lines.

But instead, we first make a small digression.

The plain-text nature of CSVs is great for interoperability and portability, and every programming language has built-in support for dealing with this type of file. That makes CSVs an excellent choice for storing and disseminating research data in repositories such as the Open Science Foundation or DARIAH-related infrastructure (e.g., LINDAT).

The minimal structure, however, comes at a cost. What happens when a text field, e.g. a manuscript description, contains the separator character (a comma)? The obvious solution is quoting. But what if the field contains a quote? These issues are what the libraries for handling CSV files resolve, at least to the extent to which it is possible. But when encoding data into CSV files, one should be aware of these issues, and ideally have a round-trip test to make sure that there is a method to load the data properly, then store it back as a CSV file, and that the resulting file has exactly the same contents as the original file.

So we actually want to use the appropriate library, and define the various parameters to load the file safely and properly.

In [27]:
# This is the library that works with CSVs in Python.
import csv

# Here we define the CSV loading parameters.
# As with the file paths, we use capital letters to clearly indicate which values
# are defined by us.
CSV_DELIMITER = ','
CSV_DOUBLEQUOTE = False
CSV_QUOTECHAR = '"'
CSV_DIALECT = 'unix'
CSV_QUOTING = csv.QUOTE_MINIMAL
# This is the policy that combines all the parameters above.
# Some of these apply when reading a CSV file, some apply when writing it back.
# The possible values of quoting policy defined and documented by the 'csv' library.

# Here is where we store all the chants, as a list.
chants = []

# Now we actually load them!
# This can take a while, as there are almost half a million of them.
with open(DATASET_CSV_NAME) as fh:
    reader = csv.DictReader(fh,
                            delimiter=CSV_DELIMITER,
                            doublequote=CSV_DOUBLEQUOTE,
                            quotechar=CSV_QUOTECHAR,
                            dialect=CSV_DIALECT,
                            quoting=CSV_QUOTING)

    # At this point, we could just load the data, but because we want
    # to be very sure that we have been able to load the data properly,
    # we wrap it with this 'try-except' construction. This means:
    # If an error happens during the 'try' block, jump into the 'except'
    # block to deal with it. In this case, we simply want to know which row
    # the error occurred in, so that we know which line of the data file
    # we should check for errors.
    try:
        for i, row in enumerate(reader):
            chants.append(row)
    except:
        print('Row {} has an error!\nRow: {}'.format(i, row))
        raise

# We also want to remember what the field names are, to be able to quickly check
# what information is available for each chant.
fieldnames = reader.fieldnames


Let's check how many chants exactly we got:

In [28]:
n_chants = len(chants)
print('Loaded {} chants.'.format(n_chants))

Loaded 497071 chants.


And what information is available for each of the chants:

In [29]:
print(fieldnames)

['id', 'incipit', 'cantus_id', 'mode', 'finalis', 'differentia', 'siglum', 'position', 'folio', 'sequence', 'marginalia', 'cao_concordances', 'feast_id', 'genre_id', 'office_id', 'source_id', 'melody_id', 'drupal_path', 'full_text', 'full_text_manuscript', 'volpiano', 'notes']


In [30]:
# Let's print it out a little better...
print('\n'.join(fieldnames))

id
incipit
cantus_id
mode
finalis
differentia
siglum
position
folio
sequence
marginalia
cao_concordances
feast_id
genre_id
office_id
source_id
melody_id
drupal_path
full_text
full_text_manuscript
volpiano
notes


So what does a chant record look like?

In [31]:
print(chants[0])

{'id': 'chant_000001', 'incipit': '#ne positi certamen habuistis mercedem', 'cantus_id': '007590.1', 'mode': '', 'finalis': '', 'differentia': '', 'siglum': 'US-HA Rauner Codex MS 003203', 'position': '', 'folio': '066r', 'sequence': '0.0', 'marginalia': '', 'cao_concordances': '', 'feast_id': 'feast_0227', 'genre_id': 'genre_r', 'office_id': 'office_m', 'source_id': 'source_519', 'melody_id': '', 'drupal_path': 'http://cantus.uwaterloo.ca/chant/693173/', 'full_text': '#ne positi certamen habuistis mercedem laboris ego reddam vobis', 'full_text_manuscript': '#ne positi certamen habuistis | Mercedem laboris ego reddam vobis', 'volpiano': '', 'notes': ''}


In [32]:
# This is not really readable. Let'd define a function for this:
def print_chant(chant, fieldnames=fieldnames):
  for f in fieldnames:

    # Check that the field name that we expect is actually available in the chant record
    if f not in chant:
      # This would be a basic error message:
      #raise ValueError('Chant does not contain field name {}!'.format(f))
      # This is a better error message, because it shows both the problematic field name
      # and the chant record, so that you can straight away start diagnosing where the problem is.
      raise ValueError('Chant does not contain field name {}!\n'
                       'Available fields: {}'.format(f, list(chant.keys())))
    value = chant[f]
    print('{}: {}'.format(f, value))

In [33]:
print_chant(chants[0])

id: chant_000001
incipit: #ne positi certamen habuistis mercedem
cantus_id: 007590.1
mode: 
finalis: 
differentia: 
siglum: US-HA Rauner Codex MS 003203
position: 
folio: 066r
sequence: 0.0
marginalia: 
cao_concordances: 
feast_id: feast_0227
genre_id: genre_r
office_id: office_m
source_id: source_519
melody_id: 
drupal_path: http://cantus.uwaterloo.ca/chant/693173/
full_text: #ne positi certamen habuistis mercedem laboris ego reddam vobis
full_text_manuscript: #ne positi certamen habuistis | Mercedem laboris ego reddam vobis
volpiano: 
notes: 


In [34]:
# And another chant:
print_chant(chants[123000])

id: chant_123001
incipit: Dixerunt impii apud se non
cantus_id: 006464
mode: 
finalis: 
differentia: 
siglum: CZ Pu VI.E.4c
position: 1.2
folio: 230r
sequence: 1.0
marginalia: 
cao_concordances: CGBEMVHRDFSL
feast_id: feast_0643
genre_id: genre_r
office_id: office_m
source_id: 
melody_id: 
drupal_path: http://cantus.uwaterloo.ca/chant/464369/
full_text: Dixerunt impii apud se non recte cogitantes circumveniamus justum quoniam contrarius est operibus nostris promittit se scientiam dei habere filium dei se nominat et gloriatur patrem se habere deum videamus si sermones illius veri sint et si est verus filius dei liberet illum de manibus nostris morte turpissima condempnemus eum
full_text_manuscript: 
volpiano: 
notes: 


Notice that when we click the `drupal_path` link, it does not work. This is because the Cantus Database has undergone a change of URL in 2023, and the `cantus.uwaterloo.ca` server stopped redirecting, too. So, to make it easier for us to check things, we will change these values in our chants.

In [35]:
for c in chants:
  old_drupal_path = c['drupal_path']
  fixed_drupal_path = old_drupal_path.replace('http://cantus.uwaterloo.ca', 'https://cantusdatabase.org').rstrip('/')
  c['drupal_path'] = fixed_drupal_path

Now the URLs should work.

In [36]:
print_chant(chants[123000])

id: chant_123001
incipit: Dixerunt impii apud se non
cantus_id: 006464
mode: 
finalis: 
differentia: 
siglum: CZ Pu VI.E.4c
position: 1.2
folio: 230r
sequence: 1.0
marginalia: 
cao_concordances: CGBEMVHRDFSL
feast_id: feast_0643
genre_id: genre_r
office_id: office_m
source_id: 
melody_id: 
drupal_path: https://cantusdatabase.org/chant/464369
full_text: Dixerunt impii apud se non recte cogitantes circumveniamus justum quoniam contrarius est operibus nostris promittit se scientiam dei habere filium dei se nominat et gloriatur patrem se habere deum videamus si sermones illius veri sint et si est verus filius dei liberet illum de manibus nostris morte turpissima condempnemus eum
full_text_manuscript: 
volpiano: 
notes: 


Issues such as this are why we should care very much about persistence and long-term storage in the digital humanities. Cantus Database and its whole network gets almost everything right -- but still these issues are inevitable, because the digital environment evolves very quickly.