<img src="https://stijl.kuleuven.be/releases/latest/img/svg/logo.svg" alt="KU Leuven">

# **Exploring *CroALa* and Marko Marulić**

An exam project for Scripting Languages.

<div class="alert alert-block alert-danger">

<b> THIS NOTEBOOK IS A WORK IN PROGRESS </b>

</div>

This is a Juypter notebook for the exam project for the [Scripting Languages \[G0W95B\]](https://onderwijsaanbod.kuleuven.be/2025/syllabi/e/G0W95BE.html) course of the [Digital Humanities](https://www.kuleuven.be/programmes/master-digital-humanities) programme at [KU Leuven](https://www.kuleuven.be/english/kuleuven/). 

Author:
<br> Petar Soldo
<br> r1076709
<br> [petar.soldo@student.kuleuven.be](mailto:petar.soldo@student.kuleuven.be)

## About the project

The main goal of the project is to perform an "exploratory analysis of data" and "to (...) independently apply the programming techniques explored during the course".

For this purpose, *CroALa* was chosen as a dataset to be analyzed. 

The project has two main goals.

1. Perform a short analysis of the documents in _CroALa_ based on their metadata.
2. Perform a short text analysis of selected works by Marko Marulić.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/da/Marko_Marulic_bust_-_lighting_fix.jpg/250px-Marko_Marulic_bust_-_lighting_fix.jpg" alt="Marko Marulić">

*Bust of Marko Marulić of Split, Croatian poet, by Ivan Meštrović*

*DIREKTOR (talk · contribs), CC0, via Wikimedia Commons*

## *CroALa*

### About *CroALa*

TBA

### What do I want to do?

I want to build a table with the metadata from the XML files in the repository.

To do this I will make a small piece of code that (i) opens every file and extracts the data we need to a list, (ii) appends the list to a dictionary, (iii) turns the dictionary to a *Pandas* dataframe and (iv) exports the dataframe to CSV file. 

The data I want in my table is:
- name of the file
- title of the work
- name of the first author
- date related to the first author
- all mentioned authors and appurtenant dates
- editors of the edition
- languages attributed to the document
- date(s) of creation
- place(s) of creation
- typus
- genres

### Retrieving metadata about the documents

**NOTE**: This part of the project was largely based on the blog post [*Parsing TEI XML documents with Python*](https://komax.github.io/blog/text/python/xml/parsing_tei_xml_python/) by Maximilian Konzack (2019).

First, let's import all the libraries we need.

In [1]:
from bs4 import BeautifulSoup 
#Make sure you have SoupSieve installed (usually it installs together with BeautifulSoup, when using pip to install)
import lxml
from glob import glob
from os.path import basename
from pprint import pprint
import pandas as pd
import re

We define two functions:
1. *read_tei* for reading the files and
2. *e2t*, short for *element to text*, for extracting the text from an XML element.

Both of these functions were slightly adapted from the aforementioned project: https://komax.github.io/blog/text/python/xml/parsing_tei_xml_python/.

In [2]:
# Take an XML file and return it as a BeutifulSoup object
def read_tei(tei_file):
    with open(tei_file, 'r', encoding = 'utf-8') as tei:
        soup = BeautifulSoup(tei, 'xml')
        return soup
# We can use this new object to navigate the XML document

In [3]:
# Take an XML element and return just its contents
def e2t(elem, default=''):
    if elem:
        return re.sub(r'\s+', ' ', elem.getText(strip=True)) 
        # The regular expression is used to avoid often occuring multiple whitespaces
        # and unsual line breaks, not removed by (strip=True)
    else:
        return default

### Extracting the elements

We can shortly demonstrate how this works. We will choose the first document from the *"texts"* directory.

Let's extract some simple metadata from it.

In [4]:
#This loads the document:
document = read_tei("txts/aa-vv-carm-occ-vd.xml")

#This finds the title:
print(e2t(document.find("title")))

#This finds the (first) author:
print(e2t(document.find("author")))

#This finds the creation date:
print(e2t(document.select_one("profileDesc creation date")))

Carmina occasionalia e codice Traguriensi Variorum Dalmaticorum, versio electronica
Auctores varii
1565-1650


Sometimes, there is more than one piece of information we want to extract from an XML element. It seems that *BeutifulSoup* always returns a list in this case. It is important to not that the `getText()` method does not work on a list, so, if we wish to extract the text from multiple elements, we must iterate throught them.

In [5]:
for author in document.select("titleStmt author"):
    print (e2t(author))

Auctores varii
Grauisius, Iacobus
Mladinić, Sebastijan1561/1563 - 1620-21
Mazarelli, Valerio
Statilić, Marinc. 1650
Pridojević, Ivanc. 1600
Vranius
Gaudentius
Matthaeus Desseus Ragusinus
Michael Racetinus


At first I thought of using the `find` and `find_all` methods, but it seems that `select` and `select_one` allow defining a path by simply writing the elements, separated by a whitespace. This would require chaining the methods when using `find/find_all`.

Thus the selection of all the `author` elements above, using `find_all` would look like this: `document.find("titleStmt").find_all("author")`. The `select(_one)` methods also have a nicer way to access an attribute value.

I thus find the CSS selector methods (`select/select_one`) much more elegant and sufficiently useful for this part of the project.

More about this issue:
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property
- https://stackoverflow.com/a/38033910

### Checking how elements behave

Before we go on to extracting the metadata, we should first check what do some of our elements look like. I was forced to perform this check *after* I started buidling a table, because I realized I was not extracting information I wanted or thought I was. However, I will, providently, explain what I checked for and how I did it before going on :)

The elements which needed inspecting were `author` and `creation`. They sometimes encode information in different ways, i.e. by using different tags.

We will first inspect the `author` tag. We make use of the `name` method to find the names of the child elements of the `author` element.

In [6]:
tagnames = []
# Iterate through documents and save all tag names to the tagnames list.
for filename in glob("txts/*.xml"):
    doc = read_tei(filename)
    if doc.select_one("titleStmt author"):
        for element in doc.select_one("titleStmt author"):
            tagnames.append(element.name)
    else:
        tagnames.append('')
# Make a set to leave only unique values
tagset = set(tagnames)
print(tagset)

{None, '', 'persName', 'date', 'ref', 'placeName', 'orgName'}


We now have a list (a set to be more precise) of all tags which appear as chidren of the `author` tag. We go on further to inspect in which files do these tags appear and what do they contain.

In [7]:
# Create lists to store the elements we want to inspect
ref= []
placeName = []
orgName = []
for file in glob("txts/*.xml"):
    doc = read_tei(file)
    # Find the elemenets and store the to list, together with the filename
    if doc.select_one("titleStmt author ref"):
        ref.append([file, doc.select_one("titleStmt author ref")])
    if doc.select_one("titleStmt author placeName"):
        placeName.append([file, doc.select_one("titleStmt author placeName")])
    if doc.select_one("titleStmt author orgName"):
        orgName.append([file, doc.select_one("titleStmt author orgName")])

We can now see what type of data these tags marks contain and in which file we can find them. This can further on be used to indeed inspect the files and decide how to extract the data.

In [8]:
pprint(ref[:3])
pprint(placeName[:3])
pprint(orgName[:3])

[['txts\\andreis-f-1529-02-15.xml',
  <ref target="http://www.wikidata.org/entity/Q16115490" type="wikidata">Andreis, Franjo Trankvil</ref>],
 ['txts\\cikulin-if-ideae.xml',
  <ref target="http://www.wikidata.org/entity/Q860595">Čikulin, Ivan Franjo</ref>],
 ['txts\\donat-mandel-sissiensis.xml',
  <ref target="donat01">Donati, Ivan</ref>]]
[['txts\\barletius-scodrensi-obsidione-1504.xml',
  <placeName>Skadar</placeName>],
 ['txts\\barletius-vita-castrioti-1508.xml', <placeName>Skadar</placeName>],
 ['txts\\goineo-gb-situistriae-1543.xml', <placeName>Piran</placeName>]]
[['txts\\aa-vv-carm-occ-vd.xml',
  <orgName ref="#varii-1650">Auctores varii</orgName>],
 ['txts\\aa-vv-epigr-mulla.xml',
  <orgName ref="#varii-1552">Auctores varii</orgName>],
 ['txts\\aa-vv-epigr-tres.xml',
  <orgName ref="#varii-1600">Auctores varii</orgName>]]


We repeat the process with the `creation` element.

In [9]:
tagnames_2 = []

for filename in glob("txts/*.xml"):
    doc = read_tei(filename)
    if doc.select_one("profileDesc creation"):
        for element in doc.select_one("profileDesc creation"):
            tagnames_2.append(element.name)
    else:
        tagnames_2.append('')
tagset_2 = set(tagnames_2)

address_c = []
placeName_c = []

for file in glob("txts/*.xml"):
    doc = read_tei(file)
    if doc.select_one("profileDesc creation address"):
        address_c.append([file, doc.select_one("profileDesc creation address")])
    if doc.select_one("profileDesc creation placeName"):
        placeName_c.append([file, doc.select_one("profileDesc creation placeName")])

In [10]:
pprint(tagset_2)
pprint(address_c[:3])
pprint(placeName_c[:3])

{None, 'date', 'placeName', 'address'}
[['txts\\aa-vv-carmina-vgc.xml',
  <address>
<addrLine>Romae</addrLine>
<addrLine>Ragusae</addrLine>
</address>],
 ['txts\\andreis-f-philos.xml',
  <address>
<addrLine>Cracoviae</addrLine>
<addrLine>Posnaniae</addrLine>
</address>],
 ['txts\\baricev-aa-epist-penzel.xml',
  <address>
<addrLine>Zagrabiae</addrLine>
</address>]]
[['txts\\aa-vv-supetarski.xml', <placeName>Split</placeName>],
 ['txts\\adam-parisius-vaticanum-officium-1059.xml',
  <placeName ref="http://www.wikidata.org/entity/Q1663">Split</placeName>],
 ['txts\\adam-radauanus-traditio.xml',
  <placeName ref="http://www.wikidata.org/entity/Q396372">Nin</placeName>]]


This could all be done in quarter of the time by simply placing all the queries under one "read file" loop, but it was separated here to give a better idea of what our code did. I will probably join them in later version of this notebook.

### Making a table out of the metadata

We can finally make a table out of the metadata. We initialize a list `croala_data` to which we will append a dictionary of values corresponding to the metadata for every XML file in our `txts` directory. This list of dictionaries is  then used to create a Panadas dataframe. In the end, we dump it to a csv file.

In [11]:
croala_data = []

for filename in glob("txts/*.xml"):
    doc = read_tei(filename)
    
    # Extract titles
    titles = e2t(doc.select_one("titleStmt title"))

    # Extract first author name
    if doc.select_one("titleStmt author"):
        if doc.select_one("titleStmt author").find("orgName"):
            first_author = e2t(doc.select_one("titleStmt author orgName"))
        elif doc.select_one("titleStmt author").find("persName"):
            first_author = e2t(doc.select_one("titleStmt author persName"))
    else:
        first_author = ''

    #Extrat first author date
    if doc.select_one("titleStmt author"):
        first_author_date = e2t(doc.select_one("titleStmt author").find("date"))
    else:
        first_author_date = ''
    
    # Extract all authors
    all_authors = [(re.sub(r'\s+', ' ', a.get_text(separator = ", ", strip=True))) for a in doc.select("author")]

    # Extract editors
    editors = [e2t(e) for e in doc.select("titleStmt editor persName ref")]

    # Extract language
    language = [l["ident"] for l in doc.select("profileDesc langUsage language")]

    # Extract place
    if doc.select_one("profileDesc creation"):
        if doc.select("profileDesc creation placeName"):
            place = [e2t(pl) for pl in doc.select("profileDesc creation placeName")]
        elif doc.select_one("profileDesc creation address"):
            place = [e2t(add) for add in doc.select("profileDesc creation addrLine")]
        else:
            place = ''
    else:
        place = ''

    # Extract date
    date = [e2t(d) for d in doc.select("profileDesc creation date")]

    # Extract typus
    typus = e2t(doc.select_one("textClass keywords[scheme=typus] term"))

    # Extract genres
    genres = [e2t(g) for g in doc.select("textClass keywords[scheme=genre] term")]

    # Append as a dictionary
    croala_data.append({
        "filename": basename(filename),
        "titles": titles,
        "first_author": first_author,
        "first_author_date": first_author_date,
        "all_authors": all_authors,
        "editors": editors,
        "language": language,
        "date": date,
        "place": place,
        "typus": typus,
        "genres": genres
    })

# Convert to a dataframe
croala_df = pd.DataFrame(croala_data)
croala_df.head()

Unnamed: 0,filename,titles,first_author,first_author_date,all_authors,editors,language,date,place,typus,genres
0,aa-vv-carm-occ-vd.xml,Carmina occasionalia e codice Traguriensi Vari...,Auctores varii,,"[Auctores varii, Grauisius, Iacobus, Mladinić,...",[Neven Jovanović],[lat],[1565-1650],,poesis,"[poesis - epigramma, poesis - elegia]"
1,aa-vv-carmina-vgc.xml,Carmina minora ex libro De vita et gestis Chri...,"Bunić, Jakov",1469-1534,"[Bunić, Jakov, 1469-1534, Caluus, Hieronymus, ...",[Neven Jovanović],[lat],[a. 1502--1526],"[Romae, Ragusae]",poesis,"[poesis - carmen, poesis - epigramma, poesis -..."
2,aa-vv-epigr-mulla.xml,Ad clarissimum uirum dominum Benedictum de Mul...,Auctores varii,,"[Auctores varii, Martinčić, Jerolim, Alberti, ...",[Neven Jovanović],[lat],[1549-1552],,poesis,"[poesis - epigramma, poesis - encomium]"
3,aa-vv-epigr-natal.xml,"Epigrammata in codice Natalis, versio electronica","Kabalin, Grgur",,"[Kabalin, Grgur, Tolimerić, Ilija, m. 1537?]",[Miroslav Marcovich],[lat],[post 1536],,poesis,"[poesis - carmen, poesis - elegia, poesis - ep..."
4,aa-vv-epigr-tres.xml,"Tres invicem epigrammata, versio electronica",Auctores varii,,"[Auctores varii, Kabalin, Grgur, Chrysogonus]",[Neven Jovanović],[lat],[c. 1600.],,poesis,[poesis - epigramma]


In [13]:
#Export to a csv file
croala_df.to_csv("croala_metadata.csv", index = False, encoding = 'utf-8')