**Creating a DataFrame out of the XML version of the Ritter catalogues**

This script creates and export a DataFrame object as a .csv file.

It is designed for the XML edition of the Ritter catalogues created via NER by Alexia Schneider and containing the labels '<name>' '<date>' '<places>'.

It requires a structure like folder > subfolder with 'volumes' having 'A121078' in their names > direct subfolder named 'xml_w_ner' > xml files having the page number at the exact end of the filename (e.g. 'A121078-2001.xml')

The pandas DataFrame allows to handle the amount of data with statistical methods, for instance exploring the distribution of the years mentioned in the catalogues.

In [8]:
#installing required modules
%pip install bs4
%pip install lxml
%pip install request

Note: you may need to restart the kernel to use updated packages.
Collecting lxml
  Downloading lxml-6.0.1-cp313-cp313-win_amd64.whl.metadata (3.9 kB)
Downloading lxml-6.0.1-cp313-cp313-win_amd64.whl (4.0 MB)
   ---------------------------------------- 0.0/4.0 MB ? eta -:--:--
   ------------------ --------------------- 1.8/4.0 MB 10.6 MB/s eta 0:00:01
   ---------------------------------------  3.9/4.0 MB 11.2 MB/s eta 0:00:01
   ---------------------------------------- 4.0/4.0 MB 9.2 MB/s  0:00:00
Installing collected packages: lxml
Successfully installed lxml-6.0.1
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement request (from versions: none)
ERROR: No matching distribution found for request


In [1]:
#importing required modules
import os
import pandas as pd
import bs4 as bs
import requests
from bs4 import BeautifulSoup
import pathlib
from pathlib import Path

In [30]:
# tiny helper to convert the normalize path in functions :

def safe_path(raw_path: str) -> str:
    return str(Path(raw_path).expanduser().resolve())

In [19]:
# define the folder path
path = r'C:\Users\pbrusa\Documents\EDA_incunables\Ritter_data'

In [20]:
# create an empty dataframe with the desired columns
frame_noindex = pd.DataFrame(columns = ['names','dates','places', 'volume', 'page'], dtype = object)
# set the desired column as index
frame = frame_noindex.set_index(['page'])

In [21]:
# define the ritterframe function to create a DataFrame out the Ritter data folder
# takes a raw string with a path and uses parsevol (see below) on every folder and subfolder with A121078 
# if it contains Alexia Schneiders xml + NER data
def ritterframe(path) :
    # normalize path for python
    good_path = safe_path(path)

    # go through the folder 
    for i in os.listdir(good_path) :
        folder = os.path.join(good_path, i)
        if os.path.isdir(folder) and 'A121078' in folder :
            directory_path = os.path.join(folder, 'xml_w_ner')
            parsevol(directory_path, folder)

In [26]:
# define the function parsevol
# it extracts data from a volume (i.e. a folder containing xml files) into a dataframe (volume+page x names, dates, places, volume) 

def parsevol(directory_path, folder) :
    # normalize directory path using safe_path
    clean_path = safe_path(directory_path)
    # determine the volume based on folder name 
    # find the last folder (in good_path --> otherwise last folder will always be 'xml_w_ner')  
    # (split by separators; grab the last one)
    last_folder = folder.split(os.sep)[-1]
    # always cut the last 10 characters (consistent naming)
    # cut = last_folder[:-10] ########### I don't remember why. below, I had cut instead of last_folder, but it was an error
    # find the signature; starts with A
    code_start = last_folder.find('A')                  
    suffix = last_folder[code_start:]
    # grab the last hyphenated character, splitting only once and eliminating any empty spaces
    volume_raw = suffix.split('-', maxsplit=1)[-1].strip()
    # adjust for volumes with str and roman numerals in them :
    # is there an hyphen, i. e. a roman numeral after it ? #######FIX
    # split into tokens based on spaces: first is number, second may contain roman numerals
    tokens = volume_raw.split()
    # define left, it is always an int
    left   = tokens[0]
    # build a string with the roman numerals that are right to the space (if any)
    roman  = "".join(ch for ch in tokens[1] if ch in "IVXLCDM") if len(tokens) > 1 else ""
    # determine the number including roman numerals
    volume_number = left if not roman else f"{left} {roman}"
    
    # build a for loop to pick up the different files in a directory
    # WARNING: the folders sometimes contain also CSV files; this is not taken into account here (I manually eliminated those before running the function
    # --> possible upgrade)
    for file in os.listdir(clean_path) :
        # give coordinates of both directory and file
        file_path = os.path.join(clean_path, file)
        if os.path.isfile(file_path):
        # open, read and soup the file
            with open(file_path, 'r', encoding='utf8') as xml :
                contents = xml.read()
            soup = BeautifulSoup(contents, 'xml')
        # parse for names
            names = soup.find_all('persName')
        # parse for dates
            dates = soup.find_all('date')
        # parse for places
            places = soup.find_all('placeName')
            
            # create a filename for the index
            file_name_with_extension = os.path.basename(file)
            file_name, _ = os.path.splitext(file_name_with_extension)
            
            # create a reference id for the index row
            # extract the last 3 char out of the filename
            page_number = file_name[-3:]
            # add the volume
            reference = volume_number + '-' + page_number

           
            
        # adding a new row to the dataframe, with an empty list for each column
            frame.loc[reference] = [[], [], [], volume_number]
        # filling the lists of data into the different stored cells
            for name in names :
                frame.loc[reference, 'names'].append(name.text)
            for date in dates :
                frame.loc[reference, 'dates'].append(date.text)
            for place in places :
                frame.loc[reference, 'places'].append(place.text)


In [31]:
ritterframe(path)
frame

Unnamed: 0_level_0,names,dates,places,volume
page,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1-001,[],"[AUX, XVe XVIe SIÈCLES, XVe siècle, XVIe s.,...",[ALSACE],1
1-002,[],[],[],1
1-003,[H. HEITZ],"[XVe, XVI SIÈCLES]","[ALSACE, STRASBOURG, STRASBOURG]",1
1-004,[],[],[],1
1-005,[],[],[Hain],1
...,...,...,...,...
7 IV-547,"[Bernard Jobin, Bernard Jobin, Mathias, Jacque...","[M.D.LXXVII, M.D.LXXXIX, 1589, 1535]","[Villequiers, Königs]",7 IV
7 IV-548,"[Wolfgang Köpfel, Jean Schwan, Christian, Thi...","[1524, 1524, 1524, 1524, 1592]","[Strasbourg, Strasbourg, Strasbourg, Strasbour...",7 IV
7 IV-549,[BONTEMPS],[1960],[],7 IV
7 IV-550,[],[],[],7 IV


In [33]:
# save the dataframe into the desired path
frame.to_csv(r'C:\Users\pbrusa\Documents\EDA_incunables\Ritter_data\Ritter_dataframe.csv')