# 1.2 Import

## XML

### `BeautifulSoup`

There are several Python libraries to parse XML but `BeautifulSoup` is somehow the swiss knife of XML parsing.

It can parse HTML, XML, as well as ill-formed or broken XML documents (very useful for legacy XML or even SGML data).

In [11]:
import os
import bs4
from bs4 import BeautifulSoup

In [3]:
data_folder = '../data/altoxml/'

In [28]:
# let's get the path of XML files
# we filter only files with XML extension
# it can be useful to ignore e.g. `.DS_Store` files (under MacOS)

xml_files = [
    os.path.join(data_folder, file)
    for file in os.listdir(data_folder)
    if ".xml" in file
]

In [6]:
xml_files

['../data/altoxml/27971740_1890-04-01_38_077_0_001.xml',
 '../data/altoxml/27971740_1890-04-01_38_077_0_002.xml',
 '../data/altoxml/27971740_1890-04-01_38_077_0_003.xml',
 '../data/altoxml/27971740_1890-04-01_38_077_0_004.xml']

In [30]:
# prefixing a code cell's content with `!`
# tells jupyter to execute it as a bash shell command
# Here we use the command `head` to peek at the first 100 lines
# of our XML file.

!head -n 100 ../data/altoxml/27971740_1890-04-01_38_077_0_001.xml

﻿<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto-v2.0.xsd">
    <Description>
        <MeasurementUnit>pixel</MeasurementUnit>
        <OCRProcessing ID="IdOcr">
            <ocrProcessingStep>
                <processingDateTime>2014-10-22</processingDateTime>
                <processingSoftware>
                    <softwareCreator>ABBYY</softwareCreator>
                    <softwareName>ABBYY FineReader Engine</softwareName>
                    <softwareVersion>11</softwareVersion>
                </processingSoftware>
            </ocrProcessingStep>
        </OCRProcessing>
    </Description>
    <Styles>
        <TextStyle ID="font0" FONTFAMILY="Times New Roman" FONTSIZE="7" />
        <TextStyle ID="font1

In [14]:
with open(xml_files[0], 'r') as inpfile:
    xml_doc = BeautifulSoup(inpfile)

In [20]:
textline_elements = xml_doc.find_all('textline')

In [26]:
x = textline_elements[0].get('vpos')
y = textline_elements[0].get('hpos')
w = textline_elements[0].get('width')
h = textline_elements[0].get('height')

In [27]:
print(
    f'The coordinates of the first line are : {x} (x), {y} (y), {h} (height), {w} (width)'
)

The coordinates of the first line are : 466 (x), 242 (y), 86 (height), 611 (width)


- how to find elements by ID
- how to get the value of an attribute
- children, parents, etc.

### From XML to Dataframe

Let's now try to put all these things together to solve a real problem that you have already encountered, i.e. **turning a bunch XML files into processable text**. Why this can be useful ?

In [29]:
def parse_alto(filepath):
    """Convert each file to: fulltext (list of lines), wordcount, filename (id)"""
    pass

## JSON