# 1.2 Import

One "Swiss knife" tool (or library) for each format:
- XML: `BeautifoulSoup`
- JSON: `jq`
- CSV: `csvkit`

<img src='https://cdn-images-1.medium.com/max/1600/1*Emm10TxVEOvWqwF9oPJb1w.jpeg' width='300px'>

## XML

### `BeautifulSoup`

There are several Python libraries to parse XML but `BeautifulSoup` is somehow the swiss knife of XML parsing.

It can parse HTML, XML, as well as ill-formed or broken XML documents (very useful for legacy XML or even SGML data).

In [4]:
import os
import bs4
from bs4 import BeautifulSoup

In [5]:
data_folder = '../data/altoxml/'

In [6]:
# let's get the path of XML files
# we filter only files with XML extension
# it can be useful to ignore e.g. `.DS_Store` files (under MacOS)

xml_files = [
    os.path.join(data_folder, file)
    for file in os.listdir(data_folder)
    if ".xml" in file
]

In [7]:
xml_files

['../data/altoxml/27971740_1890-04-01_38_077_0_003.xml',
 '../data/altoxml/27971740_1890-04-01_38_077_0_002.xml',
 '../data/altoxml/27971740_1890-04-01_38_077_0_001.xml',
 '../data/altoxml/27971740_1890-04-01_38_077_0_004.xml']

In [8]:
# prefixing a code cell's content with `!`
# tells jupyter to execute it as a bash shell command
# Here we use the command `head` to peek at the first 100 lines
# of our XML file.

!head -n 50 ../data/altoxml/27971740_1890-04-01_38_077_0_001.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto-v2.0.xsd">
    <Description>
        <MeasurementUnit>pixel</MeasurementUnit>
        <OCRProcessing ID="IdOcr">
            <ocrProcessingStep>
                <processingDateTime>2014-10-22</processingDateTime>
                <processingSoftware>
                    <softwareCreator>ABBYY</softwareCreator>
                    <softwareName>ABBYY FineReader Engine</softwareName>
                    <softwareVersion>11</softwareVersion>
                </processingSoftware>
            </ocrProcessingStep>
        </OCRProcessing>
    </Description>
    <Styles>
        <TextStyle ID="font0" FONTFAMILY="Times New Roman" FONTSIZE="7" />
        <TextStyle ID="font1"

In [9]:
with open(xml_files[0], 'r') as inpfile:
    xml_doc = BeautifulSoup(inpfile)

In [10]:
xml_doc

<html><body><p>﻿<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto-v2.0.xsd">
<description>
<measurementunit>pixel</measurementunit>
<ocrprocessing id="IdOcr">
<ocrprocessingstep>
<processingdatetime>2014-10-22</processingdatetime>
<processingsoftware>
<softwarecreator>ABBYY</softwarecreator>
<softwarename>ABBYY FineReader Engine</softwarename>
<softwareversion>11</softwareversion>
</processingsoftware>
</ocrprocessingstep>
</ocrprocessing>
</description>
<styles>
<textstyle fontfamily="Courier New" fontsize="9" id="font0"></textstyle>
<textstyle fontfamily="Times New Roman" fontsize="5" id="font1"></textstyle>
<textstyle fontfamily="Times New Roman" fontsize="8" id="font2"></textstyle>
<textstyle fontfamily="Times New Roman" fontsi

### Finding elements

Finding the `<textblock>` element with `@id` = `Page1_Block2`:

In [11]:
xml_doc.find_all?

In [15]:
target_element = xml_doc.find_all(
    'textblock',
    attrs={'id': 'Page1_Block1'}
)

In [17]:
# by definition, there should exist excatly one element
# with a given ID within the same document
assert len(target_element) == 1

The same search logic applies to *any* XML attribute. 

Here we search for all `<composedblock>` with `@type` = `container`:

In [18]:
composed_blocks = xml_doc.find_all(
    'composedblock',
    {'type': 'container'}
)

Finding all XML elements with a given name:

In [20]:
textline_elements = xml_doc.find_all('textline')

In [26]:
x = textline_elements[0].get('vpos')
y = textline_elements[0].get('hpos')
w = textline_elements[0].get('width')
h = textline_elements[0].get('height')

In [27]:
print(
    f'The coordinates of the first line are : {x} (x), {y} (y), {h} (height), {w} (width)'
)

The coordinates of the first line are : 466 (x), 242 (y), 86 (height), 611 (width)


### Navigating the XML tree

In [21]:
el = xml_doc.find('styles')

In [27]:
for child in el.children:
    print(type(child), child.name)

<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> textstyle
<class 'bs4.element.NavigableString'> None


In [29]:
parent = el.parent

In [30]:
el.previousSibling

'\n'

In [31]:
el.nextSibling

'\n'

## [Excercise] From XML to dictionary

Let's now try to put all these things together to solve a real problem that you have already encountered, i.e. **turning a bunch of XML files into processable data**. Why this can be useful?

(This exercise will take around 20-30 minutes to complete).

In [32]:
import pandas as pd

In [3]:
def parse_alto(filepath):
    """
    Convert each file to a dictionary with the
    following keys: fulltext (list of lines), wordcount, filename.
    """
    parsed_data = {}
    
    # add here your solution
    # you'll need to parse the xml elements
    # containing the information you are interested in
    
    # HINT: you may want to split the parsing of individual
    # XML elements into dedicated functions that get called from
    # `parse_alto()`
    
    return parsed_data

In [34]:
# once your function is in place, you should be
# able to execute this cell, which applies your function
# to all Alto files.

data = [
    parse_alto(xml_file)
    for xml_file in xml_files
]

df = pd.DataFrame(data)

In [35]:
df.head()

0
1
2
3
