# Pandas & BeautifulSoup 

**TLDR;**

In this notebook we are going to "meet" two highly useful Python libraries: `pandas` and Beautifulsoup (BS).

## Imports

In [61]:
import os
import pandas as pd
import bs4
from bs4 import BeautifulSoup

## BeautifulSoup: a quick introduction

There are several Python libraries to parse XML but `BeautifulSoup` is somehow the swiss knife of XML parsing.

It can parse HTML, XML, as well as ill-formed or broken XML documents (very useful for legacy XML or even SGML data).

### Open an XML file with BS

In [5]:
data_folder = 'data/'

# let's get the path of XML files
# we filter only files with XML extension
# it can be useful to ignore e.g. `.DS_Store` files (under MacOS)

xml_files = [
    os.path.join(data_folder, file)
    for file in os.listdir(data_folder)
    if ".xml" in file
]

In [6]:
with open(xml_files[0], 'r') as inpfile:
    xml_doc = BeautifulSoup(inpfile)

In [12]:
xml_files[0]

'data/igcyr024200.xml'

In [10]:
xml_doc

<?xml version="1.0" encoding="UTF-8"?><?xml-model href="http://www.stoa.org/epidoc/schema/8.23/tei-epidoc.rng" schematypens="http://relaxng.org/ns/structure/1.0"?><?xml-model href="http://www.stoa.org/epidoc/schema/8.23/tei-epidoc.rng" schematypens="http://purl.oclc.org/dsdl/schematron"?><html><body><tei xml:lang="en" xmlns="http://www.tei-c.org/ns/1.0">
<teiheader>
<filedesc>
<titlestmt>
<title><rs cert="low" type="textType">Private honors</rs> or <rs cert="low" type="textType">epitaph</rs></title>
<editor>Inscriptions of Greek Cyrenaica</editor>
</titlestmt>
<publicationstmt>
<authority></authority>
<idno type="filename">IGCyr024200</idno>
<availability>
<p><ref target="https://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attributions-NonCommercial 4.0 International</ref> License.</p> <p>All citation, reuse or distribution of this work must contain a link back to DOI: <ref target="http://doi.org/10.6092/UNIBO/IGCYRGVCYR">http://doi.org/10.6092/UNIBO/IGCYRGVCYR</ref> and 

### Finding elements by attribute

In [15]:
target_element = xml_doc.find_all(
    attrs={'xml:id': 'representation'}
)

In [16]:
target_element

[<category xml:id="representation">
 <catdesc>Digitized other representations</catdesc>
 </category>]

In [17]:
assert len(target_element) == 1

### Finding elements by name

### Navigating the XML tree

Let's navigate a bit the `edition` section of an EpiDoc TEI file. 

First off, we isolate this element, contained in a `<div>` with `@type=edition`:

In [32]:
edition = xml_doc.find_all(
    'div',
    attrs={'type': 'edition'}
)[0]

In [39]:
for child in edition.children:
    print(f"Element type: {type(child)}, element name: {child.name}, element content: \'{child}\'")

Element type: <class 'bs4.element.NavigableString'>, element name: None, element content: '
'
Element type: <class 'bs4.element.Tag'>, element name: ab, element content: '<ab>
<lb n="1"></lb><persname key="" type="attested"><name nymref="#Ξεναρίστα">Ξεναρίστα</name>
<persname key="" type="attested"><name nymref="#Πρατομήδης" type="patronymic">Πρατομήδευς</name></persname></persname>
</ab>'
Element type: <class 'bs4.element.NavigableString'>, element name: None, element content: '
'


In [58]:
for i, persname in enumerate(edition.find_all('name')):
    # note that element name and attribute name get lowercased
    print(i + 1, persname.text.replace('\n', ' '), persname.get('nymref'))

1 Ξεναρίστα #Ξεναρίστα
2 Πρατομήδευς #Πρατομήδης


In [59]:
type(persname)

bs4.element.Tag

## XML data → `DataFrame`

### Function definitions

In [65]:
def read_xml(path):
    with open(path, 'r') as inpfile:
        return BeautifulSoup(inpfile)

In [66]:
def find_name_element(doc: BeautifulSoup):
    return doc.find_all('name')

In [67]:
def parse_name_element(element: bs4.element.Tag):
    return {
        "surface": element.text,
        "id": element.get('nymref'),
        "type": element.get('type')
    }

### Easy version

In [93]:
names = [
    (file, parse_name_element(name))
    for file in xml_files
    for name in find_name_element(read_xml(file))
]

In [94]:
len(names)

33

In [76]:
names_df = pd.DataFrame(names).set_index('id', drop=False)

### Advanced version

We want to extract all names from the TEI files while keeping the provenance of each name (i.e. the path of the file where it was found).

In [103]:
dfs = []

for file, name_elements in names:
    df = pd.DataFrame([name_elements]).set_index('id', drop=False)
    df['file'] = file
    dfs.append(df)

In [105]:
names_df = pd.concat(dfs)

In [106]:
names_df.shape

(33, 4)

In [107]:
names_df.head()

Unnamed: 0_level_0,surface,id,type,file
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
#Ξεναρίστα,Ξεναρίστα,#Ξεναρίστα,,data/igcyr024200.xml
#Πρατομήδης,Πρατομήδευς,#Πρατομήδης,patronymic,data/igcyr024200.xml
Ἰσαάκιος,Ἰσαάκου,Ἰσαάκιος,,data/iospe-5.14.xml
Ἰσαάκιος,Ἰσακίου,Ἰσαάκιος,,data/iospe-5.11.xml
Κομνηνός,Κομνηνοῦ,Κομνηνός,surname,data/iospe-5.11.xml


### Data exploration

In [78]:
names_df.head(10)

Unnamed: 0_level_0,surface,id,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
#Ξεναρίστα,Ξεναρίστα,#Ξεναρίστα,
#Πρατομήδης,Πρατομήδευς,#Πρατομήδης,patronymic
Ἰσαάκιος,Ἰσαάκου,Ἰσαάκιος,
Ἰσαάκιος,Ἰσακίου,Ἰσαάκιος,
Κομνηνός,Κομνηνοῦ,Κομνηνός,surname
Αἰκατερίνη,Αἰκατερίνης,Αἰκατερίνη,
Λέων,Λέοντος,Λέων,
Ἀλιάτης,Ἀλιάτου,Ἀλιάτης,surname
,μαχ\n ος,,
Χριστός,Χριστέ,Χριστός,


In [79]:
names_df.type.value_counts()

patronymic    3
surname       2
Name: type, dtype: int64

In [81]:
names_df.id.value_counts()

Χριστός        3
#Ἀκέσανδρος    2
Ἀπόλλων        2
Ὀκτάβιος       2
Ἰσαάκιος       2
Φλάβιος        1
Νίγερ          1
Εὐκλείδας      1
Αἰκατερίνη     1
Ἰησοῦς         1
Τραιανός       1
Λέων           1
Ἁδριανός       1
#Πρατομήδης    1
Τρυφ-          1
#Τίμαρχος      1
Μᾶρκος         1
Καῖσαρ         1
 Ἀλιάτης       1
Φαυ-           1
Κομνηνός       1
#Θεύχρηστος    1
Σεβαστός       1
#Ξεναρίστα     1
Κυρά           1
Πόπλιος        1
Name: id, dtype: int64

In [89]:
names_df.surface.value_counts().mean()

1.03125

In [90]:
names_df.surface.value_counts().median()

1.0

In [91]:
names_df.id.value_counts().mean()

1.2307692307692308

In [92]:
names_df.id.value_counts().median()

1.0