# Pandas & BeautifulSoup 

**TLDR;**

In this notebook we are going to "meet" two highly useful Python libraries: `pandas` and Beautifulsoup (BS).

## Imports

In [1]:
import os
import pandas as pd
import bs4
from bs4 import BeautifulSoup

## BeautifulSoup: a quick introduction

There are several Python libraries to parse XML but `BeautifulSoup` is somehow the swiss knife of XML parsing.

It can parse HTML, XML, as well as ill-formed or broken XML documents (very useful for legacy XML or even SGML data).

### Open an XML file with BS

In [2]:
data_folder = 'data/'

# let's get the path of XML files
# we filter only files with XML extension
# it can be useful to ignore e.g. `.DS_Store` files (under MacOS)

xml_files = [
    os.path.join(data_folder, file)
    for file in os.listdir(data_folder)
    if ".xml" in file
]

In [3]:
with open(xml_files[0], 'r') as inpfile:
    xml_doc = BeautifulSoup(inpfile)

In [4]:
xml_files[0]

'data/ircyr-P.304.xml'

In [5]:
xml_doc

<?xml version="1.0" encoding="UTF-8"?><?xml-model href="http://www.stoa.org/epidoc/schema/dev/tei-epidoc.rng" schematypens="http://relaxng.org/ns/structure/1.0"?><?xml-model href="http://www.stoa.org/epidoc/schema/dev/ircyr-checking.sch" schematypens="http://purl.oclc.org/dsdl/schematron"?><?xml-model href="http://www.stoa.org/epidoc/schema/dev/tei-epidoc.rng" schematypens="http://purl.oclc.org/dsdl/schematron"?><html><body><tei xml:id="P40900" xml:lang="en" xmlns="http://www.tei-c.org/ns/1.0">
<teiheader>
<filedesc>
<titlestmt>
<title><rs type="textType">Funerary</rs> inscription</title>
<editor>Joyce M. Reynolds</editor>
</titlestmt>
<publicationstmt>
<authority>Centre for Computing in the Humanities, King's College London</authority>
<idno type="filename">P.304</idno>
<idno type="ircyr2012">P40900</idno>
<idno type="JMR">529b</idno>
<idno type="Excel">430</idno>
<availability>
<p>Creative Commons licence Attribution UK 2.0
                    (<ref>http://creativecommons.org/license

### Finding elements by attribute

In [6]:
target_element = xml_doc.find_all(
    attrs={'xml:id': 'representation'}
)

In [7]:
target_element

[<category xml:id="representation"><catdesc>Digitized other representations</catdesc></category>]

In [8]:
assert len(target_element) == 1

### Finding elements by name

### Navigating the XML tree

Let's navigate a bit the `edition` section of an EpiDoc TEI file. 

First off, we isolate this element, contained in a `<div>` with `@type=edition`:

In [9]:
edition = xml_doc.find_all(
    'div',
    attrs={'type': 'edition'}
)[0]

In [10]:
for child in edition.children:
    print(f"Element type: {type(child)}, element name: {child.name}, element content: \'{child}\'")

Element type: <class 'bs4.element.NavigableString'>, element name: None, element content: '
'
Element type: <class 'bs4.element.Tag'>, element name: ab, element content: '<ab>
<lb n="0"></lb><gap extent="unknown" reason="lost" unit="line"></gap>
<lb n="1"></lb><gap extent="unknown" reason="lost" unit="character"></gap> <persname type="attested"><name><seg part="F">μαχ
                    <lb break="no" n="2"></lb><supplied reason="lost">ο</supplied>ς</seg></name></persname>
</ab>'
Element type: <class 'bs4.element.NavigableString'>, element name: None, element content: '
'


In [11]:
for i, persname in enumerate(edition.find_all('name')):
    # note that element name and attribute name get lowercased
    print(i + 1, persname.text.replace('\n', ' '), persname.get('nymref'))

1 μαχ                     ος None


In [12]:
type(persname)

bs4.element.Tag

## XML data → `DataFrame`

**Why?**

When working with data, it's often very useful to compute some statistics about them. If you are working with a corpus of texts encoded it TEI/XML, you'll have to extract information from the XML files to be able to compute the stats.

**How?**

...

### Easy version

We want to parse all EpiDoc files contained in `data/` and extract all names (`<name>`). 

For each name we retain the following information:
- surface form (the textual content of the XML element)
- identifier (contained in `@nymRef`)
- type (contained in `@type`)

#### Function definitions

In [26]:
from typing import List

In [25]:
def read_xml(path) -> BeautifulSoup:
    with open(path, 'r') as inpfile:
        return BeautifulSoup(inpfile)

In [27]:
def find_name_element(doc: BeautifulSoup) -> List:
    return doc.find_all('name')

In [24]:
def parse_name_element(element: bs4.element.Tag) -> dict:
    return {
        "surface": element.text,
        "id": element.get('nymref'),
        "type": element.get('type')
    }

In [19]:
names = [
    parse_name_element(name)
    for file in xml_files
    for name in find_name_element(read_xml(file))
]

In [20]:
len(names)

33

In [21]:
names_df = pd.DataFrame(names).set_index('id', drop=False)

In [22]:
names_df.head()

Unnamed: 0_level_0,surface,id,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,μαχ\n ος,,
Εὐκλείδας,Εὐκλείδα,Εὐκλείδας,
Τρυφ-,Τρυφ,Τρυφ-,
Ἰσαάκιος,Ἰσαάκου,Ἰσαάκιος,
#Ἀκέσανδρος,Ἀκέσανδρον,#Ἀκέσανδρος,


### Advanced version

We want to extract all names from the TEI files while keeping the provenance of each name (i.e. the path of the file where it was found).

In [103]:
dfs = []

for file, name_elements in names:
    df = pd.DataFrame([name_elements]).set_index('id', drop=False)
    df['file'] = file
    dfs.append(df)

In [105]:
names_df = pd.concat(dfs)

In [106]:
names_df.shape

(33, 4)

In [107]:
names_df.head()

Unnamed: 0_level_0,surface,id,type,file
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
#Ξεναρίστα,Ξεναρίστα,#Ξεναρίστα,,data/igcyr024200.xml
#Πρατομήδης,Πρατομήδευς,#Πρατομήδης,patronymic,data/igcyr024200.xml
Ἰσαάκιος,Ἰσαάκου,Ἰσαάκιος,,data/iospe-5.14.xml
Ἰσαάκιος,Ἰσακίου,Ἰσαάκιος,,data/iospe-5.11.xml
Κομνηνός,Κομνηνοῦ,Κομνηνός,surname,data/iospe-5.11.xml


### Data exploration

In [78]:
names_df.head(10)

Unnamed: 0_level_0,surface,id,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
#Ξεναρίστα,Ξεναρίστα,#Ξεναρίστα,
#Πρατομήδης,Πρατομήδευς,#Πρατομήδης,patronymic
Ἰσαάκιος,Ἰσαάκου,Ἰσαάκιος,
Ἰσαάκιος,Ἰσακίου,Ἰσαάκιος,
Κομνηνός,Κομνηνοῦ,Κομνηνός,surname
Αἰκατερίνη,Αἰκατερίνης,Αἰκατερίνη,
Λέων,Λέοντος,Λέων,
Ἀλιάτης,Ἀλιάτου,Ἀλιάτης,surname
,μαχ\n ος,,
Χριστός,Χριστέ,Χριστός,


In [79]:
names_df.type.value_counts()

patronymic    3
surname       2
Name: type, dtype: int64

In [81]:
names_df.id.value_counts()

Χριστός        3
#Ἀκέσανδρος    2
Ἀπόλλων        2
Ὀκτάβιος       2
Ἰσαάκιος       2
Φλάβιος        1
Νίγερ          1
Εὐκλείδας      1
Αἰκατερίνη     1
Ἰησοῦς         1
Τραιανός       1
Λέων           1
Ἁδριανός       1
#Πρατομήδης    1
Τρυφ-          1
#Τίμαρχος      1
Μᾶρκος         1
Καῖσαρ         1
 Ἀλιάτης       1
Φαυ-           1
Κομνηνός       1
#Θεύχρηστος    1
Σεβαστός       1
#Ξεναρίστα     1
Κυρά           1
Πόπλιος        1
Name: id, dtype: int64

In [89]:
names_df.surface.value_counts().mean()

1.03125

In [90]:
names_df.surface.value_counts().median()

1.0

In [91]:
names_df.id.value_counts().mean()

1.2307692307692308

In [92]:
names_df.id.value_counts().median()

1.0