# Pandas & BeautifulSoup 

**TLDR;**

- In this notebook we are going to see how we can combine two highly useful Python libraries: `pandas` and Beautifulsoup (BS).
- To demonstrate this, we will write a Python program that **compute statistics** about the **frequency of names** found in a small set of EpiDoc TEI/XML documents.

## Imports

In [None]:
import os
import pandas as pd
import bs4
from bs4 import BeautifulSoup

## BeautifulSoup: a quick introduction

There are several Python libraries to parse XML but `BeautifulSoup` is somehow the swiss knife of XML parsing.

It can parse HTML, XML, as well as ill-formed or broken XML documents (very useful for legacy XML or even SGML data).

### Open an XML file with BS

In [None]:
data_folder = 'data/'

# let's get the path of XML files
# we filter only files with XML extension
# it can be useful to ignore e.g. `.DS_Store` files (under MacOS)

xml_files = [
    os.path.join(data_folder, file)
    for file in os.listdir(data_folder)
    if ".xml" in file
]

In [None]:
# not specifying the UTF-8 encoding of the XML files we are about the open
# raises an exception on Win, while works fine on Unix systems,
# see https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters
with open(xml_files[0], 'r', encoding='utf-8') as inpfile:
    xml_doc = BeautifulSoup(inpfile)

In [None]:
xml_files[0]

In [None]:
xml_doc

### Finding elements by id

In [None]:
target_element = xml_doc.find_all(
    attrs={'xml:id': 'representation'}
)

In [None]:
target_element

In [None]:
# by definition, there should exist excatly one element
# with a given ID within the same document
assert len(target_element) == 1

### Finding elements by other attributes

The same search logic applies to any XML attribute.

Here we search for all `<name>` with `@type = patronymic`:

In [None]:
xml_doc.find_all(
    'name',
    attrs={'type': 'patronymic'}
)

### Finding elements by name

In [None]:
xml_doc.find_all(
    'persname'
)

In [None]:
xml_doc.find_all(
    'bibl'
)

### Navigating the XML tree

So far we have seen how to process all elements matching a given query, no matter where they are found in the document. But in other cases, it's desirable to navigate through the hierarchical structure of a document.

Let's navigate a bit the `edition` section of an EpiDoc TEI file. 

First off, we isolate this element, contained in a `<div>` with `@type=edition`:

In [None]:
edition = xml_doc.find_all(
    'div',
    attrs={'type': 'edition'}
)[0]

In [None]:
for child in edition.children:
    print(f"Element type: {type(child)}, element name: {child.name}, element content: \'{child}\'")

In [None]:
for i, persname in enumerate(edition.find_all('name')):
    # note that element name and attribute name get lowercased
    print(i + 1, persname.text.replace('\n', ' '), persname.get('nymref'))

In [None]:
type(persname)

## XML data → `DataFrame`

**Why?**

When working with data, it's often very useful to compute some statistics about them. If you are working with a corpus of texts encoded it TEI/XML, you'll have to extract information from the XML files to be able to compute the stats.

**How?**

To do this, we combine together the two libraries we've encountered in this session: `pandas` and `BeautifulSoup`.

### Easy version

We want to parse all EpiDoc files contained in `data/` and extract all names (`<name>`). 

For each name we retain the following information:
- surface form (the textual content of the XML element)
- identifier (contained in `@nymRef`)
- type (contained in `@type`)

#### Function definitions

To avoid that the notebook becomes too messy, each step of the program is wrapped into a function.

These are the functions we will need:

In [None]:
# don't worry about this, it's just to add the type hints
# to each function declaration
from typing import List


def fetch_input_filenames(data_folder):
    # let's get the path of XML files
    # we filter only files with XML extension
    # it can be useful to ignore e.g. `.DS_Store` files (under MacOS)

    return [
        os.path.join(data_folder, file)
        for file in os.listdir(data_folder)
        if ".xml" in file
    ]

def read_xml(path) -> BeautifulSoup:
    """Reads the input XML file into a `BeautifulSoup` document."""
    with open(path, 'r', encoding='utf-8') as inpfile:
        return BeautifulSoup(inpfile)
    
def find_name_element(doc: BeautifulSoup) -> List:
    """Extracts all `<name>` elements from an XML document."""
    return doc.find_all('name')


def parse_name_element(element: bs4.element.Tag) -> dict:
    """Transforms a `<name>` element into a dictionary."""
    assert element.name == 'name'
    return {
        "surface": element.text,
        "id": element.get('nymref'),
        "type": element.get('type')
    }

At this point, we are ready to do the following:
- we iterate through all the files in the directory `data/*.xml` (`line 3`)
- for each file, we iterate through its `<name>` element
- for each of these elements, we parse some information out of it and store it in a dictionary
- finally, all these new dictionaries are stored in a list.

This type of syntax construct in Python is called **list comprehension**. It's very powerful (yet a bit scary at first) as it allows for writing complex sequences of processing steps in a very compact fashion.

In [None]:
names = [
    parse_name_element(name)
    for file in fetch_input_filenames('data/')
    for name in find_name_element(read_xml(file))
]

In [None]:
len(names)

In [None]:
names_df = pd.DataFrame(names).set_index('id', drop=False)

In [None]:
names_df.head()

### Advanced version

We want to extract all names from the TEI files while keeping the provenance of each name (i.e. the path of the file where it was found).

The logic is the same as in the code above, except for `line 2`. Instead of throwing all the names together, we create a tuple containing as the first element the file path, and as the second element the list of names it contains.

In [None]:
names = [
    (file, parse_name_element(name))
    for file in fetch_input_filenames('data/')
    for name in find_name_element(read_xml(file))
]

At this point, we need to *inject* the filename information into each name, before creating the dataframe.

We do this in two steps:
1. we create a list of dataframes, one per file (containing all names + the file path)
2. we concatenate all the dataframes in the list into a new one.

In [None]:
dfs = []

for file, name_elements in names:
    df = pd.DataFrame([name_elements]).set_index('id', drop=False)
    df['file'] = file
    dfs.append(df)

In [None]:
names_df = pd.concat(dfs)

In [None]:
names_df.shape

In [None]:
names_df.head()

## Data exploration

Let's see now how `pandas` can be used to explore this data. 

Think of `pandas` like a very powerful spreadsheet software, that you can program yourself to answer your burning questions about any dataset.

In [None]:
names_df.head(10)

How many names do we have for each type?

In [None]:
names_df.type.value_counts()

**Q**: Do you notice anything special about the counts above?

How many times does each name occur?

In [None]:
names_df.id.value_counts()

Not very informative, but we can even plot the name frequency:

In [None]:
names_df.id.value_counts().plot(kind='bar', figsize=(6,4))

In [None]:
# most frequently occurring
names_df.id.value_counts().max()

In [None]:
# least frequently occurring
names_df.id.value_counts().min()

All name surface forms are quite unique:

In [None]:
names_df.surface.value_counts().mean()

If we look at names ids, we can see that in average each name occur roughly 1.2 times:

In [None]:
names_df.id.value_counts().mean()

In [None]:
names_df.id.value_counts().median()

And 75% of the names occur only once:

In [None]:
names_df.id.value_counts().describe()

# Exercise


- You are asked to write a simple python program by modifying the code we provided in notebook `Pandas_BeautifulSoup.ipynb`, section "XML data → `DataFrame`"; the current code looks for `<name>` element and creates a `DataFrame` out of it. For the exercise you are asked to do something similar, but for a different set of TEI/EpiDoc elements of your choice.
- These are the steps to follow:
    1)  to identify one or more TEI elements of interest (can be lemmata, variants, bibliographic elements, metadata, etc.); 
    2)  to specify what information you to retain from them, and extract it from the XML (via `BeautifulSoup`) by modifying the code provided;
    3) convert it to a `pandas.DataFrame` and explore some statistics (for example by using `value_counts()`).


In [None]:
# put your code here