# Parsing a TEI Document

## Brief Introduction to TEI

The Text Encoding Initiative (TEI) is both a standard for encoding texts to be machine actionable and an organization that oversees the TEI standards.

As standard, TEI provides a uniform way for humanities scholars to encode literary and documentary texts in a uniform way allowing them to be machine actionable for display, searching, or processing. TEI is a set of tags that piggy-back on basic XML.

To learn more, please see the following:
* [Text Encoding Initiative Home Page](https://tei-c.org/)
* [What is the TEI from the Women Writers Project](https://www.wwp.northeastern.edu/outreach/seminars/tei.html)
* [TEI By Example Project](https://teibyexample.org/)
* [Introduction to XML](https://www.w3schools.com/xml/xml_whatis.asp)

## Parsing TEI

### Set up

In [None]:
# ! pip install beatifulsoup4

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# load xml file
url = "https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon.xml"
response = requests.get(url)
xml_str = response.text

### Parse TEI

In [3]:
# use BeautifulSoup to creat an xml object
xml = BeautifulSoup(xml_str)

In [4]:
# find all footnotes
footnotes = xml.find_all('note', attrs={'place': 'bottom'})

In [15]:
# remove extra space
def remove_extra_space(text):
  text = text.replace('\n', '')
  text = text.replace('  ', '')
  return text

In [32]:
# prepare data for dataframe
processed_footnotes = []
i = 1
for footnote in footnotes:
  d = {}
  footnote_num = f'footnote {str(i)}'
  footnote_text = remove_extra_space(footnote.text)
  d["number"] = footnote_num
  d["text"] = footnote_text
  processed_footnotes.append(d)
  i += 1

In [34]:
# convert to datafram
df = pd.DataFrame.from_dict(processed_footnotes)

In [36]:
file_name = "gibbon_footnotes.csv"
df.to_csv(file_name, index=False)