# Parsing a TEI Document

## Brief Introduction to TEI

The Text Encoding Initiative (TEI) is both a standard for encoding texts to be machine actionable and an organization that oversees the TEI standards. 

As standard, TEI provides a uniform way for humanities scholars to encode literary and documentary texts in a uniform way allowing them to be machine actionable for display, searching, or processing. TEI is a set of tags that piggy-back on basic XML.

To learn more, please see the following:
* [Text Encoding Initiative Home Page](https://tei-c.org/)
* [What is the TEI from the Women Writers Project](https://www.wwp.northeastern.edu/outreach/seminars/tei.html)
* [TEI By Example Project](https://teibyexample.org/)
* [Introduction to XML](https://www.w3schools.com/xml/xml_whatis.asp)


## Parsing TEI

In [1]:
# imports
from bs4 import BeautifulSoup
import re
import pandas as pd

In [2]:
# load xml file
with open('gibbon.xml', encoding='utf8', mode='r') as f:
    tei_string = f.read()

In [3]:
# just to show that file loaded properly
tei_string[:500]

'<TEI xmlns="http://www.tei-c.org/ns/1.0">\n    <teiHeader>\n        <fileDesc>\n            <titleStmt>\n                <title>The history of the decline and fall of the Roman Empire: By Edward Gibbon, Esq; ... [pt.2]</title>\n                <author>Gibbon, Edward, 1737-1794.</author>\n            </titleStmt>\n            <extent>511 600dpi bitonal TIFF page images and SGML/XML encoded text</extent>\n            <publicationStmt>\n                <publisher>University of Michigan Library</publisher>\n '

In [4]:
# use BeatifulSoup to instantiate tei object
tei_object = BeautifulSoup(tei_string)

In [5]:
# find all margin notes
margin_notes = tei_object.find_all('note', attrs={'place':'margin'})

In [7]:
# just to see how many margin notes there are
len(margin_notes)

402

In [17]:
# clean margin notes and add to a list
clean_margin_notes = []
for margin_note in margin_notes:
    margin_note_text = re.sub(r'[\ \n]{2,}', '', margin_note.text)
    clean_margin_notes.append(margin_note_text)

In [18]:
# just to see how many clean margin notes there are
len(clean_margin_notes)

402

In [19]:
# convert list to dataframe
df = pd.DataFrame(clean_margin_notes)

In [20]:
# check dataframe
df.head()

Unnamed: 0,0
0,"Aureolus invades Italy, is defeated and beſieg..."
1,A. D. 268.
2,A. D. 268. March 20. Death of Gallienus.
3,Character and elevation of the emperor Claudius.
4,Death of Aureolus.


In [21]:
# save datafram as csv
df.to_csv('margin_notes.csv')