# Parsing a TEI Document

## Brief Introduction to TEI

The Text Encoding Initiative (TEI) is both a standard for encoding texts to be machine actionable and an organization that oversees the TEI standards. 

As standard, TEI provides a uniform way for humanities scholars to encode literary and documentary texts in a uniform way allowing them to be machine actionable for display, searching, or processing. TEI is a set of tags that piggy-back on basic XML.

To learn more, please see the following:
* [Text Encoding Initiative Home Page](https://tei-c.org/)
* [What is the TEI from the Women Writers Project](https://www.wwp.northeastern.edu/outreach/seminars/tei.html)
* [TEI By Example Project](https://teibyexample.org/)
* [Introduction to XML](https://www.w3schools.com/xml/xml_whatis.asp)


## Parsing TEI

In [1]:
# imports
from bs4 import BeautifulSoup
import re
import pandas as pd

In [3]:
# load xml file
with open('gibbon.xml', encoding = 'utf8', mode = 'r') as f:
    tei_string = f.read()

In [5]:
tei_string[:100]

'<TEI xmlns="http://www.tei-c.org/ns/1.0">\n    <teiHeader>\n        <fileDesc>\n            <titleStmt>'

In [7]:
# use BeatifulSoup to instntiate tei object
tei_object = BeautifulSoup(tei_string)

In [8]:
tei_object

<html><body><tei xmlns="http://www.tei-c.org/ns/1.0">
<teiheader>
<filedesc>
<titlestmt>
<title>The history of the decline and fall of the Roman Empire: By Edward Gibbon, Esq; ... [pt.2]</title>
<author>Gibbon, Edward, 1737-1794.</author>
</titlestmt>
<extent>511 600dpi bitonal TIFF page images and SGML/XML encoded text</extent>
<publicationstmt>
<publisher>University of Michigan Library</publisher>
<pubplace>Ann Arbor, Michigan</pubplace>
<date when="2007-10">2007 October</date>
<idno type="DLPS">004848826</idno>
<idno type="ESTC">T78366</idno>
<idno type="DOCNO">CW100611167</idno>
<idno type="TCP">K065082.002</idno>
<idno type="GALEDOCNO">CW3300409632</idno>
<idno type="CONTENTSET">ECHG</idno>
<idno type="IMAGESETID">0034700102</idno>
<availability>
<p>
                        This keyboarded and encoded edition of the work described above is co-owned by the institutions providing financial support to the Early English Books Online Text Creation Partnership. This Phase I text is avai

In [9]:
# find all margin notes
margin_notes = tei_object.find_all('note', attrs = {'place': 'margin'})

In [10]:
len(margin_notes)

402

In [15]:
# clean margin notes and add to a list
clean_margin_notes = []
for margin_note in margin_notes:
#     print(margin_note.text)
# removing all the extra whitespace 
    margin_note = re.sub(r'[\ \n]{2,}', '', margin_note.text)
    clean_margin_notes.append(margin_note)

In [16]:
len(clean_margin_notes)

402

In [17]:
# convert list to dataframe
df = pd.DataFrame(clean_margin_notes)

In [18]:
# check dataframe
df.head()

Unnamed: 0,0
0,"Aureolus invades Italy, is defeated and beſieg..."
1,A. D. 268.
2,A. D. 268. March 20. Death of Gallienus.
3,Character and elevation of the emperor Claudius.
4,Death of Aureolus.


In [19]:
# save datafram as csv
df.to_csv('margin_notes.csv')