# Parsing a TEI Document

## Brief Introduction to TEI

The Text Encoding Initiative (TEI) is both a standard for encoding texts to be machine actionable and an organization that oversees the TEI standards. 

As standard, TEI provides a uniform way for humanities scholars to encode literary and documentary texts in a uniform way allowing them to be machine actionable for display, searching, or processing. TEI is a set of tags that piggy-back on basic XML.

To learn more, please see the following:
* [Text Encoding Initiative Home Page](https://tei-c.org/)
* [What is the TEI from the Women Writers Project](https://www.wwp.northeastern.edu/outreach/seminars/tei.html)
* [TEI By Example Project](https://teibyexample.org/)
* [Introduction to XML](https://www.w3schools.com/xml/xml_whatis.asp)


## Parsing TEI

In [None]:
# imports
from bs4 import BeautifulSoup
import re
import pandas as pd

In [None]:
# load xml file
with open('gibbon.xml', encoding='utf8', mode='r') as f:
    tei_string = f.read()

In [None]:
# use BeatifulSoup to instntiate tei object
tei_object = BeautifulSoup(tei_string)

In [None]:
# find all margin notes
margin_notes = tei_object.find_all('note', attrs={'place': 'margin'})

In [None]:
# clean margin notes and add to a list
clean_margin_notes = []
for margin_note in margin_notes:
    margin_note = re.sub(r'[\ \n]{2,}', '', margin_note.text)  # remove excess space
    clean_margin_notes.append(margin_note)

In [None]:
# convert list to dataframe
margin_notes_df = pd.DataFrame(clean_margin_notes)

In [None]:
# check dataframe
margin_notes_df.head()

In [None]:
# save datafram as csv
margin_notes_df.to_csv('margin_notes.csv')