# Parsing a TEI Document - Homework

## Directions

Parse the tei of Gibbon's _Decline and Fall_ to extract all the **marginal notes**. (XML file provided)
1. Extract all marginal notes
2. Remove extraneous whitespace
3. Place marginal notes in a dataframe
4. Save teh dataframe as a csv file


## Hint

Here is a snippet of what a marginal note in the xml document looks like:

`<note place="margin">A. D. 268. March 20. Death of Gallienus.</note>`

These are different from the footnotes that we saw in class in that (a) they do not have numbers and (b) the white space is different. You are free to accomodate for that however you would like.

### Set up

In [None]:
# ! pip install beatifulsoup4

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [3]:
# load xml file
url = "https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon.xml"
response = requests.get(url)
xml_str = response.text


### Parse TEI

In [4]:
# use BeautifulSoup to creat an xml object
xml = BeautifulSoup(xml_str)

In [8]:
# find all marginal notes
marginal_notes = xml.find_all('note', attrs={'place': 'margin'})
marginal_notes[0]

<note place="margin">
                        Aureolus invades Italy, is defeated and be
                        <g ref="char:EOLhyphen"></g>
                        ſieged at Milan.
                    </note>

In [9]:
# remove extra space (if needed)
def remove_extra_space(text):
  text = text.replace('\n', '')
  text = text.replace('  ', '')
  return text

In [11]:
# prepare data for dataframe
processed_marginal_notes = []
i = 1
for marginal_note in marginal_notes:
  d = {}
  marginal_note_num = f'marginal note {str(i)}'
  marginal_note_text = remove_extra_space(marginal_note.text)
  d["number"] = marginal_note_num
  d["text"] = marginal_note_text
  processed_marginal_notes.append(d)
  i += 1

In [12]:
processed_marginal_notes[0]

{'number': 'marginal note 1',
 'text': 'Aureolus invades Italy, is defeated and beſieged at Milan.'}

In [14]:
# convert to dataframe
df = pd.DataFrame.from_dict(processed_marginal_notes)

In [15]:
# save dataframe as csv
file_name = 'gibbon_marginal_notes_HW.csv'
df.to_csv(file_name, index=False)