# Parsing a TEI Document - Homework

## Directions

Parse the tei of Gibbon's _Decline and Fall_ to extract all the **marginal notes**. (XML file provided)
1. Extract all marginal notes
2. Remove extraneous whitespace
3. Place marginal notes in a dataframe
4. Save teh dataframe as a csv file


## Hint

Here is a snippet of what a marginal note in the xml document looks like:

`<note place="margin">A. D. 268. March 20. Death of Gallienus.</note>`

These are different from the footnotes that we saw in class in that (a) they do not have numbers and (b) the white space is different. You are free to accomodate for that however you would like.

### Set up

In [None]:
# ! pip install beatifulsoup4

In [15]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [16]:
# load xml file
url = "https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon.xml"
response = requests.get(url)
xml_str = response.text

### Parse TEI

In [17]:
# use BeautifulSoup to creat an xml object
xml = BeautifulSoup(xml_str)

In [18]:
# find all footnotes
marginal_notes = xml.find_all('note', attrs={'place': 'margin'})
marginal_notes[100].text # test

'Spectacles of Rome.'

In [19]:
# remove extra space (if needed)
def remove_extra_space(text):
  text = text.replace('\n', '')
  text = text.replace('  ', '')
  return text

In [21]:
# prepare data for dataframe
processed_notes = []
i = 1
for note in marginal_notes:
    d = {}
    note_num = f'marginal note {str(i)}'
    note_text = remove_extra_space(note.text)
    d["number"] = note_num
    d["text"] = note_text
    processed_notes.append(d)
    i += 1

In [22]:
processed_notes[100] # test

{'number': 'marginal note 101', 'text': 'Spectacles of Rome.'}

In [23]:
# convert to dataframe
df = pd.DataFrame.from_dict(processed_notes)
df.head() # test

Unnamed: 0,number,text
0,marginal note 1,"Aureolus invades Italy, is defeated and beſieg..."
1,marginal note 2,A. D. 268.
2,marginal note 3,A. D. 268. March 20. Death of Gallienus.
3,marginal note 4,Character and elevation of the emperor Claudius.
4,marginal note 5,Death of Aureolus.


In [25]:
# save dataframe as csv
file_name = "gibbon_marginal_notes.csv"
df.to_csv(file_name, index=False)