# Convert XML Latin lemmatized NH file to CSV

The XML file contains the lemmatized version (urn_cts_latinLit_phi0978.phi001.perseus-lat2.xml) of the Natural History from Thibault Clérice's project (https://zenodo.org/records/4731513).

In the file, the text is separated in passages, their type being provided as the < ab > level. The NH contains < book > and < chapter > type.

Tokens are in < w > tags. The < w > tag also contains:

- the < n > tag with the the book.chapter reference
- the < lem > tag for the lemma
- the < pos > tag for the Part-of-Speech
- the < msd > tag for the morpho-syntactic description

In the resulting CSV file, the index indicates the index of the token at the book.chapter level.

In [9]:
import pandas as pd
from bs4 import BeautifulSoup

In [10]:
#path = "/Users/u0154817/Documents/Data Extraction/urn_cts_latinLit_phi0978.phi001.perseus-lat2.xml"
#XML file downloaded from: https://github.com/lascivaroma/latin-lemmatized-texts/blob/0.1.2/lemmatized/xml/urn%3Acts%3AlatinLit%3Aphi0978.phi001.perseus-lat2.xml 
path = "data/input/urn_cts_latinLit_phi0978.phi001.perseus-lat2.xml"

In [11]:
soup = BeautifulSoup(open(path, encoding='utf-8'), features="lxml")

In [12]:
chapters = soup.find_all("ab", type="chapter")

In [13]:
reference_column = []
index_column  = []
token_column = []
lemma_column = []
msd_column = []
pos_column = []

In [14]:
for chapter in chapters: ## for each book.chapter
    
    w_tags = chapter.find_all("w") ## find all w tags
    
    for index, w_tag in enumerate(w_tags):
        
        reference = w_tag.get("n") ## get the book.chapter reference
        reference_column.append(reference)
        
        index_column.append(index)
        
        token = w_tag.get_text() ## get the token
        token_column.append(token)
                
        lemma = w_tag.get("lemma") ## get the lemma
        lemma_column.append(lemma)
        
        msd = w_tag.get('msd') ## get the msd
        msd_column.append(msd)
        
        pos = w_tag.get('pos') ## get the pos
        pos_column.append(pos)

In [15]:
data = {'reference': reference_column, 'index': index_column, 'token': token_column, 'lemma': lemma_column, "msd": msd_column, 'pos' : pos_column}
xml_Clerice_to_csv = pd.DataFrame(data)

In [16]:
xml_Clerice_to_csv.to_csv('data/intermediate/xml_Clerice_to_csv.csv', index=False)