# Formatting ETCSL TEI XML files
## Introduction

The Electronic Text Corpus of Sumerian Literature ([ETCSL](http://etcsl.orinst.ox.ac.uk) 1998-2006) provides editions and translations of some 400 Sumerian literary texts. Goal of this Notebook is to format the ETCSL data in such a way that the (lemmatized) texts are made available for computational text analysis. In order to make the data compatible with output scraped from [ORACC](http://oracc.org), the Notebook ETCSL-to-EPSD2 should be run after running the current scraper.

For most purposes you do not need to run this scraper, because the final output is made available to you. However, if you need output in a different format or if you wish to know how the output was produced, you may read, adapt, and run this Notebook.

The original [ETCSL](http://etcsl.orinst.ox.ac.uk) files in TEI XML are available upon request from the [Oxford Text Archive](http://ota.ox.ac.uk/desc/2518). Note the following description on the OTA site:

> ## The Electronic Text Corpus of Sumerian Literature. Revised edition.

> ### Editor	
> Cunningham, Graham (ed.); Ebeling, Jarle (ed.); Black, Jeremy (deceased) (ed.); Flückiger-Hawker, Esther (ed.); Robson, Eleanor (ed.); Taylor, Jon (ed.); Zólyomi, Gábor (ed.)

> ### Availability	
> Use of this resource is restricted in some manner. Usually this means that it is available for non-commercial use only with prior permission of the depositor and on condition that this header is included in its entirety with any copy distributed.

The [manual](http://etcsl.orinst.ox.ac.uk/edition2/etcslmanual.php) of the ETCSL project explains in full detail the editorial principles and the technical details. According to the manual the ETCSL data are freely available and the XML source files can be downloaded.

The TEI XML source files were sent to me by the Oxford Text Archive upon request September 3rd 2015. Any (non-commercial) re-use of the data produced in this Notebook should reproduce the header quoted above ('Editor' and 'Availability') and is understood to be licensed under a [Creative Commons Share Alike](http://creativecommons.org/licenses/by-nc-sa/4.0/) license.



# The Scraper

This scraper expects the following files:

1. Directory Input
  * etcsl.txt  a list of ETCSL text numbers
2. Directory etcsl/transliterations/
  * This directory should contain the ETCSL TEI XML files.
3. Directory Equivalencies
  * ampersands.txt a list of HTML entities and their unicode equivalents
  * version_equivalencies.txt a list of ETCSL version names with their abbreviated forms.

The output is saved in the `Output` directory as a set of .txt files.

## 1. Setting Up
First import the proper packages: 

- re: Regular Expressions
- StringIO: enable treating strings as files (used for ElementTree)
- os: enable Python to perform basic Operating System functions (such as making a directory)
- ElementTree: read and analyze an XML file as an ordered tree
- time: allows the program to 'sleep' for a brief period
- tqdm: creates a progress bar

If you installed Python 3 and Jupyter by installing the [Anaconda Navigator](https://www.continuum.io/downloads), then most of these packages should already be installed, with the exception of tqdm. The first line in the cell below installs tqdm. It needs to be installed just once, after installing it you may invalidate that line by putting a # in front of it.

In [1]:
#! pip install tqdm
import re
import xml.etree.ElementTree as ET
from io import StringIO
import os
import time
import json
import pandas as pd
import tqdm
#import ipywidgets as widgets

#from ipywidgets import Checkbox, interactive
#from IPython.display import display
#from tqdm import *



## 1.a Load Equivalencies 

In [2]:
with open("equivalencies/equivalencies.json") as f:
    eq = json.load(f)
equiv = eq["suxwords"]
equiv.extend(eq["emesalwords"])
equiv.extend(eq["propernouns"])

## 2. Text Preparation 1: HTML-entities
The ETCSL TEI XML files are written in ASCII and represent special characters (such as š or ī) by a sequence of characters that begins with & and ends with ; (e.g. &c; represents š). These so-called HTML entities are used in translation, bibliography, and introductory text, but not in the transliteration of the Sumerian text itself (see below). The entities are for the most part project-specific and are declared and described elsewhere in the ETCSL file set. The ElementTree package cannot deal with these entities and thus we have to replace them with the actual (unicode) character that they represent, before feeding the data to ElementTree. 

All the entities are listed with their corresponding unicode character (or expression) in the file `Input/ampersands.txt` separated by a space:

    &aacute; á
    &aleph; ʾ
    &amacr; ā
    &ance; {anše}
    etc.
    
in the main process (below 11) the file `ampersands.txt` is read and made into the Python dictionary `findreplace` in which each of the HTML entities is a key, with its unicode equivalent as value. The function `ampersands()` uses this dictionary for a search-replace action.

The function `ampersands()` is called in `parsetext()` before the ElementTree is built. Note that the .xml files themselves are not changed by this process (or by any other process in this Notebook).

In [3]:
def ampersands(x):
    for amp in eq["ampersands"]:
        x = x.replace(amp, eq["ampersands"][amp])
    return x

## 4. Text Preparation 3: Removing 'Secondary Text' and/or 'Additional Text'

The ETCSL web pages include variants, indicated as '(1 ms. has instead: )', with the variant text enclosed in curly brackets. Two types of variants are distinguished: 'additional text' and 'secondary text'. 'Additional text' refers to a line that appears in a minority of sources (often in only one). 'Secondary text' refers to variant words or variant lines that are found in a minority of sources. The function `remove_extra()` may remove the words of 'secondary text' and/or 'additional text' before the text is parsed by ElementTree. 

In ETCSL TEI XML secondary/additional text is introduced by a tag of the type:
```xml
`<addSpan to="c141.v11" type="secondary"/>`
```
or
```xml
`<addSpan to="c141.v11" type="additional"/>`
```

The number c141 represents the text number in ETCSL (in this case Inana's Descent, text c.1.4.1). The return to the primary text is indicated by a tag of the type:
```xml
`<anchor id="c141.v11"/>`
```
Note that the `id` attribute in the `anchor` tag is identical to the `to` attribute in the `addSpan` tag.

The function `remove_extra()` uses regular expressions to identify and remove the Sumerian words and lines between those tags. The DOTALL flag (in re.DOTALL) allows the search in the regular expression to continue over multiple lines. Note that the function does not simply erase everything between the `<addSpan >` and the `<anchor >` tags identified with the regular expression. Instead, it indentifies words and lines between those tags, and elimibates those. This is the case because on occasion there may be tags that begin before the `<addSpan>` tag and end with the secondary region. Erasing the whole region would invalidate the XML structure and make it impossible to scan the data with ElementTree.

The function `remove_extra()` is called by the function `parse()` (see below, section 8). The third argument (`which`) is either `additional` (to remove additional text) or `secondary` (to remove secondary text). The default is not to remove anything.

In [4]:
def remove_extra(xmltext, textid, which):
    textid = textid.replace('.', '')
    find = re.compile('(<addSpan to=("' + textid + '.v[0-9]{1,3}") type="' + which + '"/>.*?<anchor id=\\2/>)', re.DOTALL)
    word = re.compile('<w .*?</w>', re.DOTALL) # identify a single word in "secondary text"
    line = re.compile('<l .*?</l>', re.DOTALL) # identify an entire line of "secondary text"
    sec = re.findall(find, xmltext) # make a list of "secondary text" passages
    sec = [second[0] for second in sec] #findall creates a list of tuples; take the first of each tuple. The first
                                        # element in the tuple is the actual (secondary) text, 
                                        # the second is an id number.
    noword = [re.sub(word, '', instance) for instance in sec] # remove the single secondary words from each "secondary" passage
    noline = [re.sub(line, '', instance) for instance in noword] #remove entire secondary lines from each "secondary" passage
    for search, repl in zip(sec, noline):
        xmltext = xmltext.replace(search, repl)
    return xmltext

# Text Preparation 4: Gaps
In the XML gaps in the text are indicated as follows:

```xml
<gap extent="8 lines missing"/>
```
In order to keep the gap at the right place in the text, we will make it into a line with the <l> tag

In [5]:
def gaps(xmltext):
    xmltext = xmltext.replace('<gap extent', '<l extent')
    return xmltext

## 3. Text Preparation 2: Transliteration Conventions

Transliteration of Sumerian text in ETCSL TEI XML files uses **c** for **š**, **j** for **ŋ** and regular numbers for index numbers. The function `tounicode()` replaces each of those. For example **cag4** is replaced by **šag₄**. This function is called in the function `getword()` to format citation forms and forms (transliteration). The function `tounicode` uses the dictionary `ascii_unicode` which is stored in the `equivalencies.json` file.

In [6]:
def tounicode(x):
    for char in eq["ascii_unicode"]:
        x = x.replace(char, eq["ascii_unicode"][char])
    return x

## Replace ETCSL by ORACC Lemmatization
For every word, once `cf` (Citation Form), `gw` (Guide Word), and `pos` (Part of Speech) have been pulled out of the XML file, it is run through the etcsl/oracc equivalence lists to match it with the oracc/epsd2 standards

In [7]:
def etcsl_to_oracc(word):
    lemma = {key:word[key] for key in ['cf', 'gw', 'pos']}
    for entry in equiv:
        if lemma == entry["etcsl"]:
            word['cf'] = entry["oracc"]["cf"]
            word["gw"] = entry["oracc"]["gw"]
            word["pos"] = entry["oracc"]["pos"]
            if "oracc2" in entry:
                word["cf2"] = entry["oracc2"]["cf"]
                word["gw2"] = entry["oracc2"]["gw"]
                word["pos2"] = entry["oracc2"]["pos"]
    return word

## 6. Formatting Words

A word in the ETCSL files is represented by a number of nodes in the XML tree that identify the form (transliteration), citation form, guide word, part of speech, etc. The function `getword()` formats the word as closely as possible to the ORACC conventions. Three different types of words are treated in three different ways: Proper Nouns, Sumerian words and Emesal words.

In ETCSL **proper nouns** are nouns, which are qualified by a 'type' (Divine Name, Personal Name, Geographical Name, etc.; abbreviated as DN, PN, GN, etc.). In ORACC a word has a single POS; for proper nouns this is DN, PN, GN, etc. - so what is 'type' in ETCSL becomes POS in ORACC. ORACC proper nouns usually do not have a guide word (only a number to enable disambiguation of namesakes). The ETCSL guide words ('label') for names come pretty close to ORACC citation forms. Names are therefore formatted differently from other nouns.

**Sumerian words** are treated in basically the same way in ETCSL and ORACC, but the citation forms and guide words are often different. Transformation of citation forms and guide words to ORACC (epsd2) standards takes place in the Notebook ETCSL-toEPSD2. This harmonization process uses a set of dictionaries (prepared by Niek Veldhuis and Terri Tanaka) that record ETCSL to EPSD2 equivalencies.

**Emesal words** in ETCSL use their Sumerian equivalents as citation form ('lemma'), adding a separate node ('emesal') for the Emesal form proper. This Emesal form is the one that is used as citation form in the output.

Guide words need removal of commas and spaces. Removal of commas will allow the output files to be read as Comma Separated Value (csv) files, which is an efficient input format for processes in Python and R. In the output file commas separate different fields from each other (text ID, text name, line number and text). Spaces need to be removed because standard tokenizers will understand spaces as word dividers. 

In [8]:
def getword(node, meta_d):
    word = {key:meta_d[key] for key in meta_d} # store all meta data in metad_d in the word dictionary
    word["cf"] = node.get('lemma').replace('Xbr', '(X)')
    word["gw"] = node.get('label')
    if node.get('pos'):
        word["pos"] = node.get('pos')
    else:
        word["pos"] = 'NA'
        word["gw"] = 'NA'
    form = node.get('form').replace('Xbr', '(X)')
    word["form"] = form
    if node.get('emesal'):
        word["cf"] = node.get('emesal')
        word["lang"] = "sux-x-emesal"
    else:
        word["lang"] = "sux"
    if word["pos"] != 'NU':
        word["cf"] = tounicode(word["cf"])
        word["form"] = tounicode(word["form"])
    if node.get('type') and word["pos"] == 'N':
        if node.get('type') != 'ideophone':
            word["pos"] = node.get('type')
            word["cf"] = node.get('label')
            word["gw"] = '1'

    word["gw"] = word["gw"].replace(",", ";") #remove commas from guide words (replace by semicolon) to prevent
                                            #problems with processing of the csv format
    word["gw"] = word["gw"].replace(" ", "-") #remove spaces from guide words (replace by hyphen). Spaces
                                            #create problems with tokenizers in computational text analysis.
    word = etcsl_to_oracc(word)   
    return word

## 7. Formatting Lines

Each line consist of a series of words. The function `getline()` iterates over a line, taking one word at a time. The words and their various features (language, citation form, guideword, part of speech and form) are retrieved calling the function `getword()`, which returns a dictionary. This dictionary is forwarded to the function `outputformat()` for formatting.

The function `getword()` will supply the Part of Speech 'X' to each word that has no POS tag already.

In [9]:
def getline(lnode, meta_d):
    meta_d["line_ref"] += 1
    if "extent" in lnode.attrib:
        line = {key:meta_d[key] for key in ["id_text", "text_name", "version", "line_ref"]}
        line["extent"] = lnode.get("extent")
        line = [line]
        return line
    wordsinline = [] #initialize list for the words in this line
    for node in lnode.iter('w'):
        word = getword(node, meta_d)
        if "cf2" in word:
            word2 = {key:word[key] for key in ["id_text", "text_name","version", "line_ref", "line_no",
                                               "form", "lang"]}
            word2["cf"] = word["cf2"]
            word2["gw"] = word["gw2"]
            word2["pos"] = word["pos2"]            
            word1 = {key:word[key] for key in ["id_text", "text_name","version", "line_ref", "line_no",
                                               "form", "lang", "cf", "gw", "pos"]}
            wordsinline.extend([word1, word2])
        else:
            wordsinline.append(word)
    return wordsinline

## 8. Sections

Some compositions are divided into **sections**. That is the case, in particular, when a composition has gaps of unknown length. 

The function `getsection()` is called by `getcversion()` and receives three arguments: `tree` (an ElementTree object), `line_prefix` (which contains textid and the text name, and version name where applicable), and `csvformat` (which contains the header of the output CSV file). The function `getsection` checks to see whether a sub-division into sections is present. If so, it iterates over these sections. Each section (or, if there are no sections, the composition/version as a whole) consists of series of lines. The function `getline()` is called to request the contents of each line. The function returns the variable `csvformat`, which contains the formatted data.

In [10]:
def getsection(tree, meta_d):
    linesinsection = []
    sections = tree.find('.//div1')
    if sections != None: # if the text is not divided into sections - skip to else:
        for snode in tree.iter('div1'):
            section = snode.get('n')
            for lnode in snode.iter('l'):
                if "n" in lnode.attrib:
                    line = section + lnode.get('n')
                    meta_d["line_no"] = line
                line = getline(lnode, meta_d)
                linesinsection.extend(line)
    else:
        for lnode in tree.iter('l'):
            if "n" in lnode.attrib:
                line_no = lnode.get('n')
                meta_d["line_no"] = line_no
            line = getline(lnode, meta_d)
            linesinsection.extend(line)
    return linesinsection

## 9. Versions

In some cases an ETCSL file contains different versions of the same composition. The versions may be distinguished as 'Version A' vs. 'Version B' or may indicate the provenance of th version ('A version from Urim' vs. 'A version from Nibru'). In the edition of the proverbs the same mechanism is used to distinguish between numerous tablets (often lentils) that contain just one proverb, or a few, and are collected in the files "Proverbs from Susa," "Proverbs from Nibru," etc. (ETCSL c.6.2.1 - c.6.2.5).

The function `getversion()` is called by the function `parse()` and receives three arguments: `tree` (an ElementTree object), `line_prefix` (which contains the textid and the text name), and `csvformat` (which contains the header of the output CSV file). The function checks to see if versions are available in the file that is being parsed. If so, the function iterates over these versions while adding the version name to the variable `line_prefix`. If there are no versions, the version name is left empty. The parsing process is continued by calling `getsection()` to see if the composition/version is further divided into sections.

In [11]:
def getversion(tree, meta_d):
    sectionsinversion = []
    versions = tree.find('.//head')
    if versions != None: # if the text is not divided into versions - skip 'getversion()':
        for vnode in tree.iter('body'):
            version = vnode.find('head').text
            version = eq["versions"][version]
            meta_d["version"] = version
            section = getsection(vnode, meta_d)
            sectionsinversion.extend(section)
    else:
        meta_d["version"] = ''
        section = getsection(tree, meta_d)
        sectionsinversion.extend(section)
    return sectionsinversion


## 10. Parse a Text

The function `parsetext()` takes one xml file (a composition in ETCSL) and parses it, calling a variety of functions defined above. The function returns the variable `csvformat`. It contains a line-by-line representation of the text with version label (where applicable), line numbers (including section labels, where applicable) and all the lemmatized words.

The parsing is done by the ElementTree (ET) package. ET.parse expects a file, but instead it receives a variable here (`xmltext`). The function `StringIO()` allows a string to be treated as a file.

In [12]:
def parsetext(textid, flags={"secondary":False, "additional":False}):
    meta_d = {"id_text": textid, "line_ref" : 0}
    with open('etcsl/transliterations/' + textid + '.xml') as f:
        xmltext = f.read()
    xmltext = ampersands(xmltext)          #replace HTML entities by Unicode equivalents
    if flags["secondary"] == True:
        xmltext = remove_extra(xmltext, textid, "secondary")   # take out secondary text
    if flags["additional"] == True:
        xmltext = remove_extra(xmltext, textid, "additional")   # take out additional text
    xmltext = gaps(xmltext)
    
    tree = ET.parse(StringIO(xmltext))
    name = tree.find('.//title').text
    foreign = tree.find('.//title/foreign') #some titles have children with <foreign> tag for Sumerian words
    if foreign != None:
        name = name + foreign.text + foreign.tail
    name = name.replace(' -- a composite transliteration', '')
    name = name.replace(',', '')
    meta_d["text_name"] = name
    
    parsed = getversion(tree, meta_d)

    return parsed

## 11. Main Process

The code below opens a file `etcsl.txt` (in the directory `Input`) which contains all the numbers of ETCSL compositions (such as c.1.1.4). For each such number the corresponding xml file is opened and the content of the file is sent to the function `parsetext()`. `Parsetext()` returns the variabe `csvformat` which contains the formatted text. This is saved in the `Output` directory with a .txt extension. The main process also creates a dictionary, equiv_dic, which contains version names and abbreviated version names. This dictionary is used in the function `getversion()`(see above 9. Versions).

In [13]:
with open("Input/etcsl.txt", "r") as f:
    textlist = f.read().splitlines()
if not os.path.exists('Output'):
    os.mkdir('Output')

alltexts = []
for eachtextid in tqdm.tqdm(textlist):
    parsed = parsetext(eachtextid)
    alltexts.extend(parsed)

df = pd.DataFrame(alltexts)
df = df.fillna('')
with open('output/alltexts.csv', 'w') as w:
    df.to_csv(w)

100%|██████████| 394/394 [01:28<00:00,  2.50it/s]


In [20]:
extent = set(df["extent"])
extent

{'',
 '1 line fragmentary',
 '1 line missing',
 '10 lines missing',
 '11 lines missing',
 '12 lines fragmentary',
 '12 lines missing',
 '12 lines missing or fragmentary',
 '13 lines missing',
 '14 lines missing',
 '15 lines missing',
 '16 lines missing',
 '2 lines fragmentary',
 '2 lines missing',
 '20 lines missing',
 '200-300 lines missing',
 '21 lines missing',
 '25 lines missing',
 '29 lines missing',
 '3 lines fragmentary',
 '3 lines missing',
 '31 lines missing',
 '35 lines missing',
 '37 lines missing',
 '4 lines fragmentary',
 '4 lines missing',
 '4 lines missing or fragmentary',
 '5 lines missing',
 '6 lines fragmentary',
 '6 lines missing',
 '7 lines missing',
 '70 lines missing',
 '8 lines fragmentary',
 '8 lines missing',
 '9 lines fragmentary',
 '9 lines missing',
 'approx. 1 line missing',
 'approx. 10 lines fragmentary or missing',
 'approx. 10 lines missing',
 'approx. 10-15 lines missing',
 'approx. 11 lines missing',
 'approx. 13 lines missing',
 'approx. 14 lines mis

In [23]:
df1 = df[df["form"].str.contains('\(X\)')]

In [24]:
df1

Unnamed: 0,cf,extent,form,gw,id_text,lang,line_no,line_ref,pos,text_name,version
517,(X),,(X),,c.0.2.02,sux,53,50,,OB catalogue in the Louvre (L),
519,(X),,(X),,c.0.2.02,sux,53,50,,OB catalogue in the Louvre (L),
563,(X),,(X),,c.0.2.02,sux,67,64,,OB catalogue in the Louvre (L),
946,(X),,(X),,c.0.2.07,sux,A4,4,,OB catalogue possibly from Zimbir (B1),
1028,(X),,(X),,c.0.2.07,sux,B8,24,,OB catalogue possibly from Zimbir (B1),
1043,(X),,(X),,c.0.2.07,sux,B12,28,,OB catalogue possibly from Zimbir (B1),
1212,(X),,(X),,c.0.2.08,sux,D13,46,,OB catalogue from Nibru (N4),
1218,(X),,(X),,c.0.2.08,sux,E1,49,,OB catalogue from Nibru (N4),
1320,(X),,(X),,c.0.2.11,sux,13,13,,OB catalogue at Andrews University (B4),
1325,(X),,(X),,c.0.2.11,sux,14,14,,OB catalogue at Andrews University (B4),
