# 2.2 Parsing ETCSL
## 2.2.1. Introduction

The Electronic Text Corpus of Sumerian Literature ([ETCSL](http://etcsl.orinst.ox.ac.uk)) provides editions and translations of almost 400 Sumerian literary texts, mostly from the Old Babylonian period (around 1800 BCE). The project was founded by Jeremy Black (Oxford University), who sadly passed away in 2004; it was active from 1998 to 2006, when it was archived. Information about the project, its stages, products and collaborators may be found in the project's [About](http://etcsl.orinst.ox.ac.uk/edition2/general.php) page. By the time of its inception [ETCSL](http://etcsl.orinst.ox.ac.uk) was a pioneering effort - the first large digital project in Assyriology, using well-structured data according to the standards and best practices of the time. [ETCSL](http://etcsl.orinst.ox.ac.uk) allows for various kinds of searches in Sumerian and in English translation and provides lemmatization for each individual word. Numerous scholars contributed data sets to the [ETCSL](http://etcsl.orinst.ox.ac.uk) project (see [Acknowledgements](http://etcsl.orinst.ox.ac.uk/edition2/credits.php#ack)). The availability of [ETCSL](http://etcsl.orinst.ox.ac.uk) has fundamentally altered the study of Sumerian literature and has made this literature available for undergraduate teaching.

The original [ETCSL](http://etcsl.orinst.ox.ac.uk) files in TEI XML are stored in the [Oxford Text Archive](http://hdl.handle.net/20.500.12024/2518) from where they can be downloaded as a ZIP file under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License ([by-nc-sa 3.0](http://creativecommons.org/licenses/by-nc-sa/3.0/)). The copyright holders are Jeremy Black, Graham Cunningham, Jarle Ebeling, Esther Flückiger-Hawker, Eleanor Robson, Jon Taylor, and Gábor Zólyomi.

The [Oxford Text Archive](http://hdl.handle.net/20.500.12024/2518) page offers the following description:

> The Electronic Text Corpus of Sumerian Literature (ETCSL) comprises transliterations and English translations of 394 compositions attested on sources dating to the period from approximately 2100 to 1700 BCE. The compositions are divided into seven categories: ancient literary catalogues; narrative compositions; royal praise poetry and hymns to deities on behalf of rulers; literary letters and letter-prayers; divine and temple hymns; proverbs and proverb collections; and a more general category including compositions such as debates, dialogues and riddles. The numbering of the compositions within the corpus follows Miguel Civil's unpublished catalogue of Sumerian literature (etcslfullcat.html).Files with an initial c are composite transliterations (a reconstructed text editorially assembled from the extant exemplars but including substantive variants) in which the cuneiform signs are represented in the Roman alphabet. Files with an initial t are translations. The composite files include full references for the cuneiform sources and author-date references for the secondary sources (detailed in bibliography.xml). The composite and translation files are in XML and have been annotated according to the TEI guidelines. In terms of linguistic information, each word form in the composite transliterations has been assigned to a lexeme which is specified by a citation form, word class information and basic English translation.

Since [ETCSL](http://etcsl.orinst.ox.ac.uk) is an archival site, the editions are not updated to reflect new text finds or new insights in the Sumerian language. Many of the [ETCSL](http://etcsl.orinst.ox.ac.uk) editions were based on standard print editions that itself may have been 10 or 20 years old by the time they were digitized. Any computational analysis of the [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus will have to deal with the fact that: 

- the text may not represent the latest standard
- the [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus is extensive - but does not cover all of Sumerian literature known today

In terms of data acquisition, one way to deal with these limitations is to make the [ETCSL](http://etcsl.orinst.ox.ac.uk) data as much as possible compatible with the data standards of the Open Richly Annotated Cuneiform Corpus ([ORACC](http://oracc.org)). [ORACC](http://oracc.org) is an active project where new or updated editions can be produced. If compatible, if [ETCSL](http://etcsl.orinst.ox.ac.uk) and [ORACC](http://oracc.org) data may be freely mixed and matched, then the [ETCSL](http://etcsl.orinst.ox.ac.uk) data set can effectively be updated and expanded.

The [ETCSL](http://etcsl.orinst.ox.ac.uk) text corpus was one of the core data sets for the development of of [ePSD1](http://psd.museum.upenn.edu/epsd1/index.html) and [ePSD2][], and the ePSD version of the [ETCSL](http://etcsl.orinst.ox.ac.uk) data forms the core of the literary corpus collected in [ePSD2/literary](http://oracc.org/epsd2/literary). In order to harvest the [ETCSL](http://etcsl.orinst.ox.ac.uk) data for [ePSD2][] the lemmatization was adapted to [ORACC](http://oracc.org) standards and thus the [ePSD2/literary](http://oracc.org/epsd2/literary) version of the [ETCSL](http://etcsl.orinst.ox.ac.uk) dataset is fully compatible with any [ORACC](http://oracc.org) dataset, and can be parsed with the ORACC parser, discussed in section 2.1. However, [ePSD2/literary](http://oracc.org/epsd2/literary) is not identical with the [ETCSL](http://etcsl.orinst.ox.ac.uk) dataset. Several compositions have been replaced by more recent editions (for instance the Sumerian disputations edited in the [ORACC](http://oracc.org) project [Datenbank Sumerischer Streitliteratur](http://oracc.org/dsst)); a significant number of texts that were not available in [ETCSL](http://etcsl.orinst.ox.ac.uk) have been added (many of them published after 2006) and the Gudea Cylinders have been moved to the [epsd2/royal](http://oracc.org/epsd2/royal).

For some applications, therefore, parsing the original [ETCSL](http://etcsl.orinst.ox.ac.uk) XML TEI files has become redundant. However, any data transformation implies choices and it is hard to know what the needs will be of future computational approaches to the [ETCSL](http://etcsl.orinst.ox.ac.uk) dataset. The reason to include and discuss the [ETCSL](http://etcsl.orinst.ox.ac.uk) parser here is, first, to offer users the opportunity to work with the original data set. The various transformations included in the current parser may be adapted and adjusted to reflect the preferences and research questions of the user. As a concrete example of choices to be made, [ETCSL](http://etcsl.orinst.ox.ac.uk) distinguishes between main text, secondary text, and additional text, to reflect different types of variants between manuscripts (see below [2.2.4](#2.2.4-Pre-Processing:-Additional-Text-and-Secondary-Text)). The [ePSD2/literary](http://oracc.org/epsd2/literary) data set does not include this distinction. The output of the current parser will indicate for each word whether it is "secondary" or "additional" (according to [ETCSL](http://etcsl.orinst.ox.ac.uk) criteria) and offer the possibility to include such words or exclude them from the analysis. Similarly, the translations are not included in the [ePSD2/literary](http://oracc.org/epsd2/literary) dataset, nor are they considered by the present parser. Translation data are, however, available in the [ETCSL](http://etcsl.orinst.ox.ac.uk) XML TEI file set and the XML of the transcription files marks the beginning and end of translation paragraphs. Such data, therefore, is available and one may well imagine research questions for which the translation files are relevant (e.g. translation alignment). Although the present code does not deal with translation, one may use the same techniques and the same approach exemplified here to retrieve such data.

In order to achieve compatibility between [ETCSL](http://etcsl.orinst.ox.ac.uk) and [ORACC](http://oracc.org) the code uses a number of equivalence dictionaries, that enable replacement of characters, words, or names. These equivalence dictionaries are stored in JSON format (for JSON see section 2.1.1) in the file `equivalencies.json`  in the directory `equivalencies`.

### 2.2.1.1 XML
The [ETCSL](http://etcsl.orinst.ox.ac.uk) files as distributed by the [Oxford Text Archive](http://hdl.handle.net/20.500.12024/2518) are encoded in a dialect of `XML` (Extensible Markup Language) that is referred to as `TEI` (Text Encoding Initiative). In this encoding each word (in transliteration) is an *element* that is surrounded by `<w>` and `</w>` tags. Inside the start-tag the word may receive several attributes, encoded as name/value pairs, as in the following random examples:

```xml
<w form="ti-a" lemma="te" pos="V" label="to approach">ti-a</w>
<w form="e2-jar8-bi" lemma="e2-jar8" pos="N" label="wall">e2-jar8-bi</w>
<w form="ickila-bi" lemma="ickila" pos="N" label="shell"><term id="c1813.t1">ickila</term><gloss lang="sux" target="c1813.t1">la</gloss>-bi</w>
```

The `form` attribute is the full form of the word, including morphology, but omitting flags (such as question marks), indication of breakage, or glosses. The `lemma` attribute is the form minus morphology (similar to `Citation Form` in [ORACC](http://oracc.org). Some lemmas may be spelled in more than one way in Sumerian; the `lemma` attribute will use a standard spelling (note, for instance, that the `lemma` of "ti-a" is "te"). The `lemma` in [ETCSL](http://etcsl.orinst.ox.ac.uk) (unlike `Citation Form` in [ORACC](http://oracc.org)) uses actual transliteration with hyphens and sign index numbers (as in `lemma = "e2-jar8"`, where the corresponding [ORACC](http://oracc.org) `Citation Form` is [egar](http://oracc.org/epsd2/o0026723).

:::{note}
| Project  | transliteration  | dictionary form   | Part of Speech    | Meaning |
| --- | --- | --- | --- | ---- |
| ETCSL | form "e2-jar8-bi" | lemma "e2-jar8"| pos "N" | label "wall" |
| ORACC | form "e₂-gar₈-bi | citation form "egar" | pos "N" | guide word "wall"|
:::


The `label` attribute gives a general indication of the meaning of the Sumerian word but is not context-sensitive. That is, the `label` of "lugal" is always "king", even if in context the word means "owner". The `pos` attribute gives the Part of Speech, but again the attribute is not context-sensitive. Where a verb (such as sag₉, to be good) is used as an adjective the `pos` is still "V" (for verb). Together `lemma`, `label`, and `pos` define a Sumerian lemma (dictionary entry).

In parsing the [ETCSL](http://etcsl.orinst.ox.ac.uk) files we will be looking for the `<w>` and `</w>` tags to isolate words and their attributes. Higher level tags identify lines (`<l>` and `</l>`), versions, secondary text (found only in a minority of sources), etcetera.

The [ETCSL](http://etcsl.orinst.ox.ac.uk) file set includes the file [etcslmanual.html](http://etcsl.orinst.ox.ac.uk/edition2/etcslmanual.php) with explanations of the tags, their attributes, and their proper usage.

Goal of the parsing process is to get as much information as possible out of the `XML` tree in a format that is as close as possible to the output of the [ORACC](http://oracc.org/) parser. The output of the parser is a word-by-word (or rather lemma-by-lemma) representation of the entire [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus. For most computational projects it will be necessary to group words into lines or compositions, or to separate out a particular group of compositions. The data is structured in such a way that that can be achieved with a standard set of Python functions of the `pandas` library.

### 2.2.1.2 Parsing XML: Xpath, and lxml

:::{margin}
For proper introductions to `Xpath` and `lxml` see the [Wikipedia](https://en.wikipedia.org/wiki/XPath) article on `Xpath` and the homepage of the [`lxml`](https://lxml.de/) library, respectively.
:::

There are several Python libraries specifically for parsing `XML`, among them the popular `ElementTree` and its twin `cElementTree`. The library `lxml` is largely compatible with `ElementTree` and `cElementTree` but differs from those in its full support of `Xpath`. `Xpath` is a language for finding and retrieving elements and attributes in XML trees. `Xpath` is not a program or a library, but a set of specifications that is implemented in a variety of software packages in different programming languages. 

`Xpath` uses the forward slash to describe a path through the hierarchy of the `XML` tree. The expression `"/body/l/w"` refers to all the `w` (word) elements that are children of `l` (line) elements that are children of the `body` element in the top level of `XML` hierarchy.

The expression `'//w'`means: all the `w` nodes, wherever in the hierarchy of the `XML` tree. The expression may be used to create a list of all the `w` nodes with all of their associated attributes. The attributes of a node are addressed  with the `@` sign, so that `//w/@label` refers to the `label` attributes of all the `w` nodes at any level in the hierarchy. 

```python
words = tree.xpath('//w')
labels = tree.xpath('//w/@label')
```

Predicates are put between square brackets and describe conditions for filtering a node set. The expression  `//w[@emesal]` will return all the `w` elements that have an attribute `emesal`. 

`Xpath` also defines hundreds of functions. An important function is `'string()'` which will return the string value of a node or an attribute.  Once all `w` nodes are listed in the list `words` (with the code above) one may extract the transliteration and Guide Word (`label` in [ETCSL](http://etcsl.orinst.ox.ac.uk)) of each word as follows:

```python
form_l = []
gw_l = []
for node in words:
    form = node.xpath('string(.)') 
    form_l.append(form)
    gw = node.xpath('string(@label)')
    gw_l.append(gw)
```

The dot, the argument to the `string()` function in `node.xpath('string(.)')`, refers to the current node.

### 2.2.1.2 Input and Output

This scraper expects the following files and directories:

1. Directory `etcsl/transliterations/`  
   This directory should contain the [ETCSL](http://etcsl.orinst.ox.ac.uk) `TEI XML` transliteration files. The files may be downloaded from the [Oxford Text Archive](http://hdl.handle.net/20.500.12024/2518). The files are found in the file `etcsl.zip`, in the directory `transliterations`.
2. Directory `Equivalencies`  
   `equivalencies.json`: a set of equivalence dictionaries used at various places in the parser.  

The output is saved in the `output` directory as a single `.csv` file.

## 2.2.2 Setting Up
### 2.2.2.1 Load Libraries
Before running this cell you may need to install the package `lxml` (for parsing `XML`) by running 
```python
%conda intstall lxml
```
:::{margin}
For proper installation of packages for a Jupyter Notebook see [1.4.3 install_packages](../1_Preliminaries/1_Introduction.ipynb).

In [3]:
import re
from lxml import etree
import os
import json
import pandas as pd
from tqdm.auto import tqdm
os.makedirs('output', exist_ok = True)

### 2.2.2.2 Load Equivalencies 
The file `equivalencies.json` contains a number of dictionaries that will be used to search and replace at various places in this notebook. The dictionaries are:
- `suxwords`: Sumerian words (Citation Form, GuideWord, and Part of Speech) in [ETCSL](http://etcsl.orinst.ox.ac.uk) format and their [ORACC](http://oracc.org) counterparts.
- `emesalwords`: idem for Emesal words
- `propernouns`: idem for proper nouns
- `ampersands`: HTML entities (such as `&aacute;`) and their Unicode counterparts (`á`; see section 2.2.3).
- `versions`: [ETCSL](http://etcsl.orinst.ox.ac.uk) version names and (abbreviated) equivalents

The `equivalencies.json` file is loaded with the `json` library. The dictionaries `suxwords`, `emesalwords` and `propernouns` (which, together, contain the entire [ETCSL](http://etcsl.orinst.ox.ac.uk) vocabulary) are concatenated into a single dictionary.

In [4]:
with open("equivalencies/equivalencies.json", encoding="utf-8") as f:
    eq = json.load(f)
equiv = eq["suxwords"]
equiv.update(eq["emesalwords"])
equiv.update(eq["propernouns"])

## 2.2.3 Preprocessing: HTML-entities
Before the XML files can be parsed, it is necessary to remove character sequences that are not allowed in XML proper (so-called HTML entities). 

In non-transliteration contexts (bibliographies, composition titles, etc.) [ETCSL](https://etcsl.orinst.ox.ac.uk/) uses so-called HTML entities to represent non-ASCII characters such as,  á, ü, or š. These entities are encoded with an opening ampersand (`&`) and a closing semicolon (`;`). For instance, `&C;` represents the character `Š`. The HTML entities are for the most part project-specific and are declared in the file `etcsl-sux.ent` which is part of the file package and is used by the [ETCSL](https://etcsl.orinst.ox.ac.uk/) project in the process of validating and parsing the XML for on-line publication.

For purposes of data acquisition these entities need to be resolved, because XML parsers will not recognize these sequences as valid XML. 

The key `ampersands` in the file `equivalencies.json` has as its value a dictionary, listing all the HTML entities that appear in the [ETCSL](https://etcsl.orinst.ox.ac.uk/) files with their Unicode counterparts:

```json
{ 
 "&C;": "Š",
 "&Ccedil;": "Ç",
 "&Eacute;": "É",
 "&G;": "Ŋ",
 "&H;": "H",
 "&Imacr;": "Î",
 "&X;" : "X",
 "&aacute;": "á"
}
```

etc.

This dictionary is used to replace each HTML entity with its unicode (UTF-8) counterpart in the entire corpus. The function `ampersands()` is called in the function `parsetext()` immediately after opening the file of one of the compositions in [ETCSL](https://etcsl.orinst.ox.ac.uk/).  

:::{margin}
The expression `[^;]+` means: a sequence of one or more (`+`) characters, except the semicolon. The symbol `^` is the negation symbol in regular expressions. The expression `&[^;]+;` therefore captures a sequence of any length that begins with an ampersand and ends with a semicolon. There are many introductions for regular expressions on the web, for instance [regular-expressions.info](https://www.regular-expressions.info/), or [An Introduction to Regular Expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) by Thomas Nield.
:::

The function `ampersands()` uses the `sub()` function from the `re` (Regular Expressions) module. The arguments of this function are `sub(find_what, replace_with, text)`. In this case, the `find_what` is the compiled regular expression `amp`, matching all character sequences that begin with & and end with a semicolon (;). This regular expression is defined in the main process (see section 2.2.12) as follows:

```python
amp = re.compile(r'&[^;]+;')
```

The `replace_with` argument is a temporary `lambda` function that uses the `ampersands` dictionary to find the utf-8 counterpart of the HTML entity. The dictionary is queried with the `get()` function (m.group() represents the match of the regular expression `amp`). The `get()` function allows a fall-back argument, to be returned in case the dictionary does not have the key that was requested. This second argument is the actual regular expression match, so that in those cases where the dictionary does not contain the match it is replaced by itself.

The [ETCSL](http://etcsl.orinst.ox.ac.uk) `TEI XML` files are written in ASCII and represent special characters (such as š or ī) by a sequence of characters that begins with & and ends with ; (e.g. `&c;` represents `š`). The `lxml` library cannot deal with these entities and thus we have to replace them with the actual (Unicode) character that they represent before feeding the data to `etree` module.

The function `ampersands()` uses the dictionary `ampersands` for a search-replace action. The dictionary `ampersands` is included in the file `equivalencies.json`, which was loaded above (section 2).

The function `ampersands()` is called in `parsetext()` (see section 11) before the `etree` is built. The regular expression `amp` captures all so-called HTML entities (beginning with a '&' and ending with a ';'). The regex is compiled in the main process.

In [5]:
def ampersands(string):    
    string = re.sub(amp, lambda m: 
               eq["ampersands"].get(m.group(0), m.group(0)), string)
    return string

## 2.2.4 Marking 'Secondary Text' and/or 'Additional Text'

In order to be able to preserve the [ETCSL](http://etcsl.orinst.ox.ac.uk) distinctions between main text (the default), secondary text, and additional text, such information needs to be added as an attribute to each `w` node (word node). This must take place in pre-processing, before the `XML` file is parsed.

[ETCSL](http://etcsl.orinst.ox.ac.uk) transliterations represent composite texts, put together (in most cases) from multiple exemplars. The editions include substantive variants, which are marked either as "additional" or as "secondary". Additional text consists of words or lines that are *added* to the text in a minority of sources. In the opening passage of [Inana's Descent to the Netherworld](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=c.1.4.1&amp;amp;amp;display=Crit&amp;amp;amp;charenc=gcirc#), for instance, there is a list of temples that Inana leaves behind. One exemplar extends this list with eight more temples; in the composite text these lines are marked as "additional" and numbered lines 13A-13H. Secondary text, on the other hand, is variant text (words or lines) that are found in a minority of sources *instead of* the primary text. An example in [Inana's Descent to the Netherworld](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=c.1.4.1&amp;amp;amp;display=Crit&amp;amp;amp;charenc=gcirc#) is lines 30-31, which are replaced by 30A-31A in one manuscript (text and translation from [ETCSL](http://etcsl.orinst.ox.ac.uk)):

| line | text                                       | translation                                                  |
| ---- | ------------------------------------------ | ------------------------------------------------------------ |
| 30   | sukkal e-ne-eĝ₃ sag₉-sag₉-ga-ĝu₁₀          | my minister who speaks fair words,                           |
| 31   | ra-gaba e-ne-eĝ₃ ge-en-gen₆-na-ĝu₁₀        | my escort who speaks trustworthy words                       |
| 30A  | \[na\] ga-e-de₅ na de₅-ĝu₁₀ /ḫe₂\\-\[dab₅\]    | I am going to give you instructions: my instructions must be followed; |
| 31A  | \[inim\] ga-ra-ab-dug₄ ĝizzal \[ḫe₂-em-ši-ak\] | I am going to say something to you: it must be observed      |

"Secondary text" and "additional text" can also consist of a single word and there are even cases of "additional text" within "additional text" (an additional word within an additional line).

In [ETCSL](http://etcsl.orinst.ox.ac.uk) TEI XML secondary/additional text is introduced by a tag of the type:

```xml
<addSpan to="c141.v11" type="secondary"/>
```

or

```xml
<addSpan to="c141.v11" type="additional"/>
```

The number c141 represents the text number in [ETCSL](http://etcsl.orinst.ox.ac.uk) (in this case [Inana's Descent to the Netherworld](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=c.1.4.1&amp;amp;amp;display=Crit&amp;amp;amp;charenc=gcirc#), text c.1.4.1). The return to the primary text is indicated by a tag of the type:

```xml
<anchor id="c141.v11"/>
```

Note that the `id` attribute in the `anchor` tag is identical to the `to` attribute in the `addSpan` tag.

We can collect all the `w` tags (words) between `addSpan` and its corresponding `anchor` tag with the following `xpath` expression:

```python
secondary = tree.xpath('//w[preceding::addSpan[@type="secondary"]/@to = following::anchor/@id]')
```

In the expression `preceding` and `following` are so-called `axes` (plural of `axis`) which describe the relationship of an element to another element in the tree. The expression means: get all `w` tags that are preceded by an `addSpan` tag and followed by an `anchor` tag. The `addSpan` tag has to have an attribute `type` with value `secondary` , and the value of the `to` attribute of this `addSpan` tag is to be equal to the `id` attribute of the following `anchor` tag.

Once we have collected all the "secondary" `w` tags this way, we can add a new attribute to each of these words in the following way:

```python
for word in secondary:
    word.attrib["status"] = "secondary"
```

In the process of parsing we can retrieve this new `status` attribute to mark all of these words as `secondary`.

Since we can do exactly the same for "additional text" we can slightly adapt the above expression for use in the function `mark_extra()`

The function `mark_extra()` is called twice by the function `parsetext()` (see below, section 2.2.11), once for "additional" and once for "secondary" text, indicated by the `which` argument. 

In [6]:
def mark_extra(tree, which):
    extra = tree.xpath(f'//w[preceding::addSpan[@type="{which}"]/@to = following::anchor/@id]')
    for word in extra:
        word.attrib["status"] = which
    return tree

## 2.2.5 Transliteration Conventions

Transliteration of Sumerian text in [ETCSL](http://etcsl.orinst.ox.ac.uk) `TEI XML` files uses **c** for **š**, **j** for **ŋ** and regular numbers for index numbers. The function `tounicode()` replaces each of those. For example **cag4** is replaced by **šag₄**. This function is called in the function `getword()` to format `Citation Forms` and `Forms` (transliteration). The function `tounicode()` uses the translation tables `transind` (for index numbers) and `transcj` (for c and j), defined in the main process. The `translate()` function replaces individual characters from a string, according to the table.

In order to replace regular numbers with index numbers the function uses a [regular expression](https://www.regular-expressions.info/) to select only those single or double digit numbers that are preceded by a letter (leaving alone the "7" in 7-ta-am3). The regex `ind` is compiled in the main process.

In [7]:
def tounicode(string):
    string = re.sub(ind, lambda m: m.group().translate(transind), string)
    string = string.translate(transcj)
    return string

## 2.2.6 Replace [ETCSL](http://etcsl.orinst.ox.ac.uk) by [ORACC](http://oracc.org) Lemmatization
For every word, once `cf` (Citation Form), `gw` (Guide Word), and `pos` (Part of Speech) have been pulled out of the [ETCSL](http://etcsl.orinst.ox.ac.uk) `XML` file, they are combined into a lemma and run through the etcsl/oracc equivalence lists to match it with the [ORACC](http://oracc.org)/[ePSD2](http://oracc.org/epsd2) standards. The equivalence lists are stored in the file `equivalencies.json`, which was loaded above (section 2).

The function `etcsl_to_oracc()` is called by the function `getword()`.

In [8]:
def etcsl_to_oracc(word):
    lemma = f"{word['cf']}[{word['gw']}]{word['pos']}"
    if lemma in equiv:
        word['cf'] = equiv[lemma]["cf"]
        word["gw"] = equiv[lemma]["gw"]
        word["pos"] = equiv[lemma]["pos"]
        alltexts.append(word)
        if "cf2" in equiv[lemma]: # if an ETCSL word is replaced by two ORACC words
            word2 = word.copy()
            word2["cf"] = equiv[lemma]["cf2"]
            word2["gw"] = equiv[lemma]["gw2"]
            word2["pos"] = equiv[lemma]["pos2"]
            alltexts.append(word2)
    else: # word not found in equiv
        alltexts.append(word)
    return

## 2.2.7 Formatting Words

A word in the [ETCSL](http://etcsl.orinst.ox.ac.uk) files is represented by a `<w>` node in the `XML` tree with a number of attributes that identify the `form` (transliteration), `citation form`, `guide word`, `part of speech`, etc. The function `getword()` formats the word as closely as possible to the [ORACC](http://oracc.org) conventions. Three different types of words are treated in three different ways: Proper Nouns, Sumerian words and Emesal words.

In [ETCSL](http://etcsl.orinst.ox.ac.uk) **proper nouns** are nouns (`pos` = "N"), which are qualified by an additional attribute `type` (Divine Name, Personal Name, Geographical Name, etc.; abbreviated as DN, PN, GN, etc.). In [ORACC](http://oracc.org) a word has a single `pos`; for proper nouns this is DN, PN, GN, etc. - so what is `type` in [ETCSL](http://etcsl.orinst.ox.ac.uk) becomes `pos` in [ORACC](http://oracc.org). [ORACC](http://oracc.org) proper nouns usually do not have a guide word (only a number to enable disambiguation of namesakes). The [ETCSL](http://etcsl.orinst.ox.ac.uk) guide words (`label`) for names come pretty close to [ORACC](http://oracc.org) citation forms. Proper nouns are therefore formatted differently from other nouns.

**Sumerian words** are essentially treated in the same way in [ETCSL](http://etcsl.orinst.ox.ac.uk) and [ORACC](http://oracc.org), but the `citation forms` and `guide words` are often different. Transformation of citation forms and guide words to [ORACC](http://oracc.org)/[epsd2](http://oracc.org/epsd2/sux) standards takes place in the function `etcsl_to_oracc()` (see above, section 6).

**Emesal words** in [ETCSL](http://etcsl.orinst.ox.ac.uk) use their Sumerian equivalents as `citation form` (attribute `lemma`), adding a separate attribute (`emesal`) for the Emesal form proper. This Emesal form is the one that is used as `citation form` in the output.

The function `getword()` uses the dictionary `meta_d` which has collected all the meta-data (text ID, composition name, version, line number, etc.) of this particular word It produces the dictionary `word` which is sent to the function `etcsl_to_oracc()`

In [9]:
def getword(node):
    word = {key:meta_d[key] for key in meta_d} # copy all meta data from metad_d into the word dictionary
    if node.tag == 'gloss': # these are Akkadian glosses which are not lemmatized
        form = node.xpath('string(.)')
        form = form.replace('\n', ' ').strip() # occasionally an Akkadian gloss may consist of multiple lines
        word["form"] = tounicode(form) # check - is this needed?
        word["lang"] = node.xpath("string(@lang)")
        alltexts.append(word)
        return
    
    word["cf"] = node.xpath('string(@lemma)') # xpath('@lemma') returns a list. The string
    word["cf"] = word["cf"].replace('Xbr', '(X)')  # function turns it into a single string
    word["gw"] = node.xpath('string(@label)')

    if len(node.xpath('@pos')) > 0:
        word["pos"] = node.xpath('string(@pos)')
    else:         # if a word is not lemmatized (because it is broken or unknown) add pos = NA and gw = NA
        word["pos"] = 'NA'
        word["gw"] = 'NA'

    form = node.xpath('string(@form)')
    word["form"] = form.replace('Xbr', '(X)')
    
    if len(node.xpath('@emesal')) > 0:
        word["cf"] = node.xpath('string(@emesal)')
        word["lang"] = "sux-x-emesal"
    else:
        word["lang"] = "sux"

    exception = ["unclear", "Mountain-of-cedar-felling", "Six-headed Wild Ram", 
                     "The-enemy-cannot-escape", "Field constellation", 
                     "White Substance", "Chariot constellation", 
                 "Crushes-a-myriad", "Copper"]
    
    if len(node.xpath('@type')) > 0 and word["pos"] == 'N': # special case: Proper Nouns
        if node.xpath('string(@type)') != 'ideophone':  # special case in the special case: skip ideophones
            word["pos"] = node.xpath('string(@type)')
            word["gw"] = '1'
            if node.xpath('string(@label)') not in exception:
                word["cf"] = node.xpath('string(@label)')
    if len(node.xpath('@status')) > 0:
        word['status'] = node.xpath('string(@status)')
    
    word["cf"] = tounicode(word["cf"])
    word["form"] = tounicode(word["form"])
    etcsl_to_oracc(word)

    return

## 2.2.8 Formatting Lines

A line may either be an actual line (in Sumerian and/or Akkadian) or a gap (a portion of text lost). Both receive a line reference. A line reference is an integer that is used to keep lines (and gaps) in their proper order.

A gap of one or more lines in the composite text, due to damage to the original cuneiform tablet, is encoded as follows:

```xml
<gap extent="8 lines missing"/>
```

In order to be able to process this information and keep it at the right place in the data we will parse the `gap` tags together with the `l` (line) tags and process the gap as a line. In [ORACC](http://oracc.org) gaps are described with the fields `extent` (a number, or `n` for unknown),  and `scope` (line, column, obverse, etc.) . [ORACC](http://oracc.org) uses a restricted vocabulary for these fields, but [ETCSL](https://etcsl.orinst.ox.ac.uk/) does not. The code currently does not try to make the [ETCSL](https://etcsl.orinst.ox.ac.uk/) encoding of gaps compatible with the [ORACC](http://oracc.org) encoding.

The function `getline()` is called by `getsection()`. If the argument of `getline()` is an actual line (not a gap) it calls `getword()` for every individual word in that line.

In [10]:
def getline(lnode):
    meta_d["id_line"] += 1
    if lnode.tag == 'gap':
        line = {key:meta_d[key] for key in ["id_text", "text_name", "version", "id_line"]}
        line["extent"] = lnode.xpath("string(@extent)")
        alltexts.append(line)
        return
    
    for node in lnode.xpath('.//w|.//gloss[@lang="akk"]'):
                        # get <w> nodes and <gloss> nodes, but only Akkadian glosses
        getword(node)
    return

## 2.2.9 Sections

Some [ETCSL](http://etcsl.orinst.ox.ac.uk) compositions are divided into **sections**. That is the case, in particular, when a composition has gaps of unknown length. 

The function `getsection()` is called by `getversion()` and receives one argument: `tree` (the `etree` object representing one version of the composition). The function updates `meta_d`, a dictionary of meta data. The function `getsection()` checks to see whether a sub-division into sections is present. If so, it iterates over these sections. Each section (or, if there are no sections, the composition/version as a whole) consists of series of lines and/or gaps. The function `getline()` is called to process each line or gap. 

In [11]:
def getsection(tree):
    sections = tree.xpath('.//div1')
    
    if len(sections) > 0: # if the text is not divided into sections - skip to else:
        for snode in sections:
            section = snode.xpath('string(@n)')
            for lnode in snode.xpath('.//l|.//gap'):
                if lnode.tag == 'l':
                    line = section + lnode.xpath('string(@n)')
                    meta_d["label"] = line   # "label" is the human-legible 
                getline(lnode)

    else:
        for lnode in tree.xpath('.//l|.//gap'):
            if lnode.tag == 'l':
                line_no = lnode.xpath('string(@n)')
                meta_d["label"] = line_no
            getline(lnode)
    return

## 2.2.10 Versions

In some cases an [ETCSL](http://etcsl.orinst.ox.ac.uk) file contains different versions of the same composition. The versions may be distinguished as 'Version A' vs. 'Version B' or may indicate the provenance of the version ('A version from Urim' vs. 'A version from Nibru'). In the edition of the proverbs the same mechanism is used to distinguish between numerous tablets (often lentils) that contain just one proverb, or a few, and are collected in the files "Proverbs from Susa," "Proverbs from Nibru," etc. ([ETCSL](http://etcsl.orinst.ox.ac.uk) c.6.2.1 - c.6.2.5).

The function `getversion()` is called by the function `parsetext()` and receives one argument: `tree` (the `etree` object). The function updates`meta_d`, a dictionary of meta-data. The function checks to see if versions are available in the file that is being parsed. If so, the function iterates over these versions while adding the version name to the `meta_d` dictionary. If there are no versions, the version name is left empty. The parsing process is continued by calling `getsection()` to see if the composition/version is further divided into sections.

In [12]:
def getversion(tree):
    versions = tree.xpath('.//body[child::head]')

    if len(versions) > 0: # if the text is not divided into versions - skip 'getversion()':
        for vnode in versions:
            version = vnode.xpath('string(head)')
            version = eq["versions"][version]
            meta_d["version"] = version
            getsection(vnode)

    else:
        meta_d["version"] = ''
        getsection(tree)
    return

## 2.2.11 Parse a Text

The function `parsetext()` takes one xml file (a composition in [ETCSL](http://etcsl.orinst.ox.ac.uk)) and parses it, calling a variety of functions defined above. 

The parsing is done by the `etree` package in the `lxml` library. Before the file can be parsed properly the so-called HTML entities need to be replaced by their Unicode equivalents. This is done by calling the `ampersands()` function (see above, section 3: Preprocessing).

In [13]:
def parsetext(file):
    with open(f'etcsl/transliterations/{file}') as f:
        xmltext = f.read()
    xmltext = ampersands(xmltext)          #replace HTML entities by Unicode equivalents
    
    tree = etree.fromstring(xmltext)
    
    tree = mark_extra(tree, "additional") # mark additional words with attribute status = 'additional'
    tree = mark_extra(tree, "secondary")  # mark secondary words with attribute status = 'secondary'
    name = tree.xpath('string(//title)')
    name = name.replace(' -- a composite transliteration', '')
    name = name.replace(',', '')
    meta_d["id_text"] =  file[:-4]
    meta_d["text_name"] = name
    meta_d["id_line"] = 0
    getversion(tree)

    return

## 2.2.12 Main Process

The list `alltexts` is created as an empty list. It will be filled with dictionaries, each dictionary representing one word form.

The variable `textlist` is a list of all the `XML` files with [ETCSL](http://etcsl.orinst.ox.ac.uk) compositions in the directory `etcsl/transliterations`. Each file  is sent as an argument to the function `parsetext()`. 

The dictionary `meta_d` is created as an empty dictionary. On each level of analysis the dictionary is updated with meta-data, such as text ID, version name, line number, etc.

The list is transformed into a `pandas` DataFrame. All missing values (`NaN`) are replaced by empty strings. 

In [14]:
textlist = os.listdir('etcsl/transliterations')
textlist.sort()

amp = re.compile(r'&[^;]+;') #regex for HTML entities, used in ampersands()

asccj, unicj = 'cjCJ', 'šŋŠŊ'
transcj = str.maketrans(asccj, unicj) # translation table for c > š and j > ŋ

ind = re.compile(r'[a-zŋḫṣšṭA-ZŊḪṢŠṬ][0-9x]{1,2}') #regex for sign index nos preceded by a letter
ascind, uniind = '0123456789x', '₀₁₂₃₄₅₆₇₈₉ₓ'
transind = str.maketrans(ascind, uniind) # translation table for index numbers
# regex ind and the translation tables transind and transcj are used in tounicode()

alltexts = []
files = tqdm(textlist)
for file in files:
    files.set_description(f'ETCSL {file[2:-4]}')
    meta_d = {}
    parsetext(file)

df = pd.DataFrame(alltexts).fillna('')

  0%|          | 0/394 [00:00<?, ?it/s]

In [15]:
df

Unnamed: 0,id_text,text_name,id_line,version,label,cf,gw,pos,form,lang,extent,status
0,c.0.1.1,Ur III catalogue from Nibru (N1),1,,1,dubsaŋ,first,AJ,dub-saŋ-ta,sux,,
1,c.0.1.1,Ur III catalogue from Nibru (N1),2,,2,Enki,1,DN,{d}en-ki,sux,,
2,c.0.1.1,Ur III catalogue from Nibru (N1),2,,2,unu,dwelling,N,unu₂,sux,,
3,c.0.1.1,Ur III catalogue from Nibru (N1),2,,2,gal,big,V/i,gal,sux,,
4,c.0.1.1,Ur III catalogue from Nibru (N1),2,,2,ed,ascend,V/i,im-ed₃,sux,,
...,...,...,...,...,...,...,...,...,...,...,...,...
170851,c.6.2.5,Proverbs: of unknown provenance,211,YBC 9912,1,haš,thigh,N,haš₂,sux,,
170852,c.6.2.5,Proverbs: of unknown provenance,211,YBC 9912,1,gid,long,V/i,ba-ra-an-gid₂-nam,sux,,
170853,c.6.2.5,Proverbs: of unknown provenance,211,YBC 9912,1,lu,person,N,lu₂,sux,,
170854,c.6.2.5,Proverbs: of unknown provenance,211,YBC 9912,1,ŋeši,sesame,N,še-ŋiš-i₃,sux,,


## 2.2.13 Save as CSV
The DataFrame is saved as a `CSV` file named `alltexts.csv` in the directory `output`.

In [None]:
with open('output/alltexts.csv', 'w', encoding="utf-8") as w:
    df.to_csv(w, index=False)