# BDTNS to ORACC
Niek Veldhuis, UC Berkeley

Goal of this Notebook is to provide a script that will transform output from [BDTNS](http://bdtns.filol.csic.es/) into valid [ORACC](http://oracc.org) format. The [BDTNS](http://bdtns.filol.csic.es/) output chosen is HTML; the format in which sign index numbers are explicitly indicated. The [BDTNS](http://bdtns.filol.csic.es/) fields included are 
* BDTNS number
* CDLI number
* Object (Tablet or Envelope)
* Surface (obverse, reverse, seal)
* Column number
* Line number
* Text

The input for this notebook is located in the directory `../raw-data`. Output goes to the directory `../output`.

Current output produces a separate file for each text, and assumes that the [ORACC](http://oracc.org) project is called 'garshana'. In order to change the project name go to the last cell in this notebook and change the variable `project`.

This Notebook is written in Python 3.5, using Pandas 0.19.1.

In [1]:
import pandas as pd
import re
pd.__version__

'0.19.1'

In [2]:
texts = pd.read_csv('../raw-data/Texts_Garshana.tab', 
                    delimiter = '\t', encoding = "utf8", 
                    header = None, 
                    names=['bdtns', 'cdli','object', 'surface', 'column'
                           , 'line_no', 'text'])
texts = texts.fillna('')

# Text Cleaning
Transform the HTML codes into the appropriate symbols. All `<sup>` `</sup>` pairs are replaced by curly brackets for determinatives. This will also put in curly brackets question marks, exclamation marks, and half brackets. These are listed in `exceptions` - curly brackets are removed. Multiple explanation marks and question marks are reduced to a single one. There are currently no instances of !? or ?!.

In [BDTNS](http://bdtns.filol.csic.es/) rulings on a tablet are indicated as `========`. Such rulings are sometimes given a separate line number, sometimes not. If there is a line number it is replaced by `($single ruling$)`; if there is no line number the regular [ORACC](http://oracc.org) convention `$ single ruling` is used. 

In [3]:
HTML = {'<simbolo>&#60</simbolo>': '<',
        '<simbolo>&#62</simbolo>': '>',
        '<sup>': '{',
        '</sup>': '}'
       , '---+': '($blank$)', '===+' : 'single ruling'}
for symbol in HTML:
    texts['text'] = texts['text'].str.replace(symbol, HTML[symbol])
exceptions = {'{!+}': '!', '{\?+}': '?', '{⌉}': '⸣', '{⌈}': '⸢'}
for symbol in exceptions:
    texts['text'] = texts['text'].str.replace(symbol, exceptions[symbol])
#following code takes care of 'single ruling' distinguishing whether or not
#there is a line number
texts['text'] = [texts.loc[i, 'text'].replace('single ruling', '($single ruling$)')
                if texts.loc[i, 'line_no'] else
                texts.loc[i, 'text'].replace('single ruling', '$ single ruling')
                for i in range(len(texts))]

# Seal
In [BDTNS](http://bdtns.filol.csic.es/) the placement of a seal in the text is usually indicated with `# (Seal)` (without line number). Occasionally, such lines do have a line number and/or contain additional information. The proper way to do this in [ORACC](http://oracc.org) is `$ seal 1` (crossreferencing the `@seal 1` line). For the time being there are too many variants of the `Seal` lines in [BDTNS](http://bdtns.filol.csic.es/) to change those into proper (strict) \$-lines. If there is a line number the `seal` remark is put between `($...$)`. Otherwise, the \$ (...) convention is used for a non-strict \$-line. 

In [4]:
for i in range(len(texts)):
    if 'seal' in texts.loc[i, 'text'].lower():
        seal_remark = re.sub('[\(\)#]', '', texts.loc[i, 'text']).strip().lower()
        seal_remark
        if texts.loc[i,'line_no']:
            texts.loc[i,'text'] = '($' + seal_remark + '$)'
        else:
            texts.loc[i,'text'] = '$ (' + seal_remark + ')'
            texts.loc[i,'text'] = texts.loc[i,'text'].replace('$ (seal illegible)', '$ seal illegible')

 # (Seal)
$ (seal)
 # (Seal)
$ (seal)
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # (Seal)
$ (seal)
 # Seal illegible
$ seal illegible
 # (Seal)
$ (seal)
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # (Seals)
 # (Seals)
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # Seal illegible
$ seal illegible
 # (Seal)
$ (seal)
 # Seal illegible
$ seal illegible
 # (Seal

# Index Numbers
Index numbers for signs (as in `du₃` or `sig₄`) are written HTML style: `du<sub>3</sub>` in the BDTNS output. Occasionally, a `</sub>` is found after a number (a real number, not an index number) without opening tag (doing nothing in HTML) - those are removed separately in the next cell. All possible indexes (including `ₓ`) are listed in the dictionary `indexes`. The code iterates through the dictionary, replacing each number with its index counterpart if the number is preceded by `<sub>` and followed by `</sub>`.

In [5]:
indexes = {'x': 'ₓ', '1': '₁', '2': '₂', '3': '₃', '4': '₄', '5': '₅',
          '6': '₆', '7': '₇', '8': '₈', '9': '₉', '0': '₀', 
           '10': '₁₀', '11': '₁₁', '12': '₁₂', '13': '₁₃', '14': '₁₄', 
           '15': '₁₅','16': '₁₆', '17': '₁₇', '18': '₁₈', '19': '₁₉',
          '20': '₂₀', '21': '₂₁', '22': '₂₂', '23': '₂₃', '24': '₂₄', 
           '25': '₂₅','26': '₂₆', '27': '₂₇', '28': '₂₈', '29': '₂₉',
           '30': '₃₀', '31': '₃₁', '32': '₃₂', '33': '₃₃', '34': '₃₄', 
           '35': '₃₅','36': '₃₆', '37': '₃₇', '38': '₃₈', '39': '₃₉'}
for index in indexes:
    texts['text'] = texts['text'].str.replace('<sub>' + index + '</sub>', indexes[index])

# Other Convention Issues

The list `replacements` holds a number of sequences that are not allowed in [ORACC](http://oracc.org), such as an opening square bracket followed by a space (or a space followed by a closing square bracket). The notation `[...] #` is used in [BDTNS](http://bdtns.filol.csic.es/) before descriptions of breakage (such as `[...] # (beginning lost)`). The [ORACC](http://oracc.org) convention is `$`. Some so-called `$-lines` are called `strict $-lines` in [ORACC](http://oracc.org) (meaning that they follow a restricted vocabulary). In such cases they should not use parens (namely `(` and `)`). Those are removed, at least for one subset of strict `$-lines`, with the help of the dictionary `dollar_lines`. 

`Replacements` is a list (rather than a dictionary), so that the order in which the script iterates through is predictable. For instance, editorial notes in BDTNS are marked by ` # ` at the end of a line. This is replaced by a Newline + # (the note convention in ORACC), but this has to be done *after* the replacement of `[...] # (beginning lost)` etc. by a $-line.

In [6]:
replacements = [['</sub>',''], ['\[\.\.\.\] # *', '$ '], ['\[ +', '['], [' +]', ']'],
                ['(?<=[0-9n])\. ', ' '], ['(?<=[0-9n])\.-', '-'], ['(?<=[0-9n])\.\]', ']'],
                ['\(!\)', '!'], ['\[\.\.\]', '[...]'],
               [' \.\.\]', ' ...]'], ['\.\.\.([a-zšA-ZŠ])', '... \\1'], ['([a-zšA-ZŠ₀-₉])\.\.\.', '\\1 ...'], 
                ['x\[' , 'x ['], ['\]x', '] x'], ['\[\?\]', ' [x?]'], [' \(\?\)' , ' '],
               ['\[\.\.\.\]\?' , '[...]'], [' # ', '\n# '], ['-\]', ']-'], [' -', ' '], ['- ', ' '], ['KWU_', 'KWU']] 
for string in replacements:
    texts['text'] = texts['text'].str.replace(string[0], string[1])
dollar_lines = {'\$ *\(beginning lost\)': '$ beginning broken', '\$ *\(rest lost\)': '$ rest broken',
               '\$ *\(beginning of the col\. lost\)': '$ beginning of column broken', 
                '\$ *\(rest of the col\. lost\)': '$ rest of column broken'}
for string in dollar_lines:
    texts['text'] = texts['text'].str.replace(string, dollar_lines[string])  

# Corrections and x-Readings
Editorial corrections are indicated with an exclamation mark. They may, or may not be explained (providing the actual sign in the text) at the end of the line in the form:
> ha-bu₃!-da (=KA)

The corresponding convention in [ORACC](http://oracc.org) is:
> ha-bu₃!(KA)-da

Similarly, x-readings, such as `kuₓ` are identified at the end of the line in [BDTNS](http://bdtns.filol.csic.es/), but immediately after the x-reading in [ORACC](http://oracc.org).
> BDTNS: mu-kuₓ lugal-la (=DU)

> ORACC: mu-kuₓ(DU) lugal-la

The regex in the replace function below searches for these patterns and reorders them according to [ORACC](http://oracc.org) standards.

In some cases the (=SIGN) explanations in [BDTNS](http://bdtns.filol.csic.es/) do not explicitly refer to an element in the line, as in:
> {giš}kiri₆ zabalam₄{ki} gub-ba (=MUŠ₃.ZA.UNUG)

where (=MUŠ₃.ZA.UNUG) refers to zabalam₄, but without explicit reference. Those cases are put on a separate line and made into a footnote.
> {giš}kiri₆ zabalam₄{ki} gub-ba
#note: (=MUŠ₃.ZA.UNUG)

In [7]:
texts['text'] = texts['text'].str.replace('([ₓ\!])(.*?)(\(=)(.*?\))', '\\1(\\4\\2')
texts['text'] = texts['text'].str.replace('(\(=.*?\))', '\n#note: \\1')

# Compound Signs
Compound signs are surrounded by pipes in [ORACC](http://oracc.org), as in `girimₓ(|A.BU.HA.DU|)`. The code looks for sequences of capital letters, square brackets and half brackets, with signs separated by dots or 'times' signs. The code will find instances like `MUŠ₃.ZA.UNUG`, `⸢MUŠ₃.ZA⸣.UNUG` or `M[UŠ₃.ZA.U]NUG` and replace those by `|MUŠ₃.ZA.UNUG|`, `|⸢MUŠ₃.ZA⸣.UNUG|` and `|M[UŠ₃.ZA.U]NUG|` respectively. If one of the square brackets occurs before or after the compound sign, the result will be erratic. `[e₂ HAR.HAR]` yields `[e₂ |HAR.HAR]|` and `e₂ HAR.[HAR sumun]` yields `e₂ |HAR.[HAR| sumun]`. Both of these will elicit error messages in [ORACC](http://oracc.org) (correct is `[e₂ |HAR.HAR|]` and `e₂ |HAR.[HAR]| [sumun]`, respectively. 

In [8]:
texts['text'] = [re.sub('((([\[\]⸢⸣A-ZŠ₀-₉ₓ]+)[\.×])+[\[\]⸢⸣A-ZŠ₀-₉ₓ]+)', '|\\1|', text) for text in texts['text']]

# Name Capitalization
In [BDTNS](http://bdtns.filol.csic.es/)-style transliteration, names (proper names, city names, etc.) are capitalized, as in `{d}Inana`, or `Šu-{d}Suen`. This is not allowed in [ORACC](http://oracc.org) (capitalization of proper nouns is used only in the GuideWord in lemmatization). The first line in the code looks for a sequence of a capitalized letter, followed by 0 or 1 square brackets (`[` or `]`), followed by a lowercase letter (`Šulgi`, `Š[ulgi]`, `[Š]ulgi`, or `⸢Šulgi⸣`) and calls a function to lowercase the group. The second line looks for a single (capitalized) vowel, preceded by a word boundary (this includes `[`, `⸢`, `}` and `-`) and followed by zero or more index numbers, followed by 0 or 1 closing square bracket (or half-bracket), followed by a dash. The same function is called to lowercase the resulting group. This second regex finds cases such as `A-a-kal-la`, `⸢A⸣-a-kal-la`, `[E₂]-Šu-Suen`, or `Arad-A-a`.

For the use of a function in the `replace()` function in order to change the case of a regex group, see this message in [Stack Overflow](http://stackoverflow.com/questions/4145451/using-a-regular-expression-to-replace-upper-case-repeated-letters-in-python-with?noredirect=1&lq=1).

In [9]:
lower = lambda pat: pat.group(1).lower()
texts['text'] = [re.sub(r'((?<=\b)[A-ZŠ][\[\]]?[a-zš])', lower , text) for text in texts['text']]
texts['text'] = [re.sub(r'((?<=\b)[AEIU][₀-₉]*[\]⸣]?-)', lower, text) for text in texts['text']]

# Line Breaks
To indicate a line break within a line of text, the [ORACC](http://oracc.org) convention is a semicolon. The semicolon is placed *before* a connecting dash. The [BDTNS](http://bdtns.filol.csic.es) convention for this is a slash (`/`) which is placed *after* a connecting dash. The forward slash is also used in fractions (such as `5/6`) where it should not be replaced. Occasionally, the slash in [BDTNS](http://bdtns.filol.csic.es/) is used immediately before a sign (without space or dash) - which is not allowed for the semi-colon in [ORACC](http://oracc.org). The third line in the code below looks for a semicolon followed by anything but a space or a dash (with positive lookahead). If found, it replaces the semicolon by a semicolon plus space.

Problems arise where a line break comes immediately after a determinative before a word, as in `{d}/Inana` (BDTNS style). ORACC currently does not have a proper way of doing that.

In [10]:
texts['text'] = texts['text'].str.replace('(?<![0-9])/', ';')
texts['text'] = texts['text'].str.replace('-;', ';-')
texts['text'] = [re.sub(r';(?=[^- \{])', '; ', text) for text in texts['text']]

# Objects
[BDTNS](http://bdtns.filol.csic.es) distinguishes between `Tabl.`, and `Env`. The corresponding [ORACC](http://oracc.org) conventions are `@tablet` and `@envelope`.

In [11]:
objects = [['Tabl\.+Env\.' , '@tablet + envelope'], ['Env\.' , '@envelope'], ['Tabl\.' , '@tablet']]
for object in objects:
    texts['object'] = texts['object'].str.replace(object[0], object[1])

# Surface terminology
Replace `r.` with `reverse` etc. In [ORACC](http://oracc.org) the surface designation `@seal` is always followed by a number. In [BDTNS](http://bdtns.filol.csic.es) this only done if there is more than one seal. Simple `Seal` is therefore replaced by `Seal 1`

In [12]:
texts['surface'] = ['Seal 1' if surface.strip().lower() == 'seal' else surface for surface in texts['surface']]
    
surface = {'r\.': '@reverse', 'up\.ed\.': '@top', 'lo\.ed\.': '@bottom', 'le\.ed\.': '@left', 
           'l\.ed\.': '@left', 'Seal': '@seal'}
for term in surface:
    texts['surface'] = texts['surface'].str.replace(term, surface[term])

# Column Numbers
Change from roman to arabic numbers.

In [13]:
column = {'i' :'1', 'ii': '2', 'iii': '3', 'iv': '4', 'v': '5', 'vi': '6', 'vii': '7', 'viii': '8',
         'ix': '9', 'x': '10', 'xi': '11', 'xii': '12', 'xiii': '13', 'xiv': '14', 'xv': '15', 'xvi': '16',
         'xvii': '17', 'xviii': '18', 'xix': '19', 'xx': '20', 'xxi': '21', 'xxii': '22', 'xiii': '23', 'xiv': '24'}
for x in column:
    texts['column'] = texts['column'].str.replace('\\b'+x+'\\b', column[x])

# ID_TEXT
Create a field `id_text`, which equals the [CDLI](http://cdli.ucla.edu) number if available, otherwise the [BDTNS](http://bdtns.filol.csic.es/) number preceded by X. Note that the [BDTNS](http://bdtns.filol.csic.es/) number is an integer; it has to be transformed into a string.

In [14]:
texts['id_text'] = [texts.loc[i, 'cdli'] if not texts.loc[i, 'cdli'] == 
                    '' else 'X' + str(texts.loc[i, 'bdtns']) for i in range(len(texts))]

# Create Dictionary for Publication Details.
This dictionary is derived from a second file. The dictionary `id_publication` has the CDLI P-number (or, if absent, a number derived from the BDTNS number) as key, and a publication abbreviation as value.

In [15]:
publications = pd.read_csv('../raw-data/Editions_Garshana.tab', 
                    delimiter = '\t', encoding = "utf8", 
                    header = None, 
                    names=['bdtns', 'cdli', 'publication'])
publications = publications.fillna('')
publications['id_text'] = [publications.loc[i, 'cdli'] if publications.loc[i, 'cdli'] 
                           else 'X' + str(publications.loc[i, 'bdtns']) for i in range(len(publications))]
publications = publications.drop_duplicates(subset = 'id_text')
id_publication = {publications.loc[i, 'id_text'] : publications.loc[i, 'publication'] 
                  for i in range(len(publications))}

# Create Output Directory
For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist)

In [None]:
import errno
import os
try:
    os.mkdir('../output')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

# Create Output Files
Add all the data-elements in the appropriate order; strip each field from leading and trailing spaces. Each text is saved as a separate `.atf` file in the directory `output`. The flag `start_text` is used to put the `@obverse` line in the appropriate place.

In [None]:
id_text = ""
start_text = False
for i in range(len(texts)):
    if not texts.loc[i, 'id_text'] == id_text:
        if id_text:
            filename = '../output/' + id_text + '.atf'
            with open(filename, 'w', encoding = 'utf8') as f:
                f.write(atf)
        id_text = texts.loc[i, 'id_text']
        print("Processing " + id_text)
        and_line = '&' + id_text + ' = ' +  id_publication[id_text] + '\n'
        project = 'garshana'
        protocols =  '#atf: use unicode\n#atf: use legacy\n#atf: use math\n\n'
        atf = and_line + '#project: ' + project + '\n' + protocols
        start_text = True
    else:
        if texts.loc[i, 'object']:
            atf = atf + texts.loc[i, 'object'].strip() + '\n'
            if not texts.loc[i, 'object'].startswith('@seal'):
                start_text = True
        if start_text == True:
            atf = atf + '@obverse\n'
            start_text = False
        if texts.loc[i, 'surface']:
            atf = atf + texts.loc[i, 'surface'].strip() + '\n'
        if texts.loc[i, 'column']:
            atf = atf + '@column ' + texts.loc[i, 'column'].strip() + '\n'
        if texts.loc[i, 'line_no']:
            atf = atf + texts.loc[i, 'line_no'].strip() + '. '
            # do not add a newline after line_no!
        if texts.loc[i, 'text']:
            atf = atf + texts.loc[i, 'text'].strip() + '\n'

#now save the last text
filename = '../output/' + id_text + '.atf'
with open(filename, 'w', encoding = 'utf8') as f:
        f.write(atf)
        
    

Processing P332336
Processing P332510
Processing P332511
Processing P329447
Processing P322812
Processing P323842
Processing P322569
Processing P322924
Processing P322729
Processing P322469
Processing P322730
Processing P322736
Processing P324421
Processing P329371
Processing P324460
Processing P322731
Processing P329377
Processing P322732
Processing P329448
Processing P322737
Processing P323245
Processing P322753
Processing P322825
Processing P329389
Processing P322920
Processing P329889
Processing P329443
Processing P329871
Processing P322479
Processing P324901
Processing P322734
Processing P322735
Processing P325874
Processing P325896
Processing P329379
Processing P322742
Processing P322927
Processing P322573
Processing P322752
Processing P329383
Processing P329402
Processing P325943
Processing P332467
Processing P322751
Processing P322921
Processing P322780
Processing P322741
Processing P322650
Processing P329385
Processing P332468
Processing P332469
Processing P322928
Processing P