This notebook transforms the main part of Delnero's transliteration file of all Old Babylonian Emesal Liturgies into ORACC compatible ATF. A few steps have been taken before, adjusting all special characters (š, Š, ŋ, ṣ, Ṣ, ṭ, Ṭ); replacing ḫ by h and replacing all superscript/subscript by the approrpiate annotation.

The script tries to recognize the start of a new text (separated from the preceding by a sequence of ---- signs) and to identify the museum number in the header with a P number.

The section in Delnero's file that deals with Uru'ama'irabi has been skipped - multiple tablets are edited in a score-like fashion that is too complex to disentangle.

The script needs a version of the full cdli catalog in csv format (downloaded from https://github.com/cdli-gh/data) and a file called Delnero_main.txt. This is, in essence, Delnero's transliteration file, with special characters etc. adjusted (as explained above) and without the section on Uru'ama'irabi.

In [None]:
import pandas as pd
import re

In [None]:
cat = pd.read_csv('cdli_cat.csv', encoding='utf8', sep=',', low_memory=False).fillna('')
cat

Add additional museum_no and designation columns. The only difference with the original museum_no and designation columns is that all leading zeros are removed. Adjust the field id_text so that it becomes a string consisting of a P followed by six digits.

In [None]:
cat['museum_no2'] = [re.sub(' 0+', ' ', no) for no in cat.museum_no]
cat['designation2'] = [re.sub(' 0+', ' ', des) for des in cat.designation]
cat['id_text'] = ['P' + str(idtext).zfill(6) for idtext in cat['id_text']]

In [None]:
cat[['designation', 'id_text', 'museum_no', 'museum_no2', 'designation2']][:10]

Create a dictionary with museum number and designation as key and P number (id_text) as value. If a tablet has multiple museum numbers (joins), make each of those museum numbers into a key with the same P number as value. This dictionary is used to recognize the museum numbers and other text designations in Delnero's file.

In [None]:
dict_musno = dict(zip(cat['museum_no2'], cat['id_text']))
dict_desig = dict(zip(cat['designation2'], cat['id_text']))
dict_musno.update(dict_desig)
for idx, mus in enumerate(cat.museum_no2):
    if "+" in mus:
        mus_split = mus.split('+')
        for ms in mus_split:
            dict_musno[ms.strip()] = cat.iloc[idx]["id_text"]

In [None]:
with open('Delnero_main.txt', 'r', encoding='utf8') as d:
    lit = d.read().splitlines()

The variable musnos is a compiled regular expressions that finds museum numbers such as 'BM 23456' or 'UM 29-16-123' or text designations such as 'PRAK B 52' in the header of a text edition in Delnero's file. Such numbers may (or may not) be preceded by one or two asterisks, followed by a closing bracket and/or a space.
The variable linenos is a compiled regular expression that finds line numbers in the edition. A line number begins with a digit and ends with a colon. Occasionally, a line number may be followed by an alternative line number (probably from an earlier publication), such as "15' (90):"

In [None]:
musnos = re.compile('^\*{1,2}\)? ?(([(A-Z]+ ){1,2}[0-9\-]+)')
linenos = re.compile('(^[0-9].{0,3}( .{0,5})?):')

The main part of the script recognizes the beginning of a new text, adds the P number of that text, and deals with \@-lines (indicating columns and surfaces) and so-called \\$-lines (for single and double rulings). Text lines are recognized by a line number at the beginning of line. All collation marks are removed from text lines (there are too many of them) and several other formatting tasks are accomplished on those lines. All other lines (which are not recognized as headers, \@-lines, \\$-lines, or text lines) are considered comment lines (no distinction is made between comments and translations).

In [None]:
newtext = ['&P297213 = NBC 01313\n#project: obel\n#atf: use unicode']
recognized = 0
newtextalert = False
Xno = 700001
for line in lit:
    characters = list(set(line.strip()))
    m_l = re.findall(musnos, line)
    l_l = re.findall(linenos, line)
    if line.strip() == '':
        continue
    if len(characters) == 1 and characters[0] == '-':
        newtextalert = True
        continue
    if newtextalert:
        if m_l:
            m = m_l[0][0]
            m = m.replace('CNMA', 'NMC')
            m = m.replace('MIO', "Ist Ni")
            if '-' in m:
                m_split = m.split('-')
                if m_split[-1].isdigit():
                    m_split[-1] = m_split[-1].zfill(3)
                    m = '-'.join(m_split)
            p = dict_musno.get(m, m)
            if p == m:
                p = 'X' + str(Xno)
                Xno += 1
            else:
                recognized +=1
            line = f"&{p} = {m}\n#project: obel\natf: use unicode"
            print(f"{p} = {m}")
            newtextalert = False
            newtext.append(line)
            continue
    if l_l:
        l = l_l[0][0] + ':'
        l_new = l.replace(' ', '_') + '.'
        l_new = l_new.replace(':', '')
        line = line.replace(l, l_new)
        line = line.replace('*', '')
        line = line.replace('>>', '')
        line = line.replace('>', '#')
        line = line.replace('<', '')
        line = line.replace('. . .', '...')
        line = line.replace('b.s.', '($blank space$)')
    elif line.strip().startswith('----'):
        if 'double' in line.lower():
            line = '$ double ruling'
        elif 'single' in line.lower():
            line = '$ single ruling'
        else: 
            line = '#' + line
    elif 'rest of column broken' in line.lower():
        line = '$ rest of column broken'
    elif 'beginning of column broken' in line.lower():
        line = '$ beginning of column broken'
    elif 'rest of obverse broken' in line.lower():
        line = '$ rest of obverse broken'
    elif 'beginning of obverse broken' in line.lower():
        line = '$ beginning of obverse broken'
    elif 'rest of reverse broken' in line.lower():
        line = '$ rest of reverse broken'
    elif 'beginning of reverse broken' in line.lower():
        line = '$ beginning of reverse broken'
    elif line.lower().startswith('obv.'):
        line = "@obverse"
    elif line.lower().startswith("rev."):
        line = "@reverse"
    elif line.lower().startswith("col."):
        line = line.replace('col.', '@column')
        line = line.replace('ix', '9')
        line = line.replace('x', '10')
        line = line.replace('viii', '8')
        line = line.replace('vii', '7')
        line = line.replace('vi', '6')
        line = line.replace('iv', '4')
        line = line.replace('v', '5')
        line = line.replace('iii', '3')
        line = line.replace('ii', '2')
        line = line.replace('i', '1')
    else:
        line = '#' + line
    newtext.append(line)
print(f"{recognized} texts recognized; {Xno - 700001} texts not recognized.")

In [None]:
newtext