# Data Acquision BDTNS

Search results in [BDTNS](http://bdtns.filol.csic.es/) can be downloaded with "Export" button in the left pane. Searching for the empty string will select all texts currently available. The export creates two files: one with transliterations and one with meta-data. The transliteration file is discussed here most extensively. The file name is "query_text_" followed by a date and an additional number. Move the file to a convenient location and make sure that the variable `file` corresponds to your file name and file location.

In [1]:
import pandas as pd
import regex

In [2]:
file = 'data/query_text_19_05_15-020951.txt'
with open(file, mode = 'r', encoding = 'utf8') as f:
    bdtns = f.read().splitlines()

# Remove empty lines and editorial remarks

In [3]:
bdtns = [line for line in bdtns if len(line.strip()) > 0]
bdtns = [line.split("#")[0] for line in bdtns]  # remove editorial remarks - everything after #

# Make OGSL compliant
[OGSL](http://oracc.org/ogsl) is the ORACC Global Sign List, which lists for each sign its possible readings. [OGSL](http://oracc.org/ogsl) compliance opens the possibility to search or compare by sign *name* rather than sign value. For instance, one may search for the sequence "aga₃ kug-sig₁₇" (golden tiara) and find a line reading "gin₂ ku₃-GI". [ORACC](http://oracc.org) projects are [OGSL](http://oracc.org) compliant by definition (a project will not build if it does not follow [OGSL](http://oracc.org/ogsl) rules) and [CDLI](http://cdli.ucla.edu) data can easily be transformed to [OGSL](http://oracc.org/ogsl) compliance. Transformation of [BDTNS](http://bdtns.filol.csic.es) data to [OGSL](http://oracc.org/ogsl) compliance, therefore, also makes it possible to compare the editions of the same text in the three projects.

The main steps towards [OGSL](http://oracc.org/ogsl) compliance are: 

- replace regular number by index numbers in sign values
- add sign explication to x-values

We will deal with the index numbers after creating the DataFrame. This will make it easier to apply the search/replace *only* on the transliteration (and not, for instance, on line or text numbers).

# Dealing with x-values
Sign reading that have not (yet) received a commonly accepted index receive the ₓ index, followed by the sign name, as in ziₓ(SIG₇). In BDTNS the subscripted ₓ is represented by a capital X and the sign name is given at the end of the line, as in

> o. 2     gi ziX-a 12 sar-⌈ta⌉ (=SIG7)

In the preceding the capital X has been replaced by the index ₓ. The second step is to move "SIG₇" to the place immediately after the index ₓ and between brackets. 

This can be done with a regular expression that matches the X, everything following the X (Group 1), and everything between the round brackets (Group 2). This match is than reordered as X(Group 2)Group1.

This will result in: 
> o. 2     gi ziX(SIG7)-a 12 sar-⌈ta⌉

In practice, however, there are quite a few exceptions where the pattern presented above is not followed. Examples include

> 18 gin₂ nagga mu-kuX gibil (=AN.NA) (=DU)

Here, (=AN.NA) explains the infrequent sign "nagga". However, since "nagga" is not followed by X the regular expression will not recognize that and this will result in

> 18 gin₂ nagga mu-kuX(AN.NA) gibil (=DU)

Another exception is reduplicated "gurₓ-gurₓ"which is represented thus: 
> 6.0.0 še ur₅-ra še gurX-gurX-ta su-ga (=ŠE.KIN.ŠE.KIN)

resulting in: 
> 6.0.0 še ur₅-ra še gurX(ŠE.KIN.ŠE.KIN)-gurₓ-ta su-ga

Finally, a few x-values are usually not resolved this way, for instance: 
> 1 sila₄ Ur-nigarₓ{gar} \[...\]

with no explanation of the nigarₓ - presumably because it is unambiguous.

For these reasons we will approach the x-values in two different ways

- x-values that are frequent and resolve unambiguously are resolved with a simple search and replace, replacing, e.g. ziₓ with ziₓ(SIG₇), without paying attention to the [BDTNS](http://bdtns.filol.csic.es) sign explication (=SIG7). 
- mu-kuX(DU) is treated separately, because mu-kuX is very frequent, but kuX may also be resolved as kuX(LIL) or kuₓ(KWU147) (for the verb "to enter")
- x-values that do not resolve unambiguously (muruₓ, ušurₓ, ummuₓ, and several others) are resolved by moving the [BDTNS](http://bdtns.filol.csic.es) sign explanation after the X sign between brackets, as explained above.

# Step 1: Unambiguous and frequent x-values

# REDO Text
Some x-values are always resolved in the same way. Thus, ziX is laways ziₓ(SIG₇), hirinX is always hirinₓ(KWU318), and gurX is always gurₓ(|ŠE.KIN|). In some cases, x-values have been assigned an index number in [OGSL](http://oracc.org/ogsl). In those cases (nigarₓ = nigar; nemurₓ(PIRIG.TUR) = nemur₂; and nagX(GAZ) = nag₃) the appropriate index number is added and the sign explication is discarded.

This step is particularly important for frequently attested x-values such as ziX, or gurX. Such values may occasionally have no sign explication, or are used in combination with other x-values in the same line.

The substitution is done with a regular expression that tests for word boundaries (represented by '\\b'). By doing so, 'subX' will not match 'munsubX' and 'girX' will not match 'gigirX'.

# mu-kuₓ(DU)
# REDO text
This is a complex case, because mu-kuX ("delivery") is very frequent, may well be used in combination with other sign readings that receive explication, and because kuX is not unambiguous (kuX is also used for kuₓ(LIL) = "to enter"). We want to replace kuX by kuₓ(DU), but only in the form mu-kuX (potentially with suffixes). The form mu-kuX may be represented as \<mu\>-kuₓ or \[mu\]-kuX or mu?-kuX, etc.

We use the `translate()` function to ignore the flags while searching for mu-kuX. If a line contains mu-kuX (or mu?-kuX, etc), the line is split into words and each word is tested for the presence of mu-kuX. If that is the case, the X in the original word is replaced by ₓ(DU).

In [4]:
xvalues = {'nagX' : '₃', 'nigarX' : '', 'nemurX' : '₂', 'pešX' : '₁₄',
        'ziX' : 'ₓ(SIG₇)' ,  'urubX' : 'ₓ(|URU×KAR₂|)',
        'alX' : 'ₓ(TUR₃)' , 'sagX' : 'ₓ(|ŠE.KIN|)', 'zahX' : 'ₓ(ŠEŠ)', 
        'šaganX' : 'ₓ(|AMA.GAN|)', 'sullimX' : 'ₓ(EN)',
        'gurX' : 'ₓ(|ŠE.KIN|)', 'niginX' : 'ₓ(|LAL₂×SAR|)',
        'gišbunX' : 'ₓ(|KI.BI|)',  
        'girX' : 'ₓ(GI)', 'gigirX' : 'ₓ(|LAGAB×MU|)',
        'tubaX' : 'ₓ(TUG₂)', 'giparX' : 'ₓ(KISAL)',
        'zabalamX' : 'ₓ(|MUŠ₃.TE.UNUG|)', 'duruX' : 'ₓ(U₃)',
        'dagX' : 'ₓ(KWU844)', 'hirinX' : 'ₓ(KWU318)', 
        'bulugX' : 'ₓ(|ŠIM×KUŠU₂|)', 'subX' : 'ₓ(|DU.DU|)',
        'munsubX' : 'ₓ(|PA.USAN|)', 'kurunX' : 'ₓ(|DIN.KAŠ|)', 'mu-kuX' : 'ₓ(DU)', 
        'zahdaX' : 'ₓ(|ŠAH2.NE.TUR|)', 'šuX' : 'ₓ(TAG)', 'ulušinX' : 'ₓ(|BI.AŠ|)', 
        'durunX' : 'ₓ(|KU.KU)'}

In [5]:
flags = "][!?<>⸢⸣⌈⌉*"
table = str.maketrans(dict.fromkeys(flags))
for value in xvalues: 
    for x, line in enumerate(bdtns):
        if not regex.findall('\\b' + value.lower() +'\\b', line.translate(table).lower()): 
            continue
        else: 
            o = line.split("     ")  # split line no from text
            m = o[1].split()
            for i, w in enumerate(m):
                if not regex.findall('\\b' + value.lower() +'\\b', w.translate(table).lower()): 
                    continue
                else: 
                    m[i] = w.replace("X", xvalues[value])
            o[1] = " ".join(m)
            bdtns[x] = "     ".join(o) # put line no and text back together

# Ambiguous x-values
For ambiguous x-values the only choice is to copy the [BDTNS](http://bdtns.filol.csic.es) sign explication, found at the end of the line, to the x-value. Rather than doing that across the board (for all x-values), the code below lists some important x-values that are resolved variously. For instance, ummu₃ is |A.EDIN.LAL|, but the sign complex has many variants, all rendered ummuX: EDIN.A.SU, A.EDIN, A.EDIN.A.LAL, EDIN, etc. In a second step, compound signs (such as EDIN.A.SU) are provided with pipes, as is done in [OGSL](http://oracc.org/ogsl).

The first regular expression used here contains the following elements: 
- variable x: this is 'BuranunX', 'SamanX', etc.
- '(\[^=\]\*)': This is a group (indicated by the brackets) that will match 0 or more characters that are not the equal sign.
- '\\(='	: This will match the literal opening bracket (the preceding backslash is an escape, indicating that this is not the beginning of a new group) followed by an equal sign.
- "([^)]\*)"    : This is the second group, matching 0 or more characters that are not a closing bracket.
- '\)'         : This matches the literal closing bracket.
- '(?:\s|\\$)'    : The positive look-ahead (indicated by (?:)) that checks to see that what follows is either a space (\s) or the end of the string (\\$).


In [6]:
xvalues2 = ['BuranunX', 'SamanX', 'muruX', 'asalX', 'buruX', 'ummuX', 'ugnimX', 'šitaX', 'ušurX']
for x in xvalues2: 
    for i, line in enumerate(bdtns): 
        if x.lower() in line.translate(table).lower(): 
            bdtns[i] = regex.sub('X([^=]*)\(=([^\)]*)\)(?:\s|$)', 'ₓ(\\2)\\1', line)
            bdtns[i] = regex.sub('(?<=ₓ)\(([^\.×\)]*[\.×][^\)]*)\)', '(|\\1|)', bdtns[i])

# Errors
Inevitably, each of the steps in dealing with x-values may introduce its own errors. It is likely, moreover, that there are more x-values not treated here, or that there will be more x-values in a future version of the [BDTNS](http://bdtns.filol.csic.es) data. The code above can be used to deal with those new x-values in one of the ways explored. 

# Remove Sign Explications
We can now remove the sign explications by splitting on the string "(="

In [7]:
bdtns = [line.split("(=")[0] for line in bdtns]

# Make DataFrame

In [8]:
l = []
bdtns_no = ""
for line in bdtns: 
    if line[:6].isdigit(): 
        bdtns_no = line[:6]
        continue
    else: 
        li = line.split("     ")[:2]
        li = [bdtns_no] + li
        l.append(li)

In [9]:
columns = ["bdtns_no", "line_label", "text"]
df = pd.DataFrame(l, columns=columns).fillna("")

# Index Numbers in Sign Values
The numbers 0-9 are replaced by the (unicode) index numbers ₀-₉. In [BDTNS](http://bdtns.filol.csic.es) output the x-index (which is used if a sign value has not yet received a conventional index number) is represented by a capital X. This capital X is replaced by ₓ.

The replacement should only take place if the number (or capital X) is immediately preceded by a letter character (represented by '\p{L}'). In order to take care of double-digit indexes (as in **du₁₁**), the replacement is run a second time, now with the condition that the character to be replaced is immediately preceded by a character in the range ₀-₉.

In [18]:
nos = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'X']
indexes = ['₀', '₁', '₂', '₃', '₄', '₅', '₆', '₇', '₈', '₉', 'ₓ']
table2 = str.maketrans(dict.fromkeys(nos, indexes))
ind_d = dict(zip(nos, indexes))
for no in nos: 
    df["text"] = df["text"].apply(lambda x: regex.sub('(?<=[\p{L}])' + no, ind_d[no], x))
for no in nos: 
    df["text"] = df["text"].apply(lambda x: regex.sub('(?<=[₀-₉])' + no, ind_d[no], x))

# Check for Remaining x-values

In [13]:
df[df["text"].str.contains("X")]

Unnamed: 0,bdtns_no,line_label,text
1203,038652,r. 29,[sa2]-⌈du11⌉ {d}En-ki u3 {d}Uš-KA×X?-limmu2
1701,038660,r. 22',KA×X? la2?-la2? ⌈ša3⌉ bala-a
3064,038727,o. 7',0.0.1 en-numX-Eš4-tar2
4200,038754,r. 8,giri3 bad3-HI×X dumu LAGAB-ba-⌈x⌉-gu2
4380,038763,o. 5,[...]-{d}Nanna mu / ⌈lugal⌉ PU3 KA×X ⌈x⌉ [...]
13362,158017,o. 2,še kinX-da
13460,158025,s. 2,dumu Lugal-ušurX (LAL2.TUG2)
20752,166596,o. 1,1zahdaX{zah}-munus giš-gi
22082,015821,o. 4,kišib Lugal-ušurX-ra
22086,015821,s. 1,Lugal-⌈ušurX⌉
