# Data Acquision BDTNS

Search results in [BDTNS](http://bdtns.filol.csic.es/) can de downloaded with "Export" button in the left pane of the search results. Searching for the empty string will select all texts currently available. The export results in two files: one with transliterations and one with meta-data. The transliteration file is discussed here most extensively. The file name is "query_text_" followed by a date and an additional number. Move the file to a convenient location and make sure that the variable `file` corresponds to your file name and file location.

In [1]:
import pandas as pd
import re

In [2]:
file = 'data/query_text_19_05_15-020951.txt'
with open(file, mode = 'r', encoding = 'utf8') as f:
    bdtns = f.read().splitlines()

Remove empty lines

In [3]:
bdtns = [line for line in bdtns if len(line.strip()) > 0]

In [4]:
bdtns = [line.split("#")[0] for line in bdtns]  # remove editorial remarks - everything after #

# Dealing with x-values
Sign reading that have not (yet) received a commonly accepted index receive the ₓ index, followed by the sign name, as in ziₓ(SIG₇). In BDTNS the subscripted ₓ is represented by a capital X and the sign name is given at the end of the line, as in

> o. 2     gi ziX-a 12 sar-⌈ta⌉ (=SIG7)

This can be resolved, first, by replacing the capital X (when preceded by a letter from the Akkadian or Sumerian alphabet) by the subscript ₓ. Second, by a regular expression that matches the ₓ, everything following the ₓ (Group 1), and everything between the round brackets (Group 2). This match is than reordered as ₓ(Group 2)Group1. In detail

* ₓ		 This will match one subscripted x (Unicode 2093).
* ([^=]\*)	This is a group (indicated by the brackets) that will match 0 or more characters that are not the equal sign.
- \(=	      This will match the literal opening bracket (the preceding backslash is an escape, indicating that this is not the beginning of a new group) followed by an equal sign.
- ([^)]\*)          This is the second group, matching 0 or more characters that are not a closing bracket.
- \)                 This matches the literal closing bracket.
- '\(?:\s|\\$)'        The positive look-ahead (indicated by (?:)) that checks to see that what follows is either a space (\s) or the end of the string (\\$).

In our example above the uppercase letter X has been replaced by an subscripted ₓ, so that the string now looks thus:

> o. 2     gi ziₓ-a 12 sar-⌈ta⌉ (=SIG7)

In this string

- ₓ matches "ₓ"
- ([^=]*) matches "-a 12 sar-⌈ta⌉ " 
- \(=  matches "(="
- ([^)]*) matches "SIG7"
- \) matches ")"
- (?:\s|$)  matches because this is the end of the string

Since a few lines include more than one x-value we run the same line twice.

# TODO

test running the re.sub twice

add doco for some bad replacements (hirinₓ, nigarₓ, etc.)

Think how to deal better with gurₓ-gurₓ issue.

Deal with index numbers

In [5]:
bdtns = [re.sub('([abdegŋhḫijklmnpqrstuwzšṣṭABDEGŊHḪIJKLMNPQRSTUWZŠṢṬ])X', '\\1ₓ', line) for line in bdtns]
bdtns = [re.sub('ₓ([^=]*)\(=([^\)]*)\)(?:\s|$)', 'ₓ(\\2)\\1', line) for line in bdtns]

# Adding Pipes
If an x-value is a compound consisting of multiple signs we need pipes to indicate that. The following regular expression (which builds on the pattern above) takes care of that, so that we get gurₓ(|ŠE.KIN|). The pipes make the transliteration more in accordance with the ORACC Global Sign List ([OGSL](http://oracc.org/ogsl)), which means that we can do more complex manipulations of the data.

In [7]:
bdtns = [re.sub('ₓ\(([^\.×\)]*[\.×][^\)]*)\)', 'ₓ(|\\1|)', line) for line in bdtns]

# Other Cleaning
In a number of cases the code above will fail, for a number of reasons. Some of these reasons can easily be repaired, others are more complex.

1. gurₓ-gurₓ
> o. 2     gurX-gurX-<de3> (=ŠE.KIN.ŠE.KIN)

This line will be rendered 
> o. 2     gurₓ(|ŠE.KIN.ŠE.KIN|)-gurₓ-<de3>

which should be

> o. 2     gurₓ(|ŠE.KIN|)-gurₓ(|ŠE.KIN|)-<de3>

Since this is a very common expression, it makes sense to address this specifically.

2. nigarₓ, nemurₓ(|PIRIG.TUR|), and nagₓ(GAZ)
These are sign values that have received indexes in OGSL, namely nigar (that is nigar₁), nemur₂ and nag₃. The sign nigarₓ is usually not explained.

In [11]:
bdtns = [line.replace("gurₓ(|ŠE.KIN.ŠE.KIN|)-gurₓ", "gurₓ(|ŠE.KIN|)-gurₓ(|ŠE.KIN|)") for line in bdtns]
ogsl_repl  = {"nigarₓ" : "nigar", "nemurₓ" : "nemur₂", "nagₓ" : "nag₃"}
brackets = "\(?[^-){ ]*\)?"
for sign in ogsl_repl: 
    bdtns = [re.sub(sign+brackets, ogsl_repl[sign], line) for line in bdtns]

# Index numbers
In BDTNS all sign index numbers are represented by regular numbers. A number that is immediately preceded by a letter that is valid in Sumerian or Akkadian is replaced by its corresponding index number.

In [None]:
numbers_index = {'10':'₁₀',
               '11': '₁₁',
               '12':'₁₂',
               '13':'₁₃',
               '14':'₁₄',
               '15':'₁₅',
               '16':'₁₆',
               '17':'₁₇',
               '18':'₁₈',
               '19':'₁₉',
               '20':'₂₀',
               '21': '₂₁',
               '22':'₂₂',
               '23':'₂₃',
               '24':'₂₄',
               '25':'₂₅',
               '26':'₂₆',
               '27':'₂₇',
               '28':'₂₈',
               '29':'₂₉',
               '30':'₃₀',
               '31': '₃₁',
               '32':'₃₂',
               '33':'₃₃',
               '34':'₃₄',
               '35':'₃₅',
               '36':'₃₆',
               '37':'₃₇',
               '38':'₃₈',
               '39':'₃₉',
               '0':'₀',
               '1': '₁',
               '2':'₂',
               '3':'₃',
               '4':'₄',
               '5':'₅',
               '6':'₆',
               '7':'₇',
               '8':'₈',
               '9':'₉'}
# This should be represented as a list of tuples and then made into an ordered dictionary
# otherwise there is no guarantee that 30 will be evaluated before 3.

In [13]:
l = []
bdtns_no = ""
for line in bdtns: 
    if line[:6].isdigit(): 
        bdtns_no = line[:6]
        continue
    else: 
        li = line.split("     ")[:2]
        li = [bdtns_no] + li
        l.append(li)

In [14]:
columns = ["bdtns_no", "line_label", "text"]
df = pd.DataFrame(l, columns=columns).fillna("")

In [15]:
df

Unnamed: 0,bdtns_no,line_label,text
0,021035,o. 1,5 sila3 kaš 3 sila3 zi3
1,021035,o. 2,1 i3 a2-GAM
2,021035,o. 3,Lu2-Ma2-gan-na lu2-{giš}tukul-gu-<la>
3,021035,o. 4,0.0.1 kaš 5 sila3 zi3
4,021035,o. 5,1 i3 a2-GAM
5,021035,o. 6,da-da sukkal ša3 giš-/kin-ti-da gen-na
6,021035,o. 8,3 sila3 kaš 2 sila3 zi3
7,021035,o. 9,1 i3 a2-GAM
8,021035,o. 10,En-u2-mi-i3-li2
9,021035,o. 11,ma2 giš-še3 gen-na


In [16]:
pd.set_option('display.max_colwidth', -1)
df.loc[df["text"].str.contains("ₓ.*ₓ")]

Unnamed: 0,bdtns_no,line_label,text
5626,158584,o. 6,3.0.0 gur še gurₓ(|ŠE.KIN.ŠE.K[IN]|)-gu[rₓ-a]
5628,158584,o. 8,16.0.0 gur še gurₓ(|ŠE.KIN.[ŠE.KIN]|)-[gurₓ-a]
5633,158584,o. 13,28.0.0 gur še gu[rₓ(|ŠE.KI[N.ŠE.KIN]|)-gurₓ-a]
13976,158061,o. 2,gurₓ(|ŠE.KIN|)-gurₓ(|ŠE.KIN|)-<de3>
14945,158140,o. 1,66.2.0 ga[n2] gurₓ(|ŠE.KIN|)-gurₓ(|ŠE.KIN|)-[dam/de3]
14952,158140,r. 4,še-bi gurₓ(|ŠE.KIN|)-gurₓ(|ŠE.KIN|)-d[am]
108633,030236,o. 2,še gurₓ(|ŠE.KIN|)-gurₓ(|ŠE.KIN|)-⌈dam⌉
108635,030236,o. 4,gurₓ(|ŠE.KIN|)-gurₓ(|ŠE.KIN|)-dam
122438,035009,o. 17,[360] ⌈sar⌉ {u2}hirinₓ(SIG7)-na ziₓ 12 sar-ta
122454,035010,o. 1,600 sar {u2}hirinₓ(SIG7)-[na] / ziₓ-a 10 ⌈sar⌉-[ta]


'äbxde'