# 2.4 Data Acquision BDTNS

Goal of this notebook is to transform [BDTNS](http://bdtns.filol.csic.es/) data into a structured format that clearly distinguishes between text and non-text (such as line numbers) and that, for the text part, follows as much as possible the standards of the Oracc Global Sign List ([OGSL](http://oracc.org/ogsl)). Adherence to the [OGSL](http://oracc.org/ogsl) standard makes it possible to transform a line of text into a sequence of sign names or Unicode codepoints. This can be used to track transliteration inconsistencies, for instance in the rendering of names (U₂-da-mi-ša-ru-um vs. U₂-ta₂-mi-ša-ru-um), to build a search engine that finds a sign sequence regardless of the actual transliteration, or to compare editions of the same text in [BDTNS](http://bdtns.filol.csic.es/), [CDLI](http://cdli.ucla.edu), and [ePSD2](http://oracc.org/epsd2/admin/u3adm).

The search engine will be built in a separate notebook - primarily as a case study of the potential of this approach.

On the [BDTNS](http://bdtns.filol.csic.es/) website, search results can be downloaded with the "Export" button in the left pane. Searching for the empty string will select all texts currently available. The export creates two files: one with transliterations and one with meta-data. The transliteration file is discussed here most extensively. The file name is "query_text_" followed by a date and an additional number. Move the file to the `data` directory of this chapter and make sure that the variable `file` corresponds to your file name and file location.

This notebook uses the `regex` module for regular expressions, rather than the standard `re` module. The `regex` module has more extensive support for Unicode, in particular in classifying characters as letters ('a', 'b', 'š', 'ṭ') versus non-letters ('\[', '6', or '~'). The `regex` module needs to be installed before it can be imported (for installing modules, see Chapter 1).

This notebook also uses the `pandas` integration of the `tqdm` package, for showing a progress bar. This is important, because the the transformation of (currently) more than 1.1 million lines of text in [BDTNS](http://bdtns.filol.csic.es) takes some time. The pandas/tqdm integration is initialized with `tqdm.pandas()`, immediately after importing `tqdm`. This allows the usage of the functions `progress_apply()` and `progress_map()` instead of the standard `apply()` and `map()` functions in `pandas`.

In [1]:
import pandas as pd
import re
from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

# Create Directories

In [2]:
directories = ['output']
make_dirs(directories)

# Read the File as a List
The [BDTNS](http://bdtns.filol.csic.es/) export file is read as a list (each line is one element in the list) with the `splitlines()` function. 

[BDTNS](http://bdtns.filol.csic.es/) uses so-called "vertical tabs" (represented by `^K`, `\v`, or `\x0b`, depending on your editor) as a new line within a single text, whereas the standard newline (`\n`) character is used to separate one text from the next. The Python `readlines()` function (which reads a file line-by-line, producing a list) keeps the vertical tabs and will result in a format where each document is represented as a single line. Instead, therefore, we read the entire file with the `read()` function and then split the document into lines with `splitlines()`. The function `splitlines()` takes the vertical tab (as well as the regular newline character) as a line separator; this will result in a line-by-line representation of the document in the form of a list (called `bdtns`).

In [3]:
file = 'query_text_19_05_15-020951.txt'
with open('data/' + file, mode = 'r', encoding = 'utf8') as f:
    bdtns = f.read().splitlines()

# Inspect the Result

In [4]:
bdtns[:25]

["197624\tÀ l'école des scribes, p. 104\t",
 '021035\tAAA 1 78 24 = MVN 05 249 = CDLI P114469\t',
 '     ',
 'o. 1     5 sila3 kaš 3 sila3 zi3',
 'o. 2     1 i3 a2-GAM',
 'o. 3     Lu2-Ma2-gan-na lu2-{giš}tukul-gu-<la>',
 'o. 4     0.0.1 kaš 5 sila3 zi3',
 'o. 5     1 i3 a2-GAM',
 'o. 6     da-da sukkal ša3 giš-/kin-ti-da gen-na',
 'o. 8     3 sila3 kaš 2 sila3 zi3',
 'o. 9     1 i3 a2-GAM',
 'o. 10     En-u2-mi-i3-li2',
 'o. 11     ma2 giš-še3 gen-na',
 'o. 12     3 sila3 kaš 2 sila3 zi3',
 'r. 13     1 i3 a2-GAM',
 'r. 14     Ar-ši-ah lu2-kas4',
 'r. 15     0.0.2 5 sila3 kaš sig5 lugal',
 'r. 16     0.1.4 kaš gen',
 'r. 17     0.0.3 zi3-gu',
 'r. 18     dam ensi2 Šušin{ki}',
 'r. 19     Lu2-{d}Nin-gir2-su šu-i maškim?',
 'r. 20     3 sila3 kaš 2 sila3 zi3',
 'r. 21     1 i3 a2-GAM',
 'r. 22     Hu-nu-NE-a',
 'r. 23     dam ensi2 Šušin{ki}-še3 gen-na']

# Remove empty lines
Empty lines (or lines filled with spaces or tabs only) will cause trouble downstream and are removed.

In [5]:
bdtns = [line for line in bdtns if len(line.strip()) > 0]

# Make DataFrame
The code looks for lines that begin with 6 digits (as in '145342'). Those numbers are the ([BDTNS](http://bdtns.filol.csic.es/) text ID numbers and mark the beginning of a new document. If such a line is found, the number is stored in the variable `bdtns_no`. Other data available in such lines (such as publication and [CDLI](http://cdli.ucla.edu) P-number) are omitted, because they are better derived from the meta-data file (with the [BDTNS](http://bdtns.filol.csic.es) number as key).

If the line does not start with 6 digits, it is a transliteration line. The line is split to separate line number, transliteration text, and various types of comments. The variables `id_text` (the [BDTNS](http://bdtns.filol.csic.es) number) and `id_line` are added to each line as a separate field. The field `id_line` is a running number (an integer) that starts at 1 for each text. In the current notebook the variable is not used, but it can be used to keep or restore the lines of a text in proper order. The variable `line_label` is the human-readable line number in the format 'r ii 7' (for reverse column 2 line 7).

In the [BDTNS](http://bdtns.filol.csic.es/) output line numbers are separated from transliteration text by 5 spaces. Editorial remarks are introduced by the hash-mark (#). Finally, sign namess (qualifying x-values, rare signs, and, occasionally, readings with exclamation mark) are marked by "(=" (as in '(=SIG7)'). The five spaces are replaced by '#' and '(=' is replaced by '#(=' so that we can split each line on the hash mark. The `replace()` function is done only once and `split()` is done twice, so that we end up with three columns, representing line label, text, and comments (to which the `id_text` and `id_line` are added). The column `comments` will include both editorial comments and sign explications of x-values.

In [6]:
l = []
id_line = 0
id_text = ""
for line in tqdm_notebook(bdtns): 
    if line[:6].isdigit(): 
        id_text = line[:6]
        id_line = 0
        continue
    else: 
        li = line.strip()
        li = li.replace("(=", "#(=", 1).replace('     ', '#', 1)
        id_line += 1
        li_l = li.split('#', 2)
        li_l = [id_text, id_line] + li_l
        l.append(li_l)

HBox(children=(IntProgress(value=0, max=1251607), HTML(value='')))




In [7]:
columns = ["id_text", "id_line", "line_label", "text", "comments"]
df = pd.DataFrame(l, columns=columns).fillna("")

# Inspect the Result

In [8]:
df

Unnamed: 0,id_text,id_line,line_label,text,comments
0,021035,1,o. 1,5 sila3 kaš 3 sila3 zi3,
1,021035,2,o. 2,1 i3 a2-GAM,
2,021035,3,o. 3,Lu2-Ma2-gan-na lu2-{giš}tukul-gu-<la>,
3,021035,4,o. 4,0.0.1 kaš 5 sila3 zi3,
4,021035,5,o. 5,1 i3 a2-GAM,
5,021035,6,o. 6,da-da sukkal ša3 giš-/kin-ti-da gen-na,
6,021035,7,o. 8,3 sila3 kaš 2 sila3 zi3,
7,021035,8,o. 9,1 i3 a2-GAM,
8,021035,9,o. 10,En-u2-mi-i3-li2,
9,021035,10,o. 11,ma2 giš-še3 gen-na,


# Make OGSL compliant
[OGSL](http://oracc.org/ogsl) is the ORACC Global Sign List, which lists for each sign its possible readings. [OGSL](http://oracc.org/ogsl) compliance opens the possibility to search or compare by sign *name* rather than sign value. For instance, one may search for the sequence "aga₃ kug-sig₁₇" (golden tiara) and find a line reading "gin₂ ku₃-GI".

The main steps towards [OGSL](http://oracc.org/ogsl) compliance are: 

- add sign names to x-values
- replace regular number by index numbers in sign values

# Dealing with x-values
In [OGSL](http://oracc.org/ogsl), a sign reading that does not (yet) have a commonly accepted index number receives the ₓ index, followed by the sign name, as in ziₓ(SIG₇). In the [BDTNS](http://bdtns.filol.csic.es) export file the subscripted ₓ is represented by a capital X and the sign name is given at the end of the line, as in

> o. 2     gi ziX-a 12 sar-⌈ta⌉ (=SIG7)

In the DataFrame, the sign names are now found in the column `comments`. The most straightforward solution is to replace every capital X with an ₓ plus what is found in the comments column (minus the equal sign).

In our example, this would result in: 
> o. 2     gi ziₓ(SIG₇)-a 12 sar-⌈ta⌉

In practice, however, there are quite a few exceptions to the pattern described above. Examples include: 

> 18 gin2 nagga mu-kuX gibil (=AN.NA) (=DU)

Here (=AN.NA) explains the infrequent sign "nagga". However, since "nagga" is not followed by X the script will return

> 18 gin₂ nagga mu-kuₓ(AN.NA) gibil (=DU)

Another exception is reduplicated "gurₓ-gurₓ" which is represented thus: 
> 6.0.0 še ur5-ra še gurX-gurX-ta su-ga (=ŠE.KIN.ŠE.KIN)

resulting in: 
> 6.0.0 še ur₅-ra še gurₓ(ŠE.KIN.ŠE.KIN)-gurₓ-ta su-ga

Finally, a few x-values are usually not resolved this way, for instance: 
> 1 sila4 Ur-nigarX{gar}

with no sign name for nigarX - presumably because it is unambiguous.

For these reasons we will approach the x-values in two different ways

- x-values that resolve unambiguously are resolved with a search and replace, using a dictionary - replacing, e.g. ziₓ with ziₓ(IGI@g). This process does not paying attention to the [BDTNS](http://bdtns.filol.csic.es) sign explication (=SIG7). A special case in this category is mu-kuX ("delivery"), which is very frequent and should be resolved as mu-kuₓ(DU). However, kuX by itself may also be resolved as kuₓ(LIL) or kuₓ(KWU147) (for the verb "to enter").
- x-values that do not resolve unambiguously (muruₓ, ušurₓ, ummuₓ, and several others) are resolved by moving the [BDTNS](http://bdtns.filol.csic.es) sign name (in the `comments` column) after the X sign between brackets, as discussed above.

Both these steps are included in a single function (`ogsl_v()`) that is applied to every row of the DataFrame. In addition, this function will replace index numbers (such as the 7 in sig7) with Unicode index numbers (sig₇).

# Step 1: Unambiguous x-values

Some x-values are always resolved in the same way. Thus, ziX is always ziₓ(IGI@g), hirinX is always hirinₓ(KWU318), and gurX is always gurₓ(|ŠE.KIN|). In some cases, x-values have been assigned an index number in [OGSL](http://oracc.org/ogsl). In those cases (nigarₓ = nigar; nemurₓ(PIRIG.TUR) = nemur₂; nagₓ(GAZ) = nag₃; and pešₓ(ŠU.PEŠ5) = peš₁₄) the appropriate index number is added and the sign name is ignored.

A dictionary of such unambiguous x-values (`xvalues`) has as its key the x-value as represented in [BDTNS](http://bdtns.filol.csic.es) in lower case ('zix') and as its value the index ₓ plus the appropriate sign name ('ₓ(IGI@g)') in [OGSL](http://oracc.org/ogsl) format.

The substitution is done with a somewhat complex [regular expression](https://www.regular-expressions.info/), that looks as follows: 

```python
row['text'] = re.sub(xv, lambda m: m.group()[:-1] + xvalues.get(m.group().translate(table).lower(), 'X'), row['text'])
```
The `sub()` function of the `re` library has the general form `re.sub(search_pattern, replace, string)`. Instead of a replace string, one may also give a function (in this case a temporary `lambda` function) that returns the `replace` string. In this case the lambda function accesses the dictionary `xvalues` to see if the match that was found in the search pattern is present among the keys. The basic format of that command is `xvalues.get(m.group())`, where `m.group()` represents the current match of the search pattern. The search pattern, `xv` (to be explained in more detail below) may match `zahX`, `NigarX`, or `[bu]lugX` - in other words, the match may include capitals (as in `NigarX`) or brackets and flags (as in `[bu]lugX`). In order to find that match in the dictionary it is lowercased and "translated" by means of a table. The function `translate()` translates individual characters into other characters - according to a translation pattern in a table. In this case, the characters representing flags and brackets are all translated to `None` which means, in practice, that they are removed. The matches `zahX`, `NigarX`, and `[bu]lugX`, therefore, will be looked up in the dictionary as `zahx`, `nigarx`, and `bulugx` - and each of those are indeed keys in `xvalues`. In the `get()` function one may optionally add a fall-back value - in case the key is not found - in this case the fall-back is 'X'. 

If a match is found, say `[bu]lugX`, the key `bulugx` is found in the dictionary `xvalues`, returning `ₓ(|ŠIM×KUŠU₂|)`. The return value of the lambda function is the search match (`[bu]lugX`) minus the last character (`[bu]lug`) plus the value that was returned from the dictionary (`bulugₓ(|ŠIM×KUŠU₂|)`). If the search pattern returns a match that is not found in the dictionary (for instance `ušurX`), the return value of the lambda function is, again, the search match (`ušurX`), minus the last character (`ušur`) plus 'X', the fallback return of the `get()` function, resulting in `ušurX`. In other words - in those cases the search match is replaced by itself and nothing changes.

The search pattern is a compiled regex (compiled expressions are faster than expressions that need to be interpreted every time), `vx`, which is defined as
```python
xv = re.compile(r'[\w' + re.escape(flags) + ']+X')
```
This matches any sequence of one or more (`+`) word-characters (`\w`; this includes regular letters as well as š, ṣ, and ṭ, the digits 0-9 and the underscore) and/or flags (such as square brackets etc.), followed by a capital X. This regex will match `ziX`, `zahX`, or `ušurX`, but also `[za]hX`, etc. It does not match KA×X or simply X, because the character × (for 'times') and the space are neither word characters nor flags.

Special case: **mu-kuX**. There are multiple possible solutions for **kuX**, including kuₓ(LIL) or kuₓ(KWU147), but the very frequent form **mu-kuX** is always to be resolved **mu-kuₓ(DU)**. The regular expression `xv` in the preceding does not allow hyphens and thus it will never find the key `mu-kuX` in the dictionary `xvalues`. However, this expression (meaning 'delivery') is so frequent that it makes sense to deal with it separately, rather than depend on the sign names in the `comments`. The expression **mu-kuX** therefore, has its own line in the function.

# Step 2: Remaining x-values
For the remaining x-values (many of them ambiguous) we will copy the [BDTNS](http://bdtns.filol.csic.es) sign name, found in the `comments` column, to the x-value. For instance, **ummu₃** is |A.EDIN.LAL|, but the sign complex has many variants, all rendered **ummuX**: EDIN.A.SU, A.EDIN, A.EDIN.A.LAL, EDIN, etc. The code will result in ummuₓ(|A.EDIN.SU|), ummuₓ(|A.EDIN|), ummuₓ(|A.EDIN.LAL|), ummuₓ(EDIN), etc. Compound signs are put between pipes (|A.EDIN.SU|), according to [OGSL](http://oracc.org/ogsl) conventions.

In this step the code will naively replace the capital X by the index ₓ, followed by the first word in the `comments` column. This will result in errors if there are more such x-values in a single line - but because we have already dealt with many such values in the preceding, that risk is not very high. The code will test that the capital X does in fact follow a sign reading (as in ziX), and is not an illegible sign (as in KA×X, or simply X). This is done with a [regular expression](https://www.regular-expressions.info/) using a so-called "positive lookbehind" (?<=), to see if the preceding character is a letter. The regular expression for a capital 'X' preceded by any letter valid in Sumerian or Akkadian, is compiled in the variable `lettersX` in order to speed up the process.

# Step 3: Index Numbers
In a third step all sign reading index numbers (as in 'du11') are replaced by Unicode index numbers ('du₁₁'). Regular numbers that express quantities should not be affected. This done, again, with a look-behind regular expression that identifies the character immediately before the number as a letter. If such a match is found, the number is replaced by the corresponding index number, looked up in the dictionary `ind_d`. This uses essentially the same technique as described above for unambiguous x-values.

# Errors
Inevitably, each of the steps in dealing with x-values may introduce its own errors. It is likely, moreover, that there are more x-values not treated here, or that there will be more x-values in a future version of the [BDTNS](http://bdtns.filol.csic.es) data. The dictionary of x-values below can be adapted to deal with those situations. 

In [9]:
xvalues = {'nagx' : '₃', 'nigarx' : '', 'nemurx' : '₂', 'pešx' : '₁₄', 'urubx' : '', 
        'tubax' : '₄', 'niginx' : '₈', 'šux' : '₁₄', 
        'alx' : 'ₓ(|NUN.LAGAR|)' , 'bulugx' : 'ₓ(|ŠIM×KUŠU₂|)', 'dagx' : 'ₓ(KWU844)', 
        'durux' : 'ₓ(|IGI.DIB|)', 'durunx' : 'ₓ(|KU.KU)', 
        'gigirx' : 'ₓ(|LAGAB×MU|)', 'giparx' : 'ₓ(KISAL)', 'girx' : 'ₓ(GI)', 
        'gišbunx' : 'ₓ(|KI.BI|)', 'gurx' : 'ₓ(|ŠE.KIN|)', 
        'hirinx' : 'ₓ(KWU318)', 'kurunx' : 'ₓ(|DIN.BI|)',
        'mu-kux' : 'ₓ(DU)', 'munsubx' : 'ₓ(|PA.GU₂×NUN|)',  
        'sagx' : 'ₓ(|ŠE.KIN|)', 'subx' : 'ₓ(|DU.DU|)', 
        'sullimx' : 'ₓ(EN)', 'šaganx' : 'ₓ(|GA×AN.GAN|)', 
        'ulušinx' : 'ₓ(|BI.ZIZ₂|)', 'zabalamx' : 'ₓ(|MUŠ₃.TE.AB@g|)', 
        'zahx' : 'ₓ(ŠEŠ)', 'zahdax' : 'ₓ(|DUN.NE.TUR|)',  
        'zix' : 'ₓ(IGI@g)'}

In [10]:
flags = "][!?<>⸢⸣⌈⌉*/"
table = str.maketrans(dict.fromkeys(flags))
xv = re.compile(r'[\w' + re.escape(flags) + ']+X')

In [11]:
nos = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
       '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 
       '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', 
       '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', 'X']
indexes = ['₀', '₁', '₂', '₃', '₄', '₅', '₆', '₇', '₈', '₉',
           '₁₀', '₁₁', '₁₂', '₁₃', '₁₄', '₁₅', '₁₆', '₁₇', '₁₈', '₁₉',
           '₂₀', '₂₁', '₂₂', '₂₃', '₂₄', '₂₅', '₂₆', '₂₇', '₂₈', '₂₉',
           '₃₀', '₃₁', '₃₂', '₃₃', '₃₄', '₃₅', '₃₆', '₃₇', '₃₈', '₃₉', 'ₓ']
ind_d = dict(zip(nos, indexes))

In [12]:
letters = r'[a-zŋṣšṭA-ZŊṢŠṬ]'
lettersX = re.compile(r'(?<=' + letters + r')X') # capital X preceded by a letter
lettersNo = re.compile(r'(?<=' + letters + r')(\d+|X)') # any sequence of digits, or X, preceded by a letter

In [13]:
def ogsl_v(row):
    # 1. deal with unambiguous x-values, listed in the dictionary xvalues.
    row['text'] = re.sub(xv, lambda m: m.group()[:-1] + xvalues.get(m.group().translate(table).lower(), 'X'), row['text'])
    if 'mu-kuX' in row['text'].translate(table): 
        row['text'] = row['text'].replace('X', xvalues['mu-kux'])
    # 2. deal with remaining x-values
    if row["comments"][:2] == "(=": 
        sign_name = row["comments"][2:]  # remove (=  from (=SIG7)
        sign_name = sign_name.split(')')[0] #remove ) and anything following
        if re.findall(r'\.|×|\+', sign_name): # if sign_name contains either . or × or +
            sign_name = 'ₓ(|' + sign_name + '|)'  # add pipes if necessary
        else: 
            sign_name = 'ₓ(' + sign_name + ')'
        ogsl_valid = re.sub(lettersX, sign_name, row['text'])
    else:
        ogsl_valid = row["text"]    
    # 3 deal with index numbers
    ogsl_valid = re.sub(lettersNo, lambda m: ind_d.get(m.group(), m.group()), ogsl_valid)
    return ogsl_valid

# Apply the Function
The `ogsl_v` function is now applied to each row (`axis = 1`) in the DataFrame. The DataFrame currently has more than 1.1 million rows (lines) and the function may take a few minutes. For that reason a progress bar has been added. The progress bar is part of the `tqdm` library. It is initialized with the line
```python
tqdm_notebook.pandas()
```
Instead of the regular `apply()` function from the `pandas` library we may now use `progress_apply()` to do the same thing as `apply()`, but with a progress bar.

In [14]:
tqdm_notebook.pandas(desc="Progress")
df["text"] = df.progress_apply(ogsl_v, axis = 1) 

HBox(children=(IntProgress(value=0, description='Progress', max=1156363, style=ProgressStyle(description_width…




# Check for Remaining x-values

In [15]:
df[df.text.str.contains('X')]

Unnamed: 0,id_text,id_line,line_label,text,comments
1203,038652,31,r. 29,[sa₂]-⌈du₁₁⌉ {d}En-ki u₃ {d}Uš-KA×X?-limmu₂,
1701,038660,25,r. 22',KA×X? la₂?-la₂? ⌈ša₃⌉ bala-a,
4200,038754,8,r. 8,giri₃ bad₃-HI×X dumu LAGAB-ba-⌈x⌉-gu₂,
4380,038763,5,o. 5,[...]-{d}Nanna mu / ⌈lugal⌉ PU₃ KA×X ⌈x⌉ [...],
27306,002492,13,r. 13,ša₃ e₂ KA×X-⌈x⌉-[...],
29168,023466,4,o. 4,GIŠ.X.KI//NA-i₃-sa₆ sukkal gaba-aš,
56326,028750,5,o. 5,6.0.0 gur šuku Puzur₄-{d}GIŠ×X ra₂-gaba,
56690,029061,5,o. 5,{dug}KISIM₅×X x,
67424,029089,6,o. 6,Ba-za lu₂-KA×X,
75213,034930,49,o.iii 7,a-ša₃ A.LAGAB×X.TUR,


In [23]:
pickled = "bdtns.p"
with open('output/' + pickled, 'wb') as w:
    df.to_pickle(w)

In [24]:
json = 'bdtns.json'
with open('output/' + json, 'w', encoding='utf-8') as w: 
    df.to_json(w, orient='records', force_ascii=False)