# 2.4 Data Acquision BDTNS

Goal of this notebook is to transform [BDTNS](http://bdtns.filol.csic.es/) data into a structured format that clearly distinguishes between text and non-text (such as line numbers) and that, for the text part, follows as much as possible the standards of the Oracc Global Sign List ([OGSL](http://oracc.org/ogsl)). Adherence to the [OGSL](http://oracc.org/ogsl) standard makes it possible to transform a line of text into a sequence of sign names or Unicode codepoints. This can be used to track transliteration inconsistencies, for instance in the rendering of names (U₂-da-mi-ša-ru-um vs. U₂-ta₂-mi-ša-ru-um), to build a search engine that finds a sign sequence regardless of the actual transliteration, or to compare editions of the same text in [BDTNS](http://bdtns.filol.csic.es/), [CDLI](http://cdli.ucla.edu), and [ePSD2](http://oracc.org/epsd2/admin/u3adm).

The search engine will be built in a separate notebook - primarily as a case study of the potential of this approach.

On the [BDTNS](http://bdtns.filol.csic.es/) website, search results can be downloaded with the "Export" button in the left pane. Searching for the empty string will select all texts currently available. The export creates two files: one with transliterations and one with meta-data. The transliteration file is discussed here most extensively. The file name is "query_text_" followed by a date and an additional number. Move the file to the `data` directory of this chapter and make sure that the variable `file` corresponds to your file name and file location.

This notebook uses the `regex` module for regular expressions, rather than the standard `re` module. The `regex` module has more extensive support for Unicode, in particular in classifying characters as letters ('a', 'b', 'š', 'ṭ') versus non-letters ('\[', '6', or '~'). The `regex` module needs to be installed before it can be imported (for installing modules, see Chapter 1).

This notebook also uses the `pandas` integration of the `tqdm` package, for showing a progress bar. This is important, because the the transformation of (currently) more than 1.1 million lines of text in [BDTNS](http://bdtns.filol.csic.es) takes some time. The pandas/tqdm integration is initialized with `tqdm.pandas()`, immediately after importing `tqdm`. This allows the usage of the functions `progress_apply()` and `progress_map()` instead of the standard `apply()` and `map()` functions in `pandas`.

In [2]:
import pandas as pd
import regex
from tqdm.auto import tqdm
tqdm.pandas()
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

# Create Directories

In [3]:
directories = ['output']
make_dirs(directories)

# Read the File as a List
The [BDTNS](http://bdtns.filol.csic.es/) export file is read as a list (each line is one element in the list) with the `splitlines()` function. 

[BDTNS](http://bdtns.filol.csic.es/) uses so-called "vertical tabs" (represented by `^K`, `\v`, or `\x0b`, depending on your editor) as a new line within a single text, whereas the standard newline (`\n`) character is used to separate one text from the next. The Python `readlines()` function (which reads a file line-by-line, producing a list) keeps the vertical tabs and will result in a format where each document is represented as a single line. Instead, therefore, we read the entire file with the `read()` function and then split the document into lines with `splitlines()`. The function `splitlines()` takes the vertical tab (as well as the regular newline character) as a line separator; this will result in a line-by-line representation of the document in the form of a list (called `bdtns`).

In [4]:
file = 'query_text_19_05_15-020951.txt'
with open('data/' + file, mode = 'r', encoding = 'utf8') as f:
    bdtns = f.read().splitlines()

# Inspect the Result

In [5]:
bdtns[:25]

["197624\tÀ l'école des scribes, p. 104\t",
 '021035\tAAA 1 78 24 = MVN 05 249 = CDLI P114469\t',
 '     ',
 'o. 1     5 sila3 kaš 3 sila3 zi3',
 'o. 2     1 i3 a2-GAM',
 'o. 3     Lu2-Ma2-gan-na lu2-{giš}tukul-gu-<la>',
 'o. 4     0.0.1 kaš 5 sila3 zi3',
 'o. 5     1 i3 a2-GAM',
 'o. 6     da-da sukkal ša3 giš-/kin-ti-da gen-na',
 'o. 8     3 sila3 kaš 2 sila3 zi3',
 'o. 9     1 i3 a2-GAM',
 'o. 10     En-u2-mi-i3-li2',
 'o. 11     ma2 giš-še3 gen-na',
 'o. 12     3 sila3 kaš 2 sila3 zi3',
 'r. 13     1 i3 a2-GAM',
 'r. 14     Ar-ši-ah lu2-kas4',
 'r. 15     0.0.2 5 sila3 kaš sig5 lugal',
 'r. 16     0.1.4 kaš gen',
 'r. 17     0.0.3 zi3-gu',
 'r. 18     dam ensi2 Šušin{ki}',
 'r. 19     Lu2-{d}Nin-gir2-su šu-i maškim?',
 'r. 20     3 sila3 kaš 2 sila3 zi3',
 'r. 21     1 i3 a2-GAM',
 'r. 22     Hu-nu-NE-a',
 'r. 23     dam ensi2 Šušin{ki}-še3 gen-na']

# Remove empty lines
Empty lines (or lines filled with spaces or tabs only) will cause trouble downstream and are removed.

In [6]:
bdtns = [line for line in bdtns if len(line.strip()) > 0]

# Make DataFrame
The code looks for lines that begin with 6 digits (as in '145342'). Those numbers are the ([BDTNS](http://bdtns.filol.csic.es/) text ID numbers and mark the beginning of a new document. If such a line is found, the number is stored in the variable `bdtns_no`. Other data available in such lines (such as publication and [CDLI](http://cdli.ucla.edu) P-number) are omitted, because they are better derived from the meta-data file (with the [BDTNS](http://bdtns.filol.csic.es) number as key).

If the line does not start with 6 digits, it is a transliteration line. The line is split to separate line number, transliteration text, and various types of comments. The variable `bdtns_no` is added to each line as a separate field.

In the [BDTNS](http://bdtns.filol.csic.es/) output line numbers are separated from transliteration text by 5 spaces. Editorial remarks are introduced by the hash-mark (#). Finally, sign namess (qualifying x-values, rare signs, and, occasionally, readings with exclamation mark) are marked by "(=" (as in '(=SIG7)'). The five spaces are replaced by '#' and '(=' is replaced by '#(=' so that we can split each line on the hash mark. The `replace()` function is done only once and `split()` is done twice, so that we end up with three columns, representing line label, text, and comments (to which the `bdtns_no` is added in first position). The column `comments` will include both editorial comments and sign explications of x-values.

In [7]:
l = []
bdtns_no = ""
for line in tqdm.tqdm(bdtns): 
    if line[:6].isdigit(): 
        bdtns_no = line[:6]
        continue
    else: 
        li = line.strip()
        li = li.replace("(=", "#(=", 1).replace('     ', '#', 1)
        li_l = li.split('#', 2)
        li_l = [bdtns_no] + li_l
        l.append(li_l)

100%|████████████████████████████| 1251607/1251607 [00:09<00:00, 132276.46it/s]


In [None]:
columns = ["bdtns_no", "line_label", "text", "comments"]
df = pd.DataFrame(l, columns=columns).fillna("")

# Inspect the Result

In [None]:
df

# Make OGSL compliant
[OGSL](http://oracc.org/ogsl) is the ORACC Global Sign List, which lists for each sign its possible readings. [OGSL](http://oracc.org/ogsl) compliance opens the possibility to search or compare by sign *name* rather than sign value. For instance, one may search for the sequence "aga₃ kug-sig₁₇" (golden tiara) and find a line reading "gin₂ ku₃-GI".

The main steps towards [OGSL](http://oracc.org/ogsl) compliance are: 

- replace regular number by index numbers in sign values
- add sign names to x-values


# Dealing with x-values
In [OGSL](http://oracc.org/ogsl), a sign reading that has not (yet) received a commonly accepted index receives the ₓ index, followed by the sign name, as in ziₓ(SIG₇). In the [BDTNS](http://bdtns.filol.csic.es) export file the subscripted ₓ is represented by a capital X and the sign name is given at the end of the line, as in

> o. 2     gi ziX-a 12 sar-⌈ta⌉ (=SIG7)

Because of the way the DataFrame has been produced, the sign names are now found in the column `comments`. The most straightforward solution is to replace every capital X with an ₓ plus what is found in the comments column (minus the equal sign).

This will result in: 
> o. 2     gi ziₓ(SIG₇)-a 12 sar-⌈ta⌉

In practice, however, there are quite a few exceptions to the pattern described above. Examples include: 

> 18 gin2 nagga mu-kuX gibil (=AN.NA) (=DU)

Here (=AN.NA) explains the infrequent sign "nagga". However, since "nagga" is not followed by X the script will return

> 18 gin₂ nagga mu-kuₓ(AN.NA) gibil (=DU)

Another exception is reduplicated "gurₓ-gurₓ" which is represented thus: 
> 6.0.0 še ur5-ra še gurX-gurX-ta su-ga (=ŠE.KIN.ŠE.KIN)

resulting in: 
> 6.0.0 še ur₅-ra še gurₓ(ŠE.KIN.ŠE.KIN)-gurₓ-ta su-ga

Finally, a few x-values are usually not resolved this way, for instance: 
> 1 sila4 Ur-nigarX{gar} \[...\]

with no sign name for nigarX - presumably because it is unambiguous.

For these reasons we will approach the x-values in two different ways

- x-values that resolve unambiguously are resolved with a simple search and replace, replacing, e.g. ziₓ with ziₓ(IGI@g), without paying attention to the [BDTNS](http://bdtns.filol.csic.es) sign explication (=SIG7). A special case in this category is mu-kuX ("delivery"), which is very frequent and should be resolved as mu-kuₓ(DU). However, kuX by itself may also be resolved as kuₓ(LIL) or kuₓ(KWU147) (for the verb "to enter").
- x-values that do not resolve unambiguously (muruₓ, ušurₓ, ummuₓ, and several others) are resolved by moving the [BDTNS](http://bdtns.filol.csic.es) sign name (in the `comments` column) after the X sign between brackets, as discussed above.

Both these steps are included in a single function (`ogsl_v()`) that is applied to every row of the DataFrame. In addition, this function will replace index numbers (such as the 7 in sig7) with Unicode index numbers (sig₇).

# Step 1: Unambiguous x-values

Some x-values are always resolved in the same way. Thus, ziX is always ziₓ(IGI@g), hirinX is always hirinₓ(KWU318), and gurX is always gurₓ(|ŠE.KIN|). In some cases, x-values have been assigned an index number in [OGSL](http://oracc.org/ogsl). In those cases (nigarₓ = nigar; nemurₓ(PIRIG.TUR) = nemur₂; nagₓ(GAZ) = nag₃; and pešₓ(ŠU.PEŠ5) = peš₁₄) the appropriate index number is added and the sign name is ignored.

A dictionary of such unambiguous x-values (`xvalues`) has as its key the x-value as represented in [BDTNS](http://bdtns.filol.csic.es) ('ziX') and as its value the index ₓ plus the appropriate sign name ('ₓ(IGI@g)') in [OGSL](http://oracc.org/ogsl) format.

For various reasons the substitution cannot be done with the general `replace()` function. First, instead of 'ziₓ' we may encounter \[z\]iₓ or some other usage of flags that indicate certainty or breakage. Second, 'subX' should not match 'munsubX'. Third, not every X indicates an x-value - X is also used (less frequently) for illegible signs, as in KA×X, or simply X.

As a first step, a line is split into words with the `split()` function (by default this will split on spaces). A for-loop then iterates through the `xvalues` dictionary. First, it tests whether the sought-for x-value is present in the line - if not, we go to the next x-value in the dictionary. Because of possible flags, the test is not done on the actual line, but on a "translated" version of the line. The function `translate()` takes a table and translates characters into other characters. In the present case, each character is replaced by `None`- which means they are removed. Note that the flags are not removed from the actual transliteration - they are only removed in the `if` statement to test for the presence of the target string. The search for the target word is done with a regular expression that surrounds the word with `\\b` and `\\b`. The `\\b` represents the "word boundary". Note that in regular expressions word boundaries include the hyphen, and various brackets, so that '\\\bgurX\\\b' matches with 'in-gurX'.

If the target word is present in the line, the same test is executed on each (translated) word of the line and once the word is found the uppercase X is replaced by the corresponding value in the dictionary `xvalues`. Since only the uppercase X is replaced, all flags and brackets that may have been present in the original text (as in '\[z\]iX?') are preserved (resulting in '\[z\]iₓ(IGI@g)?')

After the script has iterated through all the items in the `xvalues` dictionary the full line is put together again (with the `join()` function) and the second part of the function will look at remaining x-values.

# Step 2: Remaining x-values
For the remaining x-values (many of them ambiguous) we will copy the [BDTNS](http://bdtns.filol.csic.es) sign name, found in the `comments` column, to the x-value. For instance, **ummu₃** is |A.EDIN.LAL|, but the sign complex has many variants, all rendered **ummuX**: EDIN.A.SU, A.EDIN, A.EDIN.A.LAL, EDIN, etc. The code will result in ummuₓ(|A.EDIN.SU|), ummuₓ(|A.EDIN|), ummuₓ(|A.EDIN.LAL|), ummuₓ(EDIN), etc. Compound signs are put between pipes (|A.EDIN.SU|), according to [OGSL](http://oracc.org/ogsl) conventions.

In this step the code will naively replace the capital X by the index ₓ, followed by the first word in the `comments` column. This will result in errors if there are more such x-values in a single line - but because we have already dealt with many such values in the preceding, that risk is not very high. The code will test that the capital X does in fact follow a sign reading (as in ziX), and is not an illegible sign (as in KA×X, or simply X). This is done with a [regular expression](https://www.regular-expressions.info/) using a so-called "positive lookbehind" (?<=), to see if the preceding character is a letter. The sequence \\p{L} indicates any letter in any Unicode alphabet - but excludes numbers, punctuation, etc. This convention is not available in the `re` library (the most commonly used library for regular expressions), but it can be used in the `regex` library, which replaces `re` in cases where more expansive Unicode support is necessary.

# Step 3: Index Numbers
In a third step all sign reading index numbers (as in 'du11') are replaced by Unicode index numbers ('du₁₁'). Regular numbers that express quantities should not be affected. This done, again, with a look-behind regular expression that identifies the character immediately before the number as a (Unicode) letter. If such a match is found, the number is replaced by the index number. The process needs to be done twice, in order to take care of double digit index numbers. In the second round, the look-behind regular expression looks for index numbers in the range [₀-₉].

# Errors
Inevitably, each of the steps in dealing with x-values may introduce its own errors. It is likely, moreover, that there are more x-values not treated here, or that there will be more x-values in a future version of the [BDTNS](http://bdtns.filol.csic.es) data. The dictionary of x-values below can be adapted to deal with those situations. 

In [None]:
xvalues = {'nagX' : '₃', 'nigarX' : '', 'nemurX' : '₂', 'pešX' : '₁₄', 'urubX' : '', 
        'tubaX' : '₄', 'niginX' : '₈', 'šuX' : '₁₄', 
        'alX' : 'ₓ(|NUN.LAGAR|)' , 'bulugX' : 'ₓ(|ŠIM×KUŠU₂|)', 'dagX' : 'ₓ(KWU844)', 
        'duruX' : 'ₓ(|IGI.DIB|)', 'durunX' : 'ₓ(|KU.KU)', 
        'gigirX' : 'ₓ(|LAGAB×MU|)','giparX' : 'ₓ(KISAL)', 'girX' : 'ₓ(GI)', 
        'gišbunX' : 'ₓ(|KI.BI|)', 'gurX' : 'ₓ(|ŠE.KIN|)', 
        'hirinX' : 'ₓ(KWU318)', 'kurunX' : 'ₓ(|DIN.BI|)',
        'mu-kuX' : 'ₓ(DU)', 'munsubX' : 'ₓ(|PA.GU₂×NUN|)',  
        'sagX' : 'ₓ(|ŠE.KIN|)', 'subX' : 'ₓ(|DU.DU|)', 
        'sullimX' : 'ₓ(EN)', 'šaganX' : 'ₓ(|GA×AN.GAN|)', 
        'ulušinX' : 'ₓ(|BI.ZIZ₂|)', 'zabalamX' : 'ₓ(|MUŠ₃.TE.AB@g|)', 
        'zahX' : 'ₓ(ŠEŠ)', 'zahdaX' : 'ₓ(|DUN.NE.TUR|)',  
        'ziX' : 'ₓ(IGI@g)'}

In [None]:
flags = "][!?<>⸢⸣⌈⌉*"
table = str.maketrans(dict.fromkeys(flags))

In [None]:
nos = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'X']
indexes = ['₀', '₁', '₂', '₃', '₄', '₅', '₆', '₇', '₈', '₉', 'ₓ']
ind_d = dict(zip(nos, indexes))

In [None]:
def ogsl_v(row):
    # 1. deal with unambiguous x-values, listed in the dictionary xvalues.
    m = row['text'].split()  # split line into words
    for value in xvalues: 
        if not regex.findall('\\b' + value.lower() +'\\b', row['text'].translate(table).lower()): 
            continue
        else: 
            for i, w in enumerate(m):
                if not regex.findall('\\b' + value.lower() +'\\b', w.translate(table).lower()): 
                    continue
                else: 
                    m[i] = w.replace("X", xvalues[value])
    row['text'] = ' '.join(m)
    
    # 2. deal with remaining x-values
    if row["comments"][:2] == "(=": 
        sign_name = row["comments"][2:]  # remove (=  from (=SIG7)
        sign_name = sign_name.split(')')[0] #remove ) and anything following
        if '.' in sign_name or '×' in sign_name: 
            sign_name = 'ₓ(|' + sign_name + '|)'  # add pipes if necessary
        else: 
            sign_name = 'ₓ(' + sign_name + ')'
        ogsl_valid = regex.sub('(?<=[\p{L}])X', sign_name, row['text'])
    else:
        ogsl_valid = row["text"]
    
    # 3 deal with index numbers
    for no in ind_d: 
        ogsl_valid = regex.sub('(?<=[\p{L}])' + no, ind_d[no], ogsl_valid)
    for no in ind_d: 
        ogsl_valid = regex.sub('(?<=[₀-₉])' + no, ind_d[no], ogsl_valid)
    return ogsl_valid

# Apply the Function
The `ogsl_v` function is now applied to each row in the DataFrame. The DataFrame currently has more than 1.1 million rows (lines) and the function is quite involved, so this will take some time. For that reason a progress bar has been added. The progress bar is part of the `tqdm` library. It is initialized with the line
```python
tqdm.pandas()
```
Instead of the regular `apply()` function from the `pandas` library we may now use `progress_apply()` to do the same thing as `apply()`, but with a progress bar.

In [None]:
from tqdm.auto import tqdm
tqdm.pandas(desc="Progress")
df["text"] = df.progress_apply(ogsl_v, axis = 1) 

# Check for Remaining x-values

In [None]:
df[df["text"].str.contains("X")]

In [None]:
pickled = "bdtns.p"
with open('output/' + pickled, 'wb') as w:
    df.to_pickle(w)