# Search BDTNS by Sign
Goal of this notebook is to search the [BDTNS](http://bdtns.filol.csic.es) data by signs, irrespective of their reading. For instance, the sign NE may be read bi₂, ne, izi, šeŋ₆, kum₂, lam₂, zah₂, etc. It is easy to search for transliteration (and/or metadata) in the [BDTNS](http://bdtns.filol.csic.es) search page, but there is currently no way to search for a sequence of signs. This is useful, in particular, in two situations. 

1. Sumerological transliteration conventions may differ quite substantially between different schools. Thus, lu₂ kin-gi₄-a, {lu₂}kin-gi₄-a, lu₂ kiŋ₂-gi₄-a and {lu₂}kiŋ₂-gi₄-a all represent the same sequence of signs and the same word (meaning 'messenger'), but without knowledge of the particular set of conventions used it may be difficult to guess which search will yield the desired results. In the sign search one may enter sign readings according to any convention recognized by the ORACC Global Sign List ([OGSL](http://oracc.org/ogsl)).

2. In some cases the correct reading and interpretation of a sign sequence may be ambiguous and the ambiguiuty may have been resolved in different ways throughout the database. The names lugal-mudra₅, lugal-suluhi₂ and lugal-siki-su₁₃ all represent the same sign sequence. Which of these is correct is not entirely clear (although the third seems unlikely) and, depending on the research question, may even be unimportant (for instance for an SNA analysis). In the sign search one may enter any of these forms and the results will include all of them.

In [1]:
import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()
import os
import sys
import re
import pickle
import zipfile
import json
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. The directories are created with the function `make_dirs()` from the `utils` module. 

In [2]:
directories = ['jsonzip', 'output']
make_dirs(directories)

## 1 Download the ZIP file
The sign search uses the ORACC Global Sign List [OGSL](http://oracc.org/ogsl), available in JSON format at http://build-oracc.museum.upenn.edu/json/ogsl.zip. The function `oracc_download()` from the `utils` module downloads the JSON file in ZIP format. The function expects a list as its sole argument.

In [4]:
project = ["ogsl"] # oracc_download() expects a list
p = oracc_download(project)

Saving http://build-oracc.museum.upenn.edu/json/ogsl.zip as jsonzip/ogsl.zip


HBox(children=(IntProgress(value=1, bar_style='info', description='ogsl', max=1, style=ProgressStyle(descripti…




# 2 The `parsejson()` function
The function iterates through the JSON object. The output is a dictionary where each possible reading, listed in [OGSL](http://oracc.org/ogsl) is a key, the value is the sign name of that reading. For instance
```python
{'u₄' : 'UD', 'ud' : 'UD', 'babbar' : 'UD'}
```
etc.

In [30]:
def parsejson(data_json):
    for key, value in data_json["signs"].items():
        if "values" in value:
            for n in value["values"]:
                d2[n] = key
    return

# 3 Main Process
In the main process the file `ogsl-sl.json` is extracted from the zip and made into a JSON object (with the `json.loads()` function). This object is sent to the `parsejson()` function defined above.

In [31]:
d2 = {}
file = "jsonzip/ogsl.zip"
z = zipfile.ZipFile(file) 
filename = "ogsl/ogsl-sl.json"
signlist = z.read(filename).decode('utf-8')
data_json = json.loads(signlist)                # make it into a json object (essentially a dictionary)
parsejson(data_json)  

# 4 Inspect the Results in Dataframe
This DataFrame is only for inspection - it is not otherwise used in the code below.

In [39]:
ogsl = pd.DataFrame.from_dict(d2, orient='index', columns = ["Name"]).sort_values(by = 'Name')
ogsl

Unnamed: 0,Name
ban₂@c,1(BAN₂)
sutu,1(BAN₂)
banda₂,1(BAN₂)
1(ban₂@c),1(BAN₂)
1(ban₂),1(BAN₂)
ban₂,1(BAN₂)
ban₂@v,1(BAN₂)
1(eše₃@c),1(EŠE₃)
1(eše₃),1(EŠE₃)
eše₃@c,1(EŠE₃)


# 5 Open BDTNS Data
We can now open the dataframe with the [BDTNS](http://bdtns.filol.csic.es) transliterations. This dataframe was pickled in notebook [2_4_Data_Acquisition_BDTNS.ipynb]. The dataframe has five fields: `id_text` (the [BDTNS](http://bdtns.filol.csic.es) number of a document), `id_line` (a continuous line numbering that starts at 1 for each new document; integer), `label` (the regular, human legible [BDTNS](http://bdtns.filol.csic.es) line number), `text` (the transliteration of the line) and `comments` (any comments added to the line in [BDTNS](http://bdtns.filol.csic.es)).

In [42]:
file = 'output/bdtns.p'
bdtns = pd.read_pickle(file)
bdtns

Unnamed: 0,id_text,id_line,label,text,comments
0,021035,1,o. 1,5 sila₃ kaš 3 sila₃ zi₃,
1,021035,2,o. 2,1 i₃ a₂-GAM,
2,021035,3,o. 3,Lu₂-Ma₂-gan-na lu₂-{giš}tukul-gu-<la>,
3,021035,4,o. 4,0.0.1 kaš 5 sila₃ zi₃,
4,021035,5,o. 5,1 i₃ a₂-GAM,
5,021035,6,o. 6,da-da sukkal ša₃ giš-/kin-ti-da gen-na,
6,021035,7,o. 8,3 sila₃ kaš 2 sila₃ zi₃,
7,021035,8,o. 9,1 i₃ a₂-GAM,
8,021035,9,o. 10,En-u₂-mi-i₃-li₂,
9,021035,10,o. 11,ma₂ giš-še₃ gen-na,


# 6 Tokenizing Signs
In order to search by sign, we need to tokenize signs in the transliteration column (`text`) and to ignore elements such as question marks or (half-) brackets. First step is to define different types of separators, operators, and flags that may be present in the text or in the sign name. The most common separators are space and hyphen. Curly brackets are placed around determinatives (semantic classifiers), as in {d}En-lil₂ ("the god Enlil"). Curly brackets and hyphens will be replaced by spaces. The separators in `separators2` are used in compound signs, as in |SI.A|, or |ŠU+NIGIN|. Operators, finally, are also used in compound signs and indicate how the signs are written in relation to each other (on top of each other, one inside the other, etc.). Compound signs that represent a sequence of simple signs (|SI.A| for **dirig** or |A.TU.GAB.LIŠ| for **asal₂**( will be decomposed in their component signs. Compound signs of the type |KA×GAR| for **gu₇**) are not analyzed, but their component parts are aligned with [OGSL](http://oracc.org/ogsl) practices (that is |KA×NINDA| will be re-written as |KA×GAR|, because in OGSL GAR is the name of the sign that can be read **ninda** or **gar**).

Finally the flags include various characters that may appear in the transliteration but will be ignored in the search. A search for `ninda`, therefore, will find `ninda`, `[nin]da`, `ninda?`, etc., as well as `gar`, `⸢gar⸣`, `gar!`, etc. (but not `nagar`, see below).

The variable `table` represents a table in which each character in `flags` corresponds to `None`. This is used by the `translate()` method; see below.

In [43]:
separators = ['{', '}', '-']
separators2 = ['.', '+', '|']  # used in compound signs
operators = ['&', '%', '@', '×']
flags = "][!?<>⸢⸣⌈⌉*/"
table = str.maketrans(dict.fromkeys(flags))

# 7 Tokenizing Signs 2


In [51]:
def signs(row):  
    row_l = []
    sign_names = []
    sign_sequence = ''
    row = row.translate(table).lower()  # remove flags, half brackets, square brackets.
    for s in separators: # first split row into signs   
        row = row.replace(s, ' ').strip()
    s_l = row.split()
    s_l = [d2.get(sign, sign) for sign in s_l]
    # Now take care of some special situations: signs with qualifiers, compound signs.
    for sign in s_l:
        if sign[-1] == ')' and '(' in sign: # qualified sign - get only the qualifier
            sign = sign.split('(')[1][:-1]
        if '.' in sign or '+' in sign: 
            for s in separators2:
                sign = sign.replace(s, ' ').strip() 
            sign_l = sign.split()
            row_l.extend(sign_l)
            continue
        if '×' in sign:
            sign = sign.replace('|', '') #temporarily remove pipes - if present.
            sign_l = sign.split('×')
            sign_l = [d2.get(sign, sign) for sign in sign_l]
            sign = '|' + '×'.join(sign_l) + '|'
        row_l.append(sign)
    sign_names = [d2.get(sign, sign) for sign in row_l]
    return ' '.join(sign_names).upper()

In [52]:
bdtns["sign_names"] = bdtns["text"].progress_apply(signs)

HBox(children=(IntProgress(value=0, max=1156363), HTML(value='')))




# The Search Function
The search function takes as input any style of transliteration recognized in [OGSL](http://orac.org/ogsl) in upper or lower case. Signs may be connected with hyphens or spaces, determinatives may be written between curly brackets ({d}En-ki), or on the line (d-nin-gisz-zi-da). Shin may be represented by š, c, or sz and sign index numbers may be written on the line, or with Unicode subscript numbers ('e₂' and 'e2' are equivalent, but 'é' will yield no results). '{d}Nin-giš-zi-da-ke₄', 'd-nin-ŋeš-zi-da-ke₄', or 'AN nin gisz ZI da ke4' will all return the same results.  

The search engine will find any matching sequence of signs, independent of the transliteration, thus 'nig2 sig' will also find 'ninda sig'.

The search results are listed in a DataFrame. If there are 25 results or less, the DataFrame provides links to the [BDTNS](http://bdtns.filol.csic.es) pages of the matching texts.

In [None]:
num = '0123456789x{}-c*'
ind = '₀₁₂₃₄₅₆₇₈₉ₓ   š×'
tab = str.maketrans(num, ind)
anchor = '<a href="http://bdtns.filol.csic.es/{}", target="_blank">{}</a>'

In [None]:
def search(search): 
    search = search.lower().replace('sz', 'š').translate(tab).strip()
    search_l = search.split()
    search_l = [d2[s] if s in d2 else s for s in search_l]
    row_l = []
    for sign in search_l: 
        if '.' in sign or '+' in sign: 
            for s in separators2:
                sign = sign.replace(s, ' ').strip() 
                sign_l = sign.split()
            row_l.extend(sign_l)
        elif '×' in sign:
            sign_l = sign.replace('|', '').split('×')
            sign_l = [d2[sign] if sign in d2 else sign for sign in sign_l]
            sign = '|' + '×'.join(sign_l) + '|'
            row_l.append(sign)
        else: 
            row_l.append(sign)
    search_l = [re.escape(s) for s in row_l]
    signs = ' '.join(search_l).upper()
    show = ['id_text', 'label', 'text']
    #results = df[show].loc[df['sign_names'].str.contains('(?:(?<=\s)|(?<=^))'+signs+'(?=\s|$)', regex=True)].copy()
    results = df[show].loc[df['sign_names'].str.contains(r'\b'+signs+r'\b', regex=True)].copy()
    print(signs)
    print(str(len(results)) + ' hits')
    if len(results) <= 25: # add links only for 25 hits or less
        results['id_text'] = [anchor.format(val,val) for val in results['id_text']]
        results = results.style
    return results

# Search Instructions
Search for a sequence of sign values in any transliteration system recognized by [OGSL](http://oracc.org/ogsl). Thus, sugal₇, sukkal, or luh, in upper or lower case will all return the same results.

The Shin may be represented by š, c, or sz in upper or lower case.

Sign indexes may be represented by regular numbers or by index numbers (sig₇ or sig7).

Compound signs (such as diri) are resolved in their component signs if the compound represents a simple sequence of signs. Thus diri is resolved as SI A, but gu₇ is resolved as |KA×GAR|.

To search for a compound sign by sign name, enter it between pipes (|). The "times" sign may be represented by \* (enter |UR₂×A| or |UR₂\*A|).

In [None]:
s = input()

In [None]:
search(s)

In [None]:
df[df['text'].str.contains('esir₂')]

In [None]:
s in df.iloc[195]['sign_names']

In [None]:
s

In [None]:
df[df["text"].str.contains('diri')]

In [None]:
df[df['sign_names'].str.contains('SI\.A')]

In [None]:
df[df['sign_names'].str.contains('A₂ SAL.KUR KA', regex=False)]

In [None]:
show = ['id_text', 'line_label', 'text']
df[show]

In [None]:
df

In [None]:
%%timeit
search('diri-ga')

In [None]:
d2['diri']