# Build Sign Search for BDTNS
Goal of this notebook is to search the [BDTNS](http://bdtns.filol.csic.es) data by signs, irrespective of their reading. For instance, the sign NE may be read bi₂, ne, izi, šeŋ₆, kum₂, lam₂, zah₂, etc. It is easy to search for transliteration (and/or metadata) in the [BDTNS](http://bdtns.filol.csic.es) search page, but there is currently no way to search for a sequence of signs. This is useful, in particular, in two situations. 

1. Sumerological transliteration conventions may differ quite substantially between different schools. Thus, lu₂ kin-gi₄-a, {lu₂}kin-gi₄-a, lu₂ kiŋ₂-gi₄-a and {lu₂}kiŋ₂-gi₄-a all represent the same sequence of signs and the same word (meaning 'messenger'), but without knowledge of the particular set of conventions used it may be difficult to guess which search will yield the desired results. In the sign search one may enter sign readings according to any convention recognized by the ORACC Global Sign List ([OGSL](http://oracc.org/ogsl)).

2. In some cases the correct reading and interpretation of a sign sequence may be ambiguous and the ambiguiuty may have been resolved in different ways throughout the database. The names lugal-mudra₅, lugal-suluhi₂ and lugal-siki-su₁₃ all represent the same sign sequence. Which of these is correct is not entirely clear (although the third seems unlikely) and, depending on the research question, may even be unimportant (for instance for an SNA analysis). In the sign search one may enter any of these forms and the results will include all of them.

In [None]:
import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()
import os
import sys
import re
import pickle
import zipfile
import json
from ipywidgets import interact, interact_manual
import ipywidgets as widgets
from IPython.display import display, clear_output
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. The directories are created with the function `make_dirs()` from the `utils` module. 

In [None]:
directories = ['jsonzip', 'output']
make_dirs(directories)

## 1 Download the ZIP file
The sign search uses the ORACC Global Sign List [OGSL](http://oracc.org/ogsl), available in JSON format at http://build-oracc.museum.upenn.edu/json/ogsl.zip. The function `oracc_download()` from the `utils` module downloads the JSON file in ZIP format. The function expects a list as its sole argument.

In [None]:
project = ["ogsl"] # oracc_download() expects a list
p = oracc_download(project)

# 2 The `parsejson()` function
The function iterates through the JSON object. The output is a dictionary where each possible reading, listed in [OGSL](http://oracc.org/ogsl) is a key, the value is the sign name of that reading. For instance
```python
{'u₄' : 'UD', 'ud' : 'UD', 'babbar' : 'UD'}
```
etc.

In [None]:
def parsejson(data_json):
    for key, value in data_json["signs"].items():
        if "values" in value:
            for n in value["values"]:
                d2[n] = key
    return

# 3 Process the JSON
In the main process the file `ogsl-sl.json` is extracted from the zip and made into a JSON object (with the `json.loads()` function). This object is sent to the `parsejson()` function defined above.

In [None]:
d2 = {}
file = "jsonzip/ogsl.zip"
z = zipfile.ZipFile(file) 
filename = "ogsl/ogsl-sl.json"
signlist = z.read(filename).decode('utf-8')
data_json = json.loads(signlist)                # make it into a json object (essentially a dictionary)
parsejson(data_json)  
with open('output/ogsl_dict.p', 'wb') as p:
    pickle.dump(d2, p)  

# 4 Inspect the Results in Dataframe
This DataFrame is only for inspection - it is not otherwise used in the code below.

In [None]:
ogsl = pd.DataFrame.from_dict(d2, orient='index', columns = ["Name"]).sort_values(by = 'Name')
ogsl[1000:1025]

# 5 Open BDTNS Data
We can now open the dataframe with the [BDTNS](http://bdtns.filol.csic.es) transliterations. This dataframe was pickled in notebook [2_4_Data_Acquisition_BDTNS.ipynb]. The dataframe has five fields: `id_text` (the [BDTNS](http://bdtns.filol.csic.es) number of a document), `id_line` (a continuous line numbering that starts at 1 for each new document; integer), `label` (the regular, human legible [BDTNS](http://bdtns.filol.csic.es) line number), `text` (the transliteration of the line) and `comments` (any comments added to the line in [BDTNS](http://bdtns.filol.csic.es)).

In [None]:
file = 'output/bdtns.p'
bdtns = pd.read_pickle(file)
bdtns

# 6 Tokenizing Signs
In order to search by sign, we need to tokenize signs in the transliteration column (`text`) and to ignore elements such as question marks or (half-) brackets. First step is to define different types of separators, operators, and flags that may be present in the text or in the sign name. The most common separators are space and hyphen. Curly brackets are placed around determinatives (semantic classifiers), as in {d}En-lil₂ ("the god Enlil"). Curly brackets and hyphens will be replaced by spaces. The separators in `separators2` are used in compound signs, as in |SI.A|, or |ŠU+NIGIN|. Operators, finally, are also used in compound signs and indicate how the signs are written in relation to each other (on top of each other, one inside the other, etc.). Compound signs that represent a sequence of simple signs (|SI.A| for **dirig** or |A.TU.GAB.LIŠ| for **asal₂**( will be decomposed in their component signs. Compound signs of the type |KA×GAR| for **gu₇**) are not analyzed, but their component parts are aligned with [OGSL](http://oracc.org/ogsl) practices (that is |KA×NINDA| will be re-written as |KA×GAR|, because in OGSL GAR is the name of the sign that can be read **ninda** or **gar**).

Finally the flags include various characters that may appear in the transliteration but will be ignored in the search. A search for `ninda`, therefore, will find `ninda`, `[nin]da`, `ninda?`, etc., as well as `gar`, `⸢gar⸣`, `gar!`, etc. (but not `nagar`, see below).

The variable `table` represents a table in which each character in `flags` corresponds to `None`. This is used by the `translate()` method; see below.

In [None]:
separators = ['{', '}', '-']
separators2 = ['.', '+', '|']  # used in compound signs
operators = ['&', '%', '@', '×']
flags = "][!?<>⸢⸣⌈⌉*/"
table = str.maketrans(dict.fromkeys(flags))

In [None]:
def signs(row):  
    row_l = []
    sign_names = []
    sign_sequence = ''
    row = row.translate(table).lower()  # remove flags, half brackets, square brackets.
    row = row.replace('...', 'x')
    for s in separators: # first split row into signs   
        row = row.replace(s, ' ').strip()
    s_l = row.split()
    s_l = [d2.get(sign, sign) for sign in s_l]
    # Now take care of some special situations: signs with qualifiers, compound signs.
    for sign in s_l:
        if sign[-1] == ')' and '(' in sign: # qualified sign - get only the qualifier
            sign = sign.split('(')[1][:-1]
            sign = d2.get(sign, sign)
        if '.' in sign or '+' in sign: 
            for s in separators2:
                sign = sign.replace(s, ' ').strip() 
            sign_l = sign.split()
            row_l.extend(sign_l)
            continue
        if '×' in sign in sign: #compound 
            sign_l = sign.replace('|', '').split('×')
            #replace individual signs of the compound by OGSL names
            sign_l = [d2.get(sign, sign) for sign in sign_l] 
            sign = '|' +'×'.join(sign_l) + '|'
        row_l.append(sign)
        # add space before and after each line so that each sign representation is enclosed in spaces
    return ' ' + ' '.join(row_l).upper() + ' ' 

In [None]:
bdtns["sign_names"] = bdtns["text"].progress_apply(signs)
bdtns.to_pickle('output/bdtns_tokenized.p')

# 7 The Search Function
The search function takes as input any style of transliteration recognized in [OGSL](http://orac.org/ogsl) in upper or lower case (see the search instructions below).  

The search engine will find any matching sequence of signs, independent of the transliteration, thus 'nig2 sig' will also find 'ninda sig'.

The search results are listed in a DataFrame. If there are 25 results or less, the DataFrame provides links to the [BDTNS](http://bdtns.filol.csic.es) pages of the matching texts.

In [None]:
digi = '0123456789x'
inde = '₀₁₂₃₄₅₆₇₈₉ₓ'
char1 = '{}-cjĝ*'
char2 = '   šŋŋ×'
index = str.maketrans(digi, inde)
char = str.maketrans(char1, char2)
ind = re.compile(r'[a-zŋḫṣšṭA-ZŊḪṢŠṬ][0-9x]{1,2}') 
anchor = '<a href="http://bdtns.filol.csic.es/{}", target="_blank">{}</a>'

In [None]:
def search(search, maxhits, links): 
    search = search.lower().replace('sz', 'š').translate(char).strip()
    search = re.sub(ind, lambda m: m.group().translate(index), search)
    search_l = search.split()
    search_l = [d2.get(s,s) for s in search_l]
    row_l = []
    for sign in search_l: 
        if '.' in sign or '+' in sign: 
            for s in separators2:
                sign = sign.replace(s, ' ').strip() 
                sign_l = sign.split()
            row_l.extend(sign_l)
        elif '×' in sign:
            sign_l = sign.replace('|', '').split('×')
            sign_l = [d2.get(sign, sign) for sign in sign_l]
            sign = '|' + '×'.join(sign_l) + '|'
            row_l.append(sign)
        else: 
            row_l.append(sign)
    #row_l = [re.escape(s) for s in row_l]
    signs = ' '.join(row_l).upper()
    signs_esc = re.escape(' ' + signs + ' ') # add space before and after the search so that each sign representation is enclosed in spaces
    signs_esc = signs_esc.replace('\ X\ ', '(?:\ [^ ]+)*\ ')
    show = ['id_text', 'label', 'text']
    #results = bdtns[show].loc[bdtns['sign_names'].str.contains('(?:(?<=\s)|(?<=^))'+signs+'(?=\s|$)', regex=True)].copy()
    #results = bdtns[show].loc[bdtns['sign_names'].str.contains(signs_esc, regex=True)].copy()
    results = bdtns.loc[bdtns['sign_names'].str.contains(signs_esc, regex=True), show].copy()
    hits = len(results)
    if maxhits > hits: 
        maxhits = hits
    print(signs), print(str(hits) + ' hits; ' + str(maxhits) +  " displayed")
    results = results[:maxhits]
    if links:
        results['id_text'] = [anchor.format(val,val) for val in results['id_text']]
        results = results.style
    return results

# 8 Search Instructions
Search for a sequence of sign values in any transliteration system recognized by [OGSL](http://oracc.org/ogsl). Thus, sugal₇, sukkal, or luh, in upper or lower case will all return the same results.

Determinatives (semantic classifiers) may be entered between curly brackets or as regular signs. Thus, gesz taskarin, gesz-taskarin, {gesz}taskarin, and {ŋeš}tug₂ will all yield the same results. 

Signs may be connected with spaces or hyphens.

The Shin may be represented by š, c, or sz in upper or lower case; nasal g may be represented as j, ŋ, or ĝ.

Sign indexes may be represented by regular numbers or by index numbers (sig₇ or sig7).

Compound signs (such as diri) are resolved in their component signs if the compound represents a simple sequence of signs. Thus diri is resolved as SI A, but gu₇ is resolved as |KA×GAR|.

To search for a compound sign by sign name, enter it between pipes (|). The "times" sign may be represented by \* (enter |UR₂×A| or |UR₂\*A|, but not |URxA|).

Wildcard: x or X, represents any number of signs in between (e.g. ku6-x-muszen will find all places where HA is followed by HU with zero or more signs in between).

In [None]:
# Creating an interface
button = widgets.Button(description='Search')
text = widgets.Text(
       value='',
       description='', )
maxhits = widgets.IntSlider(
        value=25,
        min=25,
        max=100000,
        step=25,
        description='Max hits:')
links = widgets.Checkbox(
    value=True,
    description='Display Links')
out = widgets.Output()
def on_button_clicked(_):
      # "linking function with output"
        with out:
          # what happens when we press the button
            clear_output()
            display(search(text.value, maxhits.value, links.value))
            
# linking button and function together using a button's method
button.on_click(on_button_clicked)
# displaying button and its output together
line = widgets.HBox([text, maxhits])
widgets.VBox([line,links,button,out])

# 9 Alternative Interface
The following alterative interface is much simpler in its coding (essentially letting the @interact line do all the work). To be useful, this interface requires a fairly fast machine because the search will update live while you type. The interface uses the same search function as above, so search instructions and results are the same.

In [None]:
@interact
def q(Search = '', maxhits = 25, links = True): 
    return search(Search, maxhits, links)