# Represent ORACC Texts in UTF-8 Cuneiform
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract Akkadian text in lemmatizations, transliterations and cuneiform from one or more [ORACC](http://oracc.org) projects. 

In [1]:
import pandas as pd
import zipfile
import json
import tqdm
import os
import sys
import pickle
import re
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. If they do not exist they are created, else: do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
directories = ['jsonzip', 'output', 'corpus']
make_dirs(directories)

## 1.1 Input Project Names
Provide a list of one or more project names, separated by commas. Note that subprojects must be listed separately, they are not included in the main project. For instance:

blms, akklove, cams/anzu, cams/barutu, cams/gkab, cams/selbi, ccpo, cmawro, cmawro/cmawr2, cmawro/maqlu, dcclt, dcclt/nineveh, dcclt/signlists, glass, hbtin, riao, ribo/babylon0, ribo/babylon1, ribo/babylon2, ribo/babylon3, ribo/babylon4, ribo/babylon5, ribo/babylon6, ribo/babylon7, ribo/babylon8, ribo/babylon10, rimanum, rinap/rinap1, rinap/rinap3, rinap/rinap4, rinap/rinap5, saao/saa01, saao/saa02, saao/saa03, saao/saa04, saao/saa05, saao/saa06, saao/saa07, saao/saa08, saao/saa09, saao/saa10, saao/saa11, saao/saa12, saao/saa13, saao/saa14, saao/saa15, saao/saa16, saao/saa17, saao/saa18, saao/saa19, saao/saa20, saao/saa21, suhu

In [3]:
projects = input('Project(s): ').lower()

Project(s): blms, akklove, cams/anzu, cams/barutu, cams/gkab, cams/selbi, ccpo, cmawro, cmawro/cmawr2, cmawro/maqlu, dcclt, dcclt/nineveh, dcclt/signlists, glass, hbtin, riao, ribo/babylon0, ribo/babylon1, ribo/babylon2, ribo/babylon3, ribo/babylon4, ribo/babylon5, ribo/babylon6, ribo/babylon7, ribo/babylon8, ribo/babylon10, rimanum, rinap/rinap1, rinap/rinap3, rinap/rinap4, rinap/rinap5, saao/saa01, saao/saa02, saao/saa03, saao/saa04, saao/saa05, saao/saa06, saao/saa07, saao/saa08, saao/saa09, saao/saa10, saao/saa11, saao/saa12, saao/saa13, saao/saa14, saao/saa15, saao/saa16, saao/saa17, saao/saa18, saao/saa19, saao/saa20, saao/saa21, suhu


## 1.2 Split the List of Projects and Download the ZIP files.
Split the list of projects and create a list of project names, using the `format_project_list()` and `oracc_download()` functions in the `utils` module. The code of these functions is discussed in more detail in 2.1.0. Download ORACC JSON Files.

In [4]:
p = format_project_list(projects)
oracc_download(p)

## <a name="head21"></a>2.1 The `parsejson()` function
The `parsejson()` function will "dig into" the `json` file (transformed into a dictionary) until it finds the relevant data. The `json` file consists of a hierarchy of `cdl` nodes; only the lowest nodes contain lemmatization data. The function goes down this hierarchy by calling itself when another `cdl` node is encountered. For nore information about the data hierarchy in the [ORACC](http://oracc.org) `json` files, see [ORACC Open Data](http://oracc.museum.upenn.edu/doc/opendata/index.html).

The argument of the `parsejson()` function is a `JSON` object, a dictionary that initially contains the entire contents of the original JSON file. The code takes the key `cdl` which itself contains an array (a list) of `JSON` objects. Iterating through these objects, if an object contains another `cdl` node, the function calls itself with this object as first argument. This way the function digs deeper and deeper into the `JSON` tree, until it does not encounter a `cdl` key anymore. Here we are at the level of individual words. The code checks for a key `f`, if it exists the signs are in the node `gdl` within the `f` node. 

In [7]:
def parsejson_signs(text):
    for JSONobject in text["cdl"]:
        field = ''
        if "cdl" in JSONobject: 
            parsejson_signs(JSONobject)
        if "type" in JSONobject and JSONobject["type"] == "field-start":
            field = JSONobject["subtype"]
        if "f" in JSONobject and not field in ['sg', 'pr']: # skip the fields "sign" and "pronunciation"
                                # in lexical texts
            lang = JSONobject["f"]["lang"]     #[:3] == "akk": #only Akkadian and Akkadian dialects
            word = JSONobject["f"]
            f = word["form"]
            if "sexified" in word["gdl"][0]:
                f = word["gdl"][0]["sexified"]
            if "cf" in word:
                if 'pos' in word:  #for some reason some words appear without pos. Provisionally treated as Noun
                    lemm = word["cf"] + '[' + word["gw"] + "]" + word["pos"]
                else:
                    lemm = word["cf"] + '[' + word["gw"] + "]N"
                lemm = lemm.replace(' ', '-') # remove commas and spaces from lemm
                lemm = lemm.replace(',', '')
            else:
                lemm = word["form"] # if word is unlemmatized
            all_.append(f)
            lemm_.append(lemm)
            lang_.append(lang)
    return

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file is located in the directory `jsonzip`, named PROJECT.zip. 

Each of these files is extracted from the `zip` file and read with the command `json.loads()`, which reads the json data and transforms it into a Python dictionary (a sequence of keys and values).

This dictionary, which is called `text` is now sent to the `parsejson()` function. The function adds signs to the `sign_l` list.

In [8]:
all_ = []
lemm_ = []
ids_ = []
lang_ = []
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in tqdm.tqdm(files):                            #iterate over the file names
        id_no = filename[-13:-5]
        if id_no in ids_ and not "X" in id_no: # Check if P/Q number is already in there
            continue        # a text may appear in multiple projects
        id_text = project + id_no # id_text is, for instance, blms/P414332
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
            data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
            all_.append('Start'+id_text)
            lemm_.append('Start'+id_text)   # to keep all_ and lemm_ same length
            lang_.append('Start'+id_text)
            parsejson_signs(data_json)
            ids_.append(id_no)
            #print(filename)
        except:
            print(id_text + ' is not available or not complete')

  9%|███▋                                     | 36/395 [00:00<00:04, 86.55it/s]

blms/P357131 is not available or not complete


 31%|████████████▎                           | 122/395 [00:01<00:03, 87.08it/s]

blms/P384976 is not available or not complete


 64%|████████████████████████▉              | 252/395 [00:02<00:01, 117.33it/s]

blms/P384968 is not available or not complete


100%|███████████████████████████████████████| 395/395 [00:03<00:00, 114.55it/s]
100%|█████████████████████████████████████████| 32/32 [00:00<00:00, 109.58it/s]
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 19.87it/s]
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 22.93it/s]
 45%|██████████████████▏                     | 266/585 [00:04<00:07, 43.20it/s]

cams/gkab/P363695 is not available or not complete


100%|████████████████████████████████████████| 585/585 [00:09<00:00, 63.00it/s]
100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 124.99it/s]
100%|███████████████████████████████████████| 205/205 [00:01<00:00, 103.05it/s]
0it [00:00, ?it/s]
  0%|                                                   | 0/82 [00:00<?, ?it/s]

cmawro/cmawr2/Q005082 is not available or not complete
cmawro/cmawr2/Q005112 is not available or not complete
cmawro/cmawr2/Q005074 is not available or not complete
cmawro/cmawr2/Q005087 is not available or not complete


  5%|██                                         | 4/82 [00:00<00:01, 39.92it/s]

cmawro/cmawr2/Q005093 is not available or not complete
cmawro/cmawr2/Q005073 is not available or not complete
cmawro/cmawr2/Q005064 is not available or not complete
cmawro/cmawr2/Q005043 is not available or not complete


 10%|████▏                                      | 8/82 [00:00<00:01, 39.12it/s]

cmawro/cmawr2/Q005038 is not available or not complete
cmawro/cmawr2/Q005066 is not available or not complete
cmawro/cmawr2/Q005100 is not available or not complete
cmawro/cmawr2/Q005067 is not available or not complete
cmawro/cmawr2/Q005104 is not available or not complete
cmawro/cmawr2/Q005105 is not available or not complete
cmawro/cmawr2/Q005070 is not available or not complete


 18%|███████▋                                  | 15/82 [00:00<00:01, 41.19it/s]

cmawro/cmawr2/Q005040 is not available or not complete
cmawro/cmawr2/Q005921 is not available or not complete
cmawro/cmawr2/Q005041 is not available or not complete
cmawro/cmawr2/Q005099 is not available or not complete
cmawro/cmawr2/Q005063 is not available or not complete
cmawro/cmawr2/Q005039 is not available or not complete


 26%|██████████▊                               | 21/82 [00:00<00:01, 41.33it/s]

cmawro/cmawr2/Q005068 is not available or not complete
cmawro/cmawr2/Q005065 is not available or not complete
cmawro/cmawr2/Q005052 is not available or not complete
cmawro/cmawr2/Q005114 is not available or not complete


 30%|████████████▊                             | 25/82 [00:00<00:01, 38.75it/s]

cmawro/cmawr2/Q005042 is not available or not complete
cmawro/cmawr2/Q005110 is not available or not complete
cmawro/cmawr2/Q005091 is not available or not complete
cmawro/cmawr2/Q005048 is not available or not complete
cmawro/cmawr2/Q005107 is not available or not complete
cmawro/cmawr2/Q005079 is not available or not complete
cmawro/cmawr2/Q005072 is not available or not complete


 40%|████████████████▉                         | 33/82 [00:00<00:01, 42.87it/s]

cmawro/cmawr2/Q005108 is not available or not complete
cmawro/cmawr2/Q005051 is not available or not complete
cmawro/cmawr2/Q005095 is not available or not complete
cmawro/cmawr2/Q005050 is not available or not complete
cmawro/cmawr2/Q005075 is not available or not complete
cmawro/cmawr2/Q005085 is not available or not complete
cmawro/cmawr2/Q005097 is not available or not complete
cmawro/cmawr2/Q005086 is not available or not complete


 51%|█████████████████████▌                    | 42/82 [00:00<00:00, 50.01it/s]

cmawro/cmawr2/Q005096 is not available or not complete
cmawro/cmawr2/Q005094 is not available or not complete
cmawro/cmawr2/Q005092 is not available or not complete
cmawro/cmawr2/Q005078 is not available or not complete
cmawro/cmawr2/Q005083 is not available or not complete


 59%|████████████████████████▌                 | 48/82 [00:01<00:00, 46.47it/s]

cmawro/cmawr2/Q005037 is not available or not complete
cmawro/cmawr2/Q005045 is not available or not complete
cmawro/cmawr2/Q005106 is not available or not complete
cmawro/cmawr2/Q005055 is not available or not complete
cmawro/cmawr2/Q005049 is not available or not complete
cmawro/cmawr2/Q005062 is not available or not complete
cmawro/cmawr2/Q005109 is not available or not complete


 67%|████████████████████████████▏             | 55/82 [00:01<00:00, 50.65it/s]

cmawro/cmawr2/Q005089 is not available or not complete
cmawro/cmawr2/Q005103 is not available or not complete
cmawro/cmawr2/Q005036 is not available or not complete
cmawro/cmawr2/Q005069 is not available or not complete
cmawro/cmawr2/Q005101 is not available or not complete


 74%|███████████████████████████████▏          | 61/82 [00:01<00:00, 49.85it/s]

cmawro/cmawr2/Q005058 is not available or not complete
cmawro/cmawr2/Q005056 is not available or not complete
cmawro/cmawr2/Q005077 is not available or not complete
cmawro/cmawr2/Q005111 is not available or not complete
cmawro/cmawr2/Q005102 is not available or not complete
cmawro/cmawr2/Q005098 is not available or not complete
cmawro/cmawr2/Q005046 is not available or not complete
cmawro/cmawr2/Q005076 is not available or not complete
cmawro/cmawr2/Q005059 is not available or not complete
cmawro/cmawr2/Q005084 is not available or not complete


 87%|████████████████████████████████████▎     | 71/82 [00:01<00:00, 58.31it/s]

cmawro/cmawr2/Q005081 is not available or not complete
cmawro/cmawr2/Q005053 is not available or not complete
cmawro/cmawr2/Q005047 is not available or not complete
cmawro/cmawr2/Q005088 is not available or not complete
cmawro/cmawr2/Q005057 is not available or not complete
cmawro/cmawr2/Q005044 is not available or not complete


 95%|███████████████████████████████████████▉  | 78/82 [00:01<00:00, 57.88it/s]

cmawro/cmawr2/Q005113 is not available or not complete
cmawro/cmawr2/Q005054 is not available or not complete
cmawro/cmawr2/Q005071 is not available or not complete
cmawro/cmawr2/Q005080 is not available or not complete


100%|██████████████████████████████████████████| 82/82 [00:01<00:00, 52.83it/s]
  0%|                                                    | 0/9 [00:00<?, ?it/s]

cmawro/maqlu/Q002709 is not available or not complete
cmawro/maqlu/Q002707 is not available or not complete


 22%|█████████▊                                  | 2/9 [00:00<00:00,  7.98it/s]

cmawro/maqlu/Q002713 is not available or not complete
cmawro/maqlu/Q002708 is not available or not complete


 44%|███████████████████▌                        | 4/9 [00:00<00:00,  9.38it/s]

cmawro/maqlu/Q002710 is not available or not complete
cmawro/maqlu/Q002712 is not available or not complete
cmawro/maqlu/Q002711 is not available or not complete


 78%|██████████████████████████████████▏         | 7/9 [00:00<00:00, 11.12it/s]

cmawro/maqlu/Q002705 is not available or not complete
cmawro/maqlu/Q002706 is not available or not complete


100%|████████████████████████████████████████████| 9/9 [00:00<00:00, 11.58it/s]
  1%|▎                                      | 29/4305 [00:00<00:15, 277.77it/s]

dcclt/P429699 is not available or not complete


  1%|▌                                      | 59/4305 [00:00<00:15, 273.36it/s]

dcclt/Q000086 is not available or not complete


  2%|▉                                     | 101/4305 [00:00<00:19, 214.33it/s]

dcclt/P274552 is not available or not complete


  3%|█▏                                    | 131/4305 [00:00<00:18, 222.71it/s]

dcclt/P349776 is not available or not complete
dcclt/P450974 is not available or not complete


  4%|█▎                                    | 151/4305 [00:00<00:19, 213.98it/s]

dcclt/P450780 is not available or not complete
dcclt/P381847 is not available or not complete


  4%|█▌                                    | 170/4305 [00:00<00:28, 147.56it/s]

dcclt/P365407 is not available or not complete


  5%|█▋                                    | 197/4305 [00:01<00:25, 164.31it/s]

dcclt/P370336 is not available or not complete


  5%|█▉                                    | 222/4305 [00:01<00:22, 182.34it/s]

dcclt/P450812 is not available or not complete
dcclt/P274484 is not available or not complete
dcclt/P347739 is not available or not complete


  6%|██▏                                   | 242/4305 [00:01<00:26, 154.34it/s]

dcclt/P381792 is not available or not complete


  6%|██▎                                   | 260/4305 [00:01<00:27, 146.15it/s]

dcclt/P345997 is not available or not complete
dcclt/P274488 is not available or not complete


  7%|██▊                                   | 313/4305 [00:01<00:24, 163.44it/s]

dcclt/P227761 is not available or not complete


  9%|███▎                                  | 380/4305 [00:02<00:25, 154.85it/s]

dcclt/P388211 is not available or not complete
dcclt/P247815 is not available or not complete
dcclt/X600019 is not available or not complete


  9%|███▌                                  | 404/4305 [00:02<00:24, 160.24it/s]

dcclt/P450800 is not available or not complete


 10%|███▋                                  | 423/4305 [00:02<00:29, 130.51it/s]

dcclt/P347726 is not available or not complete


 10%|███▉                                  | 445/4305 [00:02<00:29, 132.53it/s]

dcclt/P450934 is not available or not complete
dcclt/P385957 is not available or not complete


 11%|████▎                                 | 485/4305 [00:03<00:31, 121.88it/s]

dcclt/X800152 is not available or not complete


 12%|████▌                                 | 512/4305 [00:03<00:25, 145.89it/s]

dcclt/P349912 is not available or not complete


 12%|████▋                                 | 536/4305 [00:03<00:24, 156.05it/s]

dcclt/P285564 is not available or not complete
dcclt/P381829 is not available or not complete


 13%|█████                                 | 574/4305 [00:03<00:23, 160.44it/s]

dcclt/P271881 is not available or not complete


 14%|█████▎                                | 598/4305 [00:03<00:21, 173.67it/s]

dcclt/P274483 is not available or not complete
dcclt/P381848 is not available or not complete
dcclt/P349476 is not available or not complete
dcclt/P348984 is not available or not complete


 15%|█████▋                                | 638/4305 [00:03<00:17, 206.21it/s]

dcclt/X800084 is not available or not complete
dcclt/P264083 is not available or not complete


 16%|█████▉                                | 674/4305 [00:03<00:15, 232.31it/s]

dcclt/P347836 is not available or not complete
dcclt/X800153 is not available or not complete
dcclt/P282430 is not available or not complete


 16%|██████▏                               | 701/4305 [00:04<00:18, 198.67it/s]

dcclt/P385890 is not available or not complete
dcclt/P274553 is not available or not complete
dcclt/P282499 is not available or not complete
dcclt/P382247 is not available or not complete


 18%|██████▋                               | 763/4305 [00:04<00:22, 156.74it/s]

dcclt/P346075 is not available or not complete


 19%|███████                               | 797/4305 [00:04<00:18, 186.72it/s]

dcclt/P370325 is not available or not complete
dcclt/P274543 is not available or not complete


 19%|███████▎                              | 824/4305 [00:04<00:18, 191.32it/s]

dcclt/P450914 is not available or not complete


 20%|███████▍                              | 849/4305 [00:04<00:18, 184.81it/s]

dcclt/P348545 is not available or not complete
dcclt/P450756 is not available or not complete


 20%|███████▋                              | 872/4305 [00:05<00:18, 183.60it/s]

dcclt/P274563 is not available or not complete


 21%|███████▉                              | 893/4305 [00:05<00:22, 154.90it/s]

dcclt/P271710 is not available or not complete
dcclt/X800083 is not available or not complete
dcclt/P271742 is not available or not complete


 21%|████████                              | 920/4305 [00:05<00:19, 173.77it/s]

dcclt/P450941 is not available or not complete
dcclt/P230597 is not available or not complete
dcclt/Q000107 is not available or not complete
dcclt/P349850 is not available or not complete


 22%|████████▎                             | 940/4305 [00:05<00:19, 176.48it/s]

dcclt/Q003224 is not available or not complete


 22%|████████▍                             | 962/4305 [00:05<00:18, 183.39it/s]

dcclt/P345993 is not available or not complete


 23%|████████▋                             | 987/4305 [00:05<00:16, 195.31it/s]

dcclt/P346054 is not available or not complete


 24%|████████▊                            | 1027/4305 [00:05<00:19, 170.41it/s]

dcclt/Q000106 is not available or not complete


 25%|█████████▏                           | 1067/4305 [00:06<00:22, 142.15it/s]

dcclt/Q000831 is not available or not complete
dcclt/P381807 is not available or not complete
dcclt/P450763 is not available or not complete
dcclt/P386385 is not available or not complete


 25%|█████████▎                           | 1083/4305 [00:06<00:24, 129.88it/s]

dcclt/P274557 is not available or not complete
dcclt/P275016 is not available or not complete


 26%|█████████▌                           | 1110/4305 [00:06<00:20, 153.83it/s]

dcclt/P283795 is not available or not complete


 26%|█████████▋                           | 1129/4305 [00:06<00:21, 145.86it/s]

dcclt/P397726 is not available or not complete
dcclt/X600004 is not available or not complete
dcclt/P450813 is not available or not complete


 27%|█████████▊                           | 1148/4305 [00:06<00:24, 126.93it/s]

dcclt/P349952 is not available or not complete


 28%|██████████▏                          | 1187/4305 [00:06<00:19, 158.16it/s]

dcclt/P347779 is not available or not complete


 28%|██████████▍                          | 1216/4305 [00:07<00:18, 166.42it/s]

dcclt/P347126 is not available or not complete


 29%|██████████▋                          | 1237/4305 [00:07<00:23, 129.04it/s]

dcclt/P346053 is not available or not complete


 29%|██████████▊                          | 1254/4305 [00:07<00:22, 135.11it/s]

dcclt/P274549 is not available or not complete
dcclt/P381862 is not available or not complete
dcclt/P429487 is not available or not complete


 30%|███████████▏                         | 1304/4305 [00:07<00:21, 141.69it/s]

dcclt/P347745 is not available or not complete
dcclt/P347786 is not available or not complete


 31%|███████████▌                         | 1350/4305 [00:07<00:17, 171.84it/s]

dcclt/P349851 is not available or not complete
dcclt/P349917 is not available or not complete


 32%|███████████▊                         | 1378/4305 [00:08<00:15, 192.75it/s]

dcclt/P499089 is not available or not complete
dcclt/P394157 is not available or not complete


 33%|████████████                         | 1403/4305 [00:08<00:14, 206.45it/s]

dcclt/Q000205 is not available or not complete


 34%|████████████▍                        | 1452/4305 [00:08<00:11, 248.30it/s]

dcclt/P347742 is not available or not complete


 34%|████████████▋                        | 1483/4305 [00:08<00:10, 264.06it/s]

dcclt/P429485 is not available or not complete
dcclt/P451233 is not available or not complete
dcclt/P450804 is not available or not complete


 36%|█████████████▎                       | 1545/4305 [00:08<00:10, 262.29it/s]

dcclt/P347757 is not available or not complete


 37%|█████████████▌                       | 1574/4305 [00:08<00:15, 173.23it/s]

dcclt/P274562 is not available or not complete


 37%|█████████████▋                       | 1598/4305 [00:08<00:14, 187.32it/s]

dcclt/P450830 is not available or not complete
dcclt/P274545 is not available or not complete
dcclt/P345986 is not available or not complete


 38%|██████████████                       | 1641/4305 [00:09<00:12, 216.38it/s]

dcclt/P450754 is not available or not complete
dcclt/P272319 is not available or not complete
dcclt/P274491 is not available or not complete


 39%|██████████████▍                      | 1676/4305 [00:09<00:12, 218.63it/s]

dcclt/P349867 is not available or not complete


 40%|██████████████▋                      | 1716/4305 [00:09<00:10, 253.04it/s]

dcclt/P370359 is not available or not complete
dcclt/P349843 is not available or not complete


 42%|███████████████▎                     | 1788/4305 [00:09<00:08, 285.84it/s]

dcclt/P451215 is not available or not complete
dcclt/P271567 is not available or not complete


 42%|███████████████▋                     | 1821/4305 [00:09<00:09, 275.28it/s]

dcclt/P381795 is not available or not complete
dcclt/Q000093 is not available or not complete
dcclt/X999901 is not available or not complete


 43%|███████████████▉                     | 1852/4305 [00:09<00:08, 274.85it/s]

dcclt/P370356 is not available or not complete


 44%|████████████████▏                    | 1882/4305 [00:10<00:11, 213.27it/s]

dcclt/Q002278 is not available or not complete
dcclt/P451153 is not available or not complete


 44%|████████████████▍                    | 1907/4305 [00:10<00:16, 147.72it/s]

dcclt/P349948 is not available or not complete


 45%|████████████████▋                    | 1946/4305 [00:10<00:15, 152.27it/s]

dcclt/P349388 is not available or not complete
dcclt/P349858 is not available or not complete


 46%|████████████████▉                    | 1964/4305 [00:10<00:15, 154.47it/s]

dcclt/Q000091 is not available or not complete
dcclt/P381832 is not available or not complete
dcclt/P370365 is not available or not complete


 46%|█████████████████                    | 1982/4305 [00:10<00:15, 145.52it/s]

dcclt/X000181 is not available or not complete
dcclt/P370352 is not available or not complete


 47%|█████████████████▌                   | 2041/4305 [00:11<00:11, 190.58it/s]

dcclt/P381786 is not available or not complete


 48%|█████████████████▋                   | 2063/4305 [00:11<00:16, 136.47it/s]

dcclt/P451012 is not available or not complete


 49%|██████████████████                   | 2095/4305 [00:11<00:13, 164.42it/s]

dcclt/P451071 is not available or not complete
dcclt/P347731 is not available or not complete


 49%|██████████████████▏                  | 2117/4305 [00:11<00:13, 162.02it/s]

dcclt/P346045 is not available or not complete


 50%|██████████████████▌                  | 2158/4305 [00:11<00:17, 123.09it/s]

dcclt/P271724 is not available or not complete
dcclt/P499094 is not available or not complete
dcclt/Q003221 is not available or not complete


 51%|██████████████████▊                  | 2189/4305 [00:12<00:17, 122.84it/s]

dcclt/Q000110 is not available or not complete


 51%|███████████████████                  | 2214/4305 [00:12<00:14, 142.18it/s]

dcclt/P271551 is not available or not complete


 52%|███████████████████▏                 | 2231/4305 [00:12<00:14, 140.34it/s]

dcclt/P479337 is not available or not complete


 52%|███████████████████▎                 | 2248/4305 [00:12<00:15, 136.73it/s]

dcclt/P345975 is not available or not complete
dcclt/P274536 is not available or not complete


 53%|███████████████████▌                 | 2281/4305 [00:12<00:12, 164.87it/s]

dcclt/P347772 is not available or not complete
dcclt/P349913 is not available or not complete
dcclt/P387757 is not available or not complete


 53%|███████████████████▊                 | 2301/4305 [00:12<00:13, 148.46it/s]

dcclt/P424414 is not available or not complete
dcclt/P282498 is not available or not complete


 54%|████████████████████                 | 2337/4305 [00:13<00:13, 150.92it/s]

dcclt/Q000083 is not available or not complete


 55%|████████████████████▏                | 2354/4305 [00:13<00:12, 152.73it/s]

dcclt/P346017 is not available or not complete


 55%|████████████████████▌                | 2387/4305 [00:13<00:11, 172.75it/s]

dcclt/P450953 is not available or not complete


 56%|████████████████████▊                | 2422/4305 [00:13<00:09, 200.47it/s]

dcclt/P271899 is not available or not complete


 57%|█████████████████████                | 2455/4305 [00:13<00:08, 222.09it/s]

dcclt/P347753 is not available or not complete
dcclt/P450777 is not available or not complete
dcclt/P370343 is not available or not complete


 58%|█████████████████████▍               | 2492/4305 [00:13<00:07, 252.35it/s]

dcclt/P349934 is not available or not complete
dcclt/P282496 is not available or not complete


 59%|█████████████████████▊               | 2537/4305 [00:13<00:06, 289.86it/s]

dcclt/Q000314 is not available or not complete
dcclt/P247819 is not available or not complete
dcclt/P429521 is not available or not complete


 61%|██████████████████████▌              | 2622/4305 [00:13<00:05, 325.59it/s]

dcclt/P370332 is not available or not complete
dcclt/P347832 is not available or not complete


 63%|███████████████████████▏             | 2692/4305 [00:14<00:08, 192.73it/s]

dcclt/X800046 is not available or not complete
dcclt/P370334 is not available or not complete


 63%|███████████████████████▍             | 2727/4305 [00:14<00:07, 218.91it/s]

dcclt/P363629 is not available or not complete
dcclt/X999902 is not available or not complete


 64%|███████████████████████▋             | 2755/4305 [00:14<00:07, 203.67it/s]

dcclt/P282337 is not available or not complete
dcclt/P349857 is not available or not complete
dcclt/P368989 is not available or not complete


 65%|████████████████████████             | 2804/4305 [00:15<00:08, 173.48it/s]

dcclt/P381811 is not available or not complete
dcclt/P348779 is not available or not complete


 66%|████████████████████████▍            | 2846/4305 [00:15<00:08, 171.24it/s]

dcclt/P247830 is not available or not complete


 67%|████████████████████████▊            | 2883/4305 [00:15<00:09, 151.70it/s]

dcclt/P349893 is not available or not complete
dcclt/P451229 is not available or not complete
dcclt/P345999 is not available or not complete
dcclt/P423635 is not available or not complete


 67%|████████████████████████▉            | 2900/4305 [00:15<00:10, 130.42it/s]

dcclt/P282493 is not available or not complete


 68%|█████████████████████████            | 2915/4305 [00:15<00:10, 129.41it/s]

dcclt/P450749 is not available or not complete


 68%|█████████████████████████▏           | 2929/4305 [00:15<00:10, 127.70it/s]

dcclt/P247811 is not available or not complete


 69%|█████████████████████████▍           | 2957/4305 [00:16<00:08, 151.11it/s]

dcclt/P338313 is not available or not complete


 69%|█████████████████████████▌           | 2975/4305 [00:16<00:10, 131.36it/s]

dcclt/X000011 is not available or not complete
dcclt/P451001 is not available or not complete


 71%|██████████████████████████▎          | 3059/4305 [00:16<00:06, 194.10it/s]

dcclt/P256648 is not available or not complete
dcclt/P370379 is not available or not complete
dcclt/P387478 is not available or not complete


 72%|██████████████████████████▍          | 3082/4305 [00:16<00:07, 172.87it/s]

dcclt/P349911 is not available or not complete
dcclt/P370405 is not available or not complete
dcclt/P388202 is not available or not complete


 72%|██████████████████████████▋          | 3102/4305 [00:16<00:07, 150.98it/s]

dcclt/P278015 is not available or not complete
dcclt/P381750 is not available or not complete


 73%|██████████████████████████▉          | 3127/4305 [00:17<00:07, 168.16it/s]

dcclt/X800063 is not available or not complete


 73%|███████████████████████████          | 3146/4305 [00:17<00:06, 172.64it/s]

dcclt/P228096 is not available or not complete
dcclt/P271320 is not available or not complete


 74%|███████████████████████████▎         | 3183/4305 [00:17<00:05, 204.85it/s]

dcclt/P450857 is not available or not complete


 75%|███████████████████████████▌         | 3208/4305 [00:17<00:06, 174.20it/s]

dcclt/P349905 is not available or not complete
dcclt/P247857 is not available or not complete


 75%|███████████████████████████▊         | 3229/4305 [00:17<00:07, 152.34it/s]

dcclt/P381851 is not available or not complete


 75%|███████████████████████████▉         | 3248/4305 [00:17<00:09, 113.70it/s]

dcclt/P373780 is not available or not complete


 76%|████████████████████████████         | 3267/4305 [00:17<00:08, 125.88it/s]

dcclt/P228044 is not available or not complete


 77%|████████████████████████████▎        | 3297/4305 [00:18<00:06, 151.45it/s]

dcclt/P347733 is not available or not complete
dcclt/P451018 is not available or not complete
dcclt/P347758 is not available or not complete


 78%|████████████████████████████▋        | 3341/4305 [00:18<00:05, 173.99it/s]

dcclt/Q000074 is not available or not complete


 78%|████████████████████████████▉        | 3361/4305 [00:18<00:07, 124.79it/s]

dcclt/P451706 is not available or not complete


 78%|█████████████████████████████        | 3378/4305 [00:18<00:06, 134.26it/s]

dcclt/X800010 is not available or not complete
dcclt/P285556 is not available or not complete


 79%|█████████████████████████████▏       | 3399/4305 [00:18<00:06, 150.48it/s]

dcclt/P282501 is not available or not complete
dcclt/Q000316 is not available or not complete
dcclt/P365392 is not available or not complete


 80%|█████████████████████████████▌       | 3442/4305 [00:19<00:06, 132.21it/s]

dcclt/P347729 is not available or not complete


 81%|█████████████████████████████▉       | 3480/4305 [00:19<00:06, 135.92it/s]

dcclt/P370376 is not available or not complete


 81%|██████████████████████████████       | 3496/4305 [00:19<00:05, 139.73it/s]

dcclt/P332941 is not available or not complete
dcclt/Q000264 is not available or not complete


 82%|██████████████████████████████▎      | 3530/4305 [00:19<00:06, 124.88it/s]

dcclt/P274504 is not available or not complete
dcclt/P373927 is not available or not complete
dcclt/P247828 is not available or not complete


 82%|██████████████████████████████▌      | 3550/4305 [00:19<00:05, 140.15it/s]

dcclt/P347741 is not available or not complete
dcclt/P450801 is not available or not complete
dcclt/P282487 is not available or not complete


 83%|██████████████████████████████▋      | 3574/4305 [00:20<00:04, 157.61it/s]

dcclt/P450753 is not available or not complete
dcclt/P349937 is not available or not complete


 83%|██████████████████████████████▊      | 3592/4305 [00:20<00:04, 150.57it/s]

dcclt/Q000832 is not available or not complete


 84%|███████████████████████████████      | 3613/4305 [00:20<00:05, 120.72it/s]

dcclt/P349868 is not available or not complete


 84%|███████████████████████████████▎     | 3636/4305 [00:20<00:04, 138.45it/s]

dcclt/P363708 is not available or not complete
dcclt/P451221 is not available or not complete


 85%|███████████████████████████████▍     | 3653/4305 [00:20<00:05, 128.06it/s]

dcclt/P348656 is not available or not complete
dcclt/P370377 is not available or not complete


 85%|███████████████████████████████▌     | 3668/4305 [00:20<00:05, 108.20it/s]

dcclt/P381755 is not available or not complete


 86%|███████████████████████████████▊     | 3703/4305 [00:21<00:05, 112.00it/s]

dcclt/P348714 is not available or not complete


 87%|████████████████████████████████▎    | 3754/4305 [00:21<00:04, 123.23it/s]

dcclt/P349852 is not available or not complete
dcclt/P370327 is not available or not complete


 88%|████████████████████████████████▍    | 3779/4305 [00:21<00:03, 137.25it/s]

dcclt/P373929 is not available or not complete
dcclt/P451049 is not available or not complete


 88%|████████████████████████████████▌    | 3795/4305 [00:21<00:04, 103.36it/s]

dcclt/P349909 is not available or not complete


 90%|█████████████████████████████████▏   | 3867/4305 [00:22<00:02, 162.42it/s]

dcclt/P450779 is not available or not complete
dcclt/P451529 is not available or not complete
dcclt/P369429 is not available or not complete
dcclt/P450852 is not available or not complete


 90%|█████████████████████████████████▍   | 3892/4305 [00:22<00:03, 108.26it/s]

dcclt/P382242 is not available or not complete


 93%|██████████████████████████████████▍  | 4003/4305 [00:22<00:01, 195.57it/s]

dcclt/P349915 is not available or not complete
dcclt/P381865 is not available or not complete
dcclt/P385917 is not available or not complete


 94%|██████████████████████████████████▋  | 4033/4305 [00:23<00:01, 179.55it/s]

dcclt/P381781 is not available or not complete
dcclt/P451035 is not available or not complete
dcclt/P370363 is not available or not complete
dcclt/P247827 is not available or not complete


 95%|███████████████████████████████████▎ | 4110/4305 [00:23<00:01, 181.44it/s]

dcclt/P239209 is not available or not complete
dcclt/P282500 is not available or not complete


 97%|███████████████████████████████████▊ | 4174/4305 [00:23<00:00, 147.66it/s]

dcclt/P271733 is not available or not complete


 97%|████████████████████████████████████ | 4191/4305 [00:24<00:00, 139.01it/s]

dcclt/P451157 is not available or not complete


 98%|████████████████████████████████████▎| 4222/4305 [00:24<00:00, 117.30it/s]

dcclt/P346027 is not available or not complete
dcclt/P347787 is not available or not complete
dcclt/P228105 is not available or not complete
dcclt/P388213 is not available or not complete


100%|████████████████████████████████████▊| 4290/4305 [00:24<00:00, 161.79it/s]

dcclt/Q000109 is not available or not complete


100%|█████████████████████████████████████| 4305/4305 [00:24<00:00, 173.28it/s]
  2%|▋                                        | 12/664 [00:00<00:21, 30.77it/s]

dcclt/nineveh/P370417 is not available or not complete


  3%|█▍                                       | 23/664 [00:00<00:17, 37.23it/s]

dcclt/nineveh/P382647 is not available or not complete
dcclt/nineveh/P393814 is not available or not complete


  5%|██                                       | 34/664 [00:00<00:13, 45.91it/s]

dcclt/nineveh/P401791 is not available or not complete
dcclt/nineveh/P382643 is not available or not complete
dcclt/nineveh/P394149 is not available or not complete


  7%|██▊                                      | 46/664 [00:00<00:10, 56.22it/s]

dcclt/nineveh/P365275 is not available or not complete


  9%|███▊                                     | 61/664 [00:00<00:08, 68.33it/s]

dcclt/nineveh/P398414 is not available or not complete


 11%|████▍                                    | 72/664 [00:00<00:07, 75.47it/s]

dcclt/nineveh/P395481 is not available or not complete


 12%|█████                                    | 82/664 [00:01<00:10, 55.67it/s]

dcclt/nineveh/P365272 is not available or not complete


 14%|█████▋                                   | 92/664 [00:01<00:08, 63.74it/s]

dcclt/nineveh/P382648 is not available or not complete


 16%|██████▎                                 | 105/664 [00:01<00:07, 72.53it/s]

dcclt/nineveh/P386433 is not available or not complete


 19%|███████▍                                | 123/664 [00:01<00:06, 84.26it/s]

dcclt/nineveh/P365421 is not available or not complete
dcclt/nineveh/P395491 is not available or not complete
dcclt/nineveh/P395540 is not available or not complete
dcclt/nineveh/P385988 is not available or not complete


 20%|████████                                | 134/664 [00:01<00:07, 71.84it/s]

dcclt/nineveh/P346068 is not available or not complete
dcclt/nineveh/P395527 is not available or not complete


 23%|█████████▍                              | 156/664 [00:01<00:06, 84.04it/s]

dcclt/nineveh/P365387 is not available or not complete
dcclt/nineveh/P365398 is not available or not complete
dcclt/nineveh/P400632 is not available or not complete
dcclt/nineveh/P397726 is not available or not complete
dcclt/nineveh/P423632 is not available or not complete


 26%|██████████▍                             | 173/664 [00:01<00:05, 93.85it/s]

dcclt/nineveh/P397258 is not available or not complete


 30%|███████████▋                           | 198/664 [00:02<00:04, 114.04it/s]

dcclt/nineveh/P238446 is not available or not complete
dcclt/nineveh/P395719 is not available or not complete


 32%|████████████▊                           | 213/664 [00:02<00:05, 78.06it/s]

dcclt/nineveh/P393788 is not available or not complete


 35%|██████████████                          | 233/664 [00:02<00:04, 93.76it/s]

dcclt/nineveh/P395526 is not available or not complete
dcclt/nineveh/P395632 is not available or not complete


 38%|██████████████▋                        | 250/664 [00:02<00:03, 106.46it/s]

dcclt/nineveh/P365274 is not available or not complete


 40%|███████████████▋                       | 267/664 [00:02<00:03, 116.89it/s]

dcclt/nineveh/P365399 is not available or not complete
dcclt/nineveh/P289805 is not available or not complete


 42%|████████████████▌                      | 282/664 [00:02<00:03, 109.79it/s]

dcclt/nineveh/P423623 is not available or not complete
dcclt/nineveh/P238140 is not available or not complete


 44%|█████████████████▎                     | 295/664 [00:03<00:03, 107.50it/s]

dcclt/nineveh/P365314 is not available or not complete
dcclt/nineveh/P385899 is not available or not complete
dcclt/nineveh/P365385 is not available or not complete


 46%|██████████████████                     | 308/664 [00:03<00:03, 100.41it/s]

dcclt/nineveh/P393806 is not available or not complete
dcclt/nineveh/P365236 is not available or not complete


 48%|███████████████████▎                    | 320/664 [00:03<00:03, 88.99it/s]

dcclt/nineveh/P395652 is not available or not complete


 50%|███████████████████▉                    | 331/664 [00:03<00:03, 91.48it/s]

dcclt/nineveh/P365317 is not available or not complete
dcclt/nineveh/P365316 is not available or not complete


 54%|█████████████████████▌                  | 358/664 [00:03<00:03, 91.73it/s]

dcclt/nineveh/P400375 is not available or not complete


 59%|███████████████████████                | 392/664 [00:03<00:02, 101.50it/s]

dcclt/nineveh/P373795 is not available or not complete


 64%|█████████████████████████▊              | 428/664 [00:04<00:02, 80.98it/s]

dcclt/nineveh/P382585 is not available or not complete
dcclt/nineveh/P386429 is not available or not complete


 66%|██████████████████████████▍             | 439/664 [00:04<00:02, 86.40it/s]

dcclt/nineveh/P373894 is not available or not complete
dcclt/nineveh/P365245 is not available or not complete
dcclt/nineveh/P423635 is not available or not complete


 68%|███████████████████████████             | 450/664 [00:04<00:02, 85.54it/s]

dcclt/nineveh/P373851 is not available or not complete


 71%|████████████████████████████▍           | 472/664 [00:04<00:02, 89.59it/s]

dcclt/nineveh/P370415 is not available or not complete


 73%|█████████████████████████████▏          | 484/664 [00:05<00:02, 80.27it/s]

dcclt/nineveh/P399148 is not available or not complete
dcclt/nineveh/P365391 is not available or not complete
dcclt/nineveh/P373903 is not available or not complete


 75%|██████████████████████████████          | 499/664 [00:05<00:01, 91.30it/s]

dcclt/nineveh/P346056 is not available or not complete
dcclt/nineveh/P395476 is not available or not complete
dcclt/nineveh/P346052 is not available or not complete


 79%|██████████████████████████████▊        | 525/664 [00:05<00:01, 110.67it/s]

dcclt/nineveh/P395629 is not available or not complete
dcclt/nineveh/P423633 is not available or not complete


 82%|███████████████████████████████▊       | 542/664 [00:05<00:01, 121.18it/s]

dcclt/nineveh/P382606 is not available or not complete
dcclt/nineveh/P382646 is not available or not complete


 85%|█████████████████████████████████      | 562/664 [00:05<00:00, 126.62it/s]

dcclt/nineveh/P373913 is not available or not complete
dcclt/nineveh/P373867 is not available or not complete


 88%|██████████████████████████████████▍    | 587/664 [00:05<00:00, 146.23it/s]

dcclt/nineveh/P397568 is not available or not complete
dcclt/nineveh/P365320 is not available or not complete
dcclt/nineveh/P385911 is not available or not complete
dcclt/nineveh/P423640 is not available or not complete


 91%|███████████████████████████████████▍   | 604/664 [00:05<00:00, 123.60it/s]

dcclt/nineveh/P393770 is not available or not complete
dcclt/nineveh/P395535 is not available or not complete


 97%|█████████████████████████████████████▋ | 641/664 [00:06<00:00, 135.65it/s]

dcclt/nineveh/P395490 is not available or not complete


100%|███████████████████████████████████████| 664/664 [00:06<00:00, 151.33it/s]
  0%|                                                  | 0/305 [00:00<?, ?it/s]

dcclt/signlists/Q000154 is not available or not complete


  9%|███▍                                     | 26/305 [00:00<00:02, 99.15it/s]

dcclt/signlists/Q000153 is not available or not complete


 12%|████▊                                    | 36/305 [00:00<00:02, 93.39it/s]

dcclt/signlists/Q000159 is not available or not complete


 57%|██████████████████████▉                 | 175/305 [00:01<00:01, 92.81it/s]

dcclt/signlists/P370411 is not available or not complete


 62%|████████████████████████▋               | 188/305 [00:01<00:01, 79.10it/s]

dcclt/signlists/X003931 is not available or not complete


 68%|███████████████████████████             | 206/305 [00:01<00:01, 94.94it/s]

dcclt/signlists/X003934 is not available or not complete
dcclt/signlists/Q000155 is not available or not complete


 89%|██████████████████████████████████▋    | 271/305 [00:02<00:00, 132.24it/s]

dcclt/signlists/P257722 is not available or not complete


 94%|████████████████████████████████████▌  | 286/305 [00:02<00:00, 100.02it/s]

dcclt/signlists/P467315 is not available or not complete


100%|████████████████████████████████████████| 305/305 [00:03<00:00, 99.78it/s]
100%|██████████████████████████████████████████| 20/20 [00:00<00:00, 67.34it/s]
100%|████████████████████████████████████████| 485/485 [00:13<00:00, 36.70it/s]
100%|███████████████████████████████████████| 883/883 [00:06<00:00, 142.17it/s]
100%|██████████████████████████████████████████| 51/51 [00:00<00:00, 91.26it/s]
100%|█████████████████████████████████████████| 82/82 [00:00<00:00, 378.56it/s]
 29%|████████████▏                             | 11/38 [00:00<00:00, 95.64it/s]

ribo/babylon2/Q006275 is not available or not complete


100%|█████████████████████████████████████████| 38/38 [00:00<00:00, 112.08it/s]
100%|███████████████████████████████████████████| 4/4 [00:00<00:00, 129.02it/s]
100%|███████████████████████████████████████████| 6/6 [00:00<00:00, 260.84it/s]
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 199.99it/s]
 34%|█████████████▉                           | 43/126 [00:00<00:01, 81.92it/s]

ribo/babylon6/Q003344 is not available or not complete


100%|████████████████████████████████████████| 126/126 [00:01<00:00, 94.28it/s]
100%|█████████████████████████████████████████| 30/30 [00:00<00:00, 127.12it/s]
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 64.10it/s]
100%|████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████| 378/378 [00:01<00:00, 245.81it/s]
  0%|                                                   | 0/92 [00:00<?, ?it/s]

rinap/rinap1/Q003630 is not available or not complete


 33%|█████████████▋                            | 30/92 [00:00<00:01, 57.33it/s]

rinap/rinap1/Q003624 is not available or not complete
rinap/rinap1/Q003626 is not available or not complete


 49%|████████████████████▌                     | 45/92 [00:00<00:00, 67.02it/s]

rinap/rinap1/Q003625 is not available or not complete
rinap/rinap1/Q003634 is not available or not complete
rinap/rinap1/Q003633 is not available or not complete


 64%|██████████████████████████▉               | 59/92 [00:00<00:00, 64.22it/s]

rinap/rinap1/Q003629 is not available or not complete
rinap/rinap1/Q003627 is not available or not complete
rinap/rinap1/Q003628 is not available or not complete


 84%|███████████████████████████████████▏      | 77/92 [00:01<00:00, 76.75it/s]

rinap/rinap1/Q003622 is not available or not complete
rinap/rinap1/Q003623 is not available or not complete


100%|██████████████████████████████████████████| 92/92 [00:01<00:00, 84.15it/s]
 26%|██████████▌                              | 67/261 [00:01<00:05, 38.31it/s]

rinap/rinap3/Q004016 is not available or not complete


 69%|███████████████████████████▍            | 179/261 [00:02<00:01, 67.50it/s]

rinap/rinap3/Q003971 is not available or not complete


100%|████████████████████████████████████████| 261/261 [00:03<00:00, 73.74it/s]
 41%|████████████████▊                        | 75/183 [00:00<00:02, 37.63it/s]

rinap/rinap4/Q003344 is not available or not complete


100%|████████████████████████████████████████| 183/183 [00:02<00:00, 85.26it/s]
100%|████████████████████████████████████████| 140/140 [00:03<00:00, 35.68it/s]
100%|███████████████████████████████████████| 264/264 [00:01<00:00, 138.70it/s]
100%|██████████████████████████████████████████| 15/15 [00:00<00:00, 29.05it/s]
100%|██████████████████████████████████████████| 52/52 [00:00<00:00, 72.60it/s]
 32%|████████████▉                           | 115/354 [00:01<00:02, 88.30it/s]

saao/saa04/P336097 is not available or not complete
saao/saa04/P336343 is not available or not complete


 65%|█████████████████████████▏             | 229/354 [00:02<00:00, 140.94it/s]

saao/saa04/P237370 is not available or not complete


 77%|█████████████████████████████▊         | 271/354 [00:02<00:00, 154.50it/s]

saao/saa04/P336332 is not available or not complete


100%|███████████████████████████████████████| 354/354 [00:02<00:00, 130.11it/s]
100%|███████████████████████████████████████| 300/300 [00:01<00:00, 163.77it/s]
  4%|█▌                                      | 14/350 [00:00<00:02, 129.86it/s]

saao/saa06/P335176 is not available or not complete


  8%|███▏                                    | 28/350 [00:00<00:02, 131.25it/s]

saao/saa06/P335202 is not available or not complete


 12%|████▋                                   | 41/350 [00:00<00:02, 125.12it/s]

saao/saa06/P335322 is not available or not complete
saao/saa06/P335204 is not available or not complete


 15%|██████▏                                 | 54/350 [00:00<00:02, 122.89it/s]

saao/saa06/P335279 is not available or not complete


 26%|██████████▍                             | 91/350 [00:00<00:02, 120.43it/s]

saao/saa06/P335372 is not available or not complete


 29%|███████████▍                           | 103/350 [00:00<00:02, 113.73it/s]

saao/saa06/P335192 is not available or not complete


 69%|██████████████████████████▊            | 241/350 [00:02<00:00, 117.27it/s]

saao/saa06/P335226 is not available or not complete


100%|███████████████████████████████████████| 350/350 [00:03<00:00, 116.34it/s]
 33%|█████████████▎                          | 73/219 [00:00<00:01, 105.80it/s]

saao/saa07/P335792 is not available or not complete


100%|███████████████████████████████████████| 219/219 [00:01<00:00, 109.54it/s]
100%|███████████████████████████████████████| 568/568 [00:02<00:00, 212.61it/s]
100%|██████████████████████████████████████████| 11/11 [00:00<00:00, 53.65it/s]
100%|███████████████████████████████████████| 389/389 [00:03<00:00, 108.20it/s]
100%|███████████████████████████████████████| 234/234 [00:01<00:00, 187.16it/s]
 13%|█████▌                                    | 13/98 [00:00<00:01, 82.38it/s]

saao/saa12/P235242 is not available or not complete


 21%|█████████                                 | 21/98 [00:00<00:00, 78.48it/s]

saao/saa12/P285576 is not available or not complete


100%|██████████████████████████████████████████| 98/98 [00:01<00:00, 71.07it/s]
100%|███████████████████████████████████████| 210/210 [00:01<00:00, 155.85it/s]
  0%|                                                  | 0/479 [00:00<?, ?it/s]

saao/saa14/P335530 is not available or not complete


  8%|███▎                                    | 40/479 [00:00<00:02, 197.80it/s]

saao/saa14/P335415 is not available or not complete


 13%|█████                                   | 60/479 [00:00<00:02, 197.27it/s]

saao/saa14/P335587 is not available or not complete
saao/saa14/P335263 is not available or not complete


 16%|██████▍                                 | 77/479 [00:00<00:02, 185.13it/s]

saao/saa14/P335079 is not available or not complete


 19%|███████▋                                | 92/479 [00:00<00:02, 171.78it/s]

saao/saa14/P335107 is not available or not complete
saao/saa14/P335214 is not available or not complete


 27%|██████████▌                            | 130/479 [00:00<00:02, 172.13it/s]

saao/saa14/P335271 is not available or not complete


 32%|████████████▌                          | 155/479 [00:00<00:01, 185.56it/s]

saao/saa14/P334977 is not available or not complete
saao/saa14/P334991 is not available or not complete
saao/saa14/P335305 is not available or not complete


 37%|██████████████▍                        | 177/479 [00:00<00:01, 189.08it/s]

saao/saa14/P337155 is not available or not complete


 42%|████████████████▎                      | 200/479 [00:01<00:01, 198.19it/s]

saao/saa14/P336196 is not available or not complete


 46%|█████████████████▉                     | 220/479 [00:01<00:01, 195.80it/s]

saao/saa14/P335038 is not available or not complete
saao/saa14/P335943 is not available or not complete
saao/saa14/P335574 is not available or not complete


 50%|███████████████████▌                   | 240/479 [00:01<00:01, 188.67it/s]

saao/saa14/P336029 is not available or not complete
saao/saa14/P335257 is not available or not complete


 54%|█████████████████████                  | 259/479 [00:01<00:01, 185.63it/s]

saao/saa14/P335331 is not available or not complete
saao/saa14/P335525 is not available or not complete


 64%|████████████████████████▊              | 305/479 [00:01<00:00, 196.43it/s]

saao/saa14/P335197 is not available or not complete
saao/saa14/P335081 is not available or not complete
saao/saa14/P224949 is not available or not complete


 72%|████████████████████████████           | 344/479 [00:01<00:00, 178.75it/s]

saao/saa14/P336247 is not available or not complete
saao/saa14/P335459 is not available or not complete


 79%|██████████████████████████████▉        | 380/479 [00:02<00:00, 150.67it/s]

saao/saa14/P335080 is not available or not complete


 83%|█████████████████████████████████       | 396/479 [00:02<00:00, 85.22it/s]

saao/saa14/P335489 is not available or not complete
saao/saa14/P335539 is not available or not complete


 85%|██████████████████████████████████▏     | 409/479 [00:02<00:00, 94.63it/s]

saao/saa14/P335180 is not available or not complete
saao/saa14/P336194 is not available or not complete


 89%|██████████████████████████████████▌    | 424/479 [00:02<00:00, 100.71it/s]

saao/saa14/P335537 is not available or not complete


 95%|█████████████████████████████████████▏ | 457/479 [00:02<00:00, 120.68it/s]

saao/saa14/P335196 is not available or not complete
saao/saa14/P335154 is not available or not complete


100%|███████████████████████████████████████| 479/479 [00:03<00:00, 155.75it/s]
100%|███████████████████████████████████████| 389/389 [00:02<00:00, 141.83it/s]
100%|███████████████████████████████████████| 246/246 [00:01<00:00, 128.71it/s]
100%|████████████████████████████████████████| 207/207 [00:02<00:00, 97.06it/s]
100%|███████████████████████████████████████| 204/204 [00:01<00:00, 105.76it/s]
100%|███████████████████████████████████████| 229/229 [00:02<00:00, 102.73it/s]
100%|██████████████████████████████████████████| 55/55 [00:02<00:00, 27.26it/s]


jsonzip/saao-saa21.zip does not exist or is not a proper ZIP file


100%|██████████████████████████████████████████| 33/33 [00:00<00:00, 81.80it/s]


## 3 Data Structuring
### 3.1 Transform the Data into a DataFrame


In [9]:
words_l = []
separators = ['{', '}', '-']
separators2 = ['.', '+', '|']
operators = ['&', '%', '@', '×']
for e in tqdm.tqdm(all_):
    word = []
    if '1(šar₂{gal})' in e: # this cheating but it seems to work (appears in SKL 38)
            e = e.replace('1(šar₂{gal})', '1(šar₂)-gal')
    for s in separators: # first split word into signs   
        e = e.replace(s, ' ').strip()
    s_l = e.split()
    for sign in s_l:
        if sign[0].isdigit(): # 1(geš₂), 2(DIŠ), etc.
            sign = sign.lower()
        elif sign[-1] == ')': # qualified sign - get only the qualifier
            stack = []  # |GIŠ×(GIŠ%GIŠ)|(LAK277) becomes LAK277
            ind = {}    # LAK277(|GIŠ×(GIŠ%GIŠ)|) becomes |GIŠ×(GIŠ%GIŠ)|
            for i, c in reversed(list(enumerate(sign))):
                if c == ')':
                    stack.append(i)
                if c == '(':
                    ind[stack.pop()] = i   # find the opening parens that belongs to the closing parens at position -1    
            start = ind[len(sign)-1]   # this line fails on 1(šar₂{gal}) in SKL.
            t = sign[start+1:-1]
            if t.isupper(): #leave 1(diš) etc. alone
                sign = t
            
        if '|' in sign:  # separate |DU.DU| and |DU+DU| into its components but not |DU&DU|
                        # and also not |DU.DU&DU|
            flag = False
            for o in operators:
                if o in sign:
                    flag = True
            if not flag:
                for s in separators2:
                    sign = sign.replace(s, ' ').strip() 
                sign_l = sign.split()
                word.extend(sign_l)
                continue
        elif "+" in sign:  # + as marker of gloss
            sign = sign.replace('+', ' ').strip()
            sign_l = sign.split()
            word.extend(sign_l)
            continue
        word.append(sign)
    words_l.append(word)           

100%|█████████████████████████████| 1541908/1541908 [00:18<00:00, 83506.24it/s]


In [10]:
with open("output/ogsl.p", "rb") as f:
    o = pd.read_pickle(f)

In [11]:
val = list(o["value"])
utf = list(o["utf8"])
names = list(o["name"])

In [12]:
d = dict(zip(names, utf))
d2 = dict(zip(val,names))

In [13]:
names_l = []
utf8_l = []
for w in tqdm.tqdm(words_l):
    seq = [d2[s.lower()] if s.lower() in d2 else s for s in w]
    names_l.append(seq)
    utf8 = [d[n] if n in d else n for n in seq]
    utf8_l.append(''.join(utf8))

100%|████████████████████████████| 1541908/1541908 [00:14<00:00, 104965.49it/s]


In [14]:
df = pd.DataFrame({"transliteration":all_, "words":words_l, "names":names_l, "utf-8":utf8_l, "lemm" : lemm_, "lang" : lang_})
df

Unnamed: 0,transliteration,words,names,utf-8,lemm,lang
0,Startblms/P414332,[Startblms/P414332],[Startblms/P414332],Startblms/P414332,Startblms/P414332,Startblms/P414332
1,x-x,"[x, x]","[X, X]",XX,x-x,sux
2,dam-ŋu₁₀,"[dam, ŋu₁₀]","[DAM, MU]",𒁮𒈬,dam[spouse]N,sux
3,mu-ni-ib₂-be₂,"[mu, ni, ib₂, be₂]","[MU, NI, TUM, BI]",𒈬𒉌𒌈𒁉,e[speak]V/t,sux
4,x-ri,"[x, ri]","[X, RI]",X𒊑,x-ri,akk-x-stdbab
5,mu-ti-ma,"[mu, ti, ma]","[MU, TI, MA]",𒈬𒋾𒈠,mutu[husband]N,akk-x-stdbab
6,i-qab-bi,"[i, qab, bi]","[I, GABA, BI]",𒄿𒃮𒁉,qabû[say]V,akk-x-stdbab
7,x-x,"[x, x]","[X, X]",XX,x-x,sux
8,dumu-ŋu₁₀,"[dumu, ŋu₁₀]","[TUR, MU]",𒌉𒈬,dumu[child]N,sux
9,mu-ni-ib₂-be₂,"[mu, ni, ib₂, be₂]","[MU, NI, TUM, BI]",𒈬𒉌𒌈𒁉,e[speak]V/t,sux


In [15]:
with open("corpus/all_df.p", "wb") as w:
    pickle.dump(df, w)

# Save as Text
Save three different representations of the Akkadian text. Each representation is saved in a separate text file:
- in transliteration        ===> akk_tl.txt
- in lemmatized format   ===> akk_lemm.txt
- in cuneiform (utf-8)    ===> akk_utf8.txt

In [17]:
df_akk = df[df["lang"].str[:3].isin(["akk", "Sta"])]
rep_d = {"akk_utf8": "utf-8", "akk_lemm": "lemm", "akk_tl" : "transliteration"}
for rep in rep_d:
    text = ' '.join(df_akk[rep_d[rep]]).strip()
    text = text.replace(' Start', '\n').strip()
    text = text.replace('Start', '')
    text = re.sub(r'\n+', '\n', text)
    file = "corpus/" + rep + ".txt"
    with open(file, 'w', encoding="utf-8") as w:
        w.write(text)