# Extract Signs from ORACC JSON: Basic Parser
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract signbs from the texts of one or more projects. 

In [1]:
import pandas as pd
import zipfile
import json
import tqdm
import requests
import errno
import os
import pickle

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. If they do not exist they are created, else: do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
directories = ['jsonzip', 'output']
for d in directories:
    try:
        os.mkdir(d)
    except OSError as exc:
        if exc.errno !=errno.EEXIST:
            raise
        pass

## 1.1 Input Project Names
Provide a list of one or more project names, separated by commas. Note that subprojects must be listed separately, they are not included in the main project. For instance:

`saao/saa01,saao/saa02,blms`

In [3]:
projects = input('Project(s): ').lower()

Project(s): epsd2/admin/u3adm


## 1.2 Split the List of Projects
Split the list of projects and create a list of project names.

In [4]:
p = projects.split(',')               # split at each comma and make a list called `p`
p = [x.strip() for x in p]        # strip spaces left and right of each entry in `p`

## 1.3 Download the ZIP files
For each project in the list download all the `json` files from `http://build-oracc.museum.upenn.edu/json/`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

If you have downloaded the files by hand (and put them in the `jsonzip` directory) you may skip this cell and jump directly to section [2.1 The Parsejson() function](#head21).

In [None]:
CHUNK = 16 * 1024
for project in tqdm.tqdm(p):
    project = project.replace('/', '-')
    url = "http://build-oracc.museum.upenn.edu/json/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    r = requests.get(url)
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in r.iter_content(chunk_size=CHUNK):
                f.write(c)
    else:
        print(url + " does not exist.")

## <a name="head21"></a>2.1 The `parsejson()` function
The `parsejson()` function will "dig into" the `json` file (transformed into a dictionary) until it finds the relevant data. The `json` file consists of a hierarchy of `cdl` nodes; only the lowest nodes contain lemmatization data. The function goes down this hierarchy by calling itself when another `cdl` node is encountered. For nore information about the data hierarchy in the [ORACC](http://oracc.org) `json` files, see [ORACC Open Data](http://oracc.museum.upenn.edu/doc/opendata/index.html).

The argument of the `parsejson()` function is a `JSON` object, a dictionary that initially contains the entire contents of the original JSON file. The code takes the key `cdl` which itself contains an array (a list) of `JSON` objects. Iterating through these objects, if an object contains another `cdl` node, the function calls itself with this object as first argument. This way the function digs deeper and deeper into the `JSON` tree, until it does not encounter a `cdl` key anymore. Here we are at the level of individual words. The code checks for a key `f`, if it exists the signs are in the node `gdl` within the `f` node. 

In [5]:
def parsejson(text):
    analyze = False
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject)
        if "f" in JSONobject:
            id_word = JSONobject["ref"]
            for sign in JSONobject["f"]["gdl"]:
                signs = sign
                if "c" in sign:  # DIRI sign of unknown reading, like |KI.AN|
                    signs["v"] = sign['c']
                elif "seq" in sign: 
                    if "sexified" in sign: # add (diš) etc. to numbers
                        signs["v"] = sign["sexified"]
                    elif "form" in sign:  # fully qualified numbers
                        signs["form"] = sign["form"]
                    else: # determinatives
                        for s in sign["seq"]:
                            if "v" in s:
                                signs["v"] = s["v"]
                                signs["id"] = s["id"]
                elif "qualified" in sign: # like kuₓ(DU)
                    for s in sign["qualified"]:
                        if "s" in s:
                            signs["s"] = s["s"]
                        if "c" in s: # like gilgamešₓ(|BIL₃.GA.MES|)
                            compl = s["c"] 
                            if "seq" in s:
                                compound = []
                                analyze = True
                                for e in s["seq"]:
                                    if "o" in e and not e["o"] == "beside":
                                        analyze = False # leave signs like dulₓ(URxA) alone
                                        signs["s"] = compl
                                        break
                                    if "s" in e:
                                        comp = signs.copy()
                                        comp["s"] = e["s"]
                                        comp["id_word"] = id_word
                                        #print(comp["s"])
                                        compound.append(comp.copy())
                                if analyze:
                                    signs_l.extend(compound)
                elif "group" in sign: # ligatures
                    ligature = []
                    analyze = True
                    for c in sign["group"]:
                        if "v" in c:
                            lig = signs.copy()
                            lig["v"] = c["v"]
                            lig["id_word"] = id_word
                            ligature.append(lig.copy())
                    signs_l.extend(ligature)
                if not analyze:
                    signs["id_word"] = id_word
                    signs_l.append(signs)
                else:
                    analyze = False
    return

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file is located in the directory `jsonzip`, named PROJECT.zip. 

Each of these files is extracted from the `zip` file and read with the command `json.loads()`, which reads the json data and transforms it into a Python dictionary (a sequence of keys and values).

This dictionary, which is called `text` is now sent to the `parsejson()` function. The function adds signs to the `sign_l` list.

In [6]:
signs = {}
signs_l = []
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in files:                            #iterate over the file names
        id_text = project + filename[-13:-5] # id_text is, for instance, blms/P414332
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
        #    print(filename)
            data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
            parsejson(data_json)               # and send to the parsejson() function
        except:
            print(id_text + ' is not available or not complete')

epsd2/admin/u3adm/P511905 is not available or not complete
epsd2/admin/u3adm/P511471 is not available or not complete
epsd2/admin/u3adm/P109084 is not available or not complete
epsd2/admin/u3adm/P511973 is not available or not complete
epsd2/admin/u3adm/P504596 is not available or not complete
epsd2/admin/u3adm/P414560 is not available or not complete
epsd2/admin/u3adm/P512114 is not available or not complete
epsd2/admin/u3adm/P109115 is not available or not complete
epsd2/admin/u3adm/P511467 is not available or not complete
epsd2/admin/u3adm/P511979 is not available or not complete
epsd2/admin/u3adm/P105380 is not available or not complete
epsd2/admin/u3adm/P512156 is not available or not complete
epsd2/admin/u3adm/P414575 is not available or not complete
epsd2/admin/u3adm/P497673 is not available or not complete
epsd2/admin/u3adm/P476069 is not available or not complete
epsd2/admin/u3adm/P474548 is not available or not complete
epsd2/admin/u3adm/P474558 is not available or not comple

## 3 Data Structuring
### 3.1 Transform the Data into a DataFrame


In [44]:
sign_df = pd.DataFrame(signs_l)
sign_df = sign_df.fillna('')      # replace NaN (Not a Number) with empty string
sign_df

Unnamed: 0,break,breakEnd,breakStart,c,delim,det,form,gdl_collated,gdl_remarked,gdl_type,...,pos,q,qualified,queried,role,s,seq,statusStart,v,x
0,,,,,,,3(diš),,,,...,,,,,,,"[{'r': '3'}, {'v': 'diš'}]",,,
1,,,,,,,,,,,...,,,,,,,,,udu,
2,,,,,,,6(diš),,,,...,,,,,,,"[{'r': '6'}, {'v': 'diš'}]",,,
3,,,,,,,,,,,...,,,,,,,,,u₈,
4,,,,,,,2(diš),,,,...,,,,,,,"[{'r': '2'}, {'v': 'diš'}]",,,
5,,,,,-,,,,,,...,,,,,,,,,maš₂,
6,,,,,,,,,,,...,,,,,,,,,gal,
7,,,,,-,,,,,,...,,,,,,,,,šu,
8,,,,,,,,,,,...,,,,,,,,,gid₂,
9,,,,,-,,,,,,...,,,,,,,,,e₂,


In [45]:
sign_df["sign"] = sign_df[["v", "s", "form", "x"]].apply(lambda r: ''.join(r), axis=1)

In [46]:
sign_df["sign"] = ["xxx" if sign == "ellipsis" else sign for sign in sign_df["sign"]]

In [47]:
sign_df2 = sign_df[["id", "id_word", "sign"]].copy()

In [49]:
sign_df2['id_line'] = [int(wordid.split('.')[1]) for wordid in sign_df2['id_word']]
sign_df2["id_text"] = [wordid[:7] for wordid in sign_df2["id_word"]]
sign_df2["id_word"] = [int(wordid.split('.')[2]) for wordid in sign_df2["id_word"]]
sign_df2["id"] = [i[:i.rfind(".")] for i in sign_df2["id"]]

In [50]:
with open("output/ogsl.p", "rb") as p:
    o = pickle.load(p)

In [51]:
val = list(o["value"])
utf = list(o["utf8"])
names = list(o["name"])

In [52]:
d = dict(zip(names, utf))
d2 = dict(zip(val,names))

In [53]:
sign_df2["name"] = [d2[sign.lower()] if sign.lower() in d2 else sign for sign in sign_df2["sign"]]

In [54]:
sign_df2["utf8"] = [d[name] if name in d else name for name in sign_df2["name"]]

In [55]:
sign_df3 = sign_df2.groupby(["id_text", "id_line", "id_word", "id"]).agg({"utf8": ''.join}).reset_index()

In [38]:
sign_df4 = sign_df3.groupby("id_text").agg({"utf8": ' '.join}).reset_index()

In [39]:
with open("output/ur3.p", "wb") as p:
    pickle.dump(sign_df4, p)

In [40]:
text = ' '.join(sign_df4["utf8"])
with open("output/ur3.txt", 'w', encoding="utf8") as t:
    t.write(text)

# Test

In [None]:
all_ = []
for project in ["epsd2/admin/u3adm"]:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in tqdm.tqdm(files):                            #iterate over the file names
        id_text = project + filename[-13:-5] # id_text is, for instance, blms/P414332
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
        #    print(filename)
            data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
            p = jp.parse("$..f.form")
            l = p.find(j)
            s_l = [s.value for s in l]              # and send to the parsejson() function
            all_.extend(s_l)
        except:
            print(id_text + ' is not available or not complete')

epsd2/admin/u3adm/P511905 is not available or not complete
epsd2/admin/u3adm/P511471 is not available or not complete
epsd2/admin/u3adm/P109084 is not available or not complete
epsd2/admin/u3adm/P511973 is not available or not complete
epsd2/admin/u3adm/P504596 is not available or not complete
epsd2/admin/u3adm/P414560 is not available or not complete
epsd2/admin/u3adm/P512114 is not available or not complete
epsd2/admin/u3adm/P109115 is not available or not complete
epsd2/admin/u3adm/P511467 is not available or not complete
epsd2/admin/u3adm/P511979 is not available or not complete
epsd2/admin/u3adm/P105380 is not available or not complete
epsd2/admin/u3adm/P512156 is not available or not complete
epsd2/admin/u3adm/P414575 is not available or not complete
epsd2/admin/u3adm/P497673 is not available or not complete
epsd2/admin/u3adm/P476069 is not available or not complete
epsd2/admin/u3adm/P474548 is not available or not complete
epsd2/admin/u3adm/P474558 is not available or not comple

In [58]:
file = "jsonzip/epsd2/admin/u3adm/corpusjson/P100001.json"
with open(file, "r") as r:
    j = json.load(r)

In [61]:
import jsonpath_rw as jp

In [62]:
p = jp.parse("$..f.form")
l = p.find(j)
s_l = [s.value for s in l]

In [70]:
words_l = []
separators = ['|', '.', '{', '}', '-']
for e in s_l:
    for s in separators:
        e = e.replace(s, ' ').strip()
    l = e.split()
    words_l.append(l)           

In [None]:
with open("output/ogsl.p", "rb") as p:
    o = pickle.load(p)

In [None]:
val = list(o["value"])
utf = list(o["utf8"])
names = list(o["name"])

In [None]:
d = dict(zip(names, utf))
d2 = dict(zip(val,names))

In [76]:
names_l = []
utf8_l = []
for w in words_l:
    seq = [d2[s.lower()] if s.lower() in d2 else s for s in w]
    names_l.append(seq)
    utf8 = [d[n] if n in d else n for n in seq]
    utf8_l.append(''.join(utf8))

Unnamed: 0,frag,names,signs,utf-8
0,a₂-bi,"[A₂, BI]","[a₂, bi]",𒀉𒁉
1,u₄,[UD],[u₄],𒌓
2,x,[X],[x],X
3,5(u),[5(U)],[5(u)],𒐐
4,4(diš),[4(DIŠ)],[4(diš)],𒐉
5,2/3(diš)-kam,"[ŠANABI, |HI×BAD|]","[2/3(diš), kam]",𒑛𒄰
6,8(geš₂),[8(GEŠ₂)],[8(geš₂)],𒐜
7,3(u),[|U.U.U|],[3(u)],𒌍
8,5(diš),[5(DIŠ)],[5(diš)],𒐊
9,guruš,[GURUŠ],[guruš],𒄨
