# Extract Lemmatization from JSON: Basic Parser
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract lemmatization data for one or more projects. The resulting `csv` (Comma Separated Values) file is named `parsed.csv` and has two fields: a Text ID (e.g. `dcclt/Q000039`) and a string of lemmas in the format `lugal[king]N` (or `šarru[king]N` for Akkadian texts).

The output of the Basic Parser contains *only* text IDs and lemmas. This format is useful for so-called [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) techniques such as word clouds or [topic modeling](https://en.wikipedia.org/wiki/Topic_model). The `JSON` files, however, contain a wealth of other data, including language (Sumerian, Akkadian, Emesal, etc.), orthographic form, morphology (currently only for Sumerian and Emesal), line numbers, breakage, meta-data, etc. These data are not extracted in the parser demonstrated here, but the example code should provide a model for how any type of data can be extracted.

In [1]:
import pandas as pd
import zipfile
import json
import tqdm
import requests
import errno
import os

## 1.1 Input Project Names
Provide a list of one or more project names, separated by commas. Note that subprojects must be listed separately, they are not included in the main project. For instance:

`saao/saa01,saao/saa02,blms`

In [2]:
projects = input('Project(s): ').lower()

Project(s): rimanum,glass


## 1.2 Split the List of Projects
Split the list of projects and create a list of project names.


In [3]:
p = projects.split(',')               # split at each comma and make a list called `p`
p = [x.strip() for x in p]        # strip spaces left and right of each entry in `p`

# 1.3 Create Download Directory
Create a directory called `jsonzip`. If the directory already exists, do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [4]:
try:
    os.mkdir('jsonzip')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

## 1.4 Download the ZIP files
For each project in the list download all the `json` files from `http://build-oracc.museum.upenn.edu/json/`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

If you have downloaded the files by hand (and put them in the `jsonzip` directory) you may skip this cell and jump directly to section [2.1 The Parsejson() function](head21).

In [5]:
CHUNK = 16 * 1024
for project in tqdm.tqdm(p):
    project = project.replace('/', '-')
    url = "http://build-oracc.museum.upenn.edu/json/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    r = requests.get(url)
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in r.iter_content(chunk_size=CHUNK):
                f.write(c)
    else:
        print(url + " does not exist.")

  0%|                                                    | 0/2 [00:00<?, ?it/s]

Downloading http://build-oracc.museum.upenn.edu/json/rimanum.zip saving as jsonzip/rimanum.zip


 50%|██████████████████████                      | 1/2 [00:03<00:03,  3.58s/it]

Downloading http://build-oracc.museum.upenn.edu/json/glass.zip saving as jsonzip/glass.zip


100%|████████████████████████████████████████████| 2/2 [00:05<00:00,  3.04s/it]


## <a name="head21"></a>2.1 The `parsejson()` function
The `parsejson()` function will "dig into" the `json` file (transformed into a dictionary) until it finds the relevant data. The `json` file consists of a hierarchy of `cdl` nodes; only the lowest nodes contain lemmatization data. The function goes down this hierarchy by calling itself when another `cdl` node is encountered. For nore information about the data hierarchy in the [ORACC](http://oracc.org) `json` files, see [ORACC Open Data](http://oracc.museum.upenn.edu/doc/opendata/index.html).

The first argument of the `parsejson()` function is a `JSON` object, a dictionary that initially contains the entire contents of the original JSON file. The code takes the key `cdl` which itself contains an array (a list) of `JSON` objects. Iterating through these objects, if an object contains another `cdl` node, the function calls itself with this object as first argument. This way the function digs deeper and deeper into the `JSON` tree, until it does not encounter a `cdl` key anymore. Here we are at the level of individual words. The code checks for a key `f`, if it exists the value of that key is appended to the list `lemm_l`. This list is defined outside of the function proper. 

The field `id_text` is derived from the input list. It consists of a project abbreviation, such as `blms` or `cams/gkab` plus a text ID, in the format `cams/gkab/P338616` or `dcclt/Q000039`. The `id_text` is added as a second argument to the function and is added to the lemmatization data of every word.

In [6]:
def parsejson(text, id_text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject, id_text)
        if "f" in JSONobject:
            lemm = JSONobject["f"]
            lemm["id_text"] = id_text
            lemm_l.append(lemm)
    return

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file is located in the directory `jsonzip`, named PROJECT.zip. The `zip` file contains a file that is called `corpus.json` that contains a full list of all the text IDs available in that corpus (P, Q, and X numbers) under the key `members`. This list is used to identify the files that contain the text data and that will be parsed. The `zip` file contains a directory `corpusjson` that holds the text files - each one is called `P######.json` (or `Q######.json` or `X######.json`).

Each of these files is extracted from the `zip` file and read with the command command `json.loads()`, which reads the json data and transforms it into a Python dictionary (a sequence of keys and values).

This dictionary, which is called `text` is now sent to the `parsejson()` function, with the text ID as second argument. The function adds lemmata to the `lemm_l` list.

In [10]:
lemm_l = []
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)                                                      # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    corpus = z.read(project + "/corpus.json").decode("utf-8") # read and decode the file corpus.json
    cat = json.loads(corpus)["members"]                                  # select the key "members" 
                                                                                                 #that holds all the P, Q, and X numbers.
    for item in cat:                                                                     #iterate over the P, Q, and X numbers
        id_text = project + '/' + item
        try:
            file = project + "/corpusjson/" + item + '.json'
            text = z.read(file).decode('utf-8')                                 #read and decode the json file of one particular text
            data_json = json.loads(text)                                        # make it into a dictionary
            parsejson(data_json, id_text)                                       # and send to the parsejson() function
        except:
            print(id_text + ' is not available or not complete')

In [11]:
word_df = pd.DataFrame(lemm_l).fillna('')
word_df

Unnamed: 0,base,cf,cont,delim,epos,form,gdl,gw,id_text,lang,morph,norm,norm0,pos,sense
0,,,,,,2(BARIG),"[{'n': 'n', 'form': '2(BARIG)', 'id': 'P405287...",,rimanum/P405287,akk-x-oldbab,,,,n,
1,,qēmu,,,N,ZI₃,"[{'s': 'ZI₃', 'id': 'P405287.3.2.0', 'logolang...",flour,rimanum/P405287,akk-x-oldbab,,qēmu,,N,flour
2,,US₂,,,AJ,US₂,"[{'ho': '1', 's': 'US₂', 'logolang': 'sux', 'r...",lesser quality,rimanum/P405287,akk-x-oldbab,,US₂,,AJ,lesser quality
3,,ana,,,PRP,a-na,"[{'ho': '1', 'break': 'damaged', 'v': 'a', 'id...",to,rimanum/P405287,akk-x-oldbab,,ana,,PRP,for
4,,tākultu,,,N,GEŠBUN,"[{'s': 'GEŠBUN', 'logolang': 'sux', 'role': 'l...",(cultic) meal,rimanum/P405287,akk-x-oldbab,,tākulti,,N,allocation
5,,awīlu,,,N,LU₂,"[{'s': 'LU₂', 'id': 'P405287.5.1.0', 'logolang...",man,rimanum/P405287,akk-x-oldbab,,awīl,,N,man
6,,Kisura,,,SN,KI.SUR.RA{ki},"[{'gdl_type': 'logo', 'group': [{'delim': '.',...",1,rimanum/P405287,akk-x-oldbab,,Kisura,,SN,1
7,,awīlu,,,N,LU₂,"[{'s': 'LU₂', 'id': 'P405287.6.1.0', 'logolang...",man,rimanum/P405287,akk-x-oldbab,,awīl,,N,man
8,,Gutû,,,GN,gu-ti-um,"[{'delim': '-', 'v': 'gu', 'id': 'P405287.6.2....",1,rimanum/P405287,akk-x-oldbab,,Gutûm,,GN,1
9,,u,,,CNJ,u₃,"[{'ho': '1', 'break': 'damaged', 'v': 'u₃', 'i...",and,rimanum/P405287,akk-x-oldbab,,u,,CNJ,and


## 3.1 Create a `lemma` column
The following code combines the `cf` (Citation Form), `gw` (Guide Word), and `pos` (Part of Speech) columns to create a new `lemma` column with the format `cf[gw]pos`, for instance `šarru[king]N` or `lugal[king]N`. Unlemmatized words do not have `cf`, `gw`, or `pos` - they only have `form` (the transliteration). The function therefore has a condition: if `cf` is empty, the format should be `form[NA]NA`. Alternatively, one may leave out non-lemmatized words altogether and create the `lemma` column by simply adding up `cf`, `gw`, and `pos`, as follows:
```python
word_df = word_df[word_df['cf'] != '']   # throw out rows with empty CF
word_df['lemma'] = word_df['cf'] + '[' + word_df['gw'] + ']' + word_df['pos']
```

In [12]:
word_df["lemma"] = word_df.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) if r["cf"] != '' 
                                 else r['form'] + '[NA]NA', axis=1)
word_df[['id_text', 'lemma']]

Unnamed: 0,id_text,lemma
0,rimanum/P405287,2(BARIG)[NA]NA
1,rimanum/P405287,qēmu[flour]N
2,rimanum/P405287,US₂[lesser quality]AJ
3,rimanum/P405287,ana[to]PRP
4,rimanum/P405287,tākultu[(cultic) meal]N
5,rimanum/P405287,awīlu[man]N
6,rimanum/P405287,Kisura[1]SN
7,rimanum/P405287,awīlu[man]N
8,rimanum/P405287,Gutû[1]GN
9,rimanum/P405287,u[and]CNJ


## 3.2 Remove Spaces and Commas from the Lemma
Spaces and commas in the Guide Word may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [13]:
word_df.loc[:,'lemma'] = [x.replace(' ', '-') for x in word_df.loc[:,'lemma']]
word_df.loc[:,'lemma'] = [x.replace(',', '') for x in word_df.loc[:,'lemma']]
word_df

Unnamed: 0,base,cf,cont,delim,epos,form,gdl,gw,id_text,lang,morph,norm,norm0,pos,sense,lemma
0,,,,,,2(BARIG),"[{'n': 'n', 'form': '2(BARIG)', 'id': 'P405287...",,rimanum/P405287,akk-x-oldbab,,,,n,,2(BARIG)[NA]NA
1,,qēmu,,,N,ZI₃,"[{'s': 'ZI₃', 'id': 'P405287.3.2.0', 'logolang...",flour,rimanum/P405287,akk-x-oldbab,,qēmu,,N,flour,qēmu[flour]N
2,,US₂,,,AJ,US₂,"[{'ho': '1', 's': 'US₂', 'logolang': 'sux', 'r...",lesser quality,rimanum/P405287,akk-x-oldbab,,US₂,,AJ,lesser quality,US₂[lesser-quality]AJ
3,,ana,,,PRP,a-na,"[{'ho': '1', 'break': 'damaged', 'v': 'a', 'id...",to,rimanum/P405287,akk-x-oldbab,,ana,,PRP,for,ana[to]PRP
4,,tākultu,,,N,GEŠBUN,"[{'s': 'GEŠBUN', 'logolang': 'sux', 'role': 'l...",(cultic) meal,rimanum/P405287,akk-x-oldbab,,tākulti,,N,allocation,tākultu[(cultic)-meal]N
5,,awīlu,,,N,LU₂,"[{'s': 'LU₂', 'id': 'P405287.5.1.0', 'logolang...",man,rimanum/P405287,akk-x-oldbab,,awīl,,N,man,awīlu[man]N
6,,Kisura,,,SN,KI.SUR.RA{ki},"[{'gdl_type': 'logo', 'group': [{'delim': '.',...",1,rimanum/P405287,akk-x-oldbab,,Kisura,,SN,1,Kisura[1]SN
7,,awīlu,,,N,LU₂,"[{'s': 'LU₂', 'id': 'P405287.6.1.0', 'logolang...",man,rimanum/P405287,akk-x-oldbab,,awīl,,N,man,awīlu[man]N
8,,Gutû,,,GN,gu-ti-um,"[{'delim': '-', 'v': 'gu', 'id': 'P405287.6.2....",1,rimanum/P405287,akk-x-oldbab,,Gutûm,,GN,1,Gutû[1]GN
9,,u,,,CNJ,u₃,"[{'ho': '1', 'break': 'damaged', 'v': 'u₃', 'i...",and,rimanum/P405287,akk-x-oldbab,,u,,CNJ,and,u[and]CNJ


## 3.3 Group by Textid
Get all the lemmas that belong to a single text in one row (one row = one document). The `agg()` (aggregate) function, which works on the result of a `groupby()` process aggregates columns of the original dataframe. The function takes a dictionary in which the keys are column names and the values are functions to be used in the aggregation process. The example below has only one such function (`' '.join` will join all entries in the colum `lemma` with a space in between); one may specify (the same or different) functions for different columns, for instance:
> word_df = word_df.groupby("textid").agg({"lemma": ' '.join, "base": ' '.join})

In [14]:
word_df = word_df.groupby("id_text").agg({"lemma": ' '.join})
word_df.reset_index()

Unnamed: 0,id_text,lemma
0,glass/P282518,x[NA]NA ahāzu[take]V x[NA]NA nabalkutu[cross-o...
1,glass/P282519,x[NA]NA ruqqû[process-oil]V lu-x-x[NA]NA x[NA]...
2,glass/P282520,šumma[if]MOD šamnu[oil]N ša[of]DET asanītu[(an...
3,glass/P282611,kalû[totality]N x[NA]NA ša[of]DET x[NA]NA mû[w...
4,glass/P282617,šumma[if]MOD 2(BAN₂)[NA]NA šamnu[oil]N qanû[re...
5,glass/P282618,x[NA]NA ina[in]PRP 2[NA]NA ūmu[day]N ina[in]PR...
6,glass/P393786,inūma[when]SBJ uššu[foundation(s)]N kūru[kiln]...
7,glass/P394484,inūma[when]SBJ uššu[foundation(s)]N kūru[kiln]...
8,glass/P395291,x[NA]NA x-ša-x[NA]NA x[NA]NA ina[in]PRP utūnu[...
9,glass/P395468,x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]NA zakû[p...


## 4 Save Results in CSV file
The output file is called `parsed.csv` and is to be found in the directory `output`. In most computers, `csv` files open automatically in Excel. This program does not deal well with `utf-8` encoding. If you intend to use the file in Excel, change `encoding ='utf-8'` to `encoding='utf-16'`. For usage in computational text analysis applications `utf-8` is usually preferred. 

(Alternatively, use the instructions [here](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0) to import a `utf-8` file into Excel).

In [15]:
savefile =  'parsed.csv'
with open('output/' + savefile, 'w', encoding="utf-8") as w:
    word_df.to_csv(w)