# Extract Lemmatization from JSON: Basic Parser
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract lemmatization data for one or more texts. The resulting `csv` (Comma Separated Values) file has two fields: a Text ID (e.g. `dcclt/Q000039`) and a string of lemmas in the format `lugal[king]N` (or `šarru[king]N` for Akkadian texts).

The `JSON` files are available in a zipped format on [ORACC](http://oracc.org). They must be downloaded from [ORACC](http://oracc.org) and put in a directory called `jsonzip`. You may use the script [download_ORACC-JSON.ipynb](download_ORACC-JSON.ipynb) or download the files by hand.

The output of the Basic Parser does not recognize lines (all words/lemmas in a text are listed sequentially) and does not indicate breakage. These issues are of little consequence for [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) techniques such as [topic modeling](https://en.wikipedia.org/wiki/Topic_model), but may become serious limitations for other types of computational approaches. The Advanced Parser will introduce techniques to extract such data.

[ORACC](http://oracc.org) `JSON` files also include data on the sign level (reading, sign name, unicode number, function, etc). These data are not extracted in the parsers demonstrated here, but the example code should provide a model for how any type of data can be extracted.

In [1]:
import pandas as pd
import zipfile
import json

## 1.1 Input List of Text IDs
Identify a list of text IDs (P, Q, and X numbers) in the directory `text_ids`. The IDs are six-digit P, Q, or X numbers preceded by a project abbreviation in the format 'PROJECT/P######' or 'PROJECT/SUBPROJECT/Q######'. For example:
* dcclt/P117395
* etcsri/Q001203
* rinap/rinap1/Q003421

The list should be created with a flat text editor such as Textedit or Emacs, and the filename should end in `.txt`.

In [2]:
filename = input('Filename: ')

Filename: test.txt


## 1.2 Open the List of Text IDs and Remove Spaces and Empty Lines
The list of text IDs is read; accidental space at the beginning or end of each line are removed, as well as blank lines. The Advanced Parser can read a partial text by indicating the beginning and the end of the desired text segment and this is also indicated in the list of text IDs in the format `PROJECT/TEXTID start line - stop line`. Since the Basic Parser only parses full texts, we strip everything after the first space.


In [3]:
textids = 'text_ids/' + filename
with open(textids, 'r') as f:
    pqxnos = f.readlines()
pqxnos = [x.strip() for x in pqxnos]        # strip spaces left and right
pqxnos = [x for x in pqxnos if not x == ""] # strip empty lines
pqxnos = [x.split()[0] for x in pqxnos]     # strip everything after the first space

## 2.1 The `parsejson()` function
The `parsejson()` function will "dig into" the `json` file (transformed into a dictionary) until it finds the relevant data. The `json` file consists of a hierarchy of `cdl` nodes; only the lowest nodes contain lemmatization data. The function goes down this hierarchy by calling itself when another `cdl` node is encountered. For nore information about the data hierarchy in the [ORACC](http://oracc.org) `json` files, see [ORACC Open Data](http://oracc.museum.upenn.edu/doc/opendata/index.html).

The first argument of the `parsejson()` function is a `JSON` object, a dictionary that initially contains the entire contents of the original JSON file. The code takes the key `cdl` which itself contains an array (a list) of `JSON` objects. Iterating through these objects, if an object contains another `cdl` node, the function calls itself with this object as first argument. This way the function digs deeper and deeper into the `JSON` tree, until it does not encounter a `cdl` key anymore. Here we are at the level of individual words. The code checks for a key `f`, if it exists the value of that key is appended to the list `lemm_l`. This list is defined outside of the function proper. 

The field `id_text` is derived from the input list. It consists of a project abbreviation, such as `blms` or `cams/gkab` plus a text ID, in the format `cams/gkab/P338616` or `dcclt/Q000039`. The `id_text` is added as a second argument to the function and is added to the lemmatization data of every word.

In [4]:
def parsejson(text, id_text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject, id_text)
        if "f" in JSONobject:
            lemm = JSONobject["f"]
            lemm["id_text"] = id_text
            lemm_l.append(lemm)
    return

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will read the list of text IDs identified above. Each text ID (e.g.`dcclt/Q000039`) is split into the ID number (`Q000039`) and the project designation (`dcclt`). The project designation is used to identify the right ZIP file in the directory `jsonzip`. The ID number is used to find the pertinent `json` file inside the ZIP.

The command `json.loads()` reads the json data and transforms it into a Python dictionary (a sequence of keys and values).

This dictionary, which is called `text` is now sent to the `parsejson()` function, with the text ID as second argument. The function adds lemmata to the `lemm_l` list.

In [5]:
lemm_l = []
for id_text in pqxnos:
    project = id_text[:-8].lower()
    pqx = id_text[-7:].upper()
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    data = z.read(project + "/corpusjson/" + pqx +".json")
    data_json = json.loads(data)
    try:
        print("parsing " + id_text)
        parsejson(data_json, id_text)
    except:
        print(id_text + ' is not available or not complete')

parsing dcclt/Q000057
parsing blms/P282597
parsing blms/P274259
parsing saao/saa04/P237549
jsonzip/saao-saa21.zip does not exist or is not a proper ZIP file


In [6]:
word_df = pd.DataFrame(lemm_l).fillna('')
word_df

Unnamed: 0,base,cf,cont,delim,epos,form,gdl,gw,id_text,lang,morph,norm,norm0,pos,sense
0,dirig,dirig,,,V/i,dirig,"[{'v': 'dirig', 'id': 'Q000057.1.1.0'}]",exceed,dcclt/Q000057,sux,~,,dirig,V/i,exceed
1,,,,,,x,"[{'x': 'ellipsis', 'id': 'Q000057.1.2.0', 'bre...",,dcclt/Q000057,sux,,,,,
2,,,,,,|SI.A|,"[{'c': '|SI.A|', 'id': 'Q000057.1.3.0', 'seq':...",,dcclt/Q000057,sux,,,,,
3,,watru,,,AJ,wa-at-ru-um,"[{'v': 'wa', 'id': 'Q000057.1.4.0', 'delim': '...",huge,dcclt/Q000057,akk-x-oldbab,,watrum,,AJ,greater
4,dirig,dirig,,,V/i,dirig,"[{'v': 'dirig', 'id': 'Q000057.2.1.0'}]",exceed,dcclt/Q000057,sux,~,,dirig,V/i,exceed
5,,šūturu,,,AJ,šu-tu-ru-um,"[{'v': 'šu', 'id': 'Q000057.2.2.0', 'delim': '...",very great,dcclt/Q000057,akk-x-oldbab,,šūturum,,AJ,supreme
6,dirig,dirig,,,V/i,dirig,"[{'v': 'dirig', 'id': 'Q000057.3.1.0'}]",float,dcclt/Q000057,sux,~,,dirig,V/i,"to float, glide (along/down)"
7,,neqelpû,,,N,ni-qe₃-el-pu-um,"[{'v': 'ni', 'id': 'Q000057.3.2.0', 'delim': '...",float,dcclt/Q000057,akk-x-oldbab,,niqelpûm,,V,floating
8,dirig,dirig,,,V/i,dirig,"[{'v': 'dirig', 'id': 'Q000057.4.1.0'}]",fall,dcclt/Q000057,sux,~,,dirig,V/i,be distressed
9,,ašāšu,,,N,a-ša-šum,"[{'v': 'a', 'id': 'Q000057.4.2.0', 'delim': '-...",be(come) distressed,dcclt/Q000057,akk-x-oldbab,,ašāšum,,V,be(com)ing distressed


## 3.1 Create a `lemma` column
The following code combines the `cf` (Citation Form), `gw` (Guide Word), and `pos` (Part of Speech) columns to create a new `lemma` column with the format `cf[gw]pos`, for instance `šarru[king]N` or `lugal[king]N`. Unlemmatized words do not have `cf`, `gw`, or `pos` - they only have `form` (the transliteration). The function therefore has a condition: if `cf` is empty, the format should be `form[NA]NA`. Alternatively, one may leave out non-lemmatized words altogether and create the `lemma` column by simply adding up `cf`, `gw`, and `pos`, as follows:

> `word_df = word_df[word_df['cf'] != '']`

> `word_df['lemma'] = word_df['cf'] + '[' + word_df['gw'] + ']' + word_df['pos']`

In [7]:
word_df["lemma"] = word_df.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) if r["cf"] != '' 
                                 else r['form'] + '[NA]NA', axis=1)
word_df[['id_text', 'lemma']]

Unnamed: 0,id_text,lemma
0,dcclt/Q000057,dirig[exceed]V/i
1,dcclt/Q000057,x[NA]NA
2,dcclt/Q000057,|SI.A|[NA]NA
3,dcclt/Q000057,watru[huge]AJ
4,dcclt/Q000057,dirig[exceed]V/i
5,dcclt/Q000057,šūturu[very great]AJ
6,dcclt/Q000057,dirig[float]V/i
7,dcclt/Q000057,neqelpû[float]V
8,dcclt/Q000057,dirig[fall]V/i
9,dcclt/Q000057,ašāšu[be(come) distressed]V


## 3.2 Remove Spaces and Commas from the Lemma
Spaces and commas in the Guide Word may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [8]:
word_df.loc[:,'lemma'] = [x.replace(' ', '-') for x in word_df.loc[:,'lemma']]
word_df.loc[:,'lemma'] = [x.replace(',', '') for x in word_df.loc[:,'lemma']]
word_df

Unnamed: 0,base,cf,cont,delim,epos,form,gdl,gw,id_text,lang,morph,norm,norm0,pos,sense,lemma
0,dirig,dirig,,,V/i,dirig,"[{'v': 'dirig', 'id': 'Q000057.1.1.0'}]",exceed,dcclt/Q000057,sux,~,,dirig,V/i,exceed,dirig[exceed]V/i
1,,,,,,x,"[{'x': 'ellipsis', 'id': 'Q000057.1.2.0', 'bre...",,dcclt/Q000057,sux,,,,,,x[NA]NA
2,,,,,,|SI.A|,"[{'c': '|SI.A|', 'id': 'Q000057.1.3.0', 'seq':...",,dcclt/Q000057,sux,,,,,,|SI.A|[NA]NA
3,,watru,,,AJ,wa-at-ru-um,"[{'v': 'wa', 'id': 'Q000057.1.4.0', 'delim': '...",huge,dcclt/Q000057,akk-x-oldbab,,watrum,,AJ,greater,watru[huge]AJ
4,dirig,dirig,,,V/i,dirig,"[{'v': 'dirig', 'id': 'Q000057.2.1.0'}]",exceed,dcclt/Q000057,sux,~,,dirig,V/i,exceed,dirig[exceed]V/i
5,,šūturu,,,AJ,šu-tu-ru-um,"[{'v': 'šu', 'id': 'Q000057.2.2.0', 'delim': '...",very great,dcclt/Q000057,akk-x-oldbab,,šūturum,,AJ,supreme,šūturu[very-great]AJ
6,dirig,dirig,,,V/i,dirig,"[{'v': 'dirig', 'id': 'Q000057.3.1.0'}]",float,dcclt/Q000057,sux,~,,dirig,V/i,"to float, glide (along/down)",dirig[float]V/i
7,,neqelpû,,,N,ni-qe₃-el-pu-um,"[{'v': 'ni', 'id': 'Q000057.3.2.0', 'delim': '...",float,dcclt/Q000057,akk-x-oldbab,,niqelpûm,,V,floating,neqelpû[float]V
8,dirig,dirig,,,V/i,dirig,"[{'v': 'dirig', 'id': 'Q000057.4.1.0'}]",fall,dcclt/Q000057,sux,~,,dirig,V/i,be distressed,dirig[fall]V/i
9,,ašāšu,,,N,a-ša-šum,"[{'v': 'a', 'id': 'Q000057.4.2.0', 'delim': '-...",be(come) distressed,dcclt/Q000057,akk-x-oldbab,,ašāšum,,V,be(com)ing distressed,ašāšu[be(come)-distressed]V


## 3.3 Group by Textid
Get all the lemmas that belong to a single text in one row (one row = one document). The `agg()` (aggregate) function, which works on the result of a `groupby()` process aggregates columns of the original dataframe. The function takes a dictionary in which the keys are column names and the values are functions to be used in the aggregation process. The example below has only one such function (`' '.join` will join all entries in the colum `lemma` with a space in between); one may specify (the same or different) functions for different columns, for instance:
> word_df = word_df.groupby("textid").agg({"lemma": ' '.join, "base": ' '.join})

In [9]:
word_df = word_df.groupby("id_text").agg({"lemma": ' '.join})
word_df

Unnamed: 0_level_0,lemma
id_text,Unnamed: 1_level_1
blms/P274259,x-x-šu₂[NA]NA x[NA]NA isqu[share]N x[NA]NA x[N...
blms/P282597,{d}utu-gin₇[NA]NA e₃-ta[NA]NA uru₂-zu[NA]NA e-...
dcclt/Q000057,dirig[exceed]V/i x[NA]NA |SI.A|[NA]NA watru[hu...
saao/saa04/P237549,{d}UTU[NA]NA EN[NA]NA GAL-u₂[NA]NA ša₂[NA]NA a...


## 4 Save Results in CSV file
The `.csv` file has the same name as the list of textid's that was used at the beginning of this notebook. In most computers, `csv` files open automatically in Excel. This spread sheet program does not deal well with `utf-8` encoding. If you intend to use the file in Excel, change `encoding ='utf-8'` to `encoding='utf-16'`. For usage in computational text analysis applications `utf-8` is usually preferred.

In [10]:
savefile =  filename[:-3] + 'csv'
with open('output/' + savefile, 'w') as w:
    word_df.to_csv(w, encoding = 'utf-8')