# Extract Lemmatization from JSON: Basic Parser
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract lemmatization data for one or more texts. The resulting `csv` (Comma Separated Values) file has two fields: a Text ID (e.g. `dcclt/Q000039`) and a string of lemmas in the format `lugal[king]N` (or `šarru[king]N` for Akkadian texts).

The Basic Parser downloads one `json` file a time from the [ORACC](http://oracc.org) server and then parses that file, continuing with the next file. This method is not ideal for large amounts of data (for instance if you intent to parse an entire project, or several projects at once). While going through a long list of files, the connection with the server may get reset (for instance when the computer goes to sleep, or if there is a brief period of bad connectivity) and the script will fail. The Advanced Parser will introduce a better way of doing that.

The output of the Basic Parser does not recognize lines (all words/lemmas in a text are listed sequentially) and does not indicate breakage. These issues are of little consequence for [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) techniques such as [topic modeling](https://en.wikipedia.org/wiki/Topic_model), but may become serious limitations for other types of computational approaches. The Advanced Parser will introduce techniques to extract such data.

[ORACC](http://oracc.org) `JSON` files also include data on the sign level (reading, sign name, unicode number, function, etc). These data are not extracted in the parsers demonstrated here, but the example code should provide a model for how any type of data can be extracted.

In [1]:
import requests
import pandas as pd

## 1.1 Input List of Text IDs
Identify a list of text IDs (P, Q, and X numbers) in the directory `text_ids`. The IDs are six-digit P, Q, or X numbers preceded by a project abbreviation in the format 'PROJECT/P######' or 'PROJECT/SUBPROJECT/Q######'. For example:
* dcclt/P117395
* etcsri/Q001203
* rinap/rinap1/Q003421

The list should be created with a flat text editor such as Textedit or Emacs, and the filename should end in `.txt`.

The P, Q, and X numbers available in a project are listed in the project's `corpus.json` at the address `http://oracc.org/[PROJECT]/corpus.json` (for instance http://oracc.org/saao/saa01/corpus.json).

In [2]:
filename = input('Filename: ')

Filename: test.txt


## 1.2 Open the List of Text IDs and Remove Spaces and Empty Lines


In [3]:
textids = 'text_ids/' + filename
with open(textids, 'r') as f:
    pqxnos = f.readlines()
pqxnos = [x.strip() for x in pqxnos]        # strip spaces left and right
pqxnos = [x for x in pqxnos if not x == ""] # strip empty lines
pqxnos = [x.split()[0] for x in pqxnos]     # strip everything after the first space

## 2.1 The `parsejson()` function
The `parsejson()` function will "digg into" the `json` file (transformed into a dictionary) until it finds the relevant data. The `json` file consists of a hierarchy of `cdl` nodes; only the lowest nodes contain lemmatization data. The function goes down this hierarchy by calling itself when another `cdl` node is encountered.

The argument of `parsejson()` is a `JSON` object, a dictionary that initially contains the entire text data. The code takes the key `cdl` which contains an array (a list) of `JSON` objects. Iterating through these objects, if an object contains another `cdl` node, the function calls itself with this object as argument. This way the function digs deeper and deeper into the `JSON` tree, until it does not encounter a `cdl` key anymore. Here we are at the level of individual words. The code checks for a key `f`, if it exists it is appended to the list `lemm_l`. This list is defined outside of the function proper. The key `f` is always accompanied by the a key `ref` which gives a word ID of the format `textID.lineID.wordID` (for instance `P282597.2.1`). The seven characters of this `ref` are added to each entry as the field `id_text`.

In [4]:
def parsejson(text, id_text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject, id_text)
        if "f" in JSONobject:
            lemm = JSONobject["f"]
            lemm["id_text"] = id_text
            lemm_l.append(lemm)
    return

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will read the list of text IDs identified above. Each text ID (e.g.`dcclt/Q000039`) is split into the ID number (`Q000039`) and the project designation (`dcclt`). These elements are used to build the `URL` for the `json` file on the [ORACC](http://oracc.org) server.

The `json` file is downloaded with the `requests` library and transformed into a Python object (a dictionary) with the `json()` function of this same library.

This dictionary, which is called `text` is now sent to the `parsejson()` function. The function adds lemmata to the `lemm_l` list.

In [5]:
lemm_l = []
for id_text in pqxnos:
    project = id_text[:-8].lower()
    pqx = id_text[-7:].upper()
    url = "http://oracc.org/" + project + '/corpusjson/' + pqx + '.json'
    r = requests.get(url).json()
    try:
        print("parsing " + id_text)
        parsejson(r, id_text)
    except:
        print(id_text + ' is not available or not complete')

parsing blms/P282597
parsing blms/P274259


In [6]:
word_df = pd.DataFrame(lemm_l).fillna('')
word_df

Unnamed: 0,base,cf,cont,delim,epos,form,gdl,gw,id_text,lang,morph,norm,norm0,pos,sense
0,,,,,,{d}utu-gin₇,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",,blms/P282597,sux,,,,,
1,,,,,,e₃-ta,"[{'v': 'e₃', 'id': 'P282597.2.2.0', 'break': '...",,blms/P282597,sux,,,,,
2,,,,,,uru₂-zu,"[{'v': 'uru₂', 'id': 'P282597.2.3.0', 'break':...",,blms/P282597,sux,,,,,
3,,,,,,e-NE,"[{'v': 'e', 'id': 'P282597.2.4.0', 'break': 'd...",,blms/P282597,sux,,,,,
4,,,,,,x,"[{'x': 'ellipsis', 'id': 'P282597.2.5.0', 'bre...",,blms/P282597,sux,,,,,
5,,kīma,,,PRP,ki-ma,"[{'v': 'ki', 'id': 'P282597.3.1.0', 'break': '...",like,blms/P282597,akk-x-stdbab,,kīma,,PRP,like
6,,Šamaš,,,DN,{d}ša₂-maš,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",1,blms/P282597,akk-x-stdbab,,Šamaš,,DN,1
7,,waṣû,,,V,ṣa-am-ma,"[{'v': 'ṣa', 'id': 'P282597.3.3.0', 'delim': '...",go out,blms/P282597,akk-x-stdbab,,ṣâmma,,V,go out
8,,ālu,,,N,IRI-ka,"[{'s': 'IRI', 'id': 'P282597.3.4.0', 'role': '...",city,blms/P282597,akk-x-stdbab,,ālka,,N,city
9,,hiāṭu,,,V,hi-i-ṭi,"[{'v': 'hi', 'id': 'P282597.3.5.0', 'delim': '...",supervise,blms/P282597,akk-x-stdbab,,hīṭi,,V,check


## 3.1 Create a `lemma` column
The following code combines the `cf` (Citation Form), `gw` (Guide Word), and `pos` (Part of Speech) columns to create a new `lemma` column with the format `cf[gw]pos`, for instance `šarru[king]N` or `lugal[king]N`. Unlemmatized words do not have `cf`, `gw`, or `pos` - they only have `form` (the transliteration). The function therefore has a condition: if `cf` is empty, the format should be `form[NA]NA`. Alternatively, one may leave out non-lemmatized words altogether and create the `lemma` column by simply adding up `cf`, `gw`, and `pos`, as follows:

> `word_df = word_df[word_df['cf'] != '']`

> `word_df['lemma'] = word_df['cf'] + '[' + word_df['gw'] + ']' + word_df['pos']`

In [7]:
word_df["lemma"] = word_df.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) if r["cf"] != '' 
                                 else r['form'] + '[NA]NA', axis=1)
word_df[['id_text', 'lemma']]

Unnamed: 0,id_text,lemma
0,blms/P282597,{d}utu-gin₇[NA]NA
1,blms/P282597,e₃-ta[NA]NA
2,blms/P282597,uru₂-zu[NA]NA
3,blms/P282597,e-NE[NA]NA
4,blms/P282597,x[NA]NA
5,blms/P282597,kīma[like]PRP
6,blms/P282597,Šamaš[1]DN
7,blms/P282597,waṣû[go out]V
8,blms/P282597,ālu[city]N
9,blms/P282597,hiāṭu[supervise]V


## 3.2 Remove Spaces and Commas from the Lemma
Spaces and commas in the Guide Word may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [8]:
word_df.loc[:,'lemma'] = [x.replace(' ', '-') for x in word_df.loc[:,'lemma']]
word_df.loc[:,'lemma'] = [x.replace(',', '') for x in word_df.loc[:,'lemma']]
word_df

Unnamed: 0,base,cf,cont,delim,epos,form,gdl,gw,id_text,lang,morph,norm,norm0,pos,sense,lemma
0,,,,,,{d}utu-gin₇,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",,blms/P282597,sux,,,,,,{d}utu-gin₇[NA]NA
1,,,,,,e₃-ta,"[{'v': 'e₃', 'id': 'P282597.2.2.0', 'break': '...",,blms/P282597,sux,,,,,,e₃-ta[NA]NA
2,,,,,,uru₂-zu,"[{'v': 'uru₂', 'id': 'P282597.2.3.0', 'break':...",,blms/P282597,sux,,,,,,uru₂-zu[NA]NA
3,,,,,,e-NE,"[{'v': 'e', 'id': 'P282597.2.4.0', 'break': 'd...",,blms/P282597,sux,,,,,,e-NE[NA]NA
4,,,,,,x,"[{'x': 'ellipsis', 'id': 'P282597.2.5.0', 'bre...",,blms/P282597,sux,,,,,,x[NA]NA
5,,kīma,,,PRP,ki-ma,"[{'v': 'ki', 'id': 'P282597.3.1.0', 'break': '...",like,blms/P282597,akk-x-stdbab,,kīma,,PRP,like,kīma[like]PRP
6,,Šamaš,,,DN,{d}ša₂-maš,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",1,blms/P282597,akk-x-stdbab,,Šamaš,,DN,1,Šamaš[1]DN
7,,waṣû,,,V,ṣa-am-ma,"[{'v': 'ṣa', 'id': 'P282597.3.3.0', 'delim': '...",go out,blms/P282597,akk-x-stdbab,,ṣâmma,,V,go out,waṣû[go-out]V
8,,ālu,,,N,IRI-ka,"[{'s': 'IRI', 'id': 'P282597.3.4.0', 'role': '...",city,blms/P282597,akk-x-stdbab,,ālka,,N,city,ālu[city]N
9,,hiāṭu,,,V,hi-i-ṭi,"[{'v': 'hi', 'id': 'P282597.3.5.0', 'delim': '...",supervise,blms/P282597,akk-x-stdbab,,hīṭi,,V,check,hiāṭu[supervise]V


## 3.3 Group by Textid
Get all the lemmas that belong to a single text in one row (one row = one document). The `agg()` (aggregate) function, which works on the result of a `groupby()` process aggregates columns of the original dataframe. The function takes a dictionary in which the keys are column names and the values are functions to be used in the aggregation process. The example below has only one such function (`' '.join` will join all entries in the colum `lemma` with a space in between); one may specify (the same or different) functions for different columns, for instance:
> word_df = word_df.groupby("textid").agg({"lemma": ' '.join, "base": ' '.join})

In [9]:
word_df = word_df.groupby("id_text").agg({"lemma": ' '.join})
word_df

Unnamed: 0_level_0,lemma
id_text,Unnamed: 1_level_1
blms/P274259,x-x-šu₂[NA]NA x[NA]NA isqu[share]N x[NA]NA x[N...
blms/P282597,{d}utu-gin₇[NA]NA e₃-ta[NA]NA uru₂-zu[NA]NA e-...


## 4 Save Results in CSV file
The `.csv` file has the same name as the list of textid's that was used at the beginning of this notebook. In most computers, `csv` files open automatically in Excel. This spread sheet program does not deal well with `utf-8` encoding. If you intend to use the file in Excel, change `encoding ='utf-8'` to `encoding='utf-16'`. For usage in computational text analysis applications `utf-8` is usually preferred.

In [10]:
savefile =  filename[:-3] + 'csv'
with open('output/' + savefile, 'w') as w:
    word_df.to_csv(w, encoding = 'utf-8')