# Extract Lemmatization from JSON: Basic Parser
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract lemmatization data for one or more texts. The resulting `csv` (Comma Separated Values) file has two fields: a Text ID (e.g. `dcclt/Q000039`) and a string of lemmas in the format `lugal[king]N` (or `šarru[king]N` for Akkadian texts).

The Basic Parser downloads one `json` file a time from the [ORACC](http://oracc.org) server and then parses that file, continuing with the next file. This method is not ideal for large amounts of data (for instance if you intent to parse an entire project, or several projects at once). While going through a long list of files, the connection with the server may get reset (for instance when the computer goes to sleep, or if there is a brief period of bad connectivity) and the script will fail. The Advanced Parser will introduce a better way of doing that.

The output of the Basic Parser does not recognize lines (all words/lemmas in a text are listed sequentially) and does not indicate breakage. These issues are of little consequence for [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) techniques such as [topic modeling](https://en.wikipedia.org/wiki/Topic_model), but may become serious limitations for other types of computational approaches. The Advanced Parser will introduce techniques to extract such data.

[ORACC](http://oracc.org) `json` files also include data on the sign level (reading, sign name, unicode number, function, etc). These data are not extracted in the parsers demonstrated here, but the example code should provide a model for how any type of data can be extracted.

In [1]:
import requests
import pandas as pd

## 1.1 Input List of Text IDs
Identify a list of text IDs (P, Q, and X numbers) in the directory `text_ids`. The IDs are six-digit P, Q, or X numbers preceded by a project abbreviation in the format 'PROJECT/P######' or 'PROJECT/SUBPROJECT/Q######'. For example:
* dcclt/P117395
* etcsri/Q001203
* rinap/rinap1/Q003421

The list should be created with a flat text editor such as Textedit or Emacs, and the filename should end in `.txt`.

The P, Q, and X numbers available in a project are listed in the project's `corpus.json` at the address `http://oracc.org/[PROJECT]/corpus.json` (for instance http://oracc.org/saao/saa01/corpus.json).

In [2]:
filename = input('Filename: ')

Filename: ob_lists_wood.txt


## 1.2 Open the List of Text IDs and Remove Spaces and Empty Lines


In [3]:
textids = 'text_ids/' + filename
with open(textids, 'r') as f:
    pqxnos = f.readlines()
pqxnos = [x.strip() for x in pqxnos]        # strip spaces left and right
pqxnos = [x for x in pqxnos if not x == ""] # strip empty lines

## 2.1 The `parsejson()` function
The `parsejson()` function will "digg into" the `json` file (transformed into a dictionary) until it finds the relevant data. The `json` file consists of a hierarchy of `cdl` nodes; only the lowest nodes contain lemmatization data. The function goes down this hierarchy by calling itself when another `cdl` node is encountered.

The first argument of `parsejson()` is a dictionary (derived from `JSON`), the second a list of lemmatization data that is being built while the function iterates through the dictionary. When initially called, the second argument is supposed to be empty. For the code to run properly, the default value of the second argument must be `None`; see the [blog post](http://effbot.org/zone/default-values.htm) by Fredrik Lundh.

In [4]:
def parsejson(text, lemm_l = None):
    if lemm_l == None:
        lemm_l = []
    for dictionary in text["cdl"]:
        if "cdl" in dictionary: 
            parsejson(dictionary, lemm_l)
        elif "f" in dictionary:
            lemm_l.append(dictionary["f"])
    return lemm_l

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will read the list of text IDs identified above. Each text ID (e.g.`dcclt/Q000039`) is split into the ID number (`Q000039`) and the project designation (`dcclt`). These elements are used to build the `URL` for the `json` file on the [ORACC](http://oracc.org) server.

The `json` file is downloaded with the `requests` library and transformed into a Python object (a dictionary) with the `json()` function of this same library.

This dictionary, which is called `text` is now sent to the `parsejson()` function. The function returns a list of lemmata (lemm_text`

In [5]:
word_l = []
for id_text in pqxnos:
    project = id_text[:-8].lower()
    pqx = id_text[-7:].upper()
    url = "http://oracc.org/" + project + '/corpusjson/' + pqx + '.json'
    r = requests.get(url)
    try:
        text = r.json()
        print("parsing " + id_text)
        lemm_text = parsejson(text)
        for lemm in lemm_text:
            lemm['textid'] = id_text
        word_l.extend(lemm_text)
    except:
        print(id_text + ' is not available or not complete')

parsing dcclt/Q000039
parsing dcclt/P117395
parsing dcclt/P117404
parsing dcclt/P128345
parsing dcclt/P224980
parsing dcclt/P224986
parsing dcclt/P224994
parsing dcclt/P224996
parsing dcclt/P225006
parsing dcclt/P225023
parsing dcclt/P225033
parsing dcclt/P225059
parsing dcclt/P225062
parsing dcclt/P225065
parsing dcclt/P225086
parsing dcclt/P225109
parsing dcclt/P225126
parsing dcclt/P225132
parsing dcclt/P229426
parsing dcclt/P230069
parsing dcclt/P235262
parsing dcclt/P247543
parsing dcclt/P247864
parsing dcclt/P249383
parsing dcclt/P250361
parsing dcclt/P250362
parsing dcclt/P250363
parsing dcclt/P250364
parsing dcclt/P250368
parsing dcclt/P250369
parsing dcclt/P250370
parsing dcclt/P250371
parsing dcclt/P251495
parsing dcclt/P251649
parsing dcclt/P251686
parsing dcclt/P251778
parsing dcclt/P251867
parsing dcclt/P253228
parsing dcclt/P253230
parsing dcclt/P253233
parsing dcclt/P253238
parsing dcclt/P253245
parsing dcclt/P253863
parsing dcclt/P253866
parsing dcclt/P253874
parsing dc

In [6]:
word_df = pd.DataFrame(word_l).fillna('')
word_df

Unnamed: 0,base,cf,cont,delim,epos,form,gdl,gw,lang,morph,norm,norm0,pos,sense,textid
0,{ŋeš}taškarin,taškarin,,,N,{ŋeš}taškarin,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",boxwood,sux,~,,taškarin,N,"box tree, boxwood",dcclt/Q000039
1,{ŋeš}esi,esi,,,N,{ŋeš}esi,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,sux,~,,esi,N,ebony,dcclt/Q000039
2,ŋeš-nu₁₁,ŋešnu,,,N,{ŋeš}nu₁₁,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,sux,~,,ŋešnu,N,tree,dcclt/Q000039
3,{ŋeš}ha-lu-ub₂,halub,,,N,{ŋeš}ha-lu-ub₂,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,sux,~,,halub,N,tree,dcclt/Q000039
4,{ŋeš}šag₄-kal,šagkal,,,N,{ŋeš}šag₄-kal,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,sux,~,,šagkal,N,tree,dcclt/Q000039
5,ŋeš-kin₂,ŋešgana,,,N,ŋeš-kin₂,"[{'v': 'ŋeš', 'id': 'Q000039.6.1.0', 'delim': ...",tree,sux,~,,ŋešgana,N,tree,dcclt/Q000039
6,ŋeš-kin₂,ŋešgana,,,N,ŋeš-kin₂,"[{'v': 'ŋeš', 'id': 'Q000039.7.1.0', 'delim': ...",tree,sux,~,,ŋešgana,N,tree,dcclt/Q000039
7,babbar,babbar,,,V/i,babbar,"[{'v': 'babbar', 'id': 'Q000039.7.2.0'}]",white,sux,~,,babbar,V/i,(to be) white,dcclt/Q000039
8,ŋeš-kin₂,ŋešgana,,,N,ŋeš-kin₂,"[{'v': 'ŋeš', 'id': 'Q000039.8.1.0', 'delim': ...",tree,sux,~,,ŋešgana,N,tree,dcclt/Q000039
9,giggi,giggi,,,V/i,giggi,"[{'v': 'giggi', 'id': 'Q000039.8.2.0'}]",black,sux,~,,giggi,V/i,(to be) black,dcclt/Q000039


## 3.1 Create a `lemma` column
The following code combines the `cf` (Citation Form), `gw` (Guide Word), and `pos` (Part of Speech) columns to create a new `lemma` column with the format `cf[gw]pos`, for instance `šarru[king]N` or `lugal[king]N`. Unlemmatized words do not have `cf`, `gw`, or `pos` - they only have `form` (the transliteration). The function therefore has a condition: if `cf` is empty, the format should be `form[NA]NA`. Alternatively, one may leave out non-lemmatized words altogether and create the `lemma` column by simply adding up `cf`, `gw`, and `pos`, as follows:

> `word_df = word_df[word_df['cf'] != '']`

> `word_df['lemma'] = word_df['cf'] + '[' + word_df['gw'] + ']' + word_df['pos']`

In [7]:
word_df["lemma"] = word_df.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) if r["cf"] != '' 
                                 else r['form'] + '[NA]NA', axis=1)
word_df[['textid', 'lemma']]

Unnamed: 0,textid,lemma
0,dcclt/Q000039,taškarin[boxwood]N
1,dcclt/Q000039,esi[tree]N
2,dcclt/Q000039,ŋešnu[tree]N
3,dcclt/Q000039,halub[tree]N
4,dcclt/Q000039,šagkal[tree]N
5,dcclt/Q000039,ŋešgana[tree]N
6,dcclt/Q000039,ŋešgana[tree]N
7,dcclt/Q000039,babbar[white]V/i
8,dcclt/Q000039,ŋešgana[tree]N
9,dcclt/Q000039,giggi[black]V/i


## 3.2 Remove Spaces and Commas from the Lemma
Spaces and commas in the Guide Word may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [8]:
word_df.loc[:,'lemma'] = [x.replace(' ', '-') for x in word_df.loc[:,'lemma']]
word_df.loc[:,'lemma'] = [x.replace(',', '') for x in word_df.loc[:,'lemma']]
word_df

Unnamed: 0,base,cf,cont,delim,epos,form,gdl,gw,lang,morph,norm,norm0,pos,sense,textid,lemma
0,{ŋeš}taškarin,taškarin,,,N,{ŋeš}taškarin,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",boxwood,sux,~,,taškarin,N,"box tree, boxwood",dcclt/Q000039,taškarin[boxwood]N
1,{ŋeš}esi,esi,,,N,{ŋeš}esi,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,sux,~,,esi,N,ebony,dcclt/Q000039,esi[tree]N
2,ŋeš-nu₁₁,ŋešnu,,,N,{ŋeš}nu₁₁,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,sux,~,,ŋešnu,N,tree,dcclt/Q000039,ŋešnu[tree]N
3,{ŋeš}ha-lu-ub₂,halub,,,N,{ŋeš}ha-lu-ub₂,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,sux,~,,halub,N,tree,dcclt/Q000039,halub[tree]N
4,{ŋeš}šag₄-kal,šagkal,,,N,{ŋeš}šag₄-kal,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,sux,~,,šagkal,N,tree,dcclt/Q000039,šagkal[tree]N
5,ŋeš-kin₂,ŋešgana,,,N,ŋeš-kin₂,"[{'v': 'ŋeš', 'id': 'Q000039.6.1.0', 'delim': ...",tree,sux,~,,ŋešgana,N,tree,dcclt/Q000039,ŋešgana[tree]N
6,ŋeš-kin₂,ŋešgana,,,N,ŋeš-kin₂,"[{'v': 'ŋeš', 'id': 'Q000039.7.1.0', 'delim': ...",tree,sux,~,,ŋešgana,N,tree,dcclt/Q000039,ŋešgana[tree]N
7,babbar,babbar,,,V/i,babbar,"[{'v': 'babbar', 'id': 'Q000039.7.2.0'}]",white,sux,~,,babbar,V/i,(to be) white,dcclt/Q000039,babbar[white]V/i
8,ŋeš-kin₂,ŋešgana,,,N,ŋeš-kin₂,"[{'v': 'ŋeš', 'id': 'Q000039.8.1.0', 'delim': ...",tree,sux,~,,ŋešgana,N,tree,dcclt/Q000039,ŋešgana[tree]N
9,giggi,giggi,,,V/i,giggi,"[{'v': 'giggi', 'id': 'Q000039.8.2.0'}]",black,sux,~,,giggi,V/i,(to be) black,dcclt/Q000039,giggi[black]V/i


## 3.3 Group by Textid
Get all the lemmas that belong to a single text in one row (one row = one document). The `agg()` (aggregate) function, which works on the result of a `groupby()` process aggregates columns of the original dataframe. The function takes a dictionary in which the keys are column names and the values are functions to be used in the aggregation process. The example below has only one such function (`' '.join` will join all entries in the colum `lemma` with a space in between); one may specify (the same or different) functions for different columns, for instance:
> word_df = word_df.groupby("textid").agg({"lemma": ' '.join, "base": ' '.join})

In [9]:
word_df = word_df.groupby("textid").agg({"lemma": ' '.join})
word_df

Unnamed: 0_level_0,lemma
textid,Unnamed: 1_level_1
dcclt/P117395,ŋešed[key]N pakud[~tree]N raba[clamp]N
dcclt/P117404,ig[door]N eren[cedar]N ig[door]N dib[board]N i...
dcclt/P128345,garig[comb]N siki[hair]N garig[comb]N siki-sik...
dcclt/P224980,gigir[chariot]N e[house]N gigir[chariot]N e[ho...
dcclt/P224986,guza[chair]N anše[equid]N guza[chair]N kaskal[...
dcclt/P224994,{ŋeš}x-x[NA]NA {ŋeš}SI-x[NA]NA {ŋeš}šu-x[NA]NA
dcclt/P224996,guza[chair]N guza[chair]N gid[long]V/i guza[ch...
dcclt/P225006,ig[door]N suku[pole]N ig[door]N zara[pivot]N i...
dcclt/P225023,{ŋeš}x[NA]NA emerah[jug]N x[NA]NA emerah[jug]N...
dcclt/P225033,targul[pole]N targul[pole]N garig[comb]N


## 4 Save Results in CSV file
The `.csv` file has the same name as the list of textid's that was used at the beginning of this notebook. In most computers, `csv` files open automatically in Excel. This spread sheet program does not deal well with `utf-8` encoding. If you intend to use the file in Excel, change `encoding ='utf-8'` to `encoding='utf-16'`. For usage in computational text analysis applications `utf-8` is usually preferred.

In [10]:
savefile =  filename[:-3] + 'csv'
with open('output/' + savefile, 'w') as w:
    word_df.to_csv(w, encoding = 'utf-8')