# Extract Lemmatization from JSON: Extended Parser
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract lemmatization data for one or more projects. The resulting `csv` (Comma Separated Values) file is named `parsed.csv` and has two fields: a Text ID (e.g. `dcclt/Q000039`) and a string of lemmas in the format `lugal[king]N` (or `šarru[king]N` for Akkadian texts).

The output of the Extended Parser contains text IDs, line IDs, lemmas, and (potentially) other data. The first few code blocks are identical with the Basic Parser.

In [None]:
import pandas as pd
import zipfile
import json
import tqdm
import requests
import errno
import os

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. If they do not exist they are created, else: do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [None]:
directories = ['jsonzip', 'output']
for d in directories:
    try:
        os.mkdir(d)
    except OSError as exc:
        if exc.errno !=errno.EEXIST:
            raise
        pass

## 1.1 Input Project Names
Provide a list of one or more project names, separated by commas. Note that subprojects must be listed separately, they are not included in the main project. For instance:

`saao/saa01,saao/saa02,blms`

In [None]:
projects = input('Project(s): ').lower()

## 1.2 Split the List of Projects
Split the list of projects and create a list of project names.

In [None]:
p = projects.split(',')               # split at each comma and make a list called `p`
p = [x.strip() for x in p]        # strip spaces left and right of each entry in `p`

## 1.3 Download the ZIP files
For each project in the list download all the `json` files from `http://build-oracc.museum.upenn.edu/json/`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

If you have downloaded the files by hand (and put them in the `jsonzip` directory) you may skip this cell and jump directly to section [2.1 The Parsejson() function](#head21).

In [None]:
CHUNK = 16 * 1024
for project in tqdm.tqdm(p):
    project = project.replace('/', '-')
    url = "http://build-oracc.museum.upenn.edu/json/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    r = requests.get(url)
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in r.iter_content(chunk_size=CHUNK):
                f.write(c)
    else:
        print(url + " does not exist.")

## <a name="head21"></a>2.1 The `parsejson()` function
The `parsejson()` function is essentially identical with that function in `First_JSON_parser.ipynb`, but it fetches more data. The field `word_id` consists of three parts, namely a text ID, line ID, and word ID, in the format `Q000039.76.2` meaning: the second word in line 76 of text object `Q000039`. Note that `76` is not a line number strictly speaking but an object reference within the text object. Things like horizontal rulings, columns, and breaks also get object references. The `word_id` field allows us to put lines together in the proper order.

The field `label` is a human-legible label that refers a line or another part of the text; it may look like `o i 23` (obverse column 1 line 23) or `r v 23'` (reverse column 5 line 23 prime). The `label` field is used in online [ORACC](http://oracc.org) editions to indicate line numbers.

The fields `extent`, `scope`, and `state` give metatextual data about the condition of the object; they capture the number of broken lines or columns and similar information. 


In [None]:
def parsejson(text, parameters):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject, parameters)
        if "label" in JSONobject:
            parameters["label"] = JSONobject['label']
        if "f" in JSONobject:
            lemma = JSONobject["f"]
            lemma["id_word"] = JSONobject["ref"]
            lemma['label'] = parameters["label"]
            lemma["id_text"] = parameters["id_text"]
            lemm_l.append(lemma)
        if "strict" in JSONobject and JSONobject["strict"] == "1":
            lemma = {key: JSONobject[key] for key in parameters["dollar_keys"]}
            lemma["id_word"] = JSONobject["ref"] + ".0"
            lemma["id_text"] = parameters["id_text"]
            lemm_l.append(lemma)
    return

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file is located in the directory `jsonzip`, named PROJECT.zip. The `zip` file contains a file that is called `corpus.json` that contains a full list of all the text IDs available in that corpus (P, Q, and X numbers) under the key `members`. This list is used to identify the files that contain the text data and that will be parsed. The `zip` file contains a directory `corpusjson` that holds the text files - each one is called `P######.json` (or `Q######.json` or `X######.json`).

Each of these files is extracted from the `zip` file and read with the command command `json.loads()`, which reads the json data and transforms it into a Python dictionary (a sequence of keys and values).

This dictionary, which is called `text` is now sent to the `parsejson()` function, with the text ID as second argument. The function adds lemmata to the `lemm_l` list.

In [None]:
lemm_l = []
parameters = {"label": None, "id_text": None, "dollar_keys" : ["extent", "scope", "state"]}
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in files:                            #iterate over the file names
        id_text = project + filename[-13:-5] # id_text is, for instance, blms/P414332
        parameters["id_text"] = id_text
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
            data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
            parsejson(data_json, parameters)               # and send to the parsejson() function
        except:
            print(id_text + ' is not available or not complete')

## 3 Data Structuring
### 3.1 Transform the Data into a DataFrame
The word_l list is transformed into a Pandas dataframe for further manipulation.

For various reasons not all JSON files will have all data types that potentially exist in an [ORACC](http://oracc.org) signature. Only Sumerian words have a `base`, so if your data set has no Sumerian, this column will not exist in the DataFrame.  If a text has no breakage information in the form of `$ 1 line broken` (etc.) the fields `extent`, `scope`, and `state` do not exist. Where such fields are referenced in the code below (sections 2-4), the code may fail and you may need to take out some lines.

In [None]:
words = pd.DataFrame(lemm_l)
words = words.fillna('')   # replace NaN (Not a Number) with empty string
words

## 3.2 Remove Spaces and Commas from Guide Word and Sense
Spaces and commas in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens and nothing (empty string), respectively.

In [None]:
findreplace = {' ' : '-', ',' : ''}
words = words.replace({'gw' : findreplace, 'sense' : findreplace}, regex=True)

The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* frag (transliteration; including flags)
* gdl_utf8 (cuneiform)
* gw (Guide Word: main or first translation in standard dictionary)
* id_line (a line ID that begins with the six-digit P, Q, or X number of the text)
* id_text (six-digit P, Q, or X number)
* id_word (word ID that begins with the ID number of the line)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* sig (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `pos`, `id_word`, `id_line`, and `id_text` should always be there. An unlemmatized word has `pos` 'X' (for unknown). Broken words have `pos` 'u' (for 'unlemmatizable).

# 3.3 Manipulate for Analysis on Line level
For analyses that use a line as unit of analysis (e.g. lines in lexical texts as analyzed in the [Phylogenetics](https://github.com/ErinBecker/digital-humanities-phylogenetics) project) one may need to create lemmas and combine these into lines by using the `id_line` variable.

## 3.3.1 Create Lemma Column
A lemma, [ORACC](http://oracc.org) style, combines Citation Form, GuideWord and POS into a unique reference to one particular lemma in a standard dictionary, as in `lugal[king]N` (Sumerian) or `šarru[king]N`. Usually, not all words in a text are lemmatized, because a word may be (partly) broken and/or unknown. Unlemmatized and unlemmatizable words will receive a place-holder lemmatization that consists of the transliteration of the word (instead of the Citation Form), with `NA` as GuideWord and `NA` as POS, as in `i-bu-x[NA]NA`. Note that `NA` is a string.

In [None]:
words["lemma"] = words.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) 
                            if r["cf"] != '' else r['form'] + '[NA]NA', axis=1)
words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in words['lemma'] ] # kick out empty forms

## 3.3.2 Group by Line
In the `words` dataframe each word has a separate row. In order into change this to a line-by-line representation we use the Pandas `.groupby` function, using `id_text`, `id_line` and `label` fields as the sorting arguments. 

The field `id_line` is created by splitting `id_word` into three elements. The format of `id_word` is `IDtext.line.word`. The middle part, `id_line` is made into an integer so that it can be used to put the lines into their proper order (note that `id_line` is an abstract reference number that indicates the sequence of lines in a text object; `label` is a human-readable line number in the format `o ii 3`: obverse column 2, line 3). 

The fields that are aggregated are `lemma`, `extent`, `scope`, and `state`. The fields `extent`, `scope`, and `state` represent data on the number of broken lines. For instance, the notation `4 lines missing` in the [ORACC](http://oracc.org) edition will result in `extent = "4"`, `scope = "line"`, `state = "missing"` (note that the value of `extent` is a string and will be `"n"` if the number of missing lines or columns is unknown).

In [None]:
words['id_line'] = [int(wordid.split('.')[1]) for wordid in words['id_word']]

In [None]:
lines = words.groupby([words['id_text'], words['id_line'], words['label']]).agg({
        'lemma': ' '.join,
        'extent': ''.join, 
        'scope': ''.join,
        'state': ''.join
    }).reset_index()
lines

## 3.3.3 Alternative: Texts in Normalized Transcription
This code essentially follows the pattern of the preceding. Before grouping words into lines, we need to create a dummy for words that have not been normalized, using the field `form`.

In [None]:
words["norm1"] = words.apply(lambda r: (r["norm"]) if r["norm"] != '' else r['form'], axis=1)

In [None]:
texts_norm = words.groupby([words['id_text']]).agg({
        'norm1': ' '.join,
    }).reset_index()
texts_norm

## 4 Save Results in CSV file
The output file is called `parsed.csv` and is placed in the directory `output`. In most computers, `csv` files open automatically in Excel. This program does not deal well with `utf-8` encoding. If you intend to use the file in Excel, change `encoding ='utf-8'` to `encoding='utf-16'`. For usage in computational text analysis applications `utf-8` is usually preferred. 

(Alternatively, use the instructions [here](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0) to import a `utf-8` file into Excel).

In [None]:
savefile =  'parsed.csv'
with open('output/' + savefile, 'w', encoding="utf-8") as w:
    lines.to_csv(w, index=False)

In [None]:
for idx, Q in enumerate(texts_norm["id_text"]):
    savefile =  Q[-7:] + 'txt'
    with open('output/' + savefile, 'w', encoding="utf-8") as w:
        texts_norm.iloc[idx].to_csv(w, index = False)