# Second Parser: Lines and Breaks
by Niek Veldhuis
UC Berkeley

July 2017


# Introduction

The two main differences between `First_JSON_parser.ipynb` and the current notebook are

- the ability to parse an entire corpus
- recognizing lines
- including breaks (as in "3 lines broken").

Although these features somewhat complicate the code, the basic techniques used are the same.

The resulting data file may include various elements of the ORACC data structure. The current code will output a file with the following fields: 

* id_text
* id_line
* label
* lemma (a sequence of lemmas in a line)
* extent
* scope
* state

The fields `extent`, `scope`, and `state` capture the number of missing lines or columns.

The selection of fields may be adjusted with standard `Pandas` functions.

## Notes

This notebook is written for **Python 3.5** with **Pandas 0.19** and **requests 2.12.4**.


## Licensing
This notebook may be downloaded, used, adapted and distributed without restrictions ([CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/).

In [1]:
import pandas as pd   
import requests
import zipfile
import tqdm
import numpy as np
import json

# Input List of Text IDs or a project abbreviation
Identify a list of text IDs (P, Q, and X numbers) in the directory `input`. The IDs are six-digit P, Q, or X numbers preceded by a project abbreviation in the format 'PROJECT/P######' or 'PROJECT/SUBPROJECT/Q######'. For example:
* dcclt/P117395
* etcsri/Q001203
* rinap/rinap1/Q003421

The list should be created with a flat text editor such as Textedit or Emacs, and the filename should end in `.txt`.

Alternatively, one may enter the name (abbreviation) of a project or sub-project in [ORACC](http://oracc.org) and pull all the lemmatized data from that project. Note that the script will not automatically pull data from subprojects, they have to be requested separately. Examples:
* saao/saa01
* aemw/amarna
* rimanum

In [2]:
name = input('Filename or project abbreviation: ')

Filename or project abbreviation: saao/sargonletters


In [None]:
if name[-4:] == '.txt':
    textids = 'text_ids/' + name
    with open(textids, 'r') as f:
        pqxnos = f.readlines()
    pqxnos = [x.strip() for x in pqxnos]  # strip spaces left and right
    pqxnos = [x for x in pqxnos if not x == ""] # strip empty lines
    pqxnos = [x.split()[0] for x in pqxnos] # strip everything after first space
#    pqxnos = [x[-7:].upper() for x in pqxnos]
    projects = [x[:-8].lower() for x in pqxnos]
    projects = list(set(projects))
else:
    name = name.strip().lower()
    projects = [name]
    url = "http://oracc.org/" + name + "/corpus.json"
    r = requests.get(url)
    corpus = r.json()
    pqxnos = list(corpus["members"].keys())
    pqxnos = [name + '/' + no for no in pqxnos]

## 1.2 Create Download Directory and JSON directory
For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist)

In [None]:
import errno
import os
try:
    os.mkdir('jsonzip')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass
try:
    os.mkdir('json')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

## 1.3 Download `json.zip`
For each project from which files are to be processed download the entire project (all the json files) in `https://github.com/oracc/json`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

Although downloading the entire zip file is time consuming, it will make processing the individual files much more efficient and the code is less likely to break due to interruption in connectivity.

## Note:
It may be better to download the `zip` file from [ORACC](http://oracc.org), where it is available as `http://oracc.org/[PROJECT]/json.zip`. This version is updated when a project is updated. Right now the file seems to be not accessible.

In [None]:
CHUNK = 16 * 1024
for project in tqdm.tqdm(projects):
    project = project.replace('/', '-')
    url = "https://raw.github.com/oracc/json/master/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    print("Downloading " + url + " saving as " + file)
    r = requests.get(url)
    with open(file, 'wb') as f:
        for c in r.iter_content(chunk_size=CHUNK):
            f.write(c)

## 1.4 Extract JSON files from `json.zip`
Extract the texts listed in the list of text IDs from the `json.zip`. All files are extracted to a directory called `data/[PROJECT]/json/corpusjson` (for instance `data/dcclt/json/corpusjson`). If the file belongs to a subproject the directory is called `data/[PROJECT]/[SUBPROJECT]/json/corpusjson`. 

In [None]:
target_dir = 'json'
files_l = []
for no in tqdm.tqdm(pqxnos):
    project = no[:-8].lower()
    pno = no[-7:].upper()
    zip_file = "jsonzip/" + project.replace('/', '-') + ".zip"
    with zipfile.ZipFile(zip_file,"r") as zip_ref:
        file = project + '/corpusjson/' + pno + '.json'
        try:
            zip_ref.extract(file, target_dir)
            files_l.append(file)
        except:
            print(no + ' is not available')

In [None]:
files_l

## 1.5 Parse JSON files
The `parsejson()` function is essentially identical with the that function in `First_JSON_parser.ipynb`, but it fetches more data. The field `word_id` consists of three parts, namely a text ID, line ID, and word ID, in the format `Q000039.76.2` meaning: the second word in line 76 of text object `Q000039`. Note that `76` is not a line number strictly speaking but an object reference within the text object. Things like horizontal rulings, columns, and breaks also get object references. The `word_id` field allows us to put lines together in the proper order.

The field `label` is a human-legible label that refers a line or another part of the text; it may look like `o i 23` (obverse column 1 line 23) or `r v 23'` (reverse column 5 line 23 prime). The `label` field is used in online [ORACC](http://oracc.org) editions to indicate line numbers.

The fields `extent`, `scope`, and `state` give metatextual data about the condition of the object; they capture the number of broken lines or columns and similar information. 



In [None]:
def parsejson(text, parameters):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject, parameters)
        if "label" in JSONobject:
            parameters["label"] = JSONobject['label']
        if "f" in JSONobject:
            lemma = JSONobject["f"]
            lemma["id_word"] = JSONobject["ref"]
            lemma['label'] = parameters["label"]
            lemma["id_text"] = parameters["id_text"]
            lemm_l.append(lemma)
        if "strict" in JSONobject and JSONobject["strict"] == "1":
            lemma = {key: JSONobject[key] for key in parameters["dollar_keys"]}
            lemma["id_word"] = JSONobject["ref"] + ".0"
            lemma["id_text"] = parameters["id_text"]
            lemm_l.append(lemma)
    return

## 1.6 Call the Parser Function for Each Textid

In [None]:
lemm_l = []
parameters = {"label": None, "id_text": None, "dollar_keys" : ["extent", "scope", "state"]}
for file in tqdm.tqdm(files_l):
    parameters["id_text"] = file.replace('corpusjson/', '')[:-5]
    with open("json/" + file) as data_file:
        text = json.load(data_file)
    try:
        parsejson(text, parameters)
    except:
        print(no + ' is not available or not complete')

## 2 Data Structuring
### 2.1 Transform the Data into a DataFrame
The word_l list is transformed into a Pandas dataframe for further manipulation.

For various reasons not all JSON files will have all data types that potentially exist in an [ORACC](http://oracc.org) signature. Only Sumerian words have a `base`, so if your data set has no Sumerian, this column will not exist in the DataFrame.  If a text has no breakage information in the form of `$ 1 line broken` (etc.) the fields `extent`, `scope`, and `state` do not exist. Since such fields are referenced in the code below (sections 2-4) the next cell will check for the existence of each column and create an empty column if necessary.

In [None]:
words = pd.DataFrame(lemm_l)
words = words.fillna('') # replace Missing Values by empty string
words

## 2.2 Remove Spaces and Commas from Guide Word and Sense
Spaces in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [None]:
words['sense'] = [x.replace(' ', '-') for x in words['sense']]
words['sense'] = [x.replace(',', '') for x in words['sense']]
words['gw'] = [x.replace(' ', '-') for x in words['gw']]
words['gw'] = [x.replace(',', '') for x in words['gw']]

The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* frag (transliteration; including flags)
* gdl_utf8 (cuneiform)
* gw (Guide Word: main or first translation in standard dictionary)
* id_line (a line ID that begins with the six-digit P, Q, or X number of the text)
* id_text (six-digit P, Q, or X number)
* id_word (word ID that begins with the ID number of the line)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* sig (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `pos`, `id_word`, `id_line`, and `id_text` should always be there. An unlemmatized word has `pos` 'X' (for unknown). Broken words have `pos` 'u' (for 'unlemmatizable).

# 3. Manipulate for Analysis on Line level
For analyses that use a line as unit of analysis (e.g. lines in lexical texts as analyzed in the [Phylogenetics](https://github.com/ErinBecker/digital-humanities-phylogenetics) project) one may need to create lemmas and combine these into lines by using the `id_line` variable.

## 3.1 Create Lemmas and Adjust Bases
A lemma, [ORACC](http://oracc.org) style, combines Citation Form, GuideWord and POS into a unique reference to one particular lemma in a standard dictionary, as in `lugal[king]N` (Sumerian) or `šarru[king]N`. Usually, not all words in a text are lemmatized, because a word may be (partly) broken and/or unknown. Unlemmatized and unlemmatizable words will receive a place-holder lemmatization that consists of the transliteration of the word (instead of the Citation Form), with `NA` as GuideWord and `NA` as POS, as in `i-bu-x[NA]NA`. Note that `NA` is a string.

In [None]:
words["lemma"] = words.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) 
                            if r["cf"] != '' else r['form'] + '[NA]NA', axis=1)
words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in words['lemma'] ]

## 3.2 Group by Line
In the `words` dataframe each word has a separate row. In order into change this to a line-by-line representation we use the Pandas `.groupby` function, using `id_text`, `id_line` and `label` fields as the sorting arguments. 

The field `id_line` is created by splitting `id_word` into three elements. The format of `id_word` is `IDtext.line.word`. The middle part, `id_line` is made into an integer so that it can be used to put the lines into their proper order (note that `id_line` is an abstract reference number that indicates the sequence of lines in a text object; `label` is a human-readable line number in the format `o ii 3`: obverse column 2, line 3). 

The fields that are aggregated are `lemma`, `extent`, `scope`, and `state`. The fields `extent`, `scope`, and `state` represent data on the number of broken lines. For instance, the notation `4 lines missing` in the [ORACC](http://oracc.org) edition will result in `extent = "4"`, `scope = "line"`, `state = "missing"` (note that the value of `extent` is a string and will be `"n"` if the number of missing lines or columns is unknown).

In [None]:
#words['id_line'] = [wordid[:wordid.rfind('.')+1] for wordid in words['id_word']]
words['id_line'] = [int(wordid.split('.')[1]) for wordid in words['id_word']]

In [None]:
lines = words.groupby([words['id_text'], words['id_line'], words['label']]).agg({
        'lemma': ' '.join,
        'extent': ''.join, 
        'scope': ''.join,
        'state': ''.join
    }).reset_index()
lines        

## 3.3 Save in CSV Format

In [None]:
filename = name[:-4]
with open('output/' + filename + '.csv', 'w') as w:
    lines.to_csv(w, encoding='utf8')