# Third Parser: Selecting a Passage
by Niek Veldhuis
UC Berkeley

July 2017 2017

Main difference between this parser and the `Second_JSON_parser.ipynb` is the ability to define a region within a tablet to parse by giving a start label and/or a stop label. This may be used in order to omit colophons or to select one exercise among several on a school tablet.

This requires some extra parsing of the list of text IDs where labels are now optional. The list of text IDs may look like this:

- dcclt/Q000039
- dcclt/P247864 - r i 16
- dcclt/P228683 r 1 - r 4
- dcclt/P230849 r 1 -

The first will parse the entirety of `Q000039`; the second will take everything of `P247864` up to and including `r i 16`. The third will only take `r 1` to (and including) `r 4` of `P228683`. The fourth will parse everything of `P230849` starting from `r 1`. The format of the labels depends on the structure of the text object and the way it has been edited. The labels must follow exactly the form they have in the online [ORACC](http://oracc.org) edition.

## Licensing
This notebook may be downloaded, used and adapted without any restrictions.

In [None]:
import pandas as pd   
import requests
import zipfile
import tqdm
import numpy as np

In [None]:
name = input('Filename: ')

In [None]:
with open('text_ids/' + name, 'r') as f:
    pqxnos = f.read().splitlines()
pqxnos = [no.strip() for no in pqxnos]             # strip spaces left and right
nos_labels = [no.split(' ', 1) if " " in no else [no, '-'] for no in pqxnos] # isolate text ID
for label in nos_labels:    # separate start label and stop label
    label[1] = label[1].split('-')

## 1.5 Parse JSON files
The `parsejson()` below includes `id_word` which has the form `TextID.LineID.WordID` - in other words, line and text ID can be derived from it.

Parsejson() takes as second argument a logical variable. If "True" the parser starts with the first word.
If "False" the parser starts when it gets to "startlabel". The parser stops when it gets to "endlabel". `Label` `startlabel` and `stoplabel` are stored in the dictionary `labels` outside of the function.

The list `dollar_keys` (also outside of the function) stores the relevant field names when capturing line breaks etc.

In [None]:
def parsejson(text, keep = False):
    for dictionary in text["cdl"]:
        if "cdl" in dictionary: 
            parsejson(dictionary, keep)
        if "label" in dictionary:
            labels["label"] = dictionary["label"]
        if labels["label"] == labels["startlabel"]:
            keep = True
        if labels["label"] == labels["endlabel"]:
            keep = False
        if keep == True or labels["label"] == labels["endlabel"]: # the "or" statement ensures that the line
            if "f" in dictionary:             # corresponding to the endlabel is included.
                lemma = dictionary["f"]
                lemma["id_word"] = dictionary["ref"]
                lemma["label"] = labels["label"]
                lemm_l.append(lemma)
            if "strict" in dictionary and dictionary["strict"] == "1":
                lemma = {key: dictionary[key] for key in dollar_keys}
                lemma["id_word"] = dictionary["ref"] + ".0"
                lemm_l.append(lemma)
    return lemm_l

##  1.6 Call the Parser for each Textid

In [None]:
lemm_l = []
dollar_keys = ["extent", "scope", "state"]
for pqx in tqdm.tqdm(nos_labels):
    project = pqx[0][:-8]
    textid = pqx[0][-7:]
    labels = {"startlabel":pqx[1][0].strip(), "endlabel":pqx[1][1].strip(), "label":""}
    if labels["startlabel"] == "":
        keep = True
    else:
        keep = False
    url = "http://oracc.museum.upenn.edu/" + project + "/corpusjson/" + textid + ".json"  
    r = requests.get(url).json()
    try:
        lemm_l = parsejson(r, keep)
    except:
        print(url + ' is not available or not complete')

## 1.7 Transform the Data into a DataFrame
The word_l list is transformed into a Pandas dataframe for further manipulation.

For various reasons not all JSON files will have all data types that potentially exist in an [ORACC](http://oracc.org) signature. Only Sumerian words have a `base`, so if your data set has no Sumerian, this column will not exist in the DataFrame.  If a text has no breakage information in the form of `$ 1 line broken` (etc.) the fields `extent`, `scope`, and `state` do not exist. Since such fields are referenced in the code below (sections 2-4) the next cell will check for the existence of each column and create an empty column if necessary.

In [None]:
words = pd.DataFrame(lemm_l)
fields = ['base', 'cf', 'cont', 'epos', 'extent', 'form', 'gw', 'id_word',
          'label', 'lang', 'morph', 'norm', 'norm0', 'pos', 'scope', 'sense', 'sig']
for field in fields:
    if not field in words.columns:
        words[field] = ''
words = words.fillna('') # replace Missing Values by empty string
words.head(100)

## 1.8 Remove Spaces and Commas from Guide Word and Sense
Spaces in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [None]:
words['sense'] = [x.replace(' ', '-') for x in words['sense']]
words['sense'] = [x.replace(',', '') for x in words['sense']]
words['gw'] = [x.replace(' ', '-') for x in words['gw']]
words['gw'] = [x.replace(',', '') for x in words['gw']]

The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* frag (transliteration; including flags)
* gdl_utf8 (cuneiform)
* gw (Guide Word: main or first translation in standard dictionary)
* id_line (a line ID that begins with the six-digit P, Q, or X number of the text)
* id_text (six-digit P, Q, or X number)
* id_word (word ID that begins with the ID number of the line)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* sig (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `pos`, `id_word`, `id_line`, and `id_text` should always be there. An unlemmatized word has `pos` 'X' (for unknown). Broken words have `pos` 'u' (for 'unlemmatizable).

# 3. Manipulate for Analysis on Line level (e.g. phylogenetics)
For analyses that use a line as unit of analysis (e.g. lines in lexical texts as analyzed in the [Phylogenetics](https://github.com/ErinBecker/digital-humanities-phylogenetics) project) one may need to create lemmas and combine these into lines by using the `id_line` variable.

## 3.1 Create Lemmas and Adjust Bases
A lemma, [ORACC](http://oracc.org) style, combines Citation Form, GuideWord and POS into a unique reference to one particular lemma in a standard dictionary, as in `lugal[king]N` (Sumerian) or `šarru[king]N`. Usually, not all words in a text are lemmatized, because a word may be (partly) broken and/or unknown. Unlemmatized and unlemmatizable words will receive a place-holder lemmatization that consists of the transliteration of the word (instead of the Citation Form), with `NA` as GuideWord and POS, as in `i-bu-x[NA]NA`. Note that `NA` is a string.

For Sumerian projects each lemmatized word has a `base` (the word without morphology). For non-lemmatized words a place-holder base is created that consists of the transliteration of the word. If you are not working with Sumerian data.

In [None]:
words["lemma"] = words.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) 
                            if r["cf"] != '' else r['form'] + '[NA]NA', axis=1)
words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in words['lemma'] ]
words['base'] = words.apply(lambda r: r["base"] if r["base"] != '' or r['label'] == '' else r['form'], axis=1)

## 3.2 Group by Line
In the `words` dataframe each word has a separate row. In order to change this into a line-by-line representation we use the Pandas `.groupby` function, using the `id_line`, `id_text` and `label` fields as arguments. `label` is a human-readable line number in the format `o ii 3`: obverse column 2, line 3. The field `id_line` is an integer that is created from `id_word`, which has the format `IDText.IDLine.IDWord`, for instance `P296528.23.1`. The field `id_text` is also derived from `id_word`.

The fields that are aggregated are `lemma`, `base`, `extent`, and `scope`. The fields `extent` and `scope` represent data on the number of broken lines. If you work with Akkadian data you want to leave out the field `base`.

## Note:
Make sure that code handles `id_line` that includes `l`.

In [None]:
words['id_line'] = [int(wordid[wordid.find('.')+1:wordid.rfind('.')]) for wordid in words['id_word']]
words['id_text'] = [wordid[:7] for wordid in words['id_word']]

In [None]:
lines = words.groupby([words['id_text'], words['id_line'], words['label']]).agg({
        'lemma': ' '.join,
        'base': ' '.join,
        'extent': ''.join, 
        'scope': ''.join
    }).reset_index()
lines        

Note that the new `id_line` field is not a line number in the traditional sense of the word (this is `label`) but a number used to organize lines in the appropriate order.

## 3.3 Save in CSV Format

In [None]:
filename = name[:-4]
with open('output/' + filename + '.csv', 'w') as w:
    lines.to_csv(w, encoding='utf8', index=False)