# Third Parser: Selecting a Passage
by Niek Veldhuis
UC Berkeley

July 2017 2017

Main difference between this parser and the `Second_JSON_parser.ipynb` is the ability to define a region within a tablet to parse by giving a start label and/or a stop label. This may be used in order to omit colophons or to select one exercise among several on a school tablet.

This requires some extra parsing of the list of text IDs where labels are now optional. The list of text IDs may look like this:

- dcclt/Q000039
- dcclt/P247864 - r i 16
- dcclt/P228683 r 1 - r 4
- dcclt/P230849 r 1 -

The first will parse the entirety of `Q000039`; the second will take everything of `P247864` up to and including `r i 16`. The third will only take `r 1` to (and including) `r 4` of `P228683`. The fourth will parse everything of `P230849` starting from `r 1`. The format of the labels depends on the structure of the text object and the way it has been edited. The labels must follow exactly the form they have in the online [ORACC](http://oracc.org) edition.

## Licensing
This notebook may be downloaded, used and adapted without any restrictions.

In [1]:
import pandas as pd   
import requests
import zipfile
import tqdm
import numpy as np

In [2]:
name = input('Filename: ')

Filename: Q39_par.txt


In [3]:
with open('text_ids/' + name, 'r') as f:
    pqxnos = f.read().splitlines()
pqxnos = [no.strip() for no in pqxnos]
nos_labels = [no.split(' ', 1) if " " in no else [no, '-'] for no in pqxnos]

for label in nos_labels:
    label[1] = label[1].split('-')

## 1.5 Parse JSON files
# This should be replaced by some version of parsejson() in First_JSON_parser
The `parsejson()` below includes "id_word" which has the form `TextID.LineID.WordID` - in other words, line and text ID can be derived from it.

For the caller function make sure that texts from different projects will be parsed correctly (does pqxnos include a project?).

========================

The function `oraccjasonparser()` takes one argument (the **ID** or **P/Q/X-number** of the `.json` file). It looks for the prefix `textid` to retrieve the six-digit P, Q, or X number of the text artifact. Parsing the file sequentially the code looks for the places where a line starts (`'.type' = 'line-start'`) and where a word starts (`'.node' = 'l'`, where `l` is for "lemma"). At each level the code will retrieve the relevant data and create a list where each entry is a dictionary that represents a single word. 

Words not only include lemmatized words, but also unlemmatized and unlemmatizable words (such as breaks).

The dictionary includes the keys `id_line` and `id_word` that allow the user to reassemble words and lines in order.

In [4]:
##this parsejson takes as second argument a list [startlabel, endlabel]
#
def parsejson(text, labels = ["", ""], lemm_l = None):
    label = ""
    dollar_keys = ["extent", "scope","state"]
    startlabel = labels[0].strip()
    endlabel = labels[1].strip()
    if startlabel == "":    
        keep = True
    else:
        keep = False
    if lemm_l == None:
        lemm_l = []
    for dictionary in text["cdl"]:
        if "cdl" in dictionary: 
            parsejson(dictionary, labels, lemm_l)
        if "label" in dictionary:
            label = dictionary["label"]
        if label == startlabel:
            keep = True
        if label == endlabel:
            keep = False
        if keep == True or label == endlabel: # the "or" statement ensures that the line
            if "f" in dictionary:             # corresponding to the endlabel is included.
                lemma = dictionary["f"]
                lemma["id_word"] = dictionary["ref"]
                lemma["label"] = label
                lemm_l.append(lemma)
            if "strict" in dictionary and dictionary["strict"] == "1":
                lemma = {key: dictionary[key] for key in dollar_keys}
                lemma["id_word"] = dictionary["ref"] + ".0"
                lemm_l.append(lemma)
    return lemm_l

In [5]:
word_l = []
for pqx in tqdm.tqdm(nos_labels):
    project = pqx[0][:-8]
    textid = pqx[0][-7:]
    labels = pqx[1]
    url = "http://oracc.org/" + project + "/corpusjson/" + textid + ".json"  
    r = requests.get(url).json()
    try:
        word_l.extend(parsejson(r, labels))
    except:
        print(url + ' is not available or not complete')

100%|██████████| 138/138 [02:01<00:00,  1.44s/it]


## 1.7 Transform the Data into a DataFrame
The word_l list is transformed into a Pandas dataframe for further manipulation.

For various reasons not all JSON files will have all data types that potentially exist in an [ORACC](http://oracc.org) signature. Only Sumerian words have a `base`, so if your data set has no Sumerian, this column will not exist in the DataFrame.  If a text has no breakage information in the form of `$ 1 line broken` (etc.) the fields `extent`, `scope`, and `state` do not exist. Since such fields are referenced in the code below (sections 2-4) the next cell will check for the existence of each column and create an empty column if necessary.

In [8]:
words = pd.DataFrame(word_l)
fields = ['base', 'cf', 'cont', 'epos', 'extent', 'form', 'gw', 'id_line', 'id_text', 'id_word',
          'label', 'lang', 'morph', 'norm', 'norm0', 'pos', 'scope', 'sense', 'sig']
for field in fields:
    if not field in words.columns:
        words[field] = ''
words = words.fillna('') # replace Missing Values by empty string
words.head(100)

Unnamed: 0,base,cf,cont,delim,epos,extent,form,gdl,gw,id_word,...,morph,norm,norm0,pos,scope,sense,state,id_line,id_text,sig
0,{ŋeš}taškarin,taškarin,,,N,,{ŋeš}taškarin,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",boxwood,Q000039.1.1,...,~,,taškarin,N,,"box tree, boxwood",,,,
1,{ŋeš}esi,esi,,,N,,{ŋeš}esi,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,Q000039.2.1,...,~,,esi,N,,ebony,,,,
2,ŋeš-nu₁₁,ŋešnu,,,N,,{ŋeš}nu₁₁,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,Q000039.3.1,...,~,,ŋešnu,N,,tree,,,,
3,{ŋeš}ha-lu-ub₂,halub,,,N,,{ŋeš}ha-lu-ub₂,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,Q000039.4.1,...,~,,halub,N,,tree,,,,
4,{ŋeš}šag₄-kal,šagkal,,,N,,{ŋeš}šag₄-kal,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",tree,Q000039.5.1,...,~,,šagkal,N,,tree,,,,
5,ŋeš-kin₂,ŋešgana,,,N,,ŋeš-kin₂,"[{'v': 'ŋeš', 'id': 'Q000039.6.1.0', 'delim': ...",tree,Q000039.6.1,...,~,,ŋešgana,N,,tree,,,,
6,ŋeš-kin₂,ŋešgana,,,N,,ŋeš-kin₂,"[{'v': 'ŋeš', 'id': 'Q000039.7.1.0', 'delim': ...",tree,Q000039.7.1,...,~,,ŋešgana,N,,tree,,,,
7,babbar,babbar,,,V/i,,babbar,"[{'v': 'babbar', 'id': 'Q000039.7.2.0'}]",white,Q000039.7.2,...,~,,babbar,V/i,,(to be) white,,,,
8,ŋeš-kin₂,ŋešgana,,,N,,ŋeš-kin₂,"[{'v': 'ŋeš', 'id': 'Q000039.8.1.0', 'delim': ...",tree,Q000039.8.1,...,~,,ŋešgana,N,,tree,,,,
9,giggi,giggi,,,V/i,,giggi,"[{'v': 'giggi', 'id': 'Q000039.8.2.0'}]",black,Q000039.8.2,...,~,,giggi,V/i,,(to be) black,,,,


## 1.8 Remove Spaces and Commas from Guide Word and Sense
Spaces in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [9]:
words['sense'] = [x.replace(' ', '-') for x in words['sense']]
words['sense'] = [x.replace(',', '') for x in words['sense']]
words['gw'] = [x.replace(' ', '-') for x in words['gw']]
words['gw'] = [x.replace(',', '') for x in words['gw']]

The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* frag (transliteration; including flags)
* gdl_utf8 (cuneiform)
* gw (Guide Word: main or first translation in standard dictionary)
* id_line (a line ID that begins with the six-digit P, Q, or X number of the text)
* id_text (six-digit P, Q, or X number)
* id_word (word ID that begins with the ID number of the line)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* sig (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `pos`, `id_word`, `id_line`, and `id_text` should always be there. An unlemmatized word has `pos` 'X' (for unknown). Broken words have `pos` 'u' (for 'unlemmatizable).

# 3. Manipuate for Analysis on Line level (e.g. phylogenetics)
For analyses that use a line as unit of analysis (e.g. lines in lexical texts as analyzed in the [Phylogenetics](https://github.com/ErinBecker/digital-humanities-phylogenetics) project) one may need to create lemmas and combine these into lines by using the `id_line` variable.

## 3.1 Create Lemmas and Adjust Bases
A lemma, [ORACC](http://oracc.org) style, combines Citation Form, GuideWord and POS into a unique reference to one particular lemma in a standard dictionary, as in `lugal[king]N` (Sumerian) or `šarru[king]N`. Usually, not all words in a text are lemmatized, because a word may be (partly) broken and/or unknown. Unlemmatized and unlemmatizable words will receive a place-holder lemmatization that consists of the transliteration of the word (instead of the Citation Form), with `NA` as GuideWord and POS, as in `i-bu-x[NA]NA`. Note that `NA` is a string.

For Sumerian projects each lemmatized word has a `base` (the word without morphology). For non-lemmatized words a place-holder base is created that consists of the transliteration of the word. If you are not working with Sumerian data.

In [10]:
words["lemma"] = words.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) 
                            if r["cf"] != '' else r['form'] + '[NA]NA', axis=1)
words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in words['lemma'] ]
words['base'] = words.apply(lambda r: r["base"] if r["base"] != '' or r['label'] == '' else r['form'], axis=1)

## 3.2 Group by Line
In the `words` dataframe each word has a separate row. In order into change this to a line-by-line representation we use the Pandas `.groupby` function, using the `id_line` and `label` fields as arguments (`id_line` has an abstract number that indicates the sequence of lines in a text object; `label` is a human-readable line number in the format `o ii 3`: obverse column 2, line 3). The fields that are aggregated are `lemma`, `base`, `extent`, and `scope`. The fields `extent` and `scope` represent data on the number of broken lines. If you work with Akkadian data you want to leave out the field `base`.

In [11]:
words['id_line'] = [wordid[:wordid.rfind('.')] for wordid in words['id_word']]

In [12]:
lines = words.groupby([words['id_line'], words['label']]).agg({
        'lemma': ' '.join,
        'base': ' '.join,
        'extent': ''.join, 
        'scope': ''.join
    }).reset_index()
lines        

Unnamed: 0,id_line,label,lemma,base,extent,scope
0,P117395.2,o 1,ŋešed[key]N,{ŋeš}e₃-a,,
1,P117395.3,o 2,pakud[~tree]N,{ŋeš}pa-kud,,
2,P117395.4,o 3,raba[clamp]N,{ŋeš}raba,,
3,P117404.2,o 1,ig[door]N eren[cedar]N,{ŋeš}ig {ŋeš}eren,,
4,P117404.3,o 2,ig[door]N dib[board]N,{ŋeš}ig dib,,
5,P117404.4,o 3,ig[door]N i[oil]N,{ŋeš}ig i₃,,
6,P128345.2,o 1,garig[comb]N siki[hair]N,{ŋeš}ga-rig₂ siki,,
7,P128345.3,o 2,garig[comb]N siki-siki[NA]NA,{ŋeš}ga-rig₂ siki-siki,,
8,P128345.4,o 3,garig[comb]N saŋdu[head]N,{ŋeš}ga-rig₂ saŋ-du,,
9,P224980.4,o i 1,gigir[chariot]N,{ŋeš}gigir,,


Note that `id_line` is a string variable and therefore does not give the lines in the right order. We should split `id_line` into two variables: `id_text` (the first 7 characters; we lost the old `id_text` column in the `.groupby` function above) and a new `line` variable, which is a number. 

In [13]:
lines['id_text'] = lines['id_line'].str[:7] # id_text was lost in the grouping above and is recreated
lines['line'] = [line[line.rfind('.')+1:] for line in lines['id_line']] #create a line number for sorting
lines['line'] = [x.replace('l', '') for x in lines['line']]
lines['line'] = [int(x) if not x == '' else np.nan for x in lines['line']]
lines = lines.sort_values(['id_text', 'line']).reset_index(drop=True)
lines.head(100)

Unnamed: 0,id_line,label,lemma,base,extent,scope,id_text,line
0,P117395.2,o 1,ŋešed[key]N,{ŋeš}e₃-a,,,P117395,2
1,P117395.3,o 2,pakud[~tree]N,{ŋeš}pa-kud,,,P117395,3
2,P117395.4,o 3,raba[clamp]N,{ŋeš}raba,,,P117395,4
3,P117404.2,o 1,ig[door]N eren[cedar]N,{ŋeš}ig {ŋeš}eren,,,P117404,2
4,P117404.3,o 2,ig[door]N dib[board]N,{ŋeš}ig dib,,,P117404,3
5,P117404.4,o 3,ig[door]N i[oil]N,{ŋeš}ig i₃,,,P117404,4
6,P128345.2,o 1,garig[comb]N siki[hair]N,{ŋeš}ga-rig₂ siki,,,P128345,2
7,P128345.3,o 2,garig[comb]N siki-siki[NA]NA,{ŋeš}ga-rig₂ siki-siki,,,P128345,3
8,P128345.4,o 3,garig[comb]N saŋdu[head]N,{ŋeš}ga-rig₂ saŋ-du,,,P128345,4
9,P224980.4,o i 1,gigir[chariot]N,{ŋeš}gigir,,,P224980,4


Note that the new `line` field is not a line number in the traditional sense of the word (this is `label`) but a number used to organize lines in the appropriate order.

## 3.3 Save in CSV Format

In [None]:
filename = filename[:-4]
with open('output/' + filename + '.csv', 'w') as w:
    lines.to_csv(w, encoding='utf8')