# Importing ORACC Data from corpus.json
by Niek Veldhuis

February 2017

# TODO
* check that COFs are treated properly
* check that lines that continue into the next line (as in bilinguals) are captured completely.
* look for `'$ n lines broken'` lines. 

'$' line has 'node' = 'd' with 'type' = 'nonx' and 'strict' = '1'. There are three content nodes: 'extent', 'scope' and 'state'.

# Introduction

Purpose of the code is to download [ORACC](http://oracc.org) JSON files that contain textual data and produce a `.csv` file with the relevant data for use in computational text analysis. This comes in the place of scraping the published `html`. The JSON files contain all the transliteration and lemmatization data of an ORACC project (metadata are made available in a separate `.json` file).

The resulting data file may include various elements of the ORACC data structure. The current code will output a file with the following fields: 
* textid
* line number
* lemmatization
* bases

The selection of fields may be adjusted with standard `Pandas` functions.

## Notes
The current version of the script works with the `ijson` library. Documentation for [ijson](https://www.dataquest.io/blog/python-json-tutorial/), unfortunately, is extremely brief. This notebook is written for exploratory purposes, using a list of 106 P, Q, X numbers (in `ob_lists_wood.txt`). Downloading the `.json` files takes between 30 and 45 seconds. The rest of the script is reasonably fast. With larger lists of text IDs the script will obviously take longer. 

This notebook is written for **Python 3.5** with **Pandas 0.19** and **ijson 2.3**.

In [1]:
import pandas as pd
import ijson
import urllib.request
import re
import tqdm

In [2]:
textids = 'text_ids/ob_lists_wood.txt'
with open(textids, 'r') as f:
    pnos = f.readlines()
pnos = [x.strip() for x in pnos]
pnos = [x[-7:] for x in pnos]
pnos[:5]

['Q000039', 'P117395', 'P117404', 'P128345', 'P224980']

# Parse
The function `oraccjasonparser()` takes one argument (the **url** of the `.json` file). It looks for the prefix `textid` to retrieve the six-digit P, Q, or X number of the text artifact. Parsing the file sequentially the code looks for the places where a line starts (`'.type' = 'line-start'`) and where a word starts (`'.node' = 'l'`). At each level the code will retrieve the relevant data and create a list where each entry is a dictionary that represents a single word. 

Words not only included lemmatized words, but also unlemmatized and unlemmatizable words (such as breaks).

The dictionary includes the keys `id_line` and `id_word` that allow the user to reassemble words and lines in order.

In [3]:
def oraccjasonparser(url):
    d = urllib.request.urlopen(url)
    parser = ijson.parse(d)
    word_l = []
    word_d = {}
    line_start = False
    word_start = False
    for prefix, event, value in parser:
        if prefix == 'textid':
            id_text = value
#            print("parsing " + value)
        if prefix.endswith('.type'):
            if value == 'line-start':
                line_start = True
            else:
                line_start = False
        if line_start:
            if prefix.endswith('.ref') and not word_start:
                id_line = value
            if prefix.endswith('.label'):
                label = value
        if prefix.endswith('node'):
            if value == 'l':
                word_start = True
                if not word_d == {}:
                    word_l.append(word_d)
                word_d = {}
                word_d['id_text'] = id_text
                word_d['id_line'] = id_line
                word_d['label'] = label
            else:
                word_start = False
        if word_start:
            if prefix.endswith('.ref'):
                word_d['id_word'] = value
            if prefix.endswith('.sig'):
                word_d['signature'] = value
            if '.f.' in prefix:
                category = re.sub('.*\.', '', prefix) # get element after the last dot of the prefix
                word_d[category] = value
    word_l.append(word_d)
    return(word_l)

# Call the Parser Function for Each Textid

In [4]:
url_prefix = "http://oracc.museum.upenn.edu/dcclt/corpusjson/"
word_l = []
for id_text in tqdm.tqdm(pnos):
    url = url_prefix + id_text + '.json'
    word_l.extend(oraccjasonparser(url))

100%|██████████| 106/106 [00:40<00:00,  3.33it/s]


# Transform the Data into a DataFrame

In [5]:
words = pd.DataFrame(word_l)
words.head()

Unnamed: 0,base,cf,cont,epos,form,gw,id_line,id_text,id_word,label,lang,morph,norm,norm0,pos,sense,signature
0,{ŋeš}taškarin,taškarin,,N,{ŋeš}taškarin,boxwood,Q000039.1,Q000039,Q000039.1.1,1,sux,~,,taškarin,N,"box tree, boxwood",@dcclt%sux:{ŋeš}taškarin=taškarin[boxwood//box...
1,{ŋeš}esi,esi,,N,{ŋeš}esi,tree,Q000039.2,Q000039,Q000039.2.1,2,sux,~,,esi,N,ebony,@dcclt%sux:{ŋeš}esi=esi[tree//ebony]N'N$esi/{ŋ...
2,ŋeš-nu₁₁,ŋešnu,,N,{ŋeš}nu₁₁,tree,Q000039.3,Q000039,Q000039.3.1,3,sux,~,,ŋešnu,N,a tree,@dcclt%sux:{ŋeš}nu₁₁=ŋešnu[tree//a tree]N'N$ŋe...
3,{ŋeš}ha-lu-ub₂,halub,,N,{ŋeš}ha-lu-ub₂,tree,Q000039.4,Q000039,Q000039.4.1,4,sux,~,,halub,N,tree,@dcclt%sux:{ŋeš}ha-lu-ub₂=halub[tree//tree]N'N...
4,{ŋeš}šag₄-kal,šagkal,,N,{ŋeš}šag₄-kal,tree,Q000039.5,Q000039,Q000039.5.1,5,sux,~,,šagkal,N,tree,@dcclt%sux:{ŋeš}šag₄-kal=šagkal[tree//tree]N'N...


# Remove Spaces and Commas from Guide Word and Sence
Spaces in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [6]:
words = words.fillna('') # first replace Missing Values by empty string
words['sense'] = [x.replace(' ', '-') for x in words['sense']]
words['sense'] = [x.replace(',', '') for x in words['sense']]
words['gw'] = [x.replace(' ', '-') for x in words['gw']]
words['gw'] = [x.replace(',', '') for x in words['gw']]
words.head()

Unnamed: 0,base,cf,cont,epos,form,gw,id_line,id_text,id_word,label,lang,morph,norm,norm0,pos,sense,signature
0,{ŋeš}taškarin,taškarin,,N,{ŋeš}taškarin,boxwood,Q000039.1,Q000039,Q000039.1.1,1,sux,~,,taškarin,N,box-tree-boxwood,@dcclt%sux:{ŋeš}taškarin=taškarin[boxwood//box...
1,{ŋeš}esi,esi,,N,{ŋeš}esi,tree,Q000039.2,Q000039,Q000039.2.1,2,sux,~,,esi,N,ebony,@dcclt%sux:{ŋeš}esi=esi[tree//ebony]N'N$esi/{ŋ...
2,ŋeš-nu₁₁,ŋešnu,,N,{ŋeš}nu₁₁,tree,Q000039.3,Q000039,Q000039.3.1,3,sux,~,,ŋešnu,N,a-tree,@dcclt%sux:{ŋeš}nu₁₁=ŋešnu[tree//a tree]N'N$ŋe...
3,{ŋeš}ha-lu-ub₂,halub,,N,{ŋeš}ha-lu-ub₂,tree,Q000039.4,Q000039,Q000039.4.1,4,sux,~,,halub,N,tree,@dcclt%sux:{ŋeš}ha-lu-ub₂=halub[tree//tree]N'N...
4,{ŋeš}šag₄-kal,šagkal,,N,{ŋeš}šag₄-kal,tree,Q000039.5,Q000039,Q000039.5.1,5,sux,~,,šagkal,N,tree,@dcclt%sux:{ŋeš}šag₄-kal=šagkal[tree//tree]N'N...


The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* gw (Guide Word: main or first translation in standard dictionary)
* id_line (a line ID that begins with the six-digit P, Q, or X number of the text)
* id_text (six-digit P, Q, or X number)
* id_word (word ID that begins with the ID number of the line)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* signature (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `pos`, `id_word`, `id_line`, and `id_text` should always be there. An unlemmatized word has `pos` 'X' (for unknown). Broken words have `pos` 'u' (for 'unlemmatizable).

# Manipulate
The columns may be manipulated with standard Pandas methods to create the desired output. By way of example, the following code will create a column `lemma` with the format **cf[gw]pos** (for instance **lugal[king]N**). For words that have no lemmatization `lemma` equals `form`. Only Sumerian words are allowed (and thus `lang` can be omitted) and in addition to the column `lemma` the column `base` is preserved; words that have no lemmatization take `form` as their base. Words and bases are concatenated to lines.

## Remove  non-Sumerian words

In [7]:
words = words.loc[words['lang'].str[:3] == 'sux'].reset_index()

## Create Lemma Column and Adjust Base

In [8]:
words['lemma'] = words['cf'] # first element of lemma is the citation form
words['lemma'] = [words['lemma'][i] + '[' + 
                  words['gw'][i] + ']' +
                  words['pos'][i] if not words['lemma'][i] == '' else words['form'][i] for i in range(len(words))]
words['base'] = [words['base'][i] if not words['base'][i] == '' else words['form'][i] 
                 +'[NA]NA' for i in range(len(words))]
lemmas = words[['lemma', 'base', 'id_text', 'id_line', 'id_word', 'label']]
lemmas.head()

Unnamed: 0,lemma,base,id_text,id_line,id_word,label
0,taškarin[boxwood]N,{ŋeš}taškarin,Q000039,Q000039.1,Q000039.1.1,1
1,esi[tree]N,{ŋeš}esi,Q000039,Q000039.2,Q000039.2.1,2
2,ŋešnu[tree]N,ŋeš-nu₁₁,Q000039,Q000039.3,Q000039.3.1,3
3,halub[tree]N,{ŋeš}ha-lu-ub₂,Q000039,Q000039.4,Q000039.4.1,4
4,šagkal[tree]N,{ŋeš}šag₄-kal,Q000039,Q000039.5,Q000039.5.1,5


## Group by Line

In [9]:
lines = words.groupby([lemmas['id_line'], lemmas['label']]).agg({'lemma': ' '.join, 'base': ' '.join}).reset_index()

In [10]:
lines[lines['id_line'].str[:7] == 'P370399']

Unnamed: 0,id_line,label,base,lemma
2695,P370399.10,o i 7',x-na[NA]NA ŋešnimbar,x-na ŋešnimbar[palm]N
2696,P370399.100,o iii 39',x[NA]NA apin,x apin[plow]N
2697,P370399.101,o iii 40',x[NA]NA apin,x apin[plow]N
2698,P370399.104,o iv 1',{ŋeš}RI[NA]NA x[NA]NA,{ŋeš}RI x
2699,P370399.105,o iv 2',ŋeš-dal-dal x[NA]NA,ŋešdal[crossbar]N x
2700,P370399.106,o iv 3',{ŋeš}mar-gid₂-da x[NA]NA,margida[cart]N x
2701,P370399.107,o iv 4',{ŋeš}gag-sila₃,gagsila[harness]N
2702,P370399.108,o iv 5',{ŋeš}za-ra gag-sal₄,zara[pivot]N gagsila[harness]N
2703,P370399.109,o iv 6',{ŋeš}gag za-ra gag-sal₄,gag[nail]N zara[pivot]N gagsila[harness]N
2704,P370399.11,o i 8',x-x-da-ŋešnimbar[NA]NA,x-x-da-ŋešnimbar


## Save in CSV Format

In [11]:
with open('output/obwood.csv', 'w') as w:
    lines.to_csv(w, encoding='utf8')