# Importing ORACC Data from corpus.json
by Niek Veldhuis
UC Berkeley

February-May 2017

# TODO
* check that COFs are treated properly
* check that lines that continue into the next line (as in bilinguals) are captured completely. Such lines are indicated in the json by the the addition of 'l' (lower case L) to the reference (.ref).


# Introduction

Purpose of the code is to download [ORACC](http://oracc.org) JSON files that contain textual data and to produce a `.csv` file with the relevant data for use in computational text analysis. This comes in the place of scraping the published `html` (see the [Scrape-ORACC](https://github.com/niekveldhuis/Digital-Assyriology/tree/master/Scrape-Oracc) repo). The JSON files contain all the transliteration and lemmatization data of an ORACC project as well as metadata . For an introduction to the various ORACC JSON files see the [ORACC Open Data](http://oracc.org/doc/opendata) page.

The resulting data file may include various elements of the ORACC data structure. The current code will output a file with the following fields: 

* id_line
* label
* lemma
* base
* extent
* scope

The fields `extent` and `scope` capture the number of missing lines or columns.

The selection of fields may be adjusted with standard `Pandas` functions.

## Notes
The current version of the script works with the `ijson` library. Documentation for [ijson](https://www.dataquest.io/blog/python-json-tutorial/), unfortunately, is extremely brief. It is likely that in the near future, because of considerations of space, [ORACC](http://oracc.org) will no longer make available individual `.json` files, but only the file `json.zip`, a compressed file that includes all the `.json` files that belong to a single project. For that reason the current code will first download the `json.zip` and then extract the relevant files (at the moment of writing this note it is still possible to directly download such files from the [ORACC](http://oracc.org) server. 

This notebook is written for **Python 3.5** with **Pandas 0.19** and **ijson 2.3**.

The notebook was written for the [Phylogenetics](https://github.com/ErinBecker/digital-humanities-phylogenetics) project with Erin Becker of [Data Carpentry](http://www.datacarpentry.org). The particular data selection and data manipulation performed in this notebook are inspired by the needs of that project (for instance, non-Sumerian words are filtered out). It should be fairly easy to adapt the notebook to the purposes of any other project that wishes to use [ORACC](http://oracc.org) data.

## Licensing
This notebook may be downloaded, used and adapted without any restrictions.

In [1]:
import pandas as pd   
import ijson
import urllib.request
import zipfile
import re
import tqdm

# Input List of Text IDs
Identify a list of text IDs (P, Q, and X numbers) in the directory `input`. The IDs are six-digit P, Q, or X numbers preceded by a project abbreviation in the format 'PROJECT/P######' or 'PROJECT/SUBPROJECT/Q######'. For example:
* dcclt/P117395
* etcsri/Q001203
* rinap/rinap1/Q003421

The list should be created with a flat text editor such as Textedit or Emacs, and the filename should end in `.txt`.

In [2]:
filename = input('Filename: ')

Filename: saa_all.txt


In [3]:
textids = 'text_ids/' + filename
with open(textids, 'r') as f:
    pqxnos = f.readlines()
pqxnos = [x.strip() for x in pqxnos]
projects = [x[:-8] for x in pqxnos]
projects = list(set(projects))
pqxnos[:5], projects

(['saao/saa01/P224485',
  'saao/saa01/P313915',
  'saao/saa01/P313876',
  'saao/saa01/P314243',
  'saao/saa01/P334194'],
 ['saao/saa18',
  'saao/saa05',
  'saao/saa17',
  'saao/saa06',
  'saao/saa13',
  'saao/saa16',
  'saao/saa01',
  'saao/saa14',
  'saao/saa19',
  'saao/saa15',
  'saao/saa10'])

# Create Download Directory and JSON directory
For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist)

In [4]:
import errno
import os
try:
    os.mkdir('jsonzip')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass
try:
    os.mkdir('json')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

# Download `json.zip`
For each project from which files are to be processed download the entire project (all the json files) in `http://oracc.museum.upenn.edu/PROJECT/json.zip`. For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `json.zip` may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. For the chunking code see [this page](https://www.smallsurething.com/how-to-read-a-file-properly-in-python/).

Although downloading the entire zip file is time consuming, it will make processing the individual files much more efficient and the code is less likely to break due to interruption in connectivity.

In [5]:
for project in tqdm.tqdm(projects):
    url = "http://oracc.museum.upenn.edu/" + project + "/json.zip"
    file = 'jsonzip/' + project.replace('/', '_') + '_json.zip'
    print("Downloading " + url + "; saving as " + file)
    response = urllib.request.urlopen(url)
    CHUNK = 16 * 1024
    with open(file, 'wb') as f:
        for chunk in iter(lambda: response.read(CHUNK), b''):
            f.write(chunk)

  0%|          | 0/11 [00:00<?, ?it/s]

Downloading http://oracc.museum.upenn.edu/saao/saa18/json.zip; saving as jsonzip/saao_saa18_json.zip


  9%|▉         | 1/11 [00:06<01:07,  6.75s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa05/json.zip; saving as jsonzip/saao_saa05_json.zip


 18%|█▊        | 2/11 [00:17<01:12,  8.01s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa17/json.zip; saving as jsonzip/saao_saa17_json.zip


 27%|██▋       | 3/11 [00:26<01:05,  8.18s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa06/json.zip; saving as jsonzip/saao_saa06_json.zip


 36%|███▋      | 4/11 [00:39<01:07,  9.58s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa13/json.zip; saving as jsonzip/saao_saa13_json.zip


 45%|████▌     | 5/11 [00:50<01:01, 10.24s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa16/json.zip; saving as jsonzip/saao_saa16_json.zip


 55%|█████▍    | 6/11 [01:04<00:56, 11.28s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa01/json.zip; saving as jsonzip/saao_saa01_json.zip


 64%|██████▎   | 7/11 [01:18<00:48, 12.21s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa14/json.zip; saving as jsonzip/saao_saa14_json.zip


 73%|███████▎  | 8/11 [01:36<00:41, 13.85s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa19/json.zip; saving as jsonzip/saao_saa19_json.zip


 82%|████████▏ | 9/11 [01:47<00:26, 13.06s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa15/json.zip; saving as jsonzip/saao_saa15_json.zip


 91%|█████████ | 10/11 [02:05<00:14, 14.44s/it]

Downloading http://oracc.museum.upenn.edu/saao/saa10/json.zip; saving as jsonzip/saao_saa10_json.zip


100%|██████████| 11/11 [02:25<00:00, 16.07s/it]


# Extract
Extract the texts listed in the list of text IDs from the `json.zip`. All files are extracted to a directory called `data/json/corpusjson`. If a list of text IDs has the same P number multiple times (e.g. if editions of the same text exist in multipe projects), the file will be overwritten and only one instance of that P number will be available.

In [6]:
target_dir = 'json'
for no in tqdm.tqdm(pqxnos):
    project = no[:-8]
    pno = no[-7:]
    zip_file = "jsonzip/" + project.replace('/', '_') + "_json.zip"
    with zipfile.ZipFile(zip_file,"r") as zip_ref:
        try:
            file = 'corpusjson/' + pno + '.json'
            zip_ref.extract(file, target_dir)
        except:
            print(no + ' is not available')

 98%|█████████▊| 3191/3253 [00:21<00:00, 142.19it/s]

saao/saa19/P224485 is not available


100%|██████████| 3253/3253 [00:21<00:00, 146.83it/s]


# Parse
The function `oraccjasonparser()` takes one argument (the **ID** or **P/Q/X-number** of the `.json` file). It looks for the prefix `textid` to retrieve the six-digit P, Q, or X number of the text artifact. Parsing the file sequentially the code looks for the places where a line starts (`'.type' = 'line-start'`) and where a word starts (`'.node' = 'l'`, where `l` is for "lemma"). At each level the code will retrieve the relevant data and create a list where each entry is a dictionary that represents a single word. 

Words not only include lemmatized words, but also unlemmatized and unlemmatizable words (such as breaks).

The dictionary includes the keys `id_line` and `id_word` that allow the user to reassemble words and lines in order.

In [7]:
def oraccjsonparser(file):
    filename = 'json/corpusjson/' + file +'.json'
    with open(filename, 'r') as d:
        parser = ijson.parse(d)
        word_l = []
        word_d = {}
        line_start = False
        word_start = False
        nonx = False
        for prefix, event, value in parser:
            if prefix == 'textid':
                id_text = value
#            print("parsing " + value)
            if prefix.endswith('.type'):
                if value == 'line-start':
                    line_start = True
                else:
                    line_start = False
            if line_start:
                if prefix.endswith('.ref') and not word_start:
                    id_line = value # id_line is a reference number for a line
                                    # that includes the id_text (e.g. P123456.49)
                if prefix.endswith('.label'):
                    label = value   # label is a human-readable line number of the format
                                    # o ii 24' (obverse column 2 line 24')
            if prefix.endswith('node'):
                if value == 'l':
                    word_start = True
                    if not word_d == {}:
                        word_l.append(word_d) # append the previous word to the list
                    word_d = {}               # and start a new dictionary
                    word_d['id_text'] = id_text # provide each word with appropriate 
                    word_d['id_line'] = id_line # text and line-ID
                    word_d['label'] = label     # and the line label.
                else:
                    word_start = False
            if word_start:
                if prefix.endswith('.ref'):
                    word_d['id_word'] = value
                if prefix.endswith('.sig'):
                    word_d['signature'] = value
                if '.f.' in prefix:
                    category = re.sub('.*\.', '', prefix) # get element after the last dot of the prefix
                    word_d[category] = value # copy each element into the dictionary
            if prefix.endswith('.type'):
                if value == 'nonx':
                    nonx = True
                else:
                    nonx = False
            if nonx:                         # this captures so-called $-lines with information
                if prefix.endswith('.ref'):  # about number of broken lines/columns.
                    id_line = value          # $-lines have their own id_line.
                if prefix.endswith('.strict'):
                    if value == '1':           # select only 'strict' $ lines
                        if not word_d == {}:
                            word_l.append(word_d)
                        word_d = {}
                        word_d['id_line'] = id_line
                        word_d['id_text'] = id_text
                    else:
                        nonx = False
                if prefix.endswith('.extent'): # capture the three elements of strict $ lines
                    word_d['extent'] = value   # namely extent, scope, and state.
                if prefix.endswith('.scope'):
                    word_d['scope'] = value
                if prefix.endswith('.state'):
                    word_d['state'] = value

    word_l.append(word_d)  # make sure that the last word is captured, too.
    return(word_l) # return a list of dictionaries, where each entry (dictionary) in
                   # the list represents a word.

# Call the Parser Function for Each Textid

In [8]:
word_l = []
for no in tqdm.tqdm(pqxnos):
    id_text = no[-7:]
    try:
        word_l.extend(oraccjsonparser(id_text))
    except:
        print(no + ' not available or not complete')

 19%|█▉        | 623/3253 [00:18<01:04, 40.54it/s]

saao/saa06/P335202 not available or not complete


 21%|██        | 672/3253 [00:19<01:11, 36.24it/s]

saao/saa06/P335176 not available or not complete


 24%|██▍       | 776/3253 [00:22<00:56, 43.67it/s]

saao/saa06/P335322 not available or not complete


 27%|██▋       | 890/3253 [00:25<01:37, 24.22it/s]

saao/saa06/P335372 not available or not complete


 47%|████▋     | 1527/3253 [00:44<00:38, 45.17it/s]

saao/saa14/P335197 not available or not complete
saao/saa14/P335196 not available or not complete
saao/saa14/P335180 not available or not complete


 47%|████▋     | 1534/3253 [00:44<00:35, 48.92it/s]

saao/saa14/P335154 not available or not complete


 47%|████▋     | 1540/3253 [00:45<00:42, 40.51it/s]

saao/saa14/P335587 not available or not complete
saao/saa14/P335263 not available or not complete
saao/saa14/P335539 not available or not complete


 48%|████▊     | 1550/3253 [00:45<00:55, 30.93it/s]

saao/saa14/P335537 not available or not complete
saao/saa14/P335305 not available or not complete
saao/saa14/P335257 not available or not complete


 49%|████▊     | 1580/3253 [00:46<00:39, 42.65it/s]

saao/saa14/P335079 not available or not complete


 49%|████▉     | 1591/3253 [00:46<00:38, 43.04it/s]

saao/saa14/P335038 not available or not complete
saao/saa14/P334977 not available or not complete
saao/saa14/P335415 not available or not complete


 49%|████▉     | 1601/3253 [00:46<00:33, 49.93it/s]

saao/saa14/P335080 not available or not complete
saao/saa14/P335081 not available or not complete


 50%|████▉     | 1624/3253 [00:46<00:34, 46.81it/s]

saao/saa14/P335489 not available or not complete
saao/saa14/P334991 not available or not complete
saao/saa14/P336194 not available or not complete


 52%|█████▏    | 1676/3253 [00:48<00:42, 36.98it/s]

saao/saa14/P335943 not available or not complete


 53%|█████▎    | 1725/3253 [00:49<00:39, 38.95it/s]

saao/saa14/P335530 not available or not complete
saao/saa14/P335459 not available or not complete


 54%|█████▎    | 1745/3253 [00:49<00:31, 47.18it/s]

saao/saa14/P335107 not available or not complete


 56%|█████▌    | 1826/3253 [00:51<00:23, 60.41it/s]

saao/saa14/P335525 not available or not complete
saao/saa14/P335574 not available or not complete


 58%|█████▊    | 1897/3253 [00:52<00:16, 83.54it/s]

saao/saa14/P336029 not available or not complete
saao/saa14/P336196 not available or not complete


 60%|█████▉    | 1939/3253 [00:53<00:25, 51.47it/s]

saao/saa14/P224949 not available or not complete


100%|██████████| 3253/3253 [01:34<00:00, 34.54it/s]


# Transform the Data into a DataFrame

In [9]:
words = pd.DataFrame(word_l)
words

Unnamed: 0,cf,epos,extent,form,gw,id_line,id_text,id_word,label,lang,norm,pos,scope,sense,signature,state
0,awātu,N,,a-bat,word,P224485.2,P224485,P224485.2.1,o 1,akk-x-neoass,abat,N,,word,@saao/saa01%akk-x-neoass:a-bat=awātu[word//wor...,
1,šarru,N,,LUGAL,king,P224485.2,P224485,P224485.2.2,o 1,akk-x-neoass,šarri,N,,king,@saao/saa01%akk-x-neoass:LUGAL=šarru[king//kin...,
2,ana,PRP,,a-na,to,P224485.2,P224485,P224485.2.3,o 1,akk-x-neoass,ana,PRP,,to,@saao/saa01%akk-x-neoass:a-na=ana[to//to]PRP'P...,
3,Aššur-šarru-uṣur,PN,,{1}aš-šur-MAN-PAB,1,P224485.2,P224485,P224485.2.4,o 1,akk-x-neoass,Aššur-šarru-uṣur,PN,,1,@saao/saa01%akk-x-neoass:{1}aš-šur-MAN-PAB=Ašš...,
4,šulmu,N,,šul-mu,completeness,P224485.2,P224485,P224485.2.5,o 1,akk-x-neoass,šulmu,N,,health,@saao/saa01%akk-x-neoass:šul-mu=šulmu[complete...,
5,yâšim,IP,,ia-a-ši,to me,P224485.2,P224485,P224485.2.6,o 1,akk-x-neoass,ayāši,IP,,me,@saao/saa01%akk-x-neoass:ia-a-ši=yâšim[to me//...,
6,šulmu,N,,šul-mu,completeness,P224485.3,P224485,P224485.3.1,o 2,akk-x-neoass,šulmu,N,,health,@saao/saa01%akk-x-neoass:šul-mu=šulmu[complete...,
7,ana,PRP,,a-na,to,P224485.3,P224485,P224485.3.2,o 2,akk-x-neoass,ana,PRP,,to,@saao/saa01%akk-x-neoass:a-na=ana[to//to]PRP'P...,
8,Mat-Aššur,GN,,KUR-aš-šur{KI},Assyria,P224485.3,P224485,P224485.3.3,o 2,akk-x-neoass,Mat-Aššur,GN,,Assyria,@saao/saa01%akk-x-neoass:KUR-aš-šur{KI}=Mat-Aš...,
9,libbu,N,,ŠA₃-ka,interior,P224485.3,P224485,P224485.3.4,o 2,akk-x-neoass,libbaka,N,,mood,@saao/saa01%akk-x-neoass:ŠA₃-ka=libbu[interior...,


# Remove Spaces and Commas from Guide Word and Sence
Spaces in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [10]:
words = words.fillna('') # first replace Missing Values by empty string
words['sense'] = [x.replace(' ', '-') for x in words['sense']]
words['sense'] = [x.replace(',', '') for x in words['sense']]
words['gw'] = [x.replace(' ', '-') for x in words['gw']]
words['gw'] = [x.replace(',', '') for x in words['gw']]

The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* gw (Guide Word: main or first translation in standard dictionary)
* id_line (a line ID that begins with the six-digit P, Q, or X number of the text)
* id_text (six-digit P, Q, or X number)
* id_word (word ID that begins with the ID number of the line)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* signature (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `pos`, `id_word`, `id_line`, and `id_text` should always be there. An unlemmatized word has `pos` 'X' (for unknown). Broken words have `pos` 'u' (for 'unlemmatizable).

In [11]:
#words[words['label'] == '']

# Manipulate
The columns may be manipulated with standard Pandas methods to create the desired output. By way of example, the following code will create a column `lemma` with the format **cf[gw]pos** (for instance **lugal[king]N**). For words that have no lemmatization `lemma` equals `form`. Only Sumerian words are allowed (and thus `lang` can be omitted) and in addition to the column `lemma` the column `base` is preserved; words that have no lemmatization take `form` as their base. Words and bases are concatenated to lines.

## Remove  non-Sumerian words

In [12]:
lang = ['sux', ''] # note that 'lang' is empty in entries that indicate damage
words = words.loc[words['lang'].str[:3].isin(lang)].reset_index()

## Create Lemma Column and Adjust Base

In [13]:
words['lemma'] = words['cf'] # first element of lemma is the citation form
words['lemma'] = words['lang'][i] + ':' + [words['lemma'][i] + '[' + words['gw'][i] + ']' + words['pos'][i] 
                  if not words['lemma'][i] == '' 
                  else words['form'][i] +'[NA]NA' for i in range(len(words))]
words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in words['lemma'] ]
words['base'] = [words['base'][i] if not words['base'][i] == '' 
                 or words['label'][i] == '' else words['form'][i] 
                 for i in range(len(words))]
#lemmas = words[['lemma', 'id_text', 'id_line', 'id_word', 'label']]#, 'extent', 'scope']]
lemmas = words[['lemma', 'base', 'id_text', 'id_line', 'id_word', 'label', 'extent', 'scope']]
lemmas.head()

NameError: name 'i' is not defined

## Group by Line

In [None]:
lines = words.groupby([words['id_line'], words['label']]).agg({
        'lemma': ' '.join,
        'base': ' '.join,
        'extent': ''.join, 
        'scope': ''.join
    }).reset_index()
        

In [None]:
lines = lines[['id_line', 'label', 'lemma', 'base', 'extent', 'scope']]
lines

## Save in CSV Format

In [None]:
filename = filename[:-4]
with open('output/' + filename + '.csv', 'w') as w:
    lines.to_csv(w, encoding='utf8')