# Importing ORACC Data from corpus.json
by Niek Veldhuis
UC Berkeley

February-May 2017

# TODO
* check that COFs are treated properly
* check that lines that continue into the next line (as in bilinguals) are captured completely. Such lines are indicated in the json by the the addition of 'l' (lower case L) to the reference (.ref).


# Introduction

Purpose of the code is to download [ORACC](http://oracc.org) JSON files that contain textual data and to produce a `.csv` file with the relevant data for use in computational text analysis. This comes in the place of scraping the published `html` (see the [Scrape-ORACC](https://github.com/niekveldhuis/Digital-Assyriology/tree/master/Scrape-Oracc) repo). The JSON files contain all the transliteration and lemmatization data of an ORACC project as well as metadata . For an introduction to the various ORACC JSON files see the [ORACC Open Data](http://oracc.org/doc/opendata) page.

The resulting data file may include various elements of the ORACC data structure. The current code will output a file with the following fields: 

* id_line
* label
* lemma
* base
* extent
* scope

The fields `extent` and `scope` capture the number of missing lines or columns.

The selection of fields may be adjusted with standard `Pandas` functions.

## Notes
The current version of the script works with the `ijson` library. Documentation for [ijson](https://www.dataquest.io/blog/python-json-tutorial/), unfortunately, is extremely brief. It is likely that in the near future, because of considerations of space, [ORACC](http://oracc.org) will no longer make available individual `.json` files, but only the file `json.zip`, a compressed file that includes all the `.json` files that belong to a single project. For that reason the current code will first download the `json.zip` and then extract the relevant files (at the moment of writing this note it is still possible to directly download such files from the [ORACC](http://oracc.org) server. 

This notebook is written for **Python 3.5** with **Pandas 0.19** and **ijson 2.3**.

The first section of this notebook will download and parse data from any [ORACC](http://oracc.org) project (or combination of projects). For most purposes, the number of data elements extracted will be too large and it will be necessary to select and manipulate the data set. The second section of the notebook selects only proper nouns (personal names, royal names, geographical names, etc.) and prepares the data for usage in Social Network Analysis software. This is only one example of how the data may be used, other usages could include topic modeling, word2vec, etc. Each of those analyses requires a specific data format and therefore specific data manipulation. Further examples of how such manipulation might work are foreseen for a later version of this notebook.

The initial version of this notebook was written for the [Phylogenetics](https://github.com/ErinBecker/digital-humanities-phylogenetics) project with Erin Becker of [Data Carpentry](http://www.datacarpentry.org). 

## Licensing
This notebook may be downloaded, used and adapted without any restrictions.

In [1]:
import pandas as pd   
import ijson
import urllib.request
import zipfile
import re
import tqdm
import numpy as np

# 1. Download and Parse `json.zip`

## 1.1 Input List of Text IDs
Identify a list of text IDs (P, Q, and X numbers) in the directory `text_ids`. The IDs are six-digit P, Q, or X numbers preceded by a project abbreviation in the format 'PROJECT/P######' or 'PROJECT/SUBPROJECT/Q######'. For example:
* dcclt/P117395
* etcsri/Q001203
* rinap/rinap1/Q003421

The list should be created with a flat text editor such as Textedit or Emacs, and the filename should end in `.txt`.

The P, Q, and X numbers available in a project are listed in the project's `json.zip` (see below) in the directory `corpusjson`.

In [2]:
filename = input('Filename or project abbreviation: ')

Filename or project abbreviation: saa_test.txt


In [6]:
textids = 'text_ids/' + filename
with open(textids, 'r') as f:
    pqxnos = f.readlines()
pqxnos = [x.strip() for x in pqxnos]        # strip spaces left and right
pqxnos = [x for x in pqxnos if not x == ""] # strip empty lines
projects = [x[:-8] for x in pqxnos]
projects = list(set(projects))
pqxnos[:5], projects

(['saao/saa01/P224485',
  'saao/saa01/P313915',
  'saao/saa01/P313876',
  'saao/saa01/P314243',
  'saao/saa01/P334194'],
 ['saao/saa01'])

## 1.2 Create Download Directory and JSON directory
For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist)

In [7]:
import errno
import os
try:
    os.mkdir('jsonzip')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass
try:
    os.mkdir('json')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

## 1.3 Download `json.zip`
For each project from which files are to be processed download the entire project (all the json files) in `https://github.com/oracc/json`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. For the chunking code see [this page](https://www.smallsurething.com/how-to-read-a-file-properly-in-python/).

Although downloading the entire zip file is time consuming, it will make processing the individual files much more efficient and the code is less likely to break due to interruption in connectivity.

In [8]:
for project in tqdm.tqdm(projects):
    project = project.replace('/', '-')
    url = "https://raw.github.com/oracc/json/master/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    print("Downloading " + url + " saving as " + file)
    response = urllib.request.urlopen(url)
    CHUNK = 16 * 1024
    with open(file, 'wb') as f:
        for chunk in iter(lambda: response.read(CHUNK), b''):
            f.write(chunk)

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading https://raw.github.com/oracc/json/master/saao-saa01.zip saving as jsonzip/saao-saa01.zip


100%|██████████| 1/1 [00:04<00:00,  4.17s/it]


## 1.4 Extract JSON files from `json.zip`
Extract the texts listed in the list of text IDs from the `json.zip`. All files are extracted to a directory called `data/[PROJECT]/json/corpusjson` (for instance `data/dcclt/json/corpusjson`). If the file belongs to a subproject the directory is called `data/[PROJECT]/[SUBPROJECT]/json/corpusjson`. 

In [9]:
target_dir = 'json'
for no in tqdm.tqdm(pqxnos):
    project = no[:-8]
    pno = no[-7:]
    zip_file = "jsonzip/" + project.replace('/', '-') + ".zip"
    with zipfile.ZipFile(zip_file,"r") as zip_ref:
        file = project + '/corpusjson/' + pno + '.json'
        try:
            zip_ref.extract(file, target_dir)
        except:
            print(no + ' is not available')

100%|██████████| 12/12 [00:00<00:00, 103.45it/s]


## 1.5 Parse JSON files
The function `oraccjasonparser()` takes one argument (the **ID** or **P/Q/X-number** of the `.json` file). It looks for the prefix `textid` to retrieve the six-digit P, Q, or X number of the text artifact. Parsing the file sequentially the code looks for the places where a line starts (`'.type' = 'line-start'`) and where a word starts (`'.node' = 'l'`, where `l` is for "lemma"). At each level the code will retrieve the relevant data and create a list where each entry is a dictionary that represents a single word. 

Words not only include lemmatized words, but also unlemmatized and unlemmatizable words (such as breaks).

The dictionary includes the keys `id_line` and `id_word` that allow the user to reassemble words and lines in order.

In [24]:
def oraccjsonparser(text_id):
    project = text_id[:-8]
    PQXno = text_id[-8:]
    filename = 'json/' + project + '/corpusjson/' + PQXno +'.json'
    lemma_fields = ["base", "cf", "cont", "epos", "frag", "form", "gw", "inst", "lang", "morph", "norm", 
              "norm0", "pos", "sense", "sig"] # these are the fields that constitute a lemma
                                            # it does not include the gdl fields that define a sign
    with open(filename, 'r') as d:
        parser = ijson.parse(d)
        word_l = []
        word_d = {}
        line_start = False
        word_start = False
        nonx = False
        for prefix, event, value in parser:
            if prefix == 'textid':
                id_text = value
            if prefix.endswith('.type'):
                if value == 'line-start':
                    line_start = True
                else:
                    line_start = False
            if line_start:
                if prefix.endswith('.ref') and not word_start:
                    id_line = value # id_line is a reference number for a line
                                    # that includes the id_text (e.g. P123456.49)
                if prefix.endswith('.label'):
                    label = value   # label is a human-readable line number of the format
                                    # o ii 24' (obverse column 2 line 24')
            if prefix.endswith('node'):
                if value == 'l':
                    word_start = True
                    if not word_d == {}:
                        word_l.append(word_d) # append the previous word to the list
                    word_d = {}               # and start a new dictionary
                    word_d['gdl_utf8'] = ""
                    word_d['id_text'] = id_text # provide each word with appropriate 
                    word_d['id_line'] = id_line # text and line-ID
                    word_d['label'] = label     # and the line label.
                else:
                    word_start = False
            if word_start:
                if prefix.endswith('.ref'):
                    word_d['id_word'] = value
#                if prefix.endswith('.sig'):
#                    word_d['signature'] = value
#                if '.f.' in prefix:
                category = re.sub('.*\.', '', prefix) # get element after the last dot of the prefix
                if category in lemma_fields and not "gdl" in prefix:  # lemma-fields is a list of relevant field names
                    word_d[category] = value # copy each element into the dictionary
                if prefix.endswith('gdl_utf8'):
                    word_d['gdl_utf8'] = word_d['gdl_utf8'] + value 
            if prefix.endswith('.type'):
                if value == 'nonx':
                    nonx = True
                else:
                    nonx = False
            if nonx:                         # this captures so-called $-lines with information
                if prefix.endswith('.ref'):  # about number of broken lines/columns.
                    id_line = value          # $-lines have their own id_line.
                if prefix.endswith('.strict'):
                    if value == '1':           # select only 'strict' $ lines
                        if not word_d == {}:
                            word_l.append(word_d)
                        word_d = {}
                        word_d['id_line'] = id_line
                        word_d['id_text'] = id_text
                    else:
                        nonx = False
                if prefix.endswith('.extent'): # capture the three elements of strict $ lines
                    word_d['extent'] = value   # namely extent, scope, and state.
                if prefix.endswith('.scope'):
                    word_d['scope'] = value
                if prefix.endswith('.state'):
                    word_d['state'] = value

    word_l.append(word_d)  # make sure that the last word is captured, too.
    return(word_l) # return a list of dictionaries, where each entry (dictionary) in
                   # the list represents a word.

## 1.6 Call the Parser Function for Each Textid

In [25]:
word_l = []
for id_text in tqdm.tqdm(pqxnos):
    try:
        word_l.extend(oraccjsonparser(id_text))
    except:
        print(no + ' is not available or not complete')

100%|██████████| 12/12 [00:01<00:00,  5.34it/s]


## 1.7 Transform the Data into a DataFrame
The word_l list is transformed into a Pandas dataframe for further manipulation.

For various reasons not all JSON files will have all data types that potentially exist in an [ORACC](http://oracc.org) signature. Only Sumerian words have a `base`, so if your data set has no Sumerian, this column will not exist in the DataFrame.  If a text has no breakage information in the form of `$ 1 line broken` (etc.) the fields `extent`, `scope`, and `state` do not exist. Since such fields are referenced in the code below (sections 2-4) the next cell will check for the existence of each column and create an empty column if necessary.

In [26]:
words = pd.DataFrame(word_l)
fields = ['base', 'cf', 'cont', 'epos', 'extent', 'form', 'gw', 'id_line', 'id_text', 'id_word',
          'label', 'lang', 'morph', 'norm', 'norm0', 'pos', 'scope', 'sense', 'sig']
for field in fields:
    if not field in words.columns:
        words[field] = ''
words = words.fillna('') # replace Missing Values by empty string
words.head()

Unnamed: 0,cf,epos,extent,form,frag,gdl_utf8,gw,id_line,id_text,id_word,...,norm,pos,scope,sense,sig,state,base,cont,morph,norm0
0,awātu,N,,a-bat,⸢a⸣-bat,𒀀𒁁,word,P224485.2,P224485,P224485.2.1,...,abat,N,,word,@saao/saa01%akk-x-neoass:a-bat=awātu[word//wor...,,,,,
1,šarru,N,,LUGAL,LUGAL,𒈗,king,P224485.2,P224485,P224485.2.2,...,šarri,N,,king,@saao/saa01%akk-x-neoass:LUGAL=šarru[king//kin...,,,,,
2,ana,PRP,,a-na,a-na\t,𒀀𒈾,to,P224485.2,P224485,P224485.2.3,...,ana,PRP,,to,@saao/saa01%akk-x-neoass:a-na=ana[to//to]PRP'P...,,,,,
3,Aššur-šarru-uṣur,PN,,{1}aš-šur-MAN-PAB,{1}aš-šur-MAN-⸢PAB,𒁹𒀸𒋩𒌋𒌋𒉽,1,P224485.2,P224485,P224485.2.4,...,Aššur-šarru-uṣur,PN,,1,@saao/saa01%akk-x-neoass:{1}aš-šur-MAN-PAB=Ašš...,,,,,
4,šulmu,N,,šul-mu,šul⸣-mu,𒂄𒈬,completeness,P224485.2,P224485,P224485.2.5,...,šulmu,N,,health,@saao/saa01%akk-x-neoass:šul-mu=šulmu[complete...,,,,,


## 1.8 Remove Spaces and Commas from Guide Word and Sense
Spaces in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [27]:
words['sense'] = [x.replace(' ', '-') for x in words['sense']]
words['sense'] = [x.replace(',', '') for x in words['sense']]
words['gw'] = [x.replace(' ', '-') for x in words['gw']]
words['gw'] = [x.replace(',', '') for x in words['gw']]

The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* frag (transliteration; including flags)
* gdl_utf8 (cuneiform)
* gw (Guide Word: main or first translation in standard dictionary)
* id_line (a line ID that begins with the six-digit P, Q, or X number of the text)
* id_text (six-digit P, Q, or X number)
* id_word (word ID that begins with the ID number of the line)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* sig (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `pos`, `id_word`, `id_line`, and `id_text` should always be there. An unlemmatized word has `pos` 'X' (for unknown). Broken words have `pos` 'u' (for 'unlemmatizable).

# 2. Manipulate for SNA
The columns of the `words` DataFrame may be manipulated with standard Pandas methods to create the desired output. By way of example, the following code will select proper nouns only and create two `.csv` files (`edges.csv` and `nodes.csv`) that may be ingested by Social Network Analysis (SNA) software. The column names follow the conventions used in [Gephi](https://gephi.org/).

## 2.1 Select Proper Nouns
First list all Part of Speech tags currently available in the corpus.

In [28]:
pos = list(set(words['pos']))
pos

['',
 'XP',
 'QP',
 'N',
 'REL',
 'AJ',
 'IP',
 'NU',
 'AV',
 'n',
 'DN',
 'CNJ',
 'V',
 'u',
 'DET',
 'DP',
 'GN',
 'WN',
 'PRP',
 'EN',
 'PN',
 'MOD']

Then list the tags that are relevant in the list `pos` and use that list to select the rows of the DataFrame that contain proper nouns.  

In [29]:
pos = ['CN', 'DN', 'EN', 'FN', 'GN', 'PN', 'NN', 'RN', 'SN', 'TN', 'WN'] # what is 'NN'?
proper_nouns = words.loc[words['pos'].isin(pos)].reset_index(drop=True)
proper_nouns.head()

Unnamed: 0,cf,epos,extent,form,frag,gdl_utf8,gw,id_line,id_text,id_word,...,norm,pos,scope,sense,sig,state,base,cont,morph,norm0
0,Aššur-šarru-uṣur,PN,,{1}aš-šur-MAN-PAB,{1}aš-šur-MAN-⸢PAB,𒁹𒀸𒋩𒌋𒌋𒉽,1,P224485.2,P224485,P224485.2.4,...,Aššur-šarru-uṣur,PN,,1,@saao/saa01%akk-x-neoass:{1}aš-šur-MAN-PAB=Ašš...,,,,,
1,Mat-Aššur,GN,,KUR-aš-šur{KI},KUR-aš-šur{ki},𒆳𒀸𒋩𒆠,Assyria,P224485.3,P224485,P224485.3.3,...,Mat-Aššur,GN,,Assyria,@saao/saa01%akk-x-neoass:KUR-aš-šur{KI}=Mat-Aš...,,,,,
2,Mita,PN,,{1}me-ta-a,{1}me-ta-a,𒁹𒈨𒋫𒀀,Midas,P224485.4,P224485,P224485.4.6,...,Meta,PN,,Midas,@saao/saa01%akk-x-neoass:{1}me-ta-a=Mita[Midas...,,,,,
3,Muskaya,EN,,{KUR}mus-ka-a.a,{kur}mus-ka-a.a,𒆳𒈲𒅗𒀀𒀀,Phrygian,P224485.5,P224485,P224485.5.1,...,Muskaya,EN,,Phrygian,@saao/saa01%akk-x-neoass:{KUR}mus-ka-a.a=Muska...,,,,,
4,Quwaya,EN,,{KUR}qu-u-a.a,{kur}qu-u-a.⸢a,𒆳𒄣𒌋𒀀𒀀,from-Quwe,P224485.6,P224485,P224485.6.1,...,Quwaya,EN,,from-Quwe,@saao/saa01%akk-x-neoass:{KUR}qu-u-a.a=Quwaya[...,,,,,


## 2.2 Keep  Norm, Pos, and id_text
Now select the relevant columns.

In [30]:
proper_nouns = proper_nouns[['norm', 'pos', 'id_text']].drop_duplicates()
proper_nouns = proper_nouns[proper_nouns['norm'] != ''].reset_index(drop=True)
proper_nouns.head()

Unnamed: 0,norm,pos,id_text
0,Aššur-šarru-uṣur,PN,P224485
1,Mat-Aššur,GN,P224485
2,Meta,PN,P224485
3,Muskaya,EN,P224485
4,Quwaya,EN,P224485


## 2.3 Create Edge List
The edge list contains the columns `source` and `target` and combines all proper nouns that appear in a single text as source-target pairs. All edges are considered `undirected`.

In [31]:
edges = []
for i in tqdm.tqdm(range(len(proper_nouns))):
    for j in range(i+1, len(proper_nouns)):
        if proper_nouns['id_text'][i] == proper_nouns['id_text'][j]:
            edge = [proper_nouns['norm'][i], proper_nouns['norm'][j]]
            edges.append(edge)
        else:
            break
edges_df = pd.DataFrame(edges)
edges_df.columns = ['source', 'target']
edges_df['type'] = 'undirected'
edges_df.head()

100%|██████████| 66/66 [00:00<00:00, 2296.30it/s]


Unnamed: 0,source,target,type
0,Aššur-šarru-uṣur,Mat-Aššur,undirected
1,Aššur-šarru-uṣur,Meta,undirected
2,Aššur-šarru-uṣur,Muskaya,undirected
3,Aššur-šarru-uṣur,Quwaya,undirected
4,Aššur-šarru-uṣur,Urik,undirected


In [32]:
with open("output/edges.csv", 'w') as f:
    edges_df.to_csv(f, index=False)

## 2.4 Create Node List
pn_set contains the unique proper nouns in the entire corpus. This become the node list in Gephi

In [33]:
pn_set = proper_nouns[['norm', 'pos']].drop_duplicates() # Assur DN and Assur GN are not considered duplicates!
pn_set.columns = ['Id', 'Type']
pn_set['Label'] = pn_set['Id']
pn_set.head()

Unnamed: 0,Id,Type,Label
0,Aššur-šarru-uṣur,PN,Aššur-šarru-uṣur
1,Mat-Aššur,GN,Mat-Aššur
2,Meta,PN,Meta
3,Muskaya,EN,Muskaya
4,Quwaya,EN,Quwaya


In [34]:
with open("output/nodes.csv", 'w') as f:
    pn_set.to_csv(f, index=False)

# 3. Manipuate for Analysis on Line level (e.g. phylogenetics)
For analyses that use a line as unit of analysis (e.g. lines in lexical texts as analyzed in the [Phylogenetics](https://github.com/ErinBecker/digital-humanities-phylogenetics) project) one may need to create lemmas and combine these into lines by using the `id_line` variable.

## 3.1 Create Lemmas and Adjust Bases
A lemma, [ORACC](http://oracc.org) style, combines Citation Form, GuideWord and POS into a unique reference to one particular lemma in a standard dictionary, as in `lugal[king]N` (Sumerian) or `šarru[king]N`. Usually, not all words in a text are lemmatized, because a word may be (partly) broken and/or unknown. Unlemmatized and unlemmatizable words will receive a place-holder lemmatization that consists of the transliteration of the word (instead of the Citation Form), with `NA` as GuideWord and POS, as in `i-bu-x[NA]NA`. Note that `NA` is a string.

For Sumerian projects each lemmatized word has a `base` (the word without morphology). For non-lemmatized words a place-holder base is created that consists of the transliteration of the word. If you are not working with Sumerian data.

In [35]:
words.columns

Index(['cf', 'epos', 'extent', 'form', 'frag', 'gdl_utf8', 'gw', 'id_line',
       'id_text', 'id_word', 'inst', 'label', 'lang', 'norm', 'pos', 'scope',
       'sense', 'sig', 'state', 'base', 'cont', 'morph', 'norm0'],
      dtype='object')

In [36]:
words['lemma'] = words['cf'] # first element of lemma is the citation form
words['lemma'] = [words['lemma'][i] + '[' + words['gw'][i] 
                     + ']' + words['pos'][i] 
                     if not words['lemma'][i] == '' 
                     else words['form'][i] +'[NA]NA' for i in range(len(words))]
words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in words['lemma'] ]
words['base'] = [words['base'][i] if not words['base'][i] == '' 
                 or words['label'][i] == '' else words['form'][i] 
                 for i in range(len(words))]

## 3.2 Group by Line
In the `words` dataframe each word has a separate row. In order into change this to a line-by-line representation we use the Pandas `.groupby` function, using the `id_line` and `label` fields as arguments (`id_line` has an abstract number that indicates the sequence of lines in a text object; `label` is a human-readable line number in the format `o ii 3`: obverse column 2, line 3). The fields that are aggregated are `lemma`, `base`, `extent`, and `scope`. The fields `extent` and `scope` represent data on the number of broken lines. If you work with Akkadian data you want to leave out the field `base`.

In [37]:
lines = words.groupby([words['id_line'], words['label']]).agg({
        'lemma': ' '.join,
        'base': ' '.join,
        'extent': ''.join, 
        'scope': ''.join
    }).reset_index()
lines        

Unnamed: 0,id_line,label,lemma,base,extent,scope
0,P224485.10,o 9,Muskaya[Phrygian]EN pû[mouth]N tadānu[give]V,{KUR}mus-ka-a.a pi-i-šu₂ it-ta-an-na-na-ši,,
1,P224485.11,o 10,ana[to]PRP salmu[peaceful]AJ târu[turn]V ša[th...,a-na sa-al-mi-ni it-tu-ar ša taš-pur-an-ni,,
2,P224485.12,o 11,mā[saying]PRP balāt[without]PRP šarru[king]N b...,ma-a ba-lat LUGAL be-li₂-ia {LU₂}A-šip-ri-ia {...,,
3,P224485.13,o 12,Muskaya[Phrygian]EN lā[not]MOD šapāru[send]V ū...,{KUR}mus-ka-a.a la a-šap-par u₂-ma-a an-nu-rig,,
4,P224485.14,o 13,šapāru[send]V māru[son]N šipru[sending]N ištu[...,a-sap-rak-ka {LU₂}A-šip-ri-ka {LU₂}A-šip-ri-ka...,,
5,P224485.15,o 14,lū[may]MOD lā[not]MOD batāqu[cut-off]V dibbu[w...,lu la ta-bat-taq dib-bi DUG₃.GA-MEŠ šup-ra-aš₂...,,
6,P224485.16,o 15,kayyamānu[permanent]AJ mīnu[what?]QP ša[that]R...,ka-a.a-ma-nu mi-i-nu ša ṭe₃-en-šu₂-ni ši-mi a-...,,
7,P224485.17,o 16,ša[that]REL šapāru[send]V mā[saying]PRP kī[lik...,ša taš-pur-an-ni ma-a ki-i ša šu-u₂ ARAD-MEŠ š...,,
8,P224485.18,o 17,wabālu[bring]V mā[saying]PRP anāku[I]IP ardu[s...,u₂-še-bi-il-an-ni ma-a ana-ku ARAD-MEŠ-ni-šu₂ ...,,
9,P224485.19,o 18,wabālu[bring]V basi[soon]AV libbu[interior]N i...,še-bi-la-aš₂-šu₂ ba-si lib-bu-šu₂ is-si-ni ip-...,,


Note that `id_line` is a string variable and therefore does not give the lines in the right order. We should split `id_line` into two variables: `id_text` (the first 7 characters; we lost the old `id_text` column in the `.groupby` function above) and a new `line` variable, which is a number. 

In [38]:
lines['id_text'] = lines['id_line'].str[:7] # id_text was lost in the grouping above and is recreated
lines['line'] = [re.sub('.+\.', '', line) for line in lines['id_line']] #create a line number for sorting
lines['line'] = [x.replace('l', '') for x in lines['line']]
lines['line'] = [int(x) if not x == '' else np.nan for x in lines['line']]
lines = lines.sort_values(['id_text', 'line']).reset_index(drop=True)
lines.head(100)

Unnamed: 0,id_line,label,lemma,base,extent,scope,id_text,line
0,P224485.2,o 1,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...,a-bat LUGAL a-na {1}aš-šur-MAN-PAB šul-mu ia-a-ši,,,P224485,2
1,P224485.3,o 2,šulmu[completeness]N ana[to]PRP Mat-Aššur[Assy...,šul-mu a-na KUR-aš-šur{KI} ŠA₃-ka lu DUG₃.GA-ka,,,P224485,3
2,P224485.4,o 3,ša[that]REL šapāru[send]V mā[saying]PRP māru[s...,ša taš-pur-an-ni ma-a {LU₂}A-šip-ri {LU₂}A-šip...,,,P224485,4
3,P224485.5,o 4,Muskaya[Phrygian]EN ina[in]PRP muhhu[skull]N a...,{KUR}mus-ka-a.a ina UGU-hi-ia it-tal-ka ma-a 1...,,,P224485,5
4,P224485.6,o 5,Quwaya[from-Quwe]EN ša[that]REL Urik[1]PN ana[...,{KUR}qu-u-a.a ša {1}u₂-ri-ik a-na {LU₂}šap-ru-te,,,P224485,6
5,P224485.7,o 6,ana[to]PRP Urarṭu[1]GN wabālu[bring]V mā[sayin...,a-na {KUR}URI u₂-še-bi-lu-u-ni ma-a ina UGU-hi...,,,P224485,7
6,P224485.8,o 7,tarṣu[correct]AJ adanniš[very-much]AV annûri[n...,ta-ri-iṣ a-dan-niš an-nu-rig aš-šur {d}ša₂-maš EN,,,P224485,8
7,P224485.9,o 8,Nabu[1]DN ilu[god]N epēšu[do]V lā[not]MOD ina[...,{d}AG DINGIR-MEŠ-ia e-tap-šu₂ la ina ŠA₃ qa-ra...,,,P224485,9
8,P224485.10,o 9,Muskaya[Phrygian]EN pû[mouth]N tadānu[give]V,{KUR}mus-ka-a.a pi-i-šu₂ it-ta-an-na-na-ši,,,P224485,10
9,P224485.11,o 10,ana[to]PRP salmu[peaceful]AJ târu[turn]V ša[th...,a-na sa-al-mi-ni it-tu-ar ša taš-pur-an-ni,,,P224485,11


Note that the new `line` field is not a line number in the traditional sense of the word (this is `label`) but a number used to organize lines in the appropriate order.

## 3.3 Save in CSV Format

In [39]:
filename = filename[:-4]
with open('output/' + filename + '.csv', 'w') as w:
    lines.to_csv(w, encoding='utf8')

# 4 Manipulate for Document-level Analysis
For analyses that use documents in a Document Term Matrix or otherwise a similar type of manipulation is needed. This output may be used in Word2vec, in Topic Modeling and in other types of algorithms. First lemmas and bases are dealt with in the same way as above, section 3.1.

## 4.1 Create Lemmas and Adjust Bases
A lemma, [ORACC](http://oracc.org) style, combines Citation Form, GuideWord and POS into a unique reference to one particular lemma in a standard dictionary, as in `lugal[king]N` (Suerian) or `šarru[king]N`. Usually, not all words in a text are lemmatized, because a word may be (partly) broken and/or unknown. Unlemmatized and unlemmatizable words will receive a place-holder lemmatization that consists of the transliteration of the word (instead of the Citation Form), with `NA` as GuideWord and POS, as in `i-bu-x[NA]NA`. Note that `NA` is a string.

For Sumerian projects each lemmatized word has a `base` (the word without morphology). For non-lemmatized words a place-holder base is created that consists of the transliteration of the word. If you are not working with Sumerian data.

In [40]:
words['lemma'] = words['cf'] # first element of lemma is the citation form
words['lemma'] = [words['lemma'][i] + '[' + words['gw'][i] 
                     + ']' + words['pos'][i] 
                     if not words['lemma'][i] == '' 
                     else words['form'][i] +'[NA]NA' for i in range(len(words))]
words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in words['lemma'] ]
words['base'] = [words['base'][i] if not words['base'][i] == '' 
                 or words['label'][i] == '' else words['form'][i] 
                 for i in range(len(words))]

# 4.1 Group by Document
In order to group by Document we use the field `id_text` and aggregate `lemma` and `base`.

In [41]:
documents = words.groupby(words['id_text']).agg({
        'lemma': ' '.join,
        'base': ' '.join,
    }).reset_index()
documents

Unnamed: 0,id_text,lemma,base
0,P224485,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...,a-bat LUGAL a-na {1}aš-šur-MAN-PAB šul-mu ia-a...
1,P313425,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...,a-na LUGAL EN-ia ARAD-ka {1}EN-liq-bi lu DI-mu...
2,P313458,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...,a-na LUGAL EN-ia ARAD-ka {1}hu-un-ni-i lu-u šu...
3,P313644,x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]N...,x x x x x x x x x x x x x x x x x x x x ša x x...
4,P313755,x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]NA ap-ta-...,x x x x x ap-ta-x x x x x x x-u-ni u₂-ṣa-bit x...
5,P313876,x[NA]NA x[NA]NA x[NA]NA x+x-ka[NA]NA mā[saying...,x x x x+x-ka ma-a MI₂-šu₂ ša {LU₂}IGI.DUB a-di...
6,P313915,x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]N...,x x x x x x x x x x+x x x x x x x x x {1}{d}x+...
7,P314001,x[NA]NA Arpadda[Arpad]GN x[NA]NA x[NA]NA x[NA]...,x {URU}ar-pad-du x x x x x x u₂-ma-a {1}ul-lu-...
8,P314243,x[NA]NA x[NA]NA x[NA]NA x[NA]NA x-ba[NA]NA izu...,x x x x x-ba i-ti-iz x x x x-bi an-nu-rig dul-...
9,P334190,awātu[word]N šarru[king]N ana[to]PRP Ašipa[1]P...,a-bat LUGAL a-na {1}a-ši-pa-a DI-mu ia-a-ši ŠA...


## 3.4 Save in CSV Format

In [42]:
filename = filename[:-4]
with open('output/' + filename + '.csv', 'w') as w:
    lines.to_csv(w, encoding='utf8')

## 3.5 Tokenizing
Since lemmas do not contain spaces (see above, section 1.8) tokenizing is extremely easy and basically consists of splitting on spaces. Tokenized data are used in Topic Modeling, Word2vec, etc. Tokenizing is also necessary for making a Document Term Matrix.

In [43]:
documents['tokens'] = documents['lemma'].str.split()
documents

Unnamed: 0,id_text,lemma,base,tokens
0,P224485,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...,a-bat LUGAL a-na {1}aš-šur-MAN-PAB šul-mu ia-a...,"[awātu[word]N, šarru[king]N, ana[to]PRP, Aššur..."
1,P313425,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...,a-na LUGAL EN-ia ARAD-ka {1}EN-liq-bi lu DI-mu...,"[ana[to]PRP, šarru[king]N, bēlu[lord]N, ardu[s..."
2,P313458,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...,a-na LUGAL EN-ia ARAD-ka {1}hu-un-ni-i lu-u šu...,"[ana[to]PRP, šarru[king]N, bēlu[lord]N, ardu[s..."
3,P313644,x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]N...,x x x x x x x x x x x x x x x x x x x x ša x x...,"[x[NA]NA, x[NA]NA, x[NA]NA, x[NA]NA, x[NA]NA, ..."
4,P313755,x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]NA ap-ta-...,x x x x x ap-ta-x x x x x x x-u-ni u₂-ṣa-bit x...,"[x[NA]NA, x[NA]NA, x[NA]NA, x[NA]NA, x[NA]NA, ..."
5,P313876,x[NA]NA x[NA]NA x[NA]NA x+x-ka[NA]NA mā[saying...,x x x x+x-ka ma-a MI₂-šu₂ ša {LU₂}IGI.DUB a-di...,"[x[NA]NA, x[NA]NA, x[NA]NA, x+x-ka[NA]NA, mā[s..."
6,P313915,x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]NA x[NA]N...,x x x x x x x x x x+x x x x x x x x x {1}{d}x+...,"[x[NA]NA, x[NA]NA, x[NA]NA, x[NA]NA, x[NA]NA, ..."
7,P314001,x[NA]NA Arpadda[Arpad]GN x[NA]NA x[NA]NA x[NA]...,x {URU}ar-pad-du x x x x x x u₂-ma-a {1}ul-lu-...,"[x[NA]NA, Arpadda[Arpad]GN, x[NA]NA, x[NA]NA, ..."
8,P314243,x[NA]NA x[NA]NA x[NA]NA x[NA]NA x-ba[NA]NA izu...,x x x x x-ba i-ti-iz x x x x-bi an-nu-rig dul-...,"[x[NA]NA, x[NA]NA, x[NA]NA, x[NA]NA, x-ba[NA]N..."
9,P334190,awātu[word]N šarru[king]N ana[to]PRP Ašipa[1]P...,a-bat LUGAL a-na {1}a-ši-pa-a DI-mu ia-a-ši ŠA...,"[awātu[word]N, šarru[king]N, ana[to]PRP, Ašipa..."
