The data acquisition process discussed in Chapter 2.1 transforms the ORACC JSON files into a DataFrame, where each row represents a single word and each column the various data elements that describe a lemma. The DataFrame then needs to be manipulated to produce the necessary data format. For the current task that is a rather inefficient approach. We may adapt the JSON parser in such a way that it directly produces the data format that can be ingested by the PMI process.

The data format we need is a list of list, where each second-order list represents a sequence of consecutive lemmas. We can then create a sliding window of *n* words that will determine which lemmas in that consecutive sequence are treated as collocates.

Each individual document may be regarded as a consecutive sequence of lemmas. Within a document, however, there are two types of breaks: physical breaks and logical breaks. A physical break is an actual break in a clay tablet. A logical break is a horizontal ruling, or the transition from the text of the document to the text of the seal impression. We want to prevent the sliding window to extent over such breaks.

In addition, there are several types of words that require a special treatment.
* Unlemmatized words - words that are damaged or unknown should not be included. They are replaced by an underscore ("_"). Such words do not contribute to our analysis, but should also not be removed because we do not want to create artificial neighbors.
* Numbers are of no interest here and are entirely removed.
* Damaged personal names are lemmatized as PN with a citation form in the format Lu₂.x. Those are treated the same way as unlemmatized words (replace by underscore)
* Year names are removed. Although year names are important for a variety of reasons, they belong to a different register of Sumerian and their collocates are of no interest here. 
* Remove words that are not in Sumerian

In [1]:
import zipfile
import json
import tqdm
import os
import sys
import pickle
import re
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

In [2]:
directories = ['jsonzip', 'output', 'corpus']
make_dirs(directories)

In [3]:
projects = input('Project(s): ').lower()

Project(s):  treasury


In [4]:
p = format_project_list(projects)
oracc_download(p)

Saving http://build-oracc.museum.upenn.edu/json/treasury.zip as jsonzip/treasury.zip.


HBox(children=(FloatProgress(value=0.0, description='treasury', max=2598034.0, style=ProgressStyle(description…

['treasury']

In [6]:
def parsejson(text):
    l = []
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject)
        if JSONobject.get("state", "") in breakage:
            if len(l) > 1:
                lemm_.append(l)
            l = []
            continue
        if JSONobject.get("subtype","") in ['sg', 'pr']: # skip the fields "sign" and "pronunciation"
            continue                                     # in lexical texts
        if JSONobject.get("ftype", "") == "yn":
            continue # skip year names
        if "f" in JSONobject: 
            word = JSONobject["f"]
            if word["lang"][:3] != "sux": #only Sumerian and Emesal
                continue
            if word.get("pos", "") == "n":  # omit numbers
                continue
            if "cf" in word:
                #for some reason some words appear without pos. Provisionally treated as Noun
                lemm = f"{word['cf']}[{word['gw']}]{word.get('pos', 'N')}"
                lemm = lemm.replace(' ', '-') # remove commas and spaces from lemm
                lemm = lemm.replace(',', '')
            else:
                lemm = "_" # if word is unlemmatized enter a place holder
            if "x" in word.get("cf","").lower():  # partly damaged PN; enter placeholder
                lemm = "_"
            l.append(lemm)    
    if len(l) > 1:
        lemm_.append(l)
    return

In [7]:
lemm_ = []
ids_ = []
breakage = ['illegible', 'traces', 'missing', 'effaced','other', 'blank', 'ruling']
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in tqdm(files, desc = project):                            #iterate over the file names
        id_no = filename[-13:-5]
        if id_no in ids_ and not "X" in id_no: # Check if P/Q number is already in there
            continue        # a text may appear in multiple projects
        id_text = project + id_no # id_text is, for instance, blms/P414332
        ids_.append(id_text)
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
            data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
            parsejson(data_json)
        except:
            print(id_text + ' is not available or not complete')

HBox(children=(FloatProgress(value=0.0, description='treasury', max=298.0, style=ProgressStyle(description_wid…




The above results in the list of lists lemm_, which holds the individual texts. Each individual text is represented by a list of lemmas, with the lemmas in the original order. Currently, breaks etc. are not represented. Secondly, the list ids_ holds all the text IDs; the list ids_ has the same order as the list of lists lemm_.

The following is directly taken from the blog post "simple word vectors with co-occurrence pmi and svd" by Alex Klibisz.

In [17]:
from __future__ import print_function, division
from collections import Counter
from itertools import combinations
from math import log
from pprint import pformat
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds
from string import punctuation
from time import time
from nltk import ngrams
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
print('Ready')

Ready


Since Klibisz is working with *titles* he compares all possible bigrams and does not define a moving window. That probably needs to change. Aleksi creates windows with the following code:
```python
wz = self.windowsize - 1
zip(*[text[i:] for i in range(1+wz*2)])
```
This will zip multiple versions of the same text - the first starts at word 0, the next at word 1, etc. Since the window is symmetric (counting forward and backward) take it twice .
An alternative method is to use NLTK ngrams, which will create windows.
```python
from nltk import ngrams
n = 1+wz*2
windows = ngrams(text, n)
```

In [18]:
# 2a. Compute unigram and bigram counts.
# A unigram is a single word (x). A bigram is a pair of words (x,y).
# Bigrams are counted for any two terms occurring in the same title.
# For example, the title "Foo bar baz" has unigrams [foo, bar, baz]
# and bigrams [(bar, foo), (bar, baz), (baz, foo)]
t0 = time()
cx = Counter()
cxy = Counter()
for text in lemm_:
    cx.update(text)

    # Count all pairs of words, even duplicate pairs.
    windows = ngrams(text, 7) # 7 is window length - needs to be changeable
    for w in windows: # this creates the windows
        z = [tuple(l) for l in map(sorted, combinations(w, 2))]
        cxy.update(z)

#     # Alternative: count only 2-grams.
#     for x, y in zip(text[:-1], text[1:]):
#         cxy[(x, y)] += 1

#     # Alternative: count all pairs of words, but don't double count.
#     for x, y in set(map(tuple, map(sorted, combinations(text, 2)))):
#         cxy[(x,y)] += 1

print('%.3lf seconds (%.5lf / iter)' %
      (time() - t0, (time() - t0) / len(lemm_)))

0.093 seconds (0.00026 / iter)


In [19]:
# 2b. Remove frequent and infrequent unigrams.
# Pick arbitrary occurrence count thresholds to eliminate unigrams occurring
# very frequently or infrequently. This decreases the vocab size substantially.
print('%d tokens before' % len(cx))
t0 = time()
sx = sum(cx.values())
min_count = 2
max_count = sx
for x in list(cx.keys()):
    if cx[x] < min_count or cx[x] > max_count:
        del cx[x]
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cx)))
print('%d tokens after' % len(cx))
print('Most common:', cx.most_common()[:25])

935 tokens before
0.001 seconds (0.00000 / iter)
505 tokens after
Most common: [('_', 609), ('itud[moon]N', 283), ('kugsig[gold]N', 263), ('giŋ[unit]N', 257), ('šag[heart]N', 240), ('kugbabbar[silver]N', 239), ('eban[pair]N', 218), ('ki[place]N', 217), ('šu[hand]N', 166), ('teŋ[near]V/i', 137), ('suhub[boots]N', 132), ('zig[rise]V/i', 132), ('mana[unit]N', 129), ('kila[weight]N', 125), ('dušia[stone]N', 114), ('zabar[bronze]N', 107), ('šuniŋin[circular]N', 103), ('Puzrišdagan[1]SN', 102), ('maškim[administrator]N', 100), ('lugal[king]N', 99), ('še[barley]N', 98), ('ŋar[place]V/t', 95), ('har[ring]N', 79), ('Nibru[00]SN', 75), ('esir[shoe]N', 74)]


In [20]:
# 2c. Remove frequent and infrequent bigrams.
# Any bigram containing a unigram that was removed must now be removed.
t0 = time()
for x, y in list(cxy.keys()):
    if x not in cx or y not in cx:
        del cxy[(x, y)]
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cxy)))

0.010 seconds (0.00000 / iter)


In [21]:
# 3. Build unigram <-> index lookup.
t0 = time()
x2i, i2x = {}, {}
for i, x in enumerate(cx.keys()):
    x2i[x] = i
    i2x[i] = x
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cx)))

0.000 seconds (0.00000 / iter)


In [22]:
# 4. Sum unigram and bigram counts for computing probabilities.
# i.e. p(x) = count(x) / sum(all counts).
t0 = time()
sx = sum(cx.values())
sxy = sum(cxy.values())
print('%.3lf seconds (%.5lf / iter)' %
      (time() - t0, (time() - t0) / (len(cx) + len(cxy))))

0.000 seconds (0.00000 / iter)


In [23]:
# 5. Accumulate data, rows, and cols to build sparse PMI matrix
# Recall from the blog post that the PMI value for a bigram with tokens (x, y) is: 
# PMI(x,y) = log(p(x,y) / p(x) / p(y)) = log(p(x,y) / (p(x) * p(y)))
# PPMI = max(log(p(x,y) / (p(x) * p(y))), 0)
# The probabilities are computed on the fly using the sums from above.
t0 = time()
pmi_samples = Counter()
data, rows, cols = [], [], []
for (x, y), n in cxy.items():
    rows.append(x2i[x])
    cols.append(x2i[y])
    data.append(max(log((n / sxy) / (cx[x] / sx) / (cx[y] / sx)), 0))
    pmi_samples[(x, y)] = data[-1]
PPMI = csc_matrix((data, (rows, cols)))
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cxy)))
print('%d non-zero elements' % PPMI.count_nonzero())
print('Sample PPMI values\n', pformat(pmi_samples.most_common()[:10]))

0.026 seconds (0.00000 / iter)
9256 non-zero elements
Sample PPMI values
 [(('barag[sack]N', 'uzud[goat]N'), 7.162393595403842),
 (('Aradnanna[1]PN', 'Na.wa.ar[00]SN'), 7.0082429155765835),
 (('anzah[glass]N', 'saŋkul[bolt]N'), 7.0082429155765835),
 (('habum[garment]N', 'namarum[garment]N'), 7.0082429155765835),
 (('niŋsua[object]N', 'saŋki[forehead]N'), 7.008242915576583),
 (('Wa.qar.šu.suen[00]PN', 'baza[dwarf]N'), 7.008242915576583),
 (('im[clay]N', 'šeŋ[rain]V/i'), 7.008242915576583),
 (('Eriš₂[00]SN', 'Ur.nin.mug[00]PN'), 7.008242915576583),
 (('Na.wa.ar[00]SN', 'ere[go]V/i'), 6.825921358782629),
 (('aŋarak[fluid]N', 'gu[eat]V/t'), 6.825921358782629)]


In [24]:
# 6. Factorize the PPMI matrix using sparse SVD aka "learn the unigram/word vectors".
# This part replaces the stochastic gradient descent used by Word2vec
# and other related neural network formulations. We pick an arbitrary vector size k=20.
t0 = time()
U, _, _ = svds(PPMI, k=20)
print('%.3lf seconds' % (time() - t0))

0.030 seconds


In [25]:
# 7. Normalize the vectors to enable computing cosine similarity in next cell.
# If confused see: https://en.wikipedia.org/wiki/Cosine_similarity#Definition
t0 = time()
norms = np.sqrt(np.sum(np.square(U), axis=1, keepdims=True))
U /= np.maximum(norms, 1e-7)
print('%.3lf seconds' % (time() - t0))

0.001 seconds


In [29]:
# 8. Show some nearest neighbor samples as a sanity-check.
# The format is <unigram> <count>: (<neighbor unigram>, <similarity>), ...
# From this we can see that the relationships make sense.
k = 5
for x in ['esir[shoe]N', 'kugsig[gold]N', 'lugal[king]N', 'Inana[1]DN']:
    dd = np.dot(U, U[x2i[x]]) # Cosine similarity for this unigram against all others.
    s = ''
    # Compile the list of nearest neighbor descriptions.
    # Argpartition is faster than argsort and meets our needs.
    for i in np.argpartition(-1 * dd, k + 1)[:k + 1]:
        if i2x[i] == x: continue
        xy = tuple(sorted((x, i2x[i])))
        s += '(%s, %.3lf) ' % (i2x[i], dd[i])
    print('%s, %d\n %s' % (x, cx[x], s))
    print('-' * 10)

esir[shoe]N, 74
 (eban[pair]N, 0.907) (gudimba[~shoe]N, 0.900) (dušia[stone]N, 0.854) (sadab[laces]N, 0.828) (im[clay]N, 0.810) 
----------
kugsig[gold]N, 263
 (harhara[~jewelry]N, 0.832) (za.lu₂.NAM.ra[jewelry]N, 0.809) (gu[neck]N, 0.843) (sig[weak]V/i, 0.793) (penzer[genitals]N, 0.816) 
----------
lugal[king]N, 99
 (gagsum[arrow]N, 0.734) (gag[nail]N, 0.612) (hanni[~bow]AJ, 0.672) (gagsisa[arrow]N, 0.621) (kalag[strong]V/i, 0.605) 
----------
Inana[1]DN, 15
 (du[build]V/t, 0.701) (Pu.us₂[00]SN, 0.775) (de[pour]V/t, 0.633) (ugunu[decoration]N, 0.566) (igi[eye]N, 0.607) 
----------
