# 6 Ur III Commodities and Actors: Word Embeddings

The basic idea behind word embeddings is that word meaning is determined by the contexts in which the word is found. Or, in the famous quote by the linguist [J.R. Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth): "You shall know a word by the company it keeps." Each unique word (or lemma) in a corpus is assigned a vector in such a way that words attested in similar contexts receive similar vectors. As a result, vectors that represent words with similar meanings become neighbors: they are relatively close in the vector space.

A classic implementation of word embeddings is [word2vec](https://en.wikipedia.org/wiki/Word2vec), created in 2013 by a Google team under the direction of Tomas Mikolov ([Mikolov e.a. 2013](https://arxiv.org/abs/1301.3781)). They used a large corpus of texts in English, derived from the Web, and trained their model with a neural network. They found that their technique not only assigned similar vectors to similar words, but also encoded in those vectors semantic components such as "male" or "female." The classic example is 

            king - male + female ≈ queen
            
In other words: if you subtract the vector for "male" from the vector for "king" and add the vector for "female," the nearest neighbor of that computed vector turns out to be the one for "queen".

Word2vec created a revolution in Natural Language Processing and found application in many different tasks. Researchers either use pre-defined word vectors or compute such vectors from their own data. Cuneiformists, of course, do not really have a choice. They have to create their own vectors and here three important drawbacks of word2vec come to light. First, the algorithm works best on very large datasets - the initial implementation used a corpus of 1.6 billion words. For Akkadian or Sumerian such numbers are not feasible, the more so since we may want to build different models for different periods and/or different text types. Second, the neural network architecture behind word2vec works, but it is hard to explain why it works or how. In other words, there is something like a black box between the raw data and the resulting vectors which may be OK for industry applications, but is hard to deal with in a scholarly context. Third, training a model in word2vec (or any similar algorithm) takes a considerable amount of computing time. Since there are many parameters (such as the window size that defines the context of a word) and such settings may dramatically change the results, it is necessary to build and compare many models. Building and comparing many models is doable and common in a Computer Science or NLP research setting, but in Assyriology research that is rarely feasible.

More recently researchers have proposed a simpler approach to training word vectors ([Moody 2017](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/), and see [Levy and Goldberg 2014](https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html)). This approach uses PMI (Pointwise Mutual Information) combined with SVD (Singular Vector Decomposition), two well-understood and relatively straightforward processes. PMI is used to create a score between every pair of unique words in the corpus based on their individual frequency on the one hand, and the frequency of the two words as collocates on the other hand. A high score means that the words occur more often together than expected given the frequency of each word individually. This results in huge matrix of *n* rows and *n* columns, where *n* represents the number of unique words in the corpus. SVD is then used to reduce this matrix to a set of *m* dimensional vectors (one vector for each word in the corpus), where *m* may be in the range between, say, 20 and 200. 

## 6.1 Data Acquisition

The data format that we need for computing the PMI matrix is a list of lists, where each list represents a consecutive sequence of lemmas. "Consecutive" is crucial here: we do not want the last word of text a to be counted as a collocate of the first word of text b. Similarly, we should take textual breaks into account: after each break, a new sequence of consecutive words is started. Once we have created this data format we can define a sliding window of *n* words (where *n* is larger than 1 and usually between 2 and 15), so that each word inside a window is considered a collacate of all the others. That will provide us with the list of pairs of words for which PMI scores may be computed.

The data acquisition process discussed in Chapter 2.1 transforms the ORACC JSON files into a DataFrame, where each row represents a single lemma and each column the various data elements that describe a lemma (such as the language, the text ID, the Citation Form, etc.). That DataFrame contains all the information we need and we may transform that DataFrame into the desired data format (the list of lists) with the `pandas` methods discussed in previous chapters. However, this is a rather inefficient approach. Most of the information in the DataFrame we do not need and manipulating a DataFrame with `groupby()` and `aggregate()` is computationally expensive. Instead, we may parse the JSON files in such a way that we create the data format we need directly, without using a `pandas` DataFrame.

Each individual document may be regarded as a consecutive sequence of lemmas. Within a document, however, there are two types of breaks: physical breaks and logical breaks. A physical break is an actual break in a clay tablet. A logical break is a horizontal ruling, or the transition from the text of the document to the text of the seal impression. We want to prevent the sliding window to extent over such breaks.

In [None]:
import zipfile
import json
import tqdm
import os
import sys
import pickle
import re
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

In [None]:
directories = ['jsonzip', 'output', 'corpus']
for d in directories: 
    os.makedirs(d, exist_ok = True)

In [None]:
projects = "epsd2/admin/ur3, epsd2/admin/lagash2"

In [None]:
p = format_project_list(projects)
oracc_download(p);

The `parsejson()` function below works in a way that is similar to the `parsejson()` functions we discussed in Chapter 2. Each .json file to be parsed represents a single text. The list `l` collects lemmatizations in the format CF\[GW\]POS (for instance lugal\[king\]N). When the function has gone through the entire file it appends the list `l` to the list `lemm_l`. This will create a list of list (`lemm_l`), where each lower order list represents a single text.

Breaks in the text (both logical breaks and physical breaks) are marked in the JSON with a `state` node. This node has a restricted vocabulary to indicate breaks, traces, illegible lines, horizontal rulings, etc. The vocabulary that marks a logical or physical break is collected in the list `breakage`. When such a node is encountered, the list `l` is added to the list of lists `lemma_l` and a new (empty) `l` list is created. Then the process resumes. This way, a tablet with breaks is chopped up into multiple lists of consecutive lemmas.

The rest of the function takes care of special situations:
* Unlemmatized words - words that are damaged or unknown should not be included. They are replaced by an underscore ("_") as placeholder. Such words do not contribute to our analysis, but should also not be removed because we do not want to create artificial neighbors.
* Damaged personal names are like unlemmatized words and are replaced by an underscore. Such names are lemmatized as PN with a citation form in the format Lu₂.x.
* Numbers are of no interest here and are entirely removed.
* Year names are removed. Year names are important for dating, for political history and for understanding the ideology of the period. They do not contribute meaningful collocates to the transactions in the documents studied here.
* Words that are not in Sumerian are removed. Note that loans from Akkadian are considered to be Sumerian words, those are not removed. Occasionally Ur III texts may include Akkadian prepositions or fully conjugated Akkadian verbs. Such words are removed.

The process also skips lemmas that derive from the "sign" and "pronunciation" columns of lexical lists. That is not relevant in the current context, but may become relevant if you wish to use this code on a wider set of texts.

The `parsejson()` function returns the list `l`, a list of consecutive lemmas; this list is then appended to the list `lemm_l`. If the function encounters a break in the text, the lemmas that have been collected in the list `l` so far are appended to `lemm_l` and the list `l` is cleared. Note that, in order to do so, we need to append a *copy* of `l` (rather than a view) or otherwise clearing `l` will also clear the view of `l` that was appended to `lemm_l`. This is done by appending `l.copy()`

In [None]:
lemm_l = []
ids_ = []
breakage = ['illegible', 'traces', 'missing', 'effaced','other', 'blank', 'ruling']
tr = str.maketrans(" ", "-", "")

In [None]:
def parsejson(text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject)
        elif JSONobject.get("state", "") in breakage or JSONobject.get("subtype", "")[:5] in ["seal ", "envel"]:
            # at any logical or physical break, or where a seal or envelope starts
            if len(l) > 1:
                lemm_l.append(l.copy()) #append a copy of l
            l.clear()                   # and start over with an empty version of l
            continue
        elif JSONobject.get("ftype", "") == "yn": 
            continue                                     # and skip yearnames
        elif "f" in JSONobject:          # copy all the lemmatization data in the variable word
            word = JSONobject["f"]
            if word["lang"][:3] != "sux": #only Sumerian and Emesal
                continue
            if word.get("pos", "") == "n":  # omit numbers
                continue
            if "cf" in word:
                #for some reason some words appear without pos. Provisionally treated as Noun
                lemm = f"{word['cf']}[{word['gw']}]{word.get('pos', 'N')}"  
                lemm = lemm.translate(tr) # remove commas and spaces from lemm
                #lemm = lemm.replace(' ', '-') # remove commas and spaces from lemm
                #lemm = lemm.replace(',', '')
            else:
                lemm = "_" # if word is unlemmatized enter a place holder
            if "x" in word.get("cf","").lower():  # partly damaged PN; enter placeholder
                lemm = "_"
            l.append(lemm)           # append the lemmatization to the list l
    return l

In [None]:
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in tqdm(files, desc = project):                            #iterate over the file names
        id_no = filename[-13:-5]
        if id_no in ids_ and not "X" in id_no: # Check if P/Q number is already in there
            continue        # a text may appear in multiple projects
        id_text = project + id_no # id_text is, for instance, blms/P414332
        ids_.append(id_text)
        l = []
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
            data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
            l = parsejson(data_json)
            if len(l) > 1:
                lemm_l.append(l.copy())
            l.clear()
        except:
            print(id_text + ' is not available or not complete')

The above results in the list of lists lemm_l, which holds the individual texts. Each document is represented by one or more lists of lemmas, with the lemmas in the original order. Secondly, the list ids_ holds all the text IDs. These IDs are not further used, but were collected to prevent duplication, which may be an issue if you derive data from more than one project.

Pickle the results for use in the next notebook.

In [None]:
pkl_file= open('output/data_for_pmi.p','wb')
pickle.dump(lemm_l,pkl_file)

# Lemma - OID dictionary
Create a dictionary where each key is a lemma and the value is the corresponding ORACC Identification Number. The OID may be used to link to the page of the lemma in [ePSD](http://oracc.org/epsd2). The OIDs are found in the glossaries that are part of the ORACC JSON output.

In [None]:
x2oid = {}
gloss = ["sux" , "qpn"]
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    for g in gloss:
        filename = f"{project}/gloss-{g}.json"
        data = z.read(filename).decode('utf-8')         #read and decode the json file of one glossary
        data_json = json.loads(data)                # make it into a json object (essentially a dictionary)
        entries = data_json["entries"]
        x = {entry.get("headword", "") : entry.get("oid", "") for entry in entries}
        x2 = {key.translate(tr) : x.get(key) for key in x} # remove commas and spaces from keys
        x2oid.update(x2)

In [None]:
pkl_file= open('output/x2oid.p','wb')
pickle.dump(x2oid,pkl_file)