# 6 Ur III Commodities and Actors: Word Embeddings

The basic idea behind word embeddings is that word meaning is determined by the contexts in which the word is found. Or, in the famous quote by the linguist [J.R. Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth): "You shall know a word by the company it keeps." Each unique word (or lemma) in a corpus is assigned a vector in such a way that words attested in similar contexts receive similar vectors. As a result, vectors that represent words with similar meanings become neighbors: they are relatively close in the vector space.

A classic implementation of word embeddings is [word2vec](https://en.wikipedia.org/wiki/Word2vec), created in 2013 by a Google team under the direction of Tomas Mikolov ([Mikolov e.a. 2013](https://arxiv.org/abs/1301.3781)). They used a large corpus of texts in English, derived from the Web, and trained their model with a neural network. They found that their technique not only assigned similar vectors to similar weords, but also encoded in those vectors semantic components such as "male" or "female." The classic example is 

            king - male + female ≈ queen
            
In other words: if you subtract the vector for "male" from the vector for "king" and add the vector for "female," the nearest neighbor of that computed vector turns out to be the one for "queen".

Word2vec creeated a revolution in Natural Language Processing and found application in many different tasks. Researchers either use pre-defined word vectors or compute such vectors from their own data. Cuneiformists, of course, do not really have a choice. They have to create their own vectors and here three important drawbacks of word2vec come to light. First, the algorithm works best on very large datasets - the initial implementation used a corpus of 1.6 billion words. For Akkadian or Sumerian such numbers are not feasible, the more so since we may want to build different models for different periods and/or different text types. Second, the neural network architecture behind word2vec works, but it is hard to explain why it works or how. In other words, there is something like a black box between the raw data and the resulting vectors which may be OK for industry applications, but is hard to deal with in a scholarly context. Third, training a model in word2vec (or any similar algorithm) takes a considerable amount of computing time. Since there are many parameters (such as the window size that defines the context of a word) and such settings may dramatically change the results, it is necessary to build and compare many models. Building and comparing many models is doable and common in a Computer Science or NLP research setting, but in Assyriology research that is hardly feasible.

More recently researchers have proposed a simpler approach to training word vectors ([Moody 2017](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/), and see [Levy and Goldberg 2014](https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html)). This approach uses PMI (Pointwise Mutual Information) combined with SVD (Singular Vector Decomposition), two well-understood and relatively straightforward processes. PMI is used to create a score between every pair of unique words in the corpus based on their individual frequency on the one hand, and the frequency of the two words as collocates. This results in huge matrix of *n* rows and *n* column, where *n* represents the number of unique words in the corpus. SVD is then used to reduce this matrix to a set of *m* dimensional vectors (one vector for each word in the corpus), where *m* may be in the range between, say, 20 and 200. 

## 6.1 Data Acquisition
For demonstartion purposes we will use the corpus of Ur III texts, dating to the last century of the third millennium and overwhelmingly written in Sumerian. The great majority of these texts are administrative in nature and record the income and expenses of government offices. Smaller text groups include letters, contracts, and records of litigation (royal inscriptions, literary texts, and incantations are excluded). At the moment of writing more than 100,000 such texts from the Ur III period are known. More than 72,000 of these are edited in the [ORACC](http://oracc.org) project [epsd2/admin/ur3](http://oracc.org/epsd2/admin/ur3). Most of these texts were originally transcribed for [CDLI](http://cdli.ucla.edu) by various contributors; the [CDLI](http://cdli.ucla.edu) editions were imported into [ORACC](http://oracc.org) where they were lemmatized for [ePSD2](http://oracc.org/epsd2) by Steve Tinney and Niek Veldhuis.

The data acquisition process is not essentially different from what was discussed in Chapter 2.1. Instead of producing a DataFrame with the transliterations, their lemmatizations, and several other pieces of information, we will directly create the data format that is used in the process of creating the vectors. For those reasons, the `parsejson()` function we developed in 2.1 will receive still another incarnation.

In [1]:
import zipfile
import json
import tqdm
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

In [8]:
directories = ['jsonzip', 'output', 'corpus']
make_dirs(directories)
project = "epsd2/admin/ur3" # the function oracc_download() expects a list
oracc_download([project])

### 6.1.1 The parsejson() function
For the `parsejson()` function see Chapter 2.1. The functions adds lemmatization data in the format `lugal[king]n` to the list `l`. If a word is not lemmatized (there is no citation form) it is represented by an underscore ("\_"). This is important, because the process of creating word embedding vectors uses a sliding window that moves over the text. If we simply skip unlemmatized words, this will create artifical neighbors. The code skips words that belong to year names. year names are important for many reasons, but their vocabulary does not belong to the transaction at hand. Year names may have the form "Year in which (the city) Maški was destroyed" or "Year in which the high priestess was chosen by omen." Obviously, that does not imply that the transaction in the text has anything to do with Maški or with the high priestess. If you wish to use the code below for building a vector model based on other types of text you may well want to remove those lines.

Also removed (or rather, replaced by underscores) are words (usually names) that have an "X" in the citation form. These are partly legible names that cannot be meaningfully used in the present analysis.

In [9]:
def parsejson(text):
    for JSONobject in text["cdl"]:
        word = []
        if "cdl" in JSONobject: 
            parsejson(JSONobject)
        if "f" in JSONobject:
            if "ftype" in JSONobject:
                if JSONobject["ftype"] == "yn": # exclude year names
                    continue
                if JSONobject["f"]["lang"][:3] == "sux": #only Sumerian and Emesal
                    word = JSONobject["f"]
                if "cf" in word:
                    if "x" in word["cf"].lower(): # replace partly legible names by underscore (placeholder)
                        lemm = "_"
                    else: 
                        lemm = f"{word['cf']}[{word['gw']}]{word['pos']}"
                        lemm = lemm.replace(' ', '-') # remove commas and spaces from lemm
                        lemm = lemm.replace(',', '')
                else:
                    lemm = "_" # if word is unlemmatized enter a place holder
                lemm = lemm.lower()
                l.append(lemm)
    return

In [10]:
lemm_ = []
ids_ = []
file = "jsonzip/" + project.replace("/", "-") + ".zip"
try:
    z = zipfile.ZipFile(file)       # create a Zipfile object
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in tqdm(files, desc = project):                            #iterate over the file names
        id_text = f"{project}{filename[-13:-5]}" # id_text is, for instance, blms/P414332
        ids_.append(id_text)
        l = []
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
            data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
            #lemm_.append(f"\n{id_text}")     # new text starts on new line with text_id
            parsejson(data_json)
            lemm_.append(l)
        except:
            print(id_text + ' is not available or not complete')
except:
    print(file + " does not exist or is not a proper ZIP file")

HBox(children=(FloatProgress(value=0.0, description='epsd2/admin/ur3', max=71712.0, style=ProgressStyle(descri…

epsd2/admin/ur3/P143238 is not available or not complete

