# 6 Ur III Commodities and Actors: Word Embeddings

The basic idea behind word embeddings is that word meaning is determined by the contexts in which the word is found. Or, in the famous quote by the linguist [J.R. Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth): "You shall know a word by the company it keeps." Each unique word (or lemma) in a corpus is assigned a vector in such a way that words attested in similar contexts receive similar vectors. As a result, vectors that represent words with similar meanings become neighbors: they are relatively close in the vector space.

A classic implementation of word embeddings is [word2vec](https://en.wikipedia.org/wiki/Word2vec), created in 2013 by a Google team under the direction of Tomas Mikolov ([Mikolov e.a. 2013](https://arxiv.org/abs/1301.3781)). They used a large corpus of texts in English, derived from the Web, and trained their model with a neural network. They found that their technique not only assigned similar vectors to similar weords, but also encoded in those vectors semantic components such as "male" or "female." The classic example is 

            king - male + female ≈ queen
            
In other words: if you subtract the vector for "male" from the vector for "king" and add the vector for "female," the nearest neighbor of that computed vector turns out to be the one for "queen".

Word2vec creeated a revolution in Natural Language Processing and found application in many different tasks. Researchers either use pre-defined word vectors or compute such vectors from their own data. Cuneiformists, of course, do not really have a choice. They have to create their own vectors and here three important drawbacks of word2vec come to light. First, the algorithm works best on very large datasets - the initial implementation used a corpus of 1.6 billion words. For Akkadian or Sumerian such numbers are not feasible, the more so since we may want to build different models for different periods and/or different text types. Second, the neural network architecture behind word2vec works, but it is hard to explain why it works or how. In other words, there is something like a black box between the raw data and the resulting vectors which may be OK for industry applications, but is hard to deal with in a scholarly context. Third, training a model in word2vec (or any similar algorithm) takes a considerable amount of computing time. Since there are many parameters (such as the window size that defines the context of a word) and such settings may dramatically change the results, it is necessary to build and compare many models. Building and comparing many models is doable and common in a Computer Science or NLP research setting, but in Assyriology research that is hardly feasible.

More recently researchers have proposed a simpler approach to training word vectors ([Moody 2017](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/), and see [Levy and Goldberg 2014](https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html)). This approach uses PMI (Pointwise Mutual Information) combined with SVD (Singular Vector Decomposition), two well-understood and relatively straightforward processes. PMI is used to create a score between every pair of unique words in the corpus based on their individual frequency on the one hand, and the frequency of the two words as collocates. This results in huge matrix of *n* rows and *n* column, where *n* represents the number of unique words in the corpus. SVD is then used to reduce this matrix to a set of *m* dimensional vectors (one vector for each word in the corpus), where *m* may be in the range between, say, 20 and 200. 

## 6.1.1 Data Acquisition
For demonstration purposes we will use the corpus of Ur III texts, dating to the last century of the third millennium and overwhelmingly written in Sumerian. The great majority of these texts are administrative in nature and record the income and expenses of government offices. Smaller text groups include letters, contracts, and records of litigation (royal inscriptions, literary texts, and incantations are excluded). At the moment of writing more than 100,000 such texts from the Ur III period are known. More than 72,000 of these are edited in the [ORACC](http://oracc.org) project [epsd2/admin/ur3](http://oracc.org/epsd2/admin/ur3). Most of these texts were originally transcribed for [CDLI](http://cdli.ucla.edu) by various contributors; the [CDLI](http://cdli.ucla.edu) editions were imported into [ORACC](http://oracc.org) where they were lemmatized for [ePSD2](http://oracc.org/epsd2) by Steve Tinney and Niek Veldhuis.

In [1]:
import pandas as pd
import zipfile
import json
import tqdm
import os
import sys
import re
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

In [2]:
directories = ['jsonzip', 'output', 'corpus']
make_dirs(directories)
project = "epsd2/admin/ur3" # the function oracc_download() expects a list
df = get_data(project)

Downloading JSON
Saving http://build-oracc.museum.upenn.edu/json/epsd2-admin-ur3.zip as jsonzip/epsd2-admin-ur3.zip.


HBox(children=(FloatProgress(value=0.0, description='epsd2/admin/ur3', max=577877494.0, style=ProgressStyle(de…

Parsing JSON


HBox(children=(FloatProgress(value=0.0, description='epsd2/admin/ur3', max=71712.0, style=ProgressStyle(descri…

epsd2/admin/ur3/P143238 is not available or not complete



### 6.1.2 Data Formatting

The `get_data()` function in `utils` returns a DataFrame in which each word (lemma) is a row. We can manipulate the dataframe with the same methods we have used in previous chapters.

> This is not the most efficient way of acquiring and formatting the data. One could adjust the `parsejson()` function discussed in Chapter 2.1 to select the data and directly format that data in a way that can be used in the next section. However, this formatting will likely be slightly different for different corpora. For the Ur III data one may want to exclude numbers and year names, but that is not a concern when dealing with literary texts. For that reason, the code in this section builds the full DataFrame and manipulates the data in that format.

The first line in the code below creates the `lemma` column in the format **lugal\[king\]N**, as we have done in other chapters. The other lines change or remove specific types of rows:
* If there is no Citation Form, replace `lemma` by underscore. This deals with unlemmatized words - words that are damaged or unknown. Such words do not contribute to our analysis, but should also not be removed because we do not want to create artificial neighbors.
* If Part of Speech is "n" (Number), remove. Numbers are of no interest here and are removed.
* If Citation Form contains an "x" or "X" replace `lemma` by underscore. This is common in damaged personal names. The word is marked as a PN in the lemmatization, but since it is only partly preserved the Citation Form looks like Lu₂.x.
* If "ftype" = "yn": remove all words that belong to year names. Although year names are important for a variety of reasons, they do not contribute 
* Use the `state` column to indicate logical breaks (indicated by horizontal rulings and the like) and physical breaks in the text.
* Include only words in Sumerian

In [3]:
physical_break = ['illegible', 'traces', 'missing', 'effaced']
logical_break = ['other', 'blank', 'ruling']
df["lemma"] = df['cf'] + '[' + df['gw'] + ']' + df['pos']
df.loc[df["cf"] == "" , 'lemma'] = "_" 
df.loc[df["cf"].str.contains("x|X"), 'lemma'] = '_'
df.loc[df["state"].isin(logical_break), 'lemma'] = "break_logical"
df.loc[df["state"].isin(physical_break), 'lemma'] = "break_physical"
df.loc[df["lemma"].str.startswith("break_"), "lang"] = "sux"
df = df.loc[df["pos"] != "n"]
df = df.loc[df["ftype"] != "yn"]
df = df.loc[df["lang"].str.startswith("sux")]

We will now create one row per text with the pandas `grouby()` and `aggregate()`functions. The entire text of the `lemma`column is put in the variable `text`as a single string, with line breaks between the individual texts. We add additional line breaks to replace "break_logical" and "break_physical". This will make sure that the moving window of the PMI process does not cross a break. Then we use`strip()` to remove white space and/or line breaks at the beginning or the end of the entire string. Finally, the string is made into a list, by splitting on line breaks.

In [9]:
df_bytext = df.groupby("id_text").agg({"lemma" : " ".join})
text = "\n".join(df_bytext["lemma"]) 
b = re.compile("break_[^ ]+")
t = re.sub(b, "\n", text).strip()
l = t.split("\n")
lemm_ = []
for text in l: 
    text = text.strip()
    t = text.split()
    lemm_.append(t)

In [10]:
from __future__ import print_function, division
from collections import Counter
from itertools import combinations
from math import log
from pprint import pformat
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds
from string import punctuation
from time import time
from nltk import ngrams
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
print('Ready')

Ready


Since Klibisz is working with *titles* he compares all possible bigrams and does not define a moving window. That probably needs to change. Aleksi creates windows with the following code:
```python
wz = self.windowsize - 1
zip(*[text[i:] for i in range(1+wz*2)])
```
This will zip multiple versions of the same text - the first starts at word 0, the next at word 1, etc. Since the window is symmetric (counting forward and backward) take it twice .
An alternative method is to use NLTK ngrams, which will create windows.
```python
from nltk import ngrams
n = 1+wz*2
windows = ngrams(text, n)
```

In [11]:
# 2a. Compute unigram and bigram counts.
# A unigram is a single word (x). A bigram is a pair of words (x,y).
# Bigrams are counted for any two terms occurring in the same title.
# For example, the title "Foo bar baz" has unigrams [foo, bar, baz]
# and bigrams [(bar, foo), (bar, baz), (baz, foo)]
t0 = time()
cx = Counter()
cxy = Counter()
for text in lemm_:
    cx.update(text)

    # Count all pairs of words, even duplicate pairs.
    windows = ngrams(text, 7) # 7 is window length - needs to be changeable
    for w in windows: # this creates the windows
        z = [tuple(l) for l in map(sorted, combinations(w, 2))]
        cxy.update(z)

#     # Alternative: count only 2-grams.
#     for x, y in zip(text[:-1], text[1:]):
#         cxy[(x, y)] += 1

#     # Alternative: count all pairs of words, but don't double count.
#     for x, y in set(map(tuple, map(sorted, combinations(text, 2)))):
#         cxy[(x,y)] += 1

print('%.3lf seconds (%.5lf / iter)' %
      (time() - t0, (time() - t0) / len(lemm_)))

18.040 seconds (0.00015 / iter)


In [12]:
# 2b. Remove frequent and infrequent unigrams.
# Pick arbitrary occurrence count thresholds to eliminate unigrams occurring
# very frequently or infrequently. This decreases the vocab size substantially.
print('%d tokens before' % len(cx))
t0 = time()
sx = sum(cx.values())
min_count = 2
max_count = sx
for x in list(cx.keys()):
    if cx[x] < min_count or cx[x] > max_count:
        del cx[x]
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cx)))
print('%d tokens after' % len(cx))
print('Most common:', cx.most_common()[:25])

21149 tokens before
0.012 seconds (0.00000 / iter)
10931 tokens after
Most common: [('_', 165829), ('sila[unit]N', 94854), ('giŋ[unit]N', 58564), ('itud[moon]N', 52699), ('dumu[child]N', 45917), ('gur[unit]N', 44814), ('ki[place]N', 44124), ('ud[sun]N', 42837), ('udu[sheep]N', 34937), ('šuniŋin[circular]N', 33716), ('kaš[beer]N', 33515), ('še[barley]N', 32950), ('2[00]PN', 30064), ('kišib[seal]N', 27758), ('ninda[bread]N', 27062), ('lugal[king]N', 24310), ('i[oil]N', 22746), ('gud[ox]N', 22083), ('ŋuruš[male]N', 20810), ('dubsar[scribe]N', 20173), ('šu[hand]N', 17854), ('niga[fattened]V/i', 17526), ('ugula[overseer]N', 16437), ('ŋiri[foot]N', 16015), ('sila[lamb]N', 15922)]


In [13]:
# 2c. Remove frequent and infrequent bigrams.
# Any bigram containing a unigram that was removed must now be removed.
t0 = time()
for x, y in list(cxy.keys()):
    if x not in cx or y not in cx:
        del cxy[(x, y)]
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cxy)))

0.257 seconds (0.00000 / iter)


In [14]:
# 3. Build unigram <-> index lookup.
t0 = time()
x2i, i2x = {}, {}
for i, x in enumerate(cx.keys()):
    x2i[x] = i
    i2x[i] = x
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cx)))

0.005 seconds (0.00000 / iter)


In [15]:
# 4. Sum unigram and bigram counts for computing probabilities.
# i.e. p(x) = count(x) / sum(all counts).
t0 = time()
sx = sum(cx.values())
sxy = sum(cxy.values())
print('%.3lf seconds (%.5lf / iter)' %
      (time() - t0, (time() - t0) / (len(cx) + len(cxy))))

0.009 seconds (0.00000 / iter)


In [16]:
# 5. Accumulate data, rows, and cols to build sparse PMI matrix
# Recall from the blog post that the PMI value for a bigram with tokens (x, y) is: 
# PMI(x,y) = log(p(x,y) / p(x) / p(y)) = log(p(x,y) / (p(x) * p(y)))
# PPMI = max(log(p(x,y) / (p(x) * p(y))), 0)
# The probabilities are computed on the fly using the sums from above.
t0 = time()
pmi_samples = Counter()
data, rows, cols = [], [], []
for (x, y), n in cxy.items():
    rows.append(x2i[x])
    cols.append(x2i[y])
    data.append(max(log((n / sxy) / (cx[x] / sx) / (cx[y] / sx)), 0))
    pmi_samples[(x, y)] = data[-1]
PPMI = csc_matrix((data, (rows, cols)))
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cxy)))
print('%d non-zero elements' % PPMI.count_nonzero())
print('Sample PPMI values\n', pformat(pmi_samples.most_common()[:10]))

1.312 seconds (0.00000 / iter)
628620 non-zero elements
Sample PPMI values
 [(('ba.ba₆.ba.an.zi.ge[00]PN', 'lamma.e.silim.mu[00]PN'), 12.987668656041233),
 (('Gir₂.ba.du[00]PN', 'I.di₃.ni.šu[00]PN'), 12.987668656041233),
 (('Ur.me.me.ka.bi[00]PN', 'ahulŋal[mistreatment]N'), 12.900657279051602),
 (('Ba.ša.an.ti.ba.at[00]PN', 'Mu.lu.uš[00]PN'), 12.805347099247278),
 (('Ha.ur₂.na.ni.gi₄[00]PN', 'La.ap.hi[00]PN'), 12.805347099247278),
 (('A.za.ba.an[00]PN', 'Mi.iš.hi.ni.iš.hi[00]PN'), 12.805347099247278),
 (('A.za.ba.an[00]PN', 'Ku.uš.lil₂.li[00]PN'), 12.805347099247278),
 (('Du.ga.ma.aš.ti[00]PN', 'Mi.šu.a.bi.ir[00]PN'), 12.805347099247278),
 (('Ga.ba.ba[00]PN', 'Mu.ki.iš[00]SN'), 12.805347099247278),
 (('Geme₂.lu₅[00]PN', 'lamma.ri.sa.mu[00]PN'), 12.805347099247278)]


In [17]:
# 6. Factorize the PPMI matrix using sparse SVD aka "learn the unigram/word vectors".
# This part replaces the stochastic gradient descent used by Word2vec
# and other related neural network formulations. We pick an arbitrary vector size k=20.
t0 = time()
U, _, _ = svds(PPMI, k=20)
print('%.3lf seconds' % (time() - t0))

0.213 seconds


In [18]:
# 7. Normalize the vectors to enable computing cosine similarity in next cell.
# If confused see: https://en.wikipedia.org/wiki/Cosine_similarity#Definition
t0 = time()
norms = np.sqrt(np.sum(np.square(U), axis=1, keepdims=True))
U /= np.maximum(norms, 1e-7)
print('%.3lf seconds' % (time() - t0))

0.002 seconds


In [19]:
# 8. Show some nearest neighbor samples as a sanity-check.
# The format is <unigram> <count>: (<neighbor unigram>, <similarity>), ...
# From this we can see that the relationships make sense.
k = 5
for x in ['suhur[carp]N', 'gigir[chariot]N', 'šah[pig]N', 'Inanak[1]DN']:
    dd = np.dot(U, U[x2i[x]]) # Cosine similarity for this unigram against all others.
    s = ''
    # Compile the list of nearest neighbor descriptions.
    # Argpartition is faster than argsort and meets our needs.
    for i in np.argpartition(-1 * dd, k + 1)[:k + 1]:
        if i2x[i] == x: continue
        xy = tuple(sorted((x, i2x[i])))
        s += '(%s, %.3lf) ' % (i2x[i], dd[i])
    print('%s, %d\n %s' % (x, cx[x], s))
    print('-' * 10)

suhur[carp]N, 77
 (tun[container]N, 0.929) (niŋki[fish]N, 0.942) (saŋkur[fish]N, 0.952) (niŋbuna[turtle]N, 0.924) (saŋkešed[fish]N, 0.923) 
----------
gigir[chariot]N, 295
 (dus[bathroom]N, 0.834) (guza[chair]N, 0.850) (emarru[quiver]N, 0.779) (gal[cup]N, 0.754) (banšur[table]N, 0.756) 
----------
šah[pig]N, 480
 (Šu.bar[00]PN, 0.977) (u[goose]N, 0.939) (tumgur[bird]N, 0.930) (zeda[piglet]N, 0.926) (pešgi[rodent]N, 0.926) 
----------
Inanak[1]DN, 264
 (Du.du[00]PN, 0.775) (Ka₅.a[00]PN, 0.770) (E₂.gissu[00]PN, 0.764) (Lugal.me.lam₂[00]PN, 0.742) (AŠ[00]PN, 0.720) 
----------
