The following is inspired by the blog post [simple word vectors with co-occurrence pmi and svd](https://www.kaggle.com/alexklibisz/simple-word-vectors-with-co-occurrence-pmi-and-svd) by Alex Klibisz.

In [15]:
from __future__ import print_function, division
from collections import Counter
from itertools import combinations
from math import log
from pprint import pformat
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds
from sklearn import preprocessing
from string import punctuation
from time import time
from nltk import ngrams
import matplotlib.pyplot as plt
import numpy as np
from numpy.linalg import norm
import pandas as pd
import pickle
from tqdm.auto import tqdm
print('Ready')

Ready


First read in the list of list that was created in the previous notebook.

In [2]:
with open("output/data_for_pmi.p", "rb") as r:
    lemm_l = pickle.load(r)

In order to compute PMI we need a sliding window going over the data. There are various ways of creating such windows. 
We will use the `ngrams` function from the `nltk` package.
```python
from nltk import ngrams
n = 7
windows = ngrams(text, n)
```
This will create windows of the size `n`

For each window the function `combinations()` (from the `itertool` library) is used to produce all possible combinations of 2 words in that window. With a 7-word window length, a sequence such as en-lil₂ kur gal-ra šulgi lugal-e e₂ mu-na-du₃ ("Šulgi the king build a temple for Great Mountain Enlil") will yield the following 21 word pairs as tuples:  
* (Enlil\[1\]DN, kur\[mountain\]N)
* (Enlil\[1\]DN, gal\[big\]V/i)
* (Enlil\[1\]DN, Šulgi\[1\]RN)
* (Enlil\[1\]DN, lugal\[king\]N)
* (Enlil\[1\]DN, e\[house\]N)
* (Enlil\[1\]DN, du\[build\]V/t)
* (kur\[mountain\]N, gal\[big\]V/i)
* (kur\[mountain\]N, Šulgi\[1\]RN)
* (kur\[mountain\]N, lugal\[king\]N)
* (kur\[mountain\]N, e\[house\]N)
* (kur\[mountain\]N, du\[build\]V/t)
* (gal\[big\]V/i, Šulgi\[1\]RN)
* (gal\[big\]V/i, lugal\[king\]N)
* (gal\[big\]V/i, e\[house\]N)
* (gal\[big\]V/i, du\[build\]V/t)
* (Šulgi\[1\]RN, lugal\[king\]N)
* (Šulgi\[1\]RN, e\[house\]N)
* (Šulgi\[1\]RN, du\[build\]V/t)
* (lugal\[king\]N, e\[house\]N)
* (lugal\[king\]N, du\[build\V/t])
* (e\[house\]N, du\[build\]V/t)
Each of these word pairs is considered a collocate. By reducing the window size one may concentrate on words that occur closer together, and may be part of a more or less fixed combination (such as kur gal "great mountain"as an epithet of Enlil). Larger window sizes will consider broader thematic issues, recognizing words that are used when discussing particular themes or issues.

The `Counter()` function (from the `collections` library) is used to establish the counts of unique words (variable `cx`) as well as the counts of unique collocates (variable `cxy`). The `Counter()` is first initialized as an empty variable and then updated while iterating over the list of lists `lemm_l`.

In [16]:
cx = Counter()
cxy = Counter()
for text in tqdm(lemm_l):
    cx.update(text) # count all unique lemmas
    win = 7    # 7 is window length - needs to be changeable
    windows = ngrams(text, win) # this creates the windows
    for w in windows: 
        z = [tuple(l) for l in map(sorted, combinations(w, 2))]
        cxy.update(z)  # count all collocates (word pairs)

HBox(children=(FloatProgress(value=0.0, max=126172.0), HTML(value='')))




## Removing stop words and rare words
# TODO: edit the text below and make decisions about stop word list
The following cell may be used to remove very frequent and/or very rare lemmas, based on `min_count` and `max_count` thresholds. For now, only terms that occur only once are removed and there is no upper frequency boundary (note that `max_count` equals the total number of tokens in the corpus). For Sumerian, it is not obvious which words should count as stop-words since Sumerian has no independent prepositions, barely uses pronouns (where used, they are highly significant) and has few true function words that have little meaning in a Bag of Words approach. For the Ur III corpus, one could designate units such as **gur**, **giŋ**, and **sila**, or time indications such as **ud** (day) and **itud** (month) as stop words, since they add little indeed to the content (one might argue that **ud** is in many cases a function word meaning "when"). (Add here **ki** (place) **mu** (name), **ŋiri** (foot) and **šuniŋin** (total); perhaps **kišib** (seal) and **dumu** (child)). Very frequent items such as **udu** sheep, **še** (barley), **ninda** (bread) or **lugal** (king) should not be removed, because those are the actual subjects of these texts.

Importantly, we do remove all underscores here. Underscore are place holders that represent unlemmatized words (broken or unknown). They have been included in the process so far for the benefit of the sliding window in the previous section. Our list of collocates, therefore, has many entries such as (lugal\[king\]N, _ ) meaning that there are 7-word windows in which the word for king co-occurs with an unlemmatized word. These collocates are removed in the next cell.

In [17]:
print('%d tokens before' % len(cx))
sx = sum(cx.values())
min_count = 2
max_count = sx
for x in list(cx.keys()):
    if cx[x] < min_count or cx[x] > max_count:
        del cx[x]
del cx['_']  # underscores are place holders and may be removed now.
print('%d tokens after' % len(cx))
print('Most common:', cx.most_common()[:25])

28551 tokens before
13067 tokens after
Most common: [('sila[unit]N', 96554), ('giŋ[unit]N', 58899), ('itud[moon]N', 52132), ('dumu[child]N', 46623), ('gur[unit]N', 45371), ('ki[place]N', 44639), ('ud[sun]N', 43274), ('udu[sheep]N', 35534), ('še[barley]N', 34354), ('kaš[beer]N', 33902), ('šuniŋin[total]N', 32175), ('kišib[seal]N', 28136), ('ninda[bread]N', 27325), ('lugal[king]N', 24668), ('i[oil]N', 22847), ('gud[ox]N', 22401), ('ŋuruš[male]N', 21326), ('dubsar[scribe]N', 20452), ('šu[hand]N', 18033), ('niga[fattened]V/i', 17619), ('ugula[overseer]N', 16575), ('ŋiri[foot]N', 16113), ('sila[lamb]N', 16071), ('naŋa[potash]N', 15682), ('šum[garlic]N', 15479)]


# Clean the Collocates
Remove all bigrams that contain a lemma that has been removed in the previous step.

In [18]:
for x, y in list(cxy.keys()):
    if x not in cx or y not in cx:
        del cxy[(x, y)]

0.342 seconds (0.00000 / iter)


# Lookup Dictionaries
The following step creates two dictionaries. The dictionary `x2i` has the lemmas as keys and a counter as value. The dictionary `i2x` has that same counter as keys, and the lemmas as values. These dictionaries are used further down in the process.

In [19]:
t0 = time()
x2i, i2x = {}, {}
for i, x in enumerate(cx.keys()):
    x2i[x] = i
    i2x[i] = x
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cx)))

0.007 seconds (0.00000 / iter)


# Compute Token and Collocate Totals

In [21]:
sx = sum(cx.values())
sxy = sum(cxy.values())

78.229 seconds (0.00010 / iter)


In the next cell the data is transformed into a (sparse) matrix, with rows and columns representing unique words, and the data in each cell representing the PMI score for the co-occurence for the two words. The variable `cxy` contains a list of all collocates in the format `'Unug[1]SN', 'šag[heart]N'): 1653`; meaning that the collocation of these two terms appears 1653 times. The dictionary `x2i` is used to translate each lemma into an index number and append the numbers to the lists `rows` and `cols`. The third element that the matrix function need is `data`. Instead of simply entering the frequency in `data` we enter the PPMI score. The formula for PMI is  $$log\frac{p(x,y)}{p(x)p(y)}$$; that is: the logarithm of the probability of encountering x *and* y, divided by the probability of x times the probability of y. The probabilities are computed by dividing the frequency of x and y by the total number of tokens and dividing the frequency of (x, y) by the total number of collocates.

One problem with PMI is that it favors low frequency terms. If term A only appears 10 times in our dataset, each collocate with any other term will appear significant, because it occurs in 10% of all the occurences of term A. We will see this effect, for instance, in the appearance of rare names. One solution is using PMI2, where the probability of the collocate is squared: PMI2 = $$log\frac{p(x,y)^2}{p(x)p(y)}$$


In [8]:
# 5. Accumulate data, rows, and cols to build sparse PMI matrix
# Recall from the blog post that the PMI value for a bigram with tokens (x, y) is: 
# PMI(x,y) = log(p(x,y) / p(x) / p(y)) = log(p(x,y) / (p(x) * p(y)))
# PPMI(x,y) = max(log(p(x,y) / (p(x) * p(y))), 0)
# The probabilities are computed on the fly using the sums from above.
t0 = time()
pmi_samples = Counter()
pmi_samples2 = Counter()
data, rows, cols = [], [], []
data2 = []
for (x, y), n in cxy.items():
    rows.append(x2i[x])
    cols.append(x2i[y])
    data.append(max(log((n / sxy) / (cx[x] / sx) / (cx[y] / sx)), 0))
    data2.append(log(((n / sxy)**2) / (cx[x] / sx) / (cx[y] / sx)))
    pmi_samples[(x, y)] = data[-1]
    pmi_samples2[(x, y)] = data2[-1]
PPMI = csc_matrix((data, (rows, cols)))
PMI2 = csc_matrix((data, (rows, cols)))
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cxy)))
print('%d non-zero elements' % PPMI.count_nonzero())
print('Sample PPMI values\n', pformat(pmi_samples.most_common(10)))
print('Sample PMI2 values\n', pformat(pmi_samples2.most_common(10)))

2.324 seconds (0.00000 / iter)
654808 non-zero elements
Sample PPMI values
 [(('Ba.ba₆.ba.an.zi.ge[00]PN', 'Lamma.e.silim.mu[00]PN'), 13.001516570612353),
 (('Gir₂.ba.du[00]PN', 'I.di₃.ni.šu[00]PN'), 13.001516570612353),
 (('Ha.šu.dan.ni[00]SN', 'Iš.me.a.li₂[00]PN'), 13.001516570612353),
 (('Bi.za.zi[00]PN', 'Si.ru.um.KAL[00]PN'), 12.914505193622722),
 (('Ki.tu[00]PN', 'Ur₂.ra.am.še.er[00]PN'), 12.819195013818398),
 (('Gababa[0]PN', 'Mukiš[0]SN'), 12.819195013818398),
 (('Lugal.da.nir.gal₂[00]PN', 'Lu₂.gu₂.edin.na[00]PN'), 12.819195013818398),
 (('Haʾurnanigi[0]PN', 'La.ap.hi[00]PN'), 12.819195013818398),
 (('Mašhuntahli[0]PN', 'MerahŠulgi[0]PN'), 12.819195013818398),
 (('Harišhuntah[0]PN', 'Mašhuntahli[0]PN'), 12.819195013818398)]
Sample PMI2 values
 [(('giŋ[unit]N', 'i[oil]N'), -1.507942856160303),
 (('giŋ[unit]N', 'šum[garlic]N'), -1.5376713118606629),
 (('gidim[ghost]N', 'ŋešanaŋ[locus]N'), -1.5954494597936943),
 (('Ba.ba₆.ba.an.zi.ge[00]PN', 'Lamma.e.silim.mu[00]PN'), -1.700809975

In [9]:
# 6. Factorize the PPMI matrix using sparse SVD aka "learn the unigram/word vectors".
# This part replaces the stochastic gradient descent used by Word2vec
# and other related neural network formulations. We pick an arbitrary vector size k=20.
t0 = time()
U, _, _ = svds(PPMI, k=50)
U2, _, _ =svds(PMI2, k=50)
print('%.3lf seconds' % (time() - t0))

1.161 seconds


In [10]:
# 7. Normalize the vectors to enable computing cosine similarity in next cell.
# If confused see: https://en.wikipedia.org/wiki/Cosine_similarity#Definition
t0 = time()
U = preprocessing.normalize(U, norm='l2')
#U /= np.maximum(norms, 1e-7)
#U2 /= np.maximum(norms2, 1e-7)
print('%.3lf seconds' % (time() - t0))

0.005 seconds


In [13]:
# 8. Show some nearest neighbor samples as a sanity-check.
# The format is <unigram> <count>: (<neighbor unigram>, <similarity>), ...
# From this we can see that the relationships make sense.
k = 5
for x in ['esir[shoe]N', 'udu[sheep]N', 'kugsig[gold]N', 'lugal[king]N', 'Inanak[1]DN', 'Utamišaram[0]PN']:
    dd = np.dot(U, U[x2i[x]]) # Cosine similarity for this unigram against all others.
    dd2 = np.dot(U2, U2[x2i[x]])
    s = ''
    s2 = ''
    # Compile the list of nearest neighbor descriptions.
    # Argpartition is faster than argsort and meets our needs.
    for i in np.argpartition(-1 * dd, k + 1)[:k + 1]:
        if i2x[i] == x: continue
        xy = tuple(sorted((x, i2x[i])))
        s += '(%s, %.3lf) ' % (i2x[i], dd[i])
    print('PPMI %s, %d\n %s' % (x, cx[x], s))
    print('-' * 10)
    for i in np.argpartition(-1 * dd2, k + 1)[:k + 1]:
        if i2x[i] == x: continue
        xy = tuple(sorted((x, i2x[i])))
        s2 += '(%s, %.3lf) ' % (i2x[i], dd2[i])
    print('PMI2 %s, %d\n %s' % (x, cx[x], s2))
    print('-' * 10)

PPMI esir[shoe]N, 213
 (duksium[shield]N, 0.913) (eban[pair]N, 0.912) (gudimba[~shoe]N, 0.912) (KA.GUM×ŠE[~leather]N, 0.930) (sadab[laces]N, 0.898) 
----------
PMI2 esir[shoe]N, 213
 (babbar[white]V/i, 0.020) (eban[pair]N, 0.034) (arina[madder]N, 0.022) (dušia[stone]N, 0.021) (gigir[chariot]N, 0.017) 
----------
PPMI udu[sheep]N, 35534
 (maš[goat]N, 0.882) (sila[lamb]N, 0.905) (niga[fattened]V/i, 0.862) (uzud[goat]N, 0.858) (mašgal[goat]N, 0.848) 
----------
PMI2 udu[sheep]N, 35534
 (sukkal[secretary]N, 0.013) (lu[person]N, 0.015) (kaš[beer]N, 0.012) (ninda[bread]N, 0.011) (sila[unit]N, 0.011) (maškim[administrator]N, 0.010) 
----------
PPMI kugsig[gold]N, 1122
 (zagin[lapis]N, 0.905) (gug[carnelian]N, 0.895) (nir[stone]N, 0.899) (sua[stone]N, 0.902) (huš[reddish]V/i, 0.895) 
----------
PMI2 kugsig[gold]N, 1122
 (ellaŋ[bead]N, 0.034) (gug[carnelian]N, 0.049) (nir[stone]N, 0.037) (amašmeʾe[stone]N, 0.034) (huš[reddish]V/i, 0.032) 
----------
PPMI lugal[king]N, 24668
 (kiŋgia[messenger]N

# 6 Ur III Commodities and Actors: Word Embeddings

The basic idea behind word embeddings is that word meaning is determined by the contexts in which the word is found. Or, in the famous quote by the linguist [J.R. Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth): "You shall know a word by the company it keeps." Each unique word (or lemma) in a corpus is assigned a vector in such a way that words attested in similar contexts receive similar vectors. As a result, vectors that represent words with similar meanings become neighbors: they are relatively close in the vector space.

A classic implementation of word embeddings is [word2vec](https://en.wikipedia.org/wiki/Word2vec), created in 2013 by a Google team under the direction of Tomas Mikolov ([Mikolov e.a. 2013](https://arxiv.org/abs/1301.3781)). They used a large corpus of texts in English, derived from the Web, and trained their model with a neural network. They found that their technique not only assigned similar vectors to similar weords, but also encoded in those vectors semantic components such as "male" or "female." The classic example is 

            king - male + female ≈ queen
            
In other words: if you subtract the vector for "male" from the vector for "king" and add the vector for "female," the nearest neighbor of that computed vector turns out to be the one for "queen".

Word2vec creeated a revolution in Natural Language Processing and found application in many different tasks. Researchers either use pre-defined word vectors or compute such vectors from their own data. Cuneiformists, of course, do not really have a choice. They have to create their own vectors and here three important drawbacks of word2vec come to light. First, the algorithm works best on very large datasets - the initial implementation used a corpus of 1.6 billion words. For Akkadian or Sumerian such numbers are not feasible, the more so since we may want to build different models for different periods and/or different text types. Second, the neural network architecture behind word2vec works, but it is hard to explain why it works or how. In other words, there is something like a black box between the raw data and the resulting vectors which may be OK for industry applications, but is hard to deal with in a scholarly context. Third, training a model in word2vec (or any similar algorithm) takes a considerable amount of computing time. Since there are many parameters (such as the window size that defines the context of a word) and such settings may dramatically change the results, it is necessary to build and compare many models. Building and comparing many models is doable and common in a Computer Science or NLP research setting, but in Assyriology research that is hardly feasible.

More recently researchers have proposed a simpler approach to training word vectors ([Moody 2017](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/), and see [Levy and Goldberg 2014](https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html)). This approach uses PMI (Pointwise Mutual Information) combined with SVD (Singular Vector Decomposition), two well-understood and relatively straightforward processes. PMI is used to create a score between every pair of unique words in the corpus based on their individual frequency on the one hand, and the frequency of the two words as collocates. This results in huge matrix of *n* rows and *n* column, where *n* represents the number of unique words in the corpus. SVD is then used to reduce this matrix to a set of *m* dimensional vectors (one vector for each word in the corpus), where *m* may be in the range between, say, 20 and 200. 

In [None]:
lemm_l[:5]

In [None]:
files[:5]