# 6 Ur III Commodities and Actors: Word Embeddings

The basic idea behind word embeddings is that word meaning is determined by the contexts in which the word is found. Or, in the famous quote by the linguist [J.R. Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth): "You shall know a word by the company it keeps." Each unique word (or lemma) in a corpus is assigned a vector in such a way that words attested in similar contexts receive similar vectors. As a result, vectors that represent words with similar meanings become neighbors: they are relatively close in the vector space.

A classic implementation of word embeddings is [word2vec](https://en.wikipedia.org/wiki/Word2vec), created in 2013 by a Google team under the direction of Tomas Mikolov ([Mikolov e.a. 2013](https://arxiv.org/abs/1301.3781)). They used a large corpus of texts in English, derived from the Web, and trained their model with a neural network. They found that their technique not only assigned similar vectors to similar words, but also encoded in those vectors semantic components such as "male" or "female." The classic example is 

            king - male + female ≈ queen
            
In other words: if you subtract the vector for "male" from the vector for "king" and add the vector for "female," the nearest neighbor of that computed vector turns out to be the one for "queen".

Word2vec created a revolution in Natural Language Processing and found application in many different tasks. Researchers either use pre-defined word vectors or compute such vectors from their own data. Cuneiformists, of course, do not really have a choice. They have to create their own vectors and here three important drawbacks of word2vec come to light. First, the algorithm works best on very large datasets - the initial implementation used a corpus of 1.6 billion words. For Akkadian or Sumerian such numbers are not feasible, the more so since we may want to build different models for different periods and/or different text types. Second, the neural network architecture behind word2vec works, but it is hard to explain why it works or how. In other words, there is something like a black box between the raw data and the resulting vectors which may be OK for industry applications, but is hard to deal with in a scholarly context. Third, training a model in word2vec (or any similar algorithm) takes a considerable amount of computing time. Since there are many parameters (such as the window size that defines the context of a word) and such settings may dramatically change the results, it is necessary to build and compare many models. Building and comparing many models is doable and common in a Computer Science or NLP research setting, but in Assyriology research that is rarely feasible.

More recently researchers have proposed a simpler approach to training word vectors ([Moody 2017](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/), and see [Levy and Goldberg 2014](https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html)). This approach uses PMI (Pointwise Mutual Information) combined with SVD (Singular Vector Decomposition), two well-understood and relatively straightforward processes. PMI is used to create a score between every pair of unique words in the corpus based on their individual frequency on the one hand, and the frequency of the two words as collocates on the other hand. A high score means that the words occur more often together than what be expected given the frequency of each word individually. This results in huge matrix of *n* rows and *n* column, where *n* represents the number of unique words in the corpus. SVD is then used to reduce this matrix to a set of *m* dimensional vectors (one vector for each word in the corpus), where *m* may be in the range between, say, 20 and 200. 

## 6.1 Data Acquisition

The data format that we need for computing the PMI matrix is a list of lists, where each list represents a consecutive sequence of lemmas. "Consecutive" is crucial here: we do not want the last word of text a to be counted as a collocate of the first word of text b. Similarly, we should take textual breaks into account: after each break, we should start a new sequence of consecutive words. Once we have created this data format we can define a sliding window of *n* words (where *n* is larger than 1 and usually between 2 and 15), so that each word inside a window is considered a collacate of all the others. That will provide us with the list of pairs of words for which PMI scores should be computed.

The data acquisition process discussed in Chapter 2.1 transforms the ORACC JSON files into a DataFrame, where each row represents a single word and each column the various data elements that describe a lemma (such as the language, the text ID, the Citation Form, etc.). That DataFrame contains all the information we need and we may transform that DataFrame into the data format we need (the list of lists) with the `pandas` methods discussed in previous chapters. However, this is a rather inefficient approach. Most of the information in the DataFrame we do not need and manipulating a DataFrame with `groupby()` and `aggregate()` is computationally expensive. Instead, we may parse the JSON files in such a way that we create the data format we need directly, without using a `pandas` DataFrame.



Each individual document may be regarded as a consecutive sequence of lemmas. Within a document, however, there are two types of breaks: physical breaks and logical breaks. A physical break is an actual break in a clay tablet. A logical break is a horizontal ruling, or the transition from the text of the document to the text of the seal impression. We want to prevent the sliding window to extent over such breaks.

In addition, there are several types of words that require a special treatment.
* Unlemmatized words - words that are damaged or unknown should not be included. They are replaced by an underscore ("_"). Such words do not contribute to our analysis, but should also not be removed because we do not want to create artificial neighbors.
* Numbers are of no interest here and are entirely removed.
* Damaged personal names are lemmatized as PN with a citation form in the format Lu₂.x. Those are treated the same way as unlemmatized words (replace by underscore)
* Year names are removed. Although year names are important for a variety of reasons, they belong to a different register of Sumerian and their collocates are of no interest here. 
* Remove words that are not in Sumerian

In [1]:
import zipfile
import json
import tqdm
import os
import sys
import pickle
import re
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

In [None]:
directories = ['jsonzip', 'output', 'corpus']
make_dirs(directories)

In [2]:
projects = "epsd2/admin/ur3"
p = [projects]

In [None]:
p = format_project_list(projects)
oracc_download(p)

The `parsejson()` function below works in a way that is similar to the `parsejson()` functions we discussed in Chapter 2. Each .json file to be parsed represents a single text. The list `l` collects lemmatizations in the format CF\[GW\]POS (for instance lugal\[king\]N). When the function has gone through the entire file it appends the list `l` to the list `lemm_l` that is defined in the next cell (where `parsejson()` is called). This will create a list of list (`lemm_l`, where each lower order list represents a single text.

Breaks in the text (both logical breaks and physical breaks) are marked in the JSON with a `state` node. This node has a restricted vocabulary to indicate breaks, traces, illegible lines, horizontal rulings, etc. The vocabulary that marks a logical or physical break is assembled in the list `breakage` (define in the next cell). When such a node is encountered, the list `l` is added to the list of lists `lemma_l` and a new (empty) `l` list is created. Then the process resumes. This way, a tablet with breaks is chopped up into multiple lists of consecutive lemmas.

The rest of the function takes care of special situations:
* Unlemmatized words - words that are damaged or unknown should not be included. They are replaced by an underscore ("_"). Such words do not contribute to our analysis, but should also not be removed because we do not want to create artificial neighbors.
* Numbers are of no interest here and are entirely removed.
* Damaged personal names are lemmatized as PN with a citation form in the format Lu₂.x. Those are treated the same way as unlemmatized words (replace by underscore)
* Year names are removed. Year names are important for dating, for political history and for understanding the ideology of the period. They do not contribute meaningful collocates to the transactions in the documents studied here.
* Remove words that are not in Sumerian

The process also skips lemmas that derive from the "sign" and "pronunciation" columns of lexical lists. That is not relevant in the current chapter, but may become relevant if you wish to use this code on a wider set of texts.

The `parsejson()` function returns the list `l`, a list of consecutive lemmas; this list is then appended to the list `lemm_l`. If the function encounters a break in the text, the lemmas that have been collected in the list `l` are appended to `lemm_l` and the list `l` is cleared. Note that, in order to do so, we need to append a *copy* of `l` (rathjer than a view) or otherwise clearing `l` will also clear the view of `l` that was appended to `lemm_l`. This is done by appending `l.copy()`

In [27]:
lemm_l = []
l = []
ids_ = []
breakage = ['illegible', 'traces', 'missing', 'effaced','other', 'blank', 'ruling']

In [28]:
def parsejson(text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject)
        elif JSONobject.get("state", "") in breakage:
            if len(l) > 1:
                lemm_l.append(l.copy()) #append a copy of l
            l.clear()
            continue
        elif JSONobject.get("subtype","") in ['sg', 'pr']: # skip the fields "sign" and "pronunciation"
            continue                                     # in lexical texts
        elif JSONobject.get("subtype", "")[:5] in ["seal ", "envel"]: # seal 1, seal 2, etc. or envelope
            if len(l) > 1:
                lemm_l.append(l.copy())
            l.clear()
            continue
        elif JSONobject.get("ftype", "") == "yn":
            continue # skip year names
        elif "f" in JSONobject: 
            word = JSONobject["f"]
            if word["lang"][:3] != "sux": #only Sumerian and Emesal
                continue
            if word.get("pos", "") == "n":  # omit numbers
                continue
            if "cf" in word:
                #for some reason some words appear without pos. Provisionally treated as Noun
                lemm = f"{word['cf']}[{word['gw']}]{word.get('pos', 'N')}"
                lemm = lemm.replace(' ', '-') # remove commas and spaces from lemm
                lemm = lemm.replace(',', '')
            else:
                lemm = "_" # if word is unlemmatized enter a place holder
            if "x" in word.get("cf","").lower():  # partly damaged PN; enter placeholder
                lemm = "_"
            l.append(lemm)
    return l

In [29]:
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in tqdm(files, desc = project):                            #iterate over the file names
        id_no = filename[-13:-5]
        if id_no in ids_ and not "X" in id_no: # Check if P/Q number is already in there
            continue        # a text may appear in multiple projects
        id_text = project + id_no # id_text is, for instance, blms/P414332
        ids_.append(id_text)
        l = []
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
            data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
            l = parsejson(data_json)
            if len(l) > 1:
                lemm_l.append(l.copy())
            l.clear()
        except:
            print(id_text + ' is not available or not complete')

HBox(children=(FloatProgress(value=0.0, description='epsd2/admin/ur3', max=71712.0, style=ProgressStyle(descri…

epsd2/admin/ur3/P143238 is not available or not complete



The above results in the list of lists lemm_l, which holds the individual texts. Each document is represented by one or more lists of lemmas, with the lemmas in the original order. Secondly, the list ids_ holds all the text IDs. These IDs are not further used, but were collected to prevent duplication, which may be an issue if you derive data from more than one project.

The following is directly taken from the blog post "simple word vectors with co-occurrence pmi and svd" by Alex Klibisz.

In [16]:
from __future__ import print_function, division
from collections import Counter
from itertools import combinations
from math import log
from pprint import pformat
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds
from string import punctuation
from time import time
from nltk import ngrams
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
print('Ready')

Ready


Since Klibisz is working with *titles* he compares all possible bigrams and does not define a moving window. That probably needs to change. Aleksi creates windows with the following code:
```python
wz = self.windowsize - 1
zip(*[text[i:] for i in range(1+wz*2)])
```
This will zip multiple versions of the same text - the first starts at word 0, the next at word 1, etc. Since the window is symmetric (counting forward and backward) take it twice .
An alternative method is to use NLTK ngrams, which will create windows.
```python
from nltk import ngrams
n = 7
windows = ngrams(text, n)
```

In [17]:
# 2a. Compute unigram and bigram counts.
# A unigram is a single word (x). A bigram is a pair of words (x,y).
# Bigrams are counted for any two terms occurring in the same title.
# For example, the title "Foo bar baz" has unigrams [foo, bar, baz]
# and bigrams [(bar, foo), (bar, baz), (baz, foo)]
t0 = time()
cx = Counter()
cxy = Counter()
for text in lemm_l:
    cx.update(text)

    # Count all pairs of words, even duplicate pairs.
    windows = ngrams(text, 7) # 7 is window length - needs to be changeable
    for w in windows: # this creates the windows
        z = [tuple(l) for l in map(sorted, combinations(w, 2))]
        cxy.update(z)

#     # Alternative: count only 2-grams.
#     for x, y in zip(text[:-1], text[1:]):
#         cxy[(x, y)] += 1

#     # Alternative: count all pairs of words, but don't double count.
#     for x, y in set(map(tuple, map(sorted, combinations(text, 2)))):
#         cxy[(x,y)] += 1

print('%.3lf seconds (%.5lf / iter)' %
      (time() - t0, (time() - t0) / len(lemm_l)))

42.269 seconds (0.00017 / iter)


In [18]:
# 2b. Remove frequent and infrequent unigrams.
# Pick arbitrary occurrence count thresholds to eliminate unigrams occurring
# very frequently or infrequently. This decreases the vocab size substantially.
print('%d tokens before' % len(cx))
t0 = time()
sx = sum(cx.values())
min_count = 2
max_count = sx
for x in list(cx.keys()):
    if cx[x] < min_count or cx[x] > max_count:
        del cx[x]
del cx['_']
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cx)))
print('%d tokens after' % len(cx))
print('Most common:', cx.most_common()[:25])

21154 tokens before
0.008 seconds (0.00000 / iter)
21147 tokens after
Most common: [('sila[unit]N', 190943), ('itud[moon]N', 167604), ('dumu[child]N', 154024), ('ki[place]N', 150448), ('ud[sun]N', 147854), ('udu[sheep]N', 132822), ('giŋ[unit]N', 117280), ('dubsar[scribe]N', 102538), ('gur[unit]N', 89679), ('dab[seize]V/t', 84179), ('A.kal.la[00]PN', 73218), ('mašgal[goat]N', 71888), ('Ab.ba.sa₆.ga[00]PN', 69810), ('še[barley]N', 67822), ('šuniŋin[circular]N', 67432), ('kaš[beer]N', 67218), ('Ezemmah[1]MN', 65814), ('Ur.ge₆.par₄[00]PN', 63268), ('En.dingir.mu[00]PN', 62956), ('2[00]PN', 60922), ('kišib[seal]N', 55541), ('ninda[bread]N', 54162), ('lugal[king]N', 48617), ('i[oil]N', 45496), ('gud[ox]N', 44559)]


In [19]:
# 2c. Remove frequent and infrequent bigrams.
# Any bigram containing a unigram that was removed must now be removed.
t0 = time()
for x, y in list(cxy.keys()):
    if x not in cx or y not in cx:
        del cxy[(x, y)]
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cxy)))

0.260 seconds (0.00000 / iter)


In [20]:
# 3. Build unigram <-> index lookup.
t0 = time()
x2i, i2x = {}, {}
for i, x in enumerate(cx.keys()):
    x2i[x] = i
    i2x[i] = x
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cx)))

0.006 seconds (0.00000 / iter)


In [21]:
# 4. Sum unigram and bigram counts for computing probabilities.
# i.e. p(x) = count(x) / sum(all counts).
t0 = time()
sx = sum(cx.values())
sxy = sum(cxy.values())
print('%.3lf seconds (%.5lf / iter)' %
      (time() - t0, (time() - t0) / (len(cx) + len(cxy))))

0.011 seconds (0.00000 / iter)


In [22]:
# 5. Accumulate data, rows, and cols to build sparse PMI matrix
# Recall from the blog post that the PMI value for a bigram with tokens (x, y) is: 
# PMI(x,y) = log(p(x,y) / p(x) / p(y)) = log(p(x,y) / (p(x) * p(y)))
# PPMI(x,y) = max(log(p(x,y) / (p(x) * p(y))), 0)
# The probabilities are computed on the fly using the sums from above.
t0 = time()
pmi_samples = Counter()
data, rows, cols = [], [], []
for (x, y), n in cxy.items():
    rows.append(x2i[x])
    cols.append(x2i[y])
    data.append(max(log((n / sxy) / (cx[x] / sx) / (cx[y] / sx)), 0))
    pmi_samples[(x, y)] = data[-1]
PPMI = csc_matrix((data, (rows, cols)))
print('%.3lf seconds (%.5lf / iter)' % (time() - t0, (time() - t0) / len(cxy)))
print('%d non-zero elements' % PPMI.count_nonzero())
print('Sample PPMI values\n', pformat(pmi_samples.most_common()[:10]))

1.590 seconds (0.00000 / iter)
772137 non-zero elements
Sample PPMI values
 [(('Me.ra.iškur[00]PN', 'U₃.da.bi[00]FN'), 13.824340172249912),
 (('Kunga.ar₃[00]SN', 'Li.pi₂.it.ištar[00]PN'), 13.824340172249912),
 (('Geme₂.sal₄.tug₂.u₁₈[00]PN', 'ba.ba₆.ku₃[00]PN'), 13.824340172249912),
 (('Nin.e₂.nun[00]PN', 'Nin.ir.ra.an.gi₄[00]PN'), 13.824340172249912),
 (('Na.mu.ši.bar[00]PN', 'Nin.e.ba.zi[00]PN'), 13.824340172249912),
 (('E.pu.uq[00]PN', 'Eš₁₈.dar.ka₃.li₂.su[00]PN'), 13.824340172249912),
 (('Geme₂.ša₃.ku₃.ga[00]PN', 'Nin.kal.la.ar[00]PN'), 13.824340172249912),
 (('Di.ta.nu.sar[00]PN', 'Nam.lu₂[00]PN'), 13.824340172249912),
 (('Lu₂.zi.in.i₃.zu[00]PN', 'Nimgir.nam.gi.na[00]PN'), 13.824340172249912),
 (('A.ab.gu.la[00]PN', 'Ur.pa₅.sir₂.ra[00]PN'), 13.824340172249912)]


In [23]:
# 6. Factorize the PPMI matrix using sparse SVD aka "learn the unigram/word vectors".
# This part replaces the stochastic gradient descent used by Word2vec
# and other related neural network formulations. We pick an arbitrary vector size k=20.
t0 = time()
U, _, _ = svds(PPMI, k=20)
print('%.3lf seconds' % (time() - t0))

0.316 seconds


In [24]:
# 7. Normalize the vectors to enable computing cosine similarity in next cell.
# If confused see: https://en.wikipedia.org/wiki/Cosine_similarity#Definition
t0 = time()
norms = np.sqrt(np.sum(np.square(U), axis=1, keepdims=True))
U /= np.maximum(norms, 1e-7)
print('%.3lf seconds' % (time() - t0))

0.002 seconds


In [25]:
# 8. Show some nearest neighbor samples as a sanity-check.
# The format is <unigram> <count>: (<neighbor unigram>, <similarity>), ...
# From this we can see that the relationships make sense.
k = 5
for x in ['esir[shoe]N', 'kugsig[gold]N', 'lugal[king]N', 'Inana[1]DN']:
    dd = np.dot(U, U[x2i[x]]) # Cosine similarity for this unigram against all others.
    s = ''
    # Compile the list of nearest neighbor descriptions.
    # Argpartition is faster than argsort and meets our needs.
    for i in np.argpartition(-1 * dd, k + 1)[:k + 1]:
        if i2x[i] == x: continue
        xy = tuple(sorted((x, i2x[i])))
        s += '(%s, %.3lf) ' % (i2x[i], dd[i])
    print('%s, %d\n %s' % (x, cx[x], s))
    print('-' * 10)

esir[shoe]N, 416
 (kila[weight]N, 0.885) (tugdua[felt]N, 0.831) (dušia[stone]N, 0.821) (eban[pair]N, 0.928) (KA.GUM×ŠE[~leather]N, 0.874) 
----------
kugsig[gold]N, 2188
 (zahum[basin]N, 0.987) (zagaltum[object]N, 0.986) (nir[stone]N, 0.986) (zagin[lapis]N, 0.986) (sua[stone]N, 0.985) 
----------
lugal[king]N, 48617
 (lu[person]N, 0.903) (muhaldim[cook]N, 0.900) (Še.eh.la.am[00]PN, 0.873) (ziga[expenditure]N, 0.850) (ragaba[rider]N, 0.849) 
----------
Inana[1]DN, 3202
 (Ninlil[1]DN, 0.941) (Nanaya[1]DN, 0.939) (An.nu.ni.tum[00]PN, 0.895) (Ištaran[1]DN, 0.894) (Ningal[1]DN, 0.887) 
----------


# 6 Ur III Commodities and Actors: Word Embeddings

The basic idea behind word embeddings is that word meaning is determined by the contexts in which the word is found. Or, in the famous quote by the linguist [J.R. Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth): "You shall know a word by the company it keeps." Each unique word (or lemma) in a corpus is assigned a vector in such a way that words attested in similar contexts receive similar vectors. As a result, vectors that represent words with similar meanings become neighbors: they are relatively close in the vector space.

A classic implementation of word embeddings is [word2vec](https://en.wikipedia.org/wiki/Word2vec), created in 2013 by a Google team under the direction of Tomas Mikolov ([Mikolov e.a. 2013](https://arxiv.org/abs/1301.3781)). They used a large corpus of texts in English, derived from the Web, and trained their model with a neural network. They found that their technique not only assigned similar vectors to similar weords, but also encoded in those vectors semantic components such as "male" or "female." The classic example is 

            king - male + female ≈ queen
            
In other words: if you subtract the vector for "male" from the vector for "king" and add the vector for "female," the nearest neighbor of that computed vector turns out to be the one for "queen".

Word2vec creeated a revolution in Natural Language Processing and found application in many different tasks. Researchers either use pre-defined word vectors or compute such vectors from their own data. Cuneiformists, of course, do not really have a choice. They have to create their own vectors and here three important drawbacks of word2vec come to light. First, the algorithm works best on very large datasets - the initial implementation used a corpus of 1.6 billion words. For Akkadian or Sumerian such numbers are not feasible, the more so since we may want to build different models for different periods and/or different text types. Second, the neural network architecture behind word2vec works, but it is hard to explain why it works or how. In other words, there is something like a black box between the raw data and the resulting vectors which may be OK for industry applications, but is hard to deal with in a scholarly context. Third, training a model in word2vec (or any similar algorithm) takes a considerable amount of computing time. Since there are many parameters (such as the window size that defines the context of a word) and such settings may dramatically change the results, it is necessary to build and compare many models. Building and comparing many models is doable and common in a Computer Science or NLP research setting, but in Assyriology research that is hardly feasible.

More recently researchers have proposed a simpler approach to training word vectors ([Moody 2017](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/), and see [Levy and Goldberg 2014](https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html)). This approach uses PMI (Pointwise Mutual Information) combined with SVD (Singular Vector Decomposition), two well-understood and relatively straightforward processes. PMI is used to create a score between every pair of unique words in the corpus based on their individual frequency on the one hand, and the frequency of the two words as collocates. This results in huge matrix of *n* rows and *n* column, where *n* represents the number of unique words in the corpus. SVD is then used to reduce this matrix to a set of *m* dimensional vectors (one vector for each word in the corpus), where *m* may be in the range between, say, 20 and 200. 

In [30]:
lemm_l[:5]

[['sila[unit]N',
  'giŋ[unit]N',
  'še[barley]N',
  'gur[unit]N',
  'ziznumun[seed]N',
  'gur[unit]N',
  'kibnumun[wheat]N',
  'šenumun[seed]N',
  'murgu[fodder]N',
  'sila[unit]N',
  'še[barley]N',
  'gur[unit]N',
  'a[arm]N',
  'luhuŋa[hireling]N',
  'ašag[field]N',
  'sila[unit]N',
  'še[barley]N',
  'gur[unit]N',
  'a[arm]N',
  'luhuŋa[hireling]N',
  'sahar[soil]N',
  'ki[place]N',
  '_',
  'ugu[skull]N',
  'Da.a.gi₄[00]PN',
  'ŋar[place]V/t',
  'kišib[seal]N',
  'Ur.šara₂[00]PN',
  'bisaŋdubak[archivist]N'],
 ['Ur.ge₆.par₄[00]PN', 'dubsar[scribe]N', 'dumu[child]N', 'A.kal.la[00]PN'],
 ['udu[sheep]N',
  'mašgal[goat]N',
  'ud[sun]N',
  'ki[place]N',
  'Ab.ba.sa₆.ga[00]PN',
  'En.dingir.mu[00]PN',
  'dab[seize]V/t',
  'itud[moon]N',
  'Ezemmah[1]MN'],
 ['sila[lamb]N',
  'Utu[1]DN',
  'mu.DU[delivery]N',
  'Gu₃.de₂.a[00]PN',
  'sila[lamb]N',
  'en.lil₂[00]PN',
  'sila[lamb]N',
  'Ninlil[1]DN',
  'mu.DU[delivery]N',
  'en[priest]N',
  'Inana[1]DN',
  'sila[lamb]N',
  'Ninegalak[1]DN',

In [31]:
files[:5]

['epsd2/admin/ur3/corpusjson/P113959.json',
 'epsd2/admin/ur3/corpusjson/P127505.json',
 'epsd2/admin/ur3/corpusjson/P125538.json',
 'epsd2/admin/ur3/corpusjson/P379198.json',
 'epsd2/admin/ur3/corpusjson/P111964.json']