# Sumerian and Akkadian Word Embeddings
By Niek Veldhuis, Department of Near Eastern Studies, UC Berkeley

This notebook loads Sumerian and Akkadian word embedding models and gives some hints for how to explore these models. Will someone be brave enough to try and align the two models?

The data derive from the Open Richly Annotated Cuneiform Corpus ([ORACC](http://oracc.org)) created by Steve Tinney (UPenn) and dircted by Jamie Novotny (Munich), Eleanor Robson (UC London), and Niek Veldhuis (UC Berkeley).

The Sumerian texts used in this model date between 2,500 BCE and 0; the corpus is dominated by a group of about 70,000 administartive texts from around 2,050 BCE dealing with animals, grain, labor, precious objects, etc. The entire corpus used here has about about 4.3 million words. The Akkadian corpus ranges from ca. 2,000 BCE to 0 and consists primarily of library texts, royal inscriptions, royal correspondence and royal administartion, for a total of some 1.5 million words. Sumerian was spoken in the deep South of present day Iraq in the third millennium - afterwards it became a language of scholarship and religion. Akkadian was spoken in Babylonia and Assyria - roughly the same cultural area where Sumerian was at home.

![A Sumerian - Akkadian Bilingual Dictionary](https://cdli.ucla.edu/dl/tn_photo/P342645.jpg)
A Sumerian - Akkadian dictionary, Museum number MS 3178. Click [here](https://cdli.ucla.edu/dl/photo/P342645.jpg) for a larger picture and [here](http://oracc.org/dcclt/signlists/P342645) for a full edition and translation of the text.

## Licensing
The data in the models derives from [ORACC](http://oracc.org) and has been acquired formatted and processed by Niek Veldhuis, UC Berkeley, November 2018. You are free to use the models (CC0 or Public Domain), but it is appreciated if you provide a link to [ORACC](http://oracc.org) as the source of the data.

## Prerequisites
The code below requires a relatively recent version of Gensim - one that includes a FastText implementation.

In [1]:
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim') # this is only relevant in Windows.
warnings.filterwarnings(action='ignore', category=FutureWarning, module='gensim' ) # warning will disappear in a future version of Gensim
from gensim.models.fasttext import FastText as FT_gensim
import pickle

# Load the Models

In [2]:
model_sux = FT_gensim.load("model/model_lemm.model")
model_akk = FT_gensim.load("model/akk_model_lemm.model")

# Vocabulary
Lemmatized tokens have the form **CitationForm[GuideWord]POS** (for instance lugal[king]N). Tokens that are not lemmatized (names, broken words, unknown words) are given in transliteration (for instance ab-ba-sa₆-ga). To find possible vocabulary items you may inspect the `vocabulary` attribute.

In [3]:
model_sux.wv.vocab

{'1(barig@c)': <gensim.models.keyedvectors.Vocab at 0x4fb36d8>,
 'še[barley]N': <gensim.models.keyedvectors.Vocab at 0x4fb37f0>,
 'ba-lul': <gensim.models.keyedvectors.Vocab at 0x4fb3898>,
 'nagar[carpenter]N': <gensim.models.keyedvectors.Vocab at 0x4fb30b8>,
 'niŋdu[appropriate-thing]N': <gensim.models.keyedvectors.Vocab at 0x4fb3b38>,
 'aŋ[measure]V/t': <gensim.models.keyedvectors.Vocab at 0x897da20>,
 'hur-sag-še₃-mah': <gensim.models.keyedvectors.Vocab at 0x8ae37f0>,
 'saŋ.DUN₃[recorder]N': <gensim.models.keyedvectors.Vocab at 0x8ae3b38>,
 '2(iku@c)': <gensim.models.keyedvectors.Vocab at 0x8b56ef0>,
 'iku[unit]N': <gensim.models.keyedvectors.Vocab at 0x8b84b38>,
 'har-tu-{d}sud₃': <gensim.models.keyedvectors.Vocab at 0x8b84b70>,
 'nukirik[gardener]N': <gensim.models.keyedvectors.Vocab at 0x8b84ba8>,
 'me-zi-pa-e₃': <gensim.models.keyedvectors.Vocab at 0x8b84be0>,
 '1/2(iku@c)': <gensim.models.keyedvectors.Vocab at 0x8b84c18>,
 'ša₃-gu₂-ba': <gensim.models.keyedvectors.Vocab at 0x8b

# Similar words

In [4]:
model_akk.wv.most_similar("šarru[king]N", topn = 2), model_sux.wv.most_similar("lugal[king]N", topn=2)

([('bēlu[lord]N', 0.9549878835678101), ('ana[to]PRP', 0.9517405033111572)],
 [('ki[place]N', 0.9306298494338989), ('kalam[land]N', 0.9235943555831909)])

# Out of Vocabulary Words
This is fasttext ... You may also use OOV words such as partial matches.

In [5]:
model_akk.wv.most_similar("palace", topn = 2), model_sux.wv.most_similar("[tiger]N", topn=2)

([('Ekallu-eššetu[New-Palace-palace-in-Aššur]ON', 0.7083825469017029),
  ('egalturrû[little-palace]N', 0.6874814033508301)],
 [('uršub[tiger]N', 0.6488356590270996),
  ('uršubkuda[wild-animal]N', 0.5293694138526917)])

# Kings and Queens
Hmmmm. You may also try the Sumerian version with lugal[king]N, munus[woman]N and nita[man]N (it doesn't fare much better).

In [6]:
model_akk.wv.most_similar(positive=["šarru[king]N", "sinništu[woman]N"], negative = ["zikaru[male]N"])

[('ana[to]PRP', 0.8608435988426208),
 ('bēlu[lord]N', 0.8355156183242798),
 ('ša[of]DET', 0.8317270874977112),
 ('ša[that]REL', 0.82342529296875),
 ('ardu[slave]N', 0.8218632936477661),
 ('ina[in]PRP', 0.8146045207977295),
 ('māru[son]N', 0.80776047706604),
 ('abu[father]N', 0.8045011758804321),
 ('ēkallu[palace]N', 0.8006858229637146),
 ('muhhu[skull]N', 0.7995380163192749)]

# Oxen and Sheep
More culturally appropriate, perhaps.

In [7]:
model_sux.wv.most_similar(positive=["gud[oxen]N", "sila[lamb]N"], negative= ["amar[calf]N"])

[('mašgal[goat]N', 0.9077289700508118),
 ('ašgar[kid]N', 0.9058449268341064),
 ('u[ewe]N', 0.8994815349578857),
 ('udu[sheep]N', 0.8990205526351929),
 ('maš[goat]N', 0.8890479803085327),
 ('nua[~animal]N', 0.8813501596450806),
 ('niga[fattened]V/i', 0.875199556350708),
 ('gukkal[sheep]N', 0.8608914613723755),
 ('aslum[sheep]N', 0.8605831861495972),
 ('mašda[gazelle]N', 0.8534128665924072)]

# The Challenge
Is it possible to align the Sumerian and the Akkadian models? The [fasttext_multilingual](https://github.com/Babylonpartners/fastText_multilingual) repo of Babylonpartners (what's in a name?) provides a discussion of the concept and also includes a [notebook](https://github.com/Babylonpartners/fastText_multilingual/blob/master/align_your_own.ipynb) that allows you to align your own vector spaces. The notebook uses a different implementation of Fasttext.

For the alignment you need an initial bilingual dictionary. This bilingual wordlist (to avoid confusion with the Python term `dictionary`) is a list of tuples in the format
```python
[("Sumerian word", "Akkadian word"), 
 ...]
```
The wordlist has almost 800 entries and was cobbled together from the electronic Pennsylvania Sumerian Dictionary ([ePSD2](http://oracc.org/epsd2)).

In [8]:
with open("pickles/bilingual_dict.p", "rb") as f: 
    sux_akk = pickle.load(f)
sux_akk

[('a[arm]N', 'ahu[arm]N'),
 ('a[water]N', 'mû[water]N'),
 ('aʾur[limbs]N', 'mešrêtu[limbs]N'),
 ('ab[cow]N', 'arhu[cow]N'),
 ('ab[window]N', 'aptu[window]N'),
 ('aba[who?]QP', 'mannu[who?]QP'),
 ('abara[sweat]N', 'zūtu[sweat]N'),
 ('abba[father]N', 'abu[father]N'),
 ('ablal[nest]N', 'qinnu[nest]N'),
 ('abula[gate]N', 'abullu[gate]N'),
 ('ad[voice]N', 'rigmu[voice]N'),
 ('adbar[basalt]N', 'atbaru[basalt]N'),
 ('adda[father]N', 'abu[father]N'),
 ('addir[hire]N', 'igru[hire]N'),
 ('adhal[copper]N', 'erû[copper]N'),
 ('aga[tiara]N', 'agû[tiara]N'),
 ('agaʾus[soldier]N', 'rēdû[soldier]N'),
 ('agrig[steward]N', 'abarakku[steward]N'),
 ('aŋ[measure]V/t', 'madādu[measure]V'),
 ('aŋeštinak[vinegar]N', 'ṭābātu[vinegar]N'),
 ('aŋi[wave]N', 'agû[wave]N'),
 ('aŋiba[night]N', 'mūšu[night]N'),
 ('ahiaš[quickly]AV', 'zamar[quickly]AV'),
 ('ak[do]V/t', 'epēšu[do]V'),
 ('aka[fleece]N', 'itqu[fleece]N'),
 ('aku[cripple]N', 'akû[cripple]N'),
 ('al[hoe]N', 'allu[hoe]N'),
 ('alim[bison]N', 'ditānu[bison]N')