# Sumerian and Akkadian Word Embeddings
By Niek Veldhuis, Department of Near Eastern Studies, UC Berkeley

This notebook loads Sumerian and Akkadian word embedding models and gives some hints for how to explore these models. Will someone be brave enough to try and align the two models?

The data derive from the Open Richly Annotated Cuneiform Corpus ([ORACC](http://oracc.org)) created by Steve Tinney (UPenn) and dircted by Jamie Novotny (Munich), Eleanor Robson (UC London), Steve Tinney (UPenn), and Niek Veldhuis (UC Berkeley).

The Sumerian texts used in this model date between 2,500 BCE and 0; the corpus is dominated by a group of about 70,000 administrative texts from around 2,050 BCE dealing with animals, grain, labor, precious objects, etc. The entire corpus used here has about about 4.3 million words. The Akkadian corpus ranges from ca. 2,000 BCE to 0 and consists primarily of library texts, royal inscriptions, royal correspondence, and royal administartion, for a total of some 1.5 million words. Sumerian was spoken in the deep South of present day Iraq in the third millennium - afterwards it became a language of scholarship and religion. Akkadian was spoken in Babylonia and Assyria - roughly the same cultural area where Sumerian was at home.

![A Sumerian - Akkadian Bilingual Dictionary](https://cdli.ucla.edu/dl/tn_photo/P342645.jpg)
A Sumerian - Akkadian dictionary, Museum number MS 3178. Click [here](https://cdli.ucla.edu/dl/photo/P342645.jpg) for a larger picture and [here](http://oracc.org/dcclt/signlists/P342645) for a full edition and translation of the text.

## Licensing
The data in the models derive from [ORACC](http://oracc.org) and have been acquired formatted and processed by Niek Veldhuis, UC Berkeley, November 2018. You are free to use the models (CC0 or Public Domain), but it is appreciated if you provide a link to [ORACC](http://oracc.org) as the source of the data.

## Prerequisites
The code below requires a relatively recent version of Gensim - one that includes a FastText implementation.

In [None]:
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim') # this is only relevant in Windows.
warnings.filterwarnings(action='ignore', category=FutureWarning, module='gensim' ) # warning will disappear in a future version of Gensim
from gensim.models.fasttext import FastText as FT_gensim
import pickle

# Load the Models

In [None]:
model_sux = FT_gensim.load("model/model_lemm.model")
model_akk = FT_gensim.load("model/akk_model_lemm.model")

# Vocabulary
Lemmatized tokens have the form **CitationForm[GuideWord]POS** (for instance lugal[king]N). Tokens that are not lemmatized (names, broken words, unknown words) are given in transliteration (for instance ab-ba-sa₆-ga). To find possible vocabulary items you may inspect the `vocabulary` attribute.

In [None]:
model_sux.wv.vocab

# Similar words

In [None]:
model_akk.wv.most_similar("šarru[king]N", topn = 2), model_sux.wv.most_similar("lugal[king]N", topn=2)

# Out of Vocabulary Words
This is fasttext ... You may also use OOV words such as partial matches.

In [None]:
model_akk.wv.most_similar("palace", topn = 2), model_sux.wv.most_similar("[tiger]N", topn=2)

# Kings and Queens
Hmmmm. You may also try the Sumerian version with lugal[king]N, munus[woman]N and nita[man]N (it doesn't fare much better).

In [None]:
model_akk.wv.most_similar(positive=["šarru[king]N", "sinništu[woman]N"], negative = ["zikaru[male]N"])

# Oxen and Sheep
More culturally appropriate, perhaps.

In [None]:
model_sux.wv.most_similar(positive=["gud[oxen]N", "sila[lamb]N"], negative= ["amar[calf]N"])

# The Challenge
Is it possible to align the Sumerian and the Akkadian models? The [fasttext_multilingual](https://github.com/Babylonpartners/fastText_multilingual) repo of Babylonpartners (what's in a name?) provides a discussion of the concept and also includes a [notebook](https://github.com/Babylonpartners/fastText_multilingual/blob/master/align_your_own.ipynb) that allows you to align your own vector spaces. The notebook uses a different implementation of Fasttext.

For the alignment you need an initial bilingual dictionary. This bilingual wordlist (to avoid confusion with the Python term `dictionary`) is a list of tuples in the format
```python
[("Sumerian word", "Akkadian word"), 
 ...]
```
The wordlist has almost 800 entries and was cobbled together from the electronic Pennsylvania Sumerian Dictionary ([ePSD2](http://oracc.org/epsd2)).

In [None]:
with open("pickles/bilingual_dict.p", "rb") as f: 
    sux_akk = pickle.load(f)
sux_akk