# Akkadian Embeddings: Demonstration
This is a demo for showing some potential avenues of research, using a word embedding model. The current notebook uses a vanilla word2vec model, but other models (such as Aleksi's PMI-based model) can easily be substituted. The final section demonstrates a way to visualize animal words as being more "sheep-like" or more "horse-like". This is inspired by a blog by [Ben Schmidt](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html), characterizing food words from a corpus of recipes as more "vegetable-like" vs. more "meat-like". This blog post also goes a long way in explaining in a non-technical way the basics of what a word embedding model is (quite independent of the technique used for creating such a model) - highly recommended reading.

The challenge will be to build upon these examples, find words or groups of words that provide interesting contrasts and relationships with other words.

This notebook builds upon a notebook written by me and Laura Nelson (now at Northwestern University) for a class in Computational Text Analysis at UC Berkeley.

# Corpus

The corpus used for this notebook consists of the following ORACC projects, using only the Akkadian words.

> 'adsd/adart1',
 'adsd/adart2',
 'adsd/adart3',
 'adsd/adart6',
 'aemw/amarna',
 'aemw/idrimi',
 'akklove',
 'atae/assur',
 'atae/burmarina',
 'atae/durkatlimmu',
 'atae/guzana',
 'atae/huzirina',
 'atae/imgurenlil',
 'atae/kalhu',
 'atae/mallanate',
 'atae/marqasu',
 'atae/nineveh',
 'atae/samal',
 'atae/tilbarsip',
 'bbto',
 'blms',
 'cams/anzu',
 'cams/barutu',
 'cams/etana',
 'cams/gkab',
 'cams/ludlul',
 'cams/selbi',
 'caspo',
 'caspo/akkpm',
 'ccpo',
 'cmawro/cmawr1',
 'cmawro/cmawr2',
 'cmawro/cmawr3',
 'cmawro/maqlu',
 'dcclt',
 'dcclt/nineveh',
 'dcclt/signlists',
 'dccmt',
 'glass',
 'hbtin',
 'riao',
 'rinap/rinap1',
 'rinap/rinap2',
 'rinap/rinap3',
 'rinap/rinap4',
 'rinap/rinap5',
 'saao/saa01',
 'saao/saa02',
 'saao/saa03',
 'saao/saa04',
 'saao/saa05',
 'saao/saa06',
 'saao/saa07',
 'saao/saa08',
 'saao/saa09',
 'saao/saa10',
 'saao/saa11',
 'saao/saa12',
 'saao/saa13',
 'saao/saa14',
 'saao/saa15',
 'saao/saa16',
 'saao/saa17',
 'saao/saa18',
 'saao/saa19',
 'saao/saa20',
 'saao/saa21',
 'saao/saas2',
 'suhu',
 'tcma/ali1',
 'tcma/assur',
 'tcma/barri',
 'tcma/bazmusian',
 'tcma/billa',
 'tcma/brak',
 'tcma/chuera',
 'tcma/emar',
 'tcma/fekheriye',
 'tcma/giricano',
 'tcma/hana',
 'tcma/haradum',
 'tcma/hatti',
 'tcma/kartn',
 'tcma/kulishinas',
 'tcma/laws',
 'tcma/miscellaneous',
 'tcma/nineveh',
 'tcma/nippur',
 'tcma/nuzi',
 'tcma/qitar',
 'tcma/rimah',
 'tcma/suri',
 'tcma/taban',
 'tcma/tsa1',
 'tcma/tsh1'

It may well be that a smaller (more focused) corpus yields better results, in particular when using PMI-based models (that can deal with smaller corpora).

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform, cosine
# Data Wrangling
import os
import gensim #library needed for word2vec
#for visualization
from scipy.spatial.distance import cosine
from sklearn.metrics import pairwise
from sklearn.manifold import MDS, TSNE

In [None]:
%pylab inline
matplotlib.style.use('ggplot')

# Load the data

In [None]:
data = pd.read_csv('output/parsed.csv',index_col=None, header=0)

# How many texts are included?

In [None]:
data.shape
#data.iloc[0,1]

In [None]:
#tokenize the data by splitting on white space. There is no punctuation in this text.
data['tokens'] = data['lemma'].str.split()
data['tokens'][0][:25]

# Data Cleaning
Unlemmatized (broken or unknown) words are represented as, for instance, `x-ši-ka[NA]NA`. Such tokens are essentially placeholders. One may try two different approaches:
- represent all such placeholders by NA
- eliminate all placeholders

In [None]:
data_NA = data.copy()
for i in range(len(data_NA)):
    data_NA['tokens'][i] = [token if not token.endswith('NA]NA') else 'NA' for token in data_NA['tokens'][i]]

In [None]:
for i in range(len(data)):
    data['tokens'][i] = [token for token in data['tokens'][i] if not token.endswith('NA]NA')]

In [None]:
data['tokens'][0][:25]

In [None]:
data_NA['tokens'][0][:25]

In [None]:
#fit a word2vec model on the tokenized data, with all the default options
#setting the 'worker' option to 1 should ensure reproducibility
#As per the docs of Gensim, for executing a fully deterministically-reproducible run, 
#you must also limit the model to a single worker thread, 
#to eliminate ordering jitter from OS thread scheduling.

model = gensim.models.Word2Vec(data_NA['tokens'], size=100, window=5, \
                               min_count=1, sg=1, alpha=0.025, iter=5, batch_words=10000, workers=1)

In [None]:
#view the 100 element vector for the word 'ēkallu[palace]N'
#each token (not document) has a 100 element vector
model.wv['ēkallu[palace]N']

In [None]:
# Find cosine distance between two given word vectors
model.wv.similarity('ēkallu[palace]N','bītu[house]N')

In [None]:
#find the 10 most similar vectors to the given word vector
model.wv.most_similar('ēkallu[palace]N')

The paradigmatic example of what word2vec can do is the so-called analogy task: 
> man : woman = king : ?

As it turns out, our model is not very good with that question (perhaps because kings appear so much more frequently than queens), but it does OK with some animal analogies. 

In [None]:
#For analogies, use both positive and negative vectors. The target word, in this case, is lamb.
# if sheep - lamb = ox - calf, then ox = sheep + calf - lamb
model.wv.most_similar(positive=['immeru[sheep]N', 'būru[(bull)-calf]N'], negative=['puhādu[lamb]N'])

# visualize the 50 most similar vectors to two words meaning 'bad'

In [None]:
model.wv.most_similar(['lemnu[bad]AJ', 'masku[bad]AJ'])

In [None]:
# Graph for the fifty most similar tokens
bad_tokens = [token for token,weight in model.wv.most_similar(['lemnu[bad]AJ', 'masku[bad]AJ'], topn=50)]
vectors = [model.wv[word] for word in bad_tokens]
dist = pdist(vectors, metric='cosine')
dist_matrix = squareform(dist)
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=1, color='b')
for i in range(len(vectors)):
    ax.annotate(bad_tokens[i], ((embeddings[i,0], embeddings[i,1])))

In [None]:
#try to find different senses of the word bad by removing the vector for 'an evil demon'
model.wv.most_similar(positive=['masku[bad]AJ','lemnu[bad]AJ'], negative=['utukku[(an-evil-demon)]N'])

In [None]:
#remove more vectors to get at different senses of the word 'bad'
model.wv.most_similar(positive=['masku[bad]AJ','lemnu[bad]AJ'], negative=['utukku[(an-evil-demon)]N','dipalû[distortion-of-justice]N'])

# Define a list of animals and animal words.
List animal words and words that are associated with these animals. One could also derive a list of animal words from [Ura 14](http://oracc.org/dcclt/Q000089) and [15](http://oracc.org/dcclt/Q000090)

In [None]:
animals = ['sisû[horse]N', 'immeru[sheep]N', 'imēru[donkey]N', 'alpu[ox]N', 
           'pīru[elephant]N', 'yābilu[ram]N', 'udru[Bactrian-camel]N', 'damdāmu[(a-kind-of-mule)]N'
           ,'atānu[she-ass]N', 'būru[(bull)-calf]N', 'tuānu[(a-breed-of-horse)]N', 'agālu[donkey]N'
          , 'šullāmu[(a-type-of-horse)]N', 'sugullu[herd]N', 'anāqāte[she-camels]N',
          'gurrutu[ewe]N', 'irginu[(a-breed-or-colour-of-horse)]N', 
           'huzīru[pig]N', 'pēthallu[riding-horse]N', 'puhādu[lamb]N']
animal_words = model.wv.most_similar(animals, topn=10)
animal_words = [word for word, similarity in animal_words]
animal_words

# Horses and Sheep
The animal vocabulary may be divided into 'horse-vocabulary' (used for war and often received from foreign countries) and sheep vocabulary. Sheep are domestic animals held for meat and wool and are (relatively) close to other such animals (ox, calf) and words that have to do with wool production. All animal words collected above are located on a graph - the X-axis represents the cosine similarity to "sheep" the Y-axis the cosine similarity to "horse".

In [None]:
animals_assoc = animals + animal_words
animals_assoc = list(set(animals_assoc)) # remove duplicates
x = [model.wv.similarity('sisû[horse]N', word) for word in animals_assoc]
y = [model.wv.similarity('immeru[sheep]N', word) for word in animals_assoc]  

In [None]:
from bokeh.io import show
from bokeh.models import (BoxSelectTool, Circle, HoverTool,
                          Plot, BoxZoomTool, ResetTool, SaveTool)
from bokeh.plotting import figure, output_notebook
from bokeh.models import TextInput, Button, ColumnDataSource, LabelSet
output_notebook()

In [None]:
data = {'x_values' : x, 'y_values' : y, 'labels': animals_assoc}
source = ColumnDataSource(data=data)
p = figure(
        plot_width=800, plot_height=1000,
        tools="tap,pan,wheel_zoom,box_zoom,reset,save")
p.circle(x='x_values', y='y_values', source=source)
p.add_tools(HoverTool(
        tooltips=[
            ("", "@labels"), 
            ("sheep", "@y_values"),
            ("horse", "@x_values")]
        ))
labels = LabelSet(
            x='x_values',
            y='y_values',
            text='labels',
            level='glyph',
            x_offset=5, 
            y_offset=5, 
            source=ColumnDataSource(data), 
            render_mode='canvas'
            )
p.add_layout(labels)
p.line([0, 1], [0, 1], color = "red")
show(p)

The Bokeh plot above has various tools that may be selected with the symbols to the right of the graph: Zoom in, Save, Reset, Pan. It also uses tiptools to show the cosine similarity of a word with sheep and horse, respectively. A Bokeh graph can also use links to websites (an oracc edition, or a glossary entry, for instance).

The words immeru\[sheep\]N and sisû\[horse\]N are to be found at (0.3167, 1) and (1, 0.3167), respectively, because immeru\[sheep\]N and sisû\[horse\]N have a cosine similarity of 0.3167 (in the current model). When compared to itself, cosine similarity is 1 (the maximum value).

The red line (diagonal) separates the words that are more sheep-like from those that are more horse-like.

Bokeh plots can also be saved as independent HTML files that can be displayed on a website, preserving the interactive tools. Such files are useful for presenting research results. I highly recommend using Bokeh or a similar visualization library that can create interactive plots.

In [None]:
model.wv.similarity('immeru[sheep]N', 'sisû[horse]N')