# Word2Vec on the Akkadian ORACC corpus

This lesson is designed to explore features of word embeddings produced through the word2vec model.

The primary corpus we use consists of [Akkadian ORACC corpus](https://github.com/niekveldhuis/Word2vec), put together by Professor Niek Veldhuis, UC Berkeley Near Eastern Studies.

At then end we'll also look at a <a href="http://ryanheuser.org/word-vectors-1/">Word2Vec model trained on the ECCO-TCP corpus</a> of 2,350 eighteenth-century literary texts made available by Ryan Heuser. (Note that I have shortened the number of terms in the model by half in order to conserve memory.)

### Learning Goals
* Learn the intuition behind word embedding models (WEM)
* Learn how to implement a WEM using the gensim implementation of word2vec
* Explore a completely unknown corpus using this method (unknown to most of you)
* Think through how visualization of WEM might help you explore your corpus
* Implement text analysis on a non-English language

### Agenda
<ol>
<li>Import & Pre-Processing</li>
<li>Word2Vec</li>
<ol><li>Training</li>
<li>Embeddings</li>
<li>Visualization</li>
</ol>
<li>Saving/Loading Models</li>
</ol>

### Further Resources

For further background on Word2Vec's mechanics, I suggest this <a href="https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html">brief tutorial</a> by Google, especially the sections "Motivation," "Skip-Gram Model," and "Visualizing."

Ben Schmidt's blogs [here](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html) and [here](http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html).

## 0. Prep

Install a new package, and import necessary packages.

In [None]:
#Install a package that is not in the Anaconda distribution
#To do this we'll use pip install
!pip3 install gensim

In [None]:
#import the necessary libraries

#Data Wrangling
import pandas
import numpy as np
import os

import gensim #library needed for word2vec

#for visualization
from scipy.spatial.distance import cosine
from sklearn.metrics import pairwise
from sklearn.manifold import MDS, TSNE

In [None]:
#Visualization parameters
%pylab inline
matplotlib.style.use('ggplot')

## 1. Import and Pre-Processing

### Corpus Description

The corpus description can be found [here](https://github.com/niekveldhuis/Word2vec).

### Import Data

Read in all of the .csv files in the folder `../data/oracc/`, do some pre-processing on it, and concat them all into a Pandas dataframe

In [None]:
#read in all the data, with some cleaning
#I won't explain this code, but challenge yourself to understand it
path ='../data/oracc/' # indicate the local path where files are stored
allFiles = os.listdir(path) #save the list of filenames into a variable
print(allFiles)

In [None]:
list_ = []
files_ = []
for file_ in allFiles:
    filename = path+file_ #add the relative path name to the filename
    df = pandas.read_csv(filename,index_col=None, header=0)
    df['id_text'] = [file_[7:-4].replace('_', '/') + '/' + text for text in df['id_text']]
    df['lemma'] = [lemma.replace('$', '') for lemma in df['lemma']]
    list_.append(df)
    files_.append(file_[7:-4].replace('_', '/'))
data = pandas.concat(list_).reset_index(drop=True)
#view the data
data

In [None]:
#Number of rows
data.shape

In [None]:
#View the first text
data.iloc[0,1]

### Pre-Processing

Word2Vec learns about the relationships among words by observing them in context. This means that we want to split our texts into word-units. In this text there is no punctuation, and thus nothing resembling a sentence. In other text we  want to maintain sentence boundaries as well, since the last word of the previous sentence might skew the meaning of the next sentence.

You can split your text in sentences using ` nltk.tokenize.sent_tokenize()`

For today, we'll tokenize our text by spliting on the white space.

In [None]:
#tokenize the data by splitting on white space. There is no punctuation in this text.
data['tokens'] = data['lemma'].str.split()
data['tokens'][0]

### Data Cleaning
Unlemmatized (broken or unknown) words are represented as, for instance, `x-ši-ka[NA]NA`. Such tokens are essentially placeholders. One may try two different approaches:
- represent all such placeholders by NA
- eliminate all placeholders

In [None]:
data_NA = data.copy()
data_NA['tokens'] = data_NA['tokens'].apply(lambda x: [token if not token.endswith('NA]NA') else 'NA' for token in x])

In [None]:
data['tokens'] = data['tokens'].apply(lambda x: [token for token in x if not token.endswith('NA]NA')])

In [None]:
data['tokens'][0]

In [None]:
data_NA['tokens'][0]

## 2. Word2Vec

### Word Embedding
Word2Vec is the most prominent word embedding algorithm. Word embedding generally attempts to identify semantic relationships between words by observing them in context.

Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts.  This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.

The two main flavors of Word2Vec are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. Skip-Gram takes a word of interest as its input (e.g. "me") and tries to learn how to predict its context words ("Call","Ishmael"). CBOW does the opposite, taking the context words ("Call","Ishmael") as a single input and tries to predict the word of interest ("me").

In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.

### Word2Vec Features
<ul>
<li>Size: Number of dimensions for word embedding model</li>
<li>Window: Number of context words to observe in each direction</li>
<li>min_count: Minimum frequency for words included in model</li>
<li>sg (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram</li>
<li>Alpha: Learning rate (initial); prevents model from over-correcting, enables finer tuning</li>
<li>Iterations: Number of passes through dataset</li>
<li>Batch Size: Number of words to sample from data during each pass</li>
<li>Worker: Set the 'worker' option to ensure reproducibility</li>
</ul>

Note: Script uses default value for each argument

### Training, or fitting

In [None]:
model = gensim.models.Word2Vec(data['tokens'], size=100, window=5, \
                               min_count=1, sg=1, alpha=0.025, iter=5, batch_words=10000, workers=1)

### Embeddings

In [None]:
# Return dense word vector for the word 'ēkallu[palace]N'
#each token (not document) has a 100 element vector
model['ēkallu[palace]N']

### Vector-Space Operations

#### Similarity
Since words are represented as dense vectors, we can ask how similiar words' meanings are based on their cosine similarity (essentially how much they overlap). gensim has a few dout-of-the-box functions that enable different kinds of comparisons.

In [None]:
# Find cosine distance between two given word vectors
model.similarity('ēkallu[palace]N','bītu[house]N')

In [None]:
#Find the 10 most similar vectors to the given word vector, using cosine distance
model.most_similar('ēkallu[palace]N')

In [None]:
model.similarity('immeru[sheep]N','puhādu[lamb]N')

In [None]:
model.similarity('arhu[cow]N','būru[(bull)-calf]N')

In [None]:
##EX: find the most similar words to cow and sheep. Do they make sense?
model.most_similar('arhu[cow]N')

In [None]:
model.most_similar('immeru[sheep]N')

### Multiple Valences

A word embedding may encode both primary and secondary meanings that are both present at the same time. In order to identify secondary meanings in a word, we can subtract the vectors of primary (or simply unwanted) meanings. For example, we may wish to remove the sense of <em>river bank</em> from the word <em>bank</em>. This would be written mathetmatically as <em>RIVER - BANK</em>, which in <em>gensim</em>'s interface lists <em>RIVER</em> as a positive meaning and <em>BANK</em> as a negative one.

We'll try to find different meanings of the words 'bad' and 'good' in our corpus.

In [None]:
model.most_similar(['masku[bad]AJ','lemnu[bad]AJ'])

This seems to capture 'bad' in the magical, sorcery sense, and perhaps as injustice. Let's remove those vectors from the vector space.

In [None]:
#remove more vectors to get at different senses of the word 'bad'
model.most_similar(positive=['masku[bad]AJ','lemnu[bad]AJ'], negative=['utukku[(an-evil-demon)]N','dipalû[distortion-of-justice]N'])

This gets at a slightly different sense of the word 'bad', relating to battle and military force.

In [None]:
## EX. Use the most_similar method to find the tokens nearest to 'good' in our model.
##The strings for good are 'damqu[good]AJ' and 'ṭābu[good]AJ'.
print(model.most_similar(['damqu[good]AJ', 'ṭābu[good]AJ']))
print()
## EX. Remove the vector 'hadû[joyful]AJ' from the 'good' vector.
## What alternative meaning of 'good' comes through?
print(model.most_similar(positive=['damqu[good]AJ', 'ṭābu[good]AJ'], negative=['hadû[joyful]AJ']))

### Analogy
Analogies are rendered as simple mathematical operations in vector space. For example, the canonic word2vec analogy <em>MAN is to KING as WOMAN is to ??</em> is rendered as <em>KING - MAN + WOMAN</em>. In the gensim interface, we designate <em>KING</em> and <em>WOMAN</em> as positive terms and <em>MAN</em> as a negative term, since it is subtracted from those.

We'll try this with the analogy Cow::Calf as Sheep::?? (the word we are looking for is lamb).

In [None]:
model.most_similar(positive=['immeru[sheep]N', 'būru[(bull)-calf]N'], negative=['arhu[cow]N'])

### Creating a binary: Horses and Sheep

Ben Schimdt found the meat/vegetable binary as a useful binary to see in a vector space. We can find an analogous binary here.

The animal vocabulary may be divided into 'horse-vocabulary' (used for war and often received from foreign countries) and sheep vocabulary. Sheep are domestic animals held for meat and wool and are (relatively) close to other such animals (ox, calf) and words that have to do with wool production.

In [None]:
animals = ['sisû[horse]N', 'immeru[sheep]N', 'imēru[donkey]N', 'alpu[ox]N', 'littu[cow]N', 
           'pīru[elephant]N', 'yābilu[ram]N', 'udru[Bactrian-camel]N', 'damdāmu[(a-kind-of-mule)]N'
           ,'atānu[she-ass]N', 'būru[(bull)-calf]N', 'tuānu[(a-breed-of-horse)]N', 'agālu[donkey]N'
          , 'šullāmu[(a-type-of-horse)]N', 'sugullu[herd]N', 'naṣmadu[harness]N', 'ṣamādu[team]N'
          ,'harbu[plough]N', 'Parsuaya[from-Parsua]EN', 'šulušīu[three-year-old]AJ', 'kīṣu[flayed]AJ'
          ,'bitrumu[very-colourful]AJ', 'buqūmu[plucking]N', 'anāqāte[she-camels]N',
           'udukiutukku[(a-kind-of-sacrificial-sheep)]N', 'maḫirtu[(a-bone-of-the-leg)]N', 'Muṣuraya[Egyptian]EN',
          'gurrutu[ewe]N', 'irginu[(a-breed-or-colour-of-horse)]N', 'ṣummudu[equipped]AJ', 'qummānu[(a-sheep)]N',
           'baqmu[plucked]AJ', 'huzīru[pig]N', 'surrudu[packed-up]AJ', 'pēthallu[riding-horse]N', 'nāmurtu[audience-gift]N', 
           'Manna[Mannea]GN', 'puhādu[lamb]N']
animal_words = model.most_similar(animals, topn=100)
animal_words = [word for word, similarity in animal_words]
animal_words

### Visualization

We can visualize this 'sheep 'horse' binary by plotting the vector space for these two words on the same graph. This is similar to the 'meat' 'vegetable' binary graphed by Ben Schmidt.

In [None]:
x = [model.similarity('sisû[horse]N', word) for word in animals]
y = [model.similarity('immeru[sheep]N', word) for word in animals]

Add an array with relative count frequencies for each word to scale the size of each node based on the relative frequency in the text.

Thanks to classmate Richard Doan for this code.

In [None]:
#Create a count dictionary
counts = {}
for sentence in data['tokens']:
    for word in sentence:
        if word not in counts:
            counts[word] = 0
        counts[word] += 1


In [None]:
#Creat an array for the size, based on the relative count
sizes = []
for animal in animals:
    sizes.append(counts[animal])

sizes = list(map(lambda x: x / max(sizes), sizes))

In [None]:
import matplotlib
matplotlib.rc('font', family='Arial')


_, ax = plt.subplots(figsize=(20,20))
ax.scatter(x, y, sizes, alpha=1, color='b')
for i in range(len(animals)):
    ax.annotate(animals[i], (x[i], y[i]))
ax.set_xlim(.25, 1.1)
ax.set_ylim(.4, 1.1)
plt.plot([0, 1], [0, 1], linestyle='--');

### Q. What kinds of semantic relationships exist in the diagram above?
####    Are there any words that seem out of place?

## 3. Saving/Loading Models

In [None]:
# Save current model for later use

model.wv.save_word2vec_format('../data/word2vec.oracc.txt')

In [None]:
# Load up models from disk

# Model trained on Eighteenth Century Collections Online corpus (~2500 texts)
# Made available by Ryan Heuser: http://ryanheuser.org/word-vectors-1/

ecco_model = gensim.models.Word2Vec.load_word2vec_format('../data/word2vec.ECCO-TCP.txt')

In [None]:
# Can we get the currency sense of the word bank in Ryan Heuser's model?

ecco_model.most_similar(positive=['bank'], negative=['river'])

In [None]:
## EX. Heuser's blog post explores an analogy in eighteenth-century thought that
##     RICHES are to VIRTUE what LEARNING is to GENIUS.
## Reproduce this analogy using his trained word2vec model

##  Q. How might we compare word2vec models more generally?
ecco_model.most_similar(positive=['virtue', 'learning'], negative=['riches'])

# 4. Open Questions
At this point, we have seen a number of mathemetical operations that we may use to explore word2vec's word embeddings. These enable us to answer a set of new, interesting questions dealing with semantics, yet there are many other questions that remain unanswered.

For example:
<ol>
<li>How to compare word usages in different texts (within the same model)?</li>
<li>How to compare word meanings in different models? compare whole models?</li>
<li>What about the space “in between” words?</li>
<li>Do we agree with the Distributional Hypothesis that words with the same contexts share their meanings?</li>
<ol><li>If not, then what information do we think is encoded in a word’s context?</li></ol>
<li>What good, humanistic research questions do analogies shed light on?</li>
<ol><li>shades of meaning?</li><li>context similarity?</li></ol>
</ol>

With the time remaining, play around with either of these two word2vec models, or begin to implement it on your own corpus.