# Word2Vec Tutorial Notes & Modifications <a id=top> </a>
`MMV | 12/4 | w266 Final Project: Crosslingual Word Embeddings`   


The code in this notebook follows [this tutorial](http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/)  which is based on the [TensorFlow tutorial code](https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/word2vec/word2vec_basic.py). I will first attempt to the basic Word2Vec algorithm to a sample of our data (Wikipedia dumps in English). Then I'll examine different ways of visualizing the embeddings that result. Finally I will explore what it might look like to make [Duong et al's modifications](https://arxiv.org/pdf/1606.09403.pdf) to train crosslingual embeddings.

# Embeddings Overview 

__Basic Idea__: start with 1-hot vector, pass it through a linear activation layer then into a softmax and optimize for the probability of nearby words(Skipgram) or the centerword(CBOW). The 'embeddings' are the parameters of the linear activation (which transform the vector of size $|V|$ into an embedding of size $N$:
$$\text{Weight Matrix:}\qquad W \in \mathbb{R}^{|V|\times N}$$
$$\text{Bias (?):}\qquad b \in \mathbb{R}^{n}$$

__Key Modifications:__ 
* Duong et all use a CBOW style algorithm but substitute a word's translation at training time so that they learn embeddings for the target language word based on the source language context. (see section 4.1)
* As a result, instead of a single weight matrix, they use a concatenation of two (see section 4 intro):
$$\text{Context Matrix:}\qquad W \in \mathbb{R}^{|V|\times N}$$
$$\text{Embedding Matrix:}\qquad U \in \mathbb{R}^{|V|\times N}$$
* Since normalizing Softmax is costly, they instead optimize for a _log-pseudo likelihood_ by learning to differentiate data from negative examples selected from a noise distribution (following Mikolov 2013, see section 3) (Note that the TF tutorial models how to do this 'noise contrastive estimation').

# Notebook Setup

In [30]:
# general imports
from __future__ import print_function
import os
import math
import random
import numpy as np
import collections
import datetime as dt
import matplotlib.pyplot as plt
import tensorflow as tf

# tell matplotlib not to open a new window
%matplotlib inline

# automatically reload modules 
%reload_ext autoreload
%autoreload 2

# custom imports - see APPENDIX 
import helperfunc

In [2]:
# filepaths
BASE = '/home/mmillervedam/Data'
FPATH_EN = BASE + '/test/wiki_en_10K.txt' # first 10000 lines from wiki dump
FPATH_ES = BASE + '/test/wiki_es_10K.txt' # first 10000 lines from wiki dump
DPATH = '/home/mmillervedam/ProjectRepo/XlingualEmb/data/dicts/en.es.panlex.all.processed'

In [3]:
# globals
VOCAB_SIZE = 5000

# Data Load & Tokenize

In [4]:
# Tokenizer preserves order (see code in Appendix)
en_raw = helperfunc.read_data(FPATH_EN)
es_raw = helperfunc.read_data(FPATH_ES)

In [5]:
# take a look
print(en_raw[:10])
print(es_raw[:10])

['[[12]]', 'Anarchism', 'is', 'often', 'defined', 'as', 'a', 'political', 'philosophy', 'which']
['[[7]]', 'El', 'Principado', 'de', 'Andorra', '(', 'en', 'catal\xc3\xa1n', ':', 'Principat']


__`NOTE!`__ We'll need to prepend 'en' and 'es' before training crosslingual versions.   
__`QUESTIONS:`__ Do we deal with special characters?, punctuation?

# Build Vocabulary

In [6]:
# Dataset Builder indexes by count (see code in Appendix)
en_data, en_counts, en_dict, en_index = helperfunc.build_dataset(en_raw, VOCAB_SIZE)
es_data, es_counts, es_dict, es_index = helperfunc.build_dataset(es_raw, VOCAB_SIZE)

In [7]:
#del en_raw  # Uncomment to reduce memory.
print("ENGLISH:")
print('Most common words (+UNK):\n', en_counts[:5])
print('Sample data:\n',' '.join(['%s(%s)'%(en_index[i],i) for i in en_data[:10]]))

ENGLISH:
Most common words (+UNK):
 [['UNK', 35112], ('the', 14841), (',', 14135), ('.', 9672), ('of', 8627)]
Sample data:
 UNK(0) Anarchism(1959) is(9) often(92) defined(571) as(11) a(8) political(226) philosophy(301) which(26)


In [70]:
# del es_raw  # Uncomment to reduce memory.
print("SPANISH:")
print('Most common words (+UNK)\n', es_counts[:5])
print('Sample data\n:',' '.join(['%s(%s)'%(es_index[i],i) for i in es_data[:10]]))

SPANISH:
Most common words (+UNK)
 [['UNK', 40501], ('de', 16422), (',', 14864), ('la', 9002), ('.', 8578)]
Sample data
: UNK(0) El(27) Principado(1076) de(1) Andorra(160) ((14) en(6) catalán(1381) :(32) UNK(0)


# Generate Batched Data

In [8]:
#################### PARAMETERS ####################
batch_size = 8 # Number of inputs to process at once.
num_skips = 2 # How many times to reuse an input to generate a context.
skip_window = 2 # How many words to consider left and right.
data_index = 0  # -see note below-

__`NOTE:`__ The TF tutorial sets data_index as global inside the generate_batch function. Double check you're getting the expected behavior below b/c we're doubling up on languages. 
> `UPDATE`: OK - it looks like this is because the 'generate batch' function is used dynamically to window over the data. I'll figure out how to handle the global indexer when I get to the tensorflow portion of the code.

In [9]:
############## ENGLISH BATCHES & CONTEXT #################
# batch = list of text segmetns represented by their indices
# contexts = corresponding skip_gram context set indices
en_batch, en_context = helperfunc.generate_batch(en_data, 
                                                 batch_size, 
                                                 num_skips, 
                                                 skip_window)

In [92]:
# take a look
print('RAW BATCH:', en_batch)
print('RAW CONTEXT:', en_context.squeeze())
print("Decoded:")
for i in range(8):
    print("   ", en_batch[i], en_index[en_batch[i]],
        '->', en_context[i, 0], en_index[en_context[i, 0]])

RAW BATCH: [  9   9  92  92 571 571  11  11]
RAW CONTEXT: [   0 1959    9  571    8    9   92    8]
Decoded:
    9 is -> 0 UNK
    9 is -> 1959 Anarchism
    92 often -> 9 is
    92 often -> 571 defined
    571 defined -> 8 a
    571 defined -> 9 is
    11 as -> 92 often
    11 as -> 8 a


In [10]:
############## SPANISH BATCHES & CONTEXT #################
# batch = list of text segmetns represented by their indices
# contexts = corresponding skip_gram context set indices
es_batch, es_context = helperfunc.generate_batch(es_data, 
                                                 batch_size, 
                                                 num_skips, 
                                                 skip_window)

In [11]:
# take a look
print('RAW BATCH:', es_batch)
print('RAW CONTEXT:', es_context.squeeze())
print("Decoded:")
for i in range(8):
    print("   ", es_batch[i], es_index[en_batch[i]],
        '->', es_context[i, 0], es_index[es_context[i, 0]])

RAW BATCH: [   6    6 1381 1381   32   32    0    0]
RAW CONTEXT: [ 32 160  14   6   0   6  32  31]
Decoded:
    6 los -> 32 :
    6 los -> 160 Andorra
    1381 durante -> 14 (
    1381 durante -> 6 en
    32 material -> 0 UNK
    32 material -> 6 en
    0 a -> 32 :
    0 a -> 31 )


__`NOTE:`__ To implment Duong et Al's work we'd perform the word substitution at this stage, replacing the words in the batch with the index of their translation... In fact we'd probably do so using a dictionary of indices for the vocab. 

# TensorFlow Model w/ full softmax (slow!)

__Step 1:__ Set up the model graph.

In [13]:
# recall that we set the vocabulary size at the top of the NB
print(VOCAB_SIZE)

5000


In [14]:
# additional model parameters
batch_size = 128 # Number of inputs to process at once.
embedding_size = 128 # Hidden layer representation size
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a context.

In [15]:
# initialize the TF graph
graph = tf.Graph()

In [16]:
##################### DATA PLACEHOLDERS ####################
with graph.as_default():
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
    train_context = tf.placeholder(tf.int32, shape=[batch_size, 1])
    train_one_hot = tf.one_hot(train_context, VOCAB_SIZE)

In [17]:
#################### INPUT(EMBEDDING)LAYER #################
with graph.as_default():
    embeddings = tf.Variable(tf.random_uniform([VOCAB_SIZE, 
                                                embedding_size],
                                               -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs)

In [21]:
######################## HIDDEN LAYER ######################
with graph.as_default():
    weights = tf.Variable(tf.truncated_normal([VOCAB_SIZE, embedding_size],
                              stddev=1.0 / math.sqrt(embedding_size)))
    biases = tf.Variable(tf.zeros([VOCAB_SIZE]))
    hidden_out = tf.matmul(embed, tf.transpose(weights)) + biases

__`NOTE:`__ If we're going to se up experiments/comparisons between different embedding training methods (eg. Duongs word2vec modification vs the post training aligned word vectors referenced in the Babylon Repo)... we'll want to fix the embedding size across the multiple models. Maybe even fix the initialization for the weights?-- no in this case the weights are irrelevant across models b/c they'll be optimizing different things. Presumably part of what we're interested in is comparisons of speed to train in concert w/ efficacy on the translation task and random initialization always begs the question of 'did we just get lucky'.

In [22]:
########################## TRAIN OP ########################
with graph.as_default():
    cross_entropy = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(logits=hidden_out, 
                                                labels=train_one_hot))
    # Construct the SGD optimizer using a learning rate of 1.0.
    optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(cross_entropy)

__Step 2:__ Set up validation set - arandomly chosen set of words to use to track our progress as we train. By construction we'll pick words from the 100 most frequent in the vocabulary then use cosine similarity to find the nearest neighbors in the embedding matrix.

In [23]:
###################### VALIDATION EXAMPLES #################
valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)

with graph.as_default():
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

In [24]:
##################### SIMILARITY CALCULATION ################
with graph.as_default():
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

In [25]:
# Variable initializer
with graph.as_default():
    init = tf.global_variables_initializer()

__Step 3:__ Run the model & track progress by examining the matches for words in our validation set.

In [34]:
from helperfunc import generate_batch
data_index = 0 # used to track batches

def run(graph, num_steps):
    """Runner code for word2vec TF model w/ full softmax"""
    with tf.Session(graph=graph) as session:
      # We must initialize all variables before we use them.
        init.run()
        print('Initialized')

        average_loss = 0
        for step in range(num_steps):
            batch_inputs, batch_context = generate_batch(en_data,
                                                         batch_size, 
                                                         num_skips, 
                                                         skip_window)
            feed_dict = {train_inputs: batch_inputs, 
                         train_context: batch_context}

            # We perform one update step by evaluating the optimizer op 
            _, loss_val = session.run([optimizer, cross_entropy], 
                                      feed_dict=feed_dict)
            average_loss += loss_val

            if step % 100 == 0:
                if step > 0:
                    average_loss /= 100
                # The average loss is an estimate of the loss over the last 2000 batches.
                print('Average loss at step ', step, ': ', average_loss)
                average_loss = 0

            # Note that this is expensive (~20% slowdown if computed every 500 steps)
            if step % 500 == 0:
                sim = similarity.eval()
                for i in range(valid_size):
                    valid_word = en_index[valid_examples[i]]
                    top_k = 8  # number of nearest neighbors
                    nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                    log_str = 'Nearest to %s:' % valid_word
                    for k in range(top_k):
                        close_word = en_index[nearest[k]]
                        log_str = '%s %s,' % (log_str, close_word)
                    print(log_str)

Runner Call

In [None]:
num_steps = 10001
softmax_start_time = dt.datetime.now()
run(graph, num_steps=num_steps)
softmax_end_time = dt.datetime.now()
print("Softmax method took {} seconds to run 10000 iterations".format((softmax_end_time-softmax_start_time).total_seconds()))

In [41]:
# NOTE: output from ^^ saved to:
path = 'wtv_output/en_smalldata_10Kiter_fullsfmx.txt'
!tail -n 1 {path}

Softmax method took 1461.520475 seconds to run 10000 iterations

In [45]:
# take a look at loss
!grep 'Average' {path} | tail

Average loss at step  9100 :  5.23669617653
Average loss at step  9200 :  5.23754267216
Average loss at step  9300 :  5.30736485481
Average loss at step  9400 :  5.27600327492
Average loss at step  9500 :  5.2527682209
Average loss at step  9600 :  5.25820608616
Average loss at step  9700 :  5.34852351189
Average loss at step  9800 :  5.43696550369
Average loss at step  9900 :  5.37404325485
Average loss at step  10000 :  5.33349477768


In [59]:
# take a look at NN for 'the'
!grep 'Nearest to them:' {path} |tail

Nearest to them: chose, origin, lines, Delos, lighter, young, Nixon, remove,
Nearest to them: chose, origin, lines, Delos, lighter, young, remove, Nixon,
Nearest to them: chose, origin, lines, Delos, lighter, young, Nixon, shadow,
Nearest to them: chose, origin, lines, Delos, lighter, young, shadow, remove,
Nearest to them: chose, origin, lines, Delos, lighter, young, shadow, remove,
Nearest to them: chose, origin, lines, Delos, lighter, young, Alexander, shadow,
Nearest to them: chose, lines, origin, Delos, lighter, young, Alexander, shadow,
Nearest to them: chose, origin, lines, Delos, lighter, young, Alexander, shadow,
Nearest to them: chose, origin, lines, Delos, lighter, young, Alexander, shadow,
Nearest to them: chose, origin, lines, Delos, lighter, young, Alexander, shadow,


__`NOTE:`__ The data ^^ are undoubtedly too small... 'Alabama' shouldn't appear in the top 100 words. However I'll wait to look at larger data with the sampling method which is much mor efficient.

# TensorFlow Model w/ NCE (faster)

We'll write it as a class this time for ease of calling later.

In [None]:
# Helper function
def with_self_graph(function):
    """Decorator-foo borrowed from w266 a4."""
    def wrapper(self, *args, **kwargs):
        with self.graph.as_default():
            return function(self, *args, **kwargs)
    return wrapper

In [None]:
# Helper function

In [None]:
class Word2Vec(object):
    # This code was adapted from:
    # SOURCE: https://github.com/tensorflow/tensorflow
    #         /blob/r1.2/tensorflow/examples/tutorials
    #         /word2vec/word2vec_basic.py
    
    def __init__(self, graph=None, *args, **kwargs):
        """
        Args:
          V: vocabulary size
          H: embedding size
          
        Kwargs (reset defaulst w/ feed_dict):
          softmax_ns = 64
          learning_rate = 1.0
        """
        # Set TensorFlow graph. All TF code will work on this graph.
        self.graph = graph or tf.Graph()
        self.SetParams(*args, **kwargs)
        
    @with_self_graph
    def SetParams(self, V, H, softmax_ns=64, learning_rate=1.0):
        # Model structure.
        self.V = V
        self.H = H

        # Training hyperparameters
        with tf.name_scope("Training_Parameters"):
            # Number of samples for sampled softmax.
            self.softmax_ns = softmax_ns
            # Learning Rate
            self.learning_rate = 1.0
            #self.learning_rate_ = tf.placeholder(tf.float32, [], name="learning_rate")
            
        

# Appendix & Supplemental Code

__`helperfunc.py`__

In [74]:
%%writefile helperfunc.py
#!/usr/bin/env python
"""
Helper Functions for implementing Word2Vec in Python.

Most of the functions in this file come from the Official 
Tensorflow Docs and are made available via the word2vec
tutorial at: https://github.com/tensorflow/tensorflow/blob
/r1.2/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
As such, this code is protected by their liscence, see ^^.
       
As noted, some of these helper fuctions were written or modified
by the authors of adventuresinmachinelearning.com as part of their
word-2-vec tutorial which closely follows the Tensorflow code.

I have also modified some to suit our use case.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import math
import os
import random
import numpy as np
import tensorflow as tf


def read_data(filename):
    """
    Extract the file as a list of words.
    NOTE: this is modified from original function in TF  
    tutorialwhich expected a zipped input file.
    """
    with open(filename) as f:
        data = tf.compat.as_str(f.read()).split()
    return data


def build_dataset(words, n_words):
    """
    Process raw inputs into a dataset.
    Creates vocabulary from top n words indexed by rank.
    """
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

data_index = 0
def generate_batch(data, batch_size, num_skips, skip_window):
    """
    Function to generate a training batch for the skip-gram model.
    NOTE: this wass modified from original function in TF  
    tutorial by adventuresinML tutorial - mostly just renamed.
    """
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    context = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1  # [ skip_window input_word skip_window ]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # input word at the center of the buffer
        targets_to_avoid = [skip_window]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]  # this is the input word
            context[i * num_skips + j, 0] = buffer[target]  # these are the context words
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    # Backtrack a little bit to avoid skipping words in the end of a batch
    data_index = (data_index + len(data) - span) % len(data)
    return batch, context


Overwriting helperfunc.py
