# A short & practical introduction to Tensor Flow!

Part 3

The goal of this assignment is to train a Word2Vec skip-gram model over [Text8](http://mattmahoney.net/dc/textdata) data.

This is a personal wrap-up of all the material provided by [Google's Deep Learning course on Udacity](https://www.udacity.com/course/deep-learning--ud730), so all credit goes to them. 

Author: Pablo M. Olmos (olmos@tsc.uc3m.es)

Date: March 2017

## Word Embeddings using the Word2Vec skp-gram model

The following [link](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) gives a very simple explanation of the model

In [None]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE

import preprocessing


In [None]:
# Lets check what version of tensorflow we have installed. The provided scripts should run with tf 1.0 and above

print(tf.__version__)

Download the data from the source website if necessary.

In [None]:
filename = preprocessing.maybe_download('../../DataSets/textWordEmbeddings/text8.zip', 31344016) ## Change according to the folder where you saved the dataset provided

Read the data into a string

In [None]:
words = preprocessing.read_data(filename)
print('Data size %d' % len(words))

In [None]:
type(words)

In [None]:
print(words[0:20])

Build the dictionary and replace rare words with UNK token. 

In [None]:
vocabulary_size = 50000

data, count, dictionary, reverse_dictionary = preprocessing.build_dataset(vocabulary_size,words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.

Let's display the internal variables to better understand their structure:

In [None]:
print(data[:10])
print(count[:10])

In [None]:
print(list(dictionary.items())[:10])
print(list(reverse_dictionary.items())[:10])

In [None]:
print('The index of the word dictionary is %d\n' %(dictionary['crafty']))
print('The word corresponding to the index 875 is %s\n' %(reverse_dictionary[875]))

Function to generate a training batch for the skip-gram model.

In [None]:
data_index = 0

"""Generate a batch of data for training.
    Args:
        batch_size: Number of samples to generate in the batch.
        
        skip_window:# How many words to consider left and right.
        
            How many words to consider around the target word, left and right.
            With skip_window=2, in the sentence above for "consider" we'll
            build the window [words, to, consider, around, the].
            
        num_skips: How many times to reuse an input to generate a label.
        
            For skip-gram, we map target word to adjacent words in the window
            around it. This parameter says how many adjacent word mappings to
            add to the batch for each target word. Naturally it can't be more
            than skip_window * 2.
            
    Returns:
        batch, labels - ndarrays with IDs.
        batch: Row vector of size batch_size containing target words.
        labels:
            Column vector of size batch_size containing a randomly selected
            adjacent word for every target word in 'batch'.
    """


print('data:', [reverse_dictionary[di] for di in data[:32]])

for num_skips, skip_window in [(2, 4)]:
    data_index = 0
    batch, labels = preprocessing.generate_batch(data, data_index, batch_size=16, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(16)])

## Using the above data set, now we train a skip-gram model!

The following [link](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) gives a very simple explanation of the model

In [None]:
batch_size = 32
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.

# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 32 # Random set of words to evaluate similarity on.
valid_window = 200 # Only pick samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

    # Input data.
    train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
    # Variables.
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    softmax_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Model.
    # Look up embeddings for inputs. YOU DON'T NEED THE ONE HOT ENCODING FOR THE INPUT!!!! :)
    embed = tf.nn.embedding_lookup(embeddings, train_dataset)
    # Compute the softmax loss, using a sample of the negative labels each time.
    loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, train_labels, 
                                                     embed, num_sampled, vocabulary_size))

    # Optimizer.
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
    # Compute the similarity between minibatch examples and all embeddings.
    # We use the cosine distance:
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

In [None]:
num_steps = 100001
data_index = 0

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    average_loss = 0
    for step in range(num_steps):

        batch_data, batch_labels = preprocessing.generate_batch(data,data_index,batch_size, num_skips, skip_window) 
        data_index = (data_index + batch_size) % len(data)
        
        feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
        _, l = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += l
        if step % 2000 == 0:
            if step > 0:
                average_loss = average_loss / 2000
            # The average loss is an estimate of the loss over the last 2000 batches.
            print('Average loss at step %d: %f' % (step, average_loss))
            average_loss = 0
            
        # note that this is expensive (~20% slowdown if computed every 500 steps)
        if step % 10000 == 0:
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8 # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k+1]
                log = 'Nearest to %s:' % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log = '%s %s,' % (log, close_word)
                print(log)
    
    final_embeddings = normalized_embeddings.eval()

This is what an embedding looks like:

In [None]:
print(final_embeddings[2,:])

The embeddings have unit norm!

In [None]:
print(np.sum(np.square(final_embeddings[40000,:])))

Now we project the emmbeding vectors into a 2-dimensional space using [TSNE](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)

We use the [TSNE sklearn implementation](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)

In [None]:
num_points = 20

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=500)

two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

Lets visualize the result

In [None]:
def plot(embeddings, labels):
    assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
    pylab.figure(figsize=(15,15))  # in inches
    for i, label in enumerate(labels):
        x, y = embeddings[i,:]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',ha='right', va='bottom')
    pylab.show()

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)