# Skip-gram word2vec

In this notebook, I'll lead you through using TensorFlow to implement the word2vec algorithm using the skip-gram architecture. By implementing this, you'll learn about embedding words for use in natural language processing. This will come in handy when dealing with things like machine translation.

## Readings

Here are the resources I used to build this notebook. I suggest reading these either beforehand or while you're working on this material.

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of word2vec from Chris McCormick 
* [First word2vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [NIPS paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for word2vec also from Mikolov et al.
* An [implementation of word2vec](http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/) from Thushan Ganegedara
* TensorFlow [word2vec tutorial](https://www.tensorflow.org/tutorials/word2vec)

## Word embeddings

When you're dealing with words in text, you end up with tens of thousands of classes to predict, one for each word. Trying to one-hot encode these words is massively inefficient, you'll have one element set to 1 and the other 50,000 set to 0. The matrix multiplication going into the first hidden layer will have almost all of the resulting values be zero. This a huge waste of computation. 

![one-hot encodings](assets/one_hot_encoding.png)

To solve this problem and greatly increase the efficiency of our networks, we use what are called embeddings. Embeddings are just a fully connected layer like you've seen before. We call this layer the embedding layer and the weights are embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the "on" input unit.

![lookup](assets/lookup_matrix.png)

Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th row of the embedding matrix. This process is called an **embedding lookup** and the number of hidden units is the **embedding dimension**.

<img src='assets/tokenize_lookup.png' width=500>
 
There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix as well.

Embeddings aren't only used for words of course. You can use them for any model where you have a massive number of classes. A particular type of model called **Word2Vec** uses the embedding layer to find vector representations of words that contain semantic meaning.



## Word2Vec

The word2vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. Words that show up in similar contexts, such as "black", "white", and "red" will have vectors near each other. There are two architectures for implementing word2vec, CBOW (Continuous Bag-Of-Words) and Skip-gram.

<img src="assets/word2vec_architectures.png" width="500">

In this implementation, we'll be using the skip-gram architecture because it performs better than CBOW. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.

First up, importing packages.

In [1]:
import time

import numpy as np
import tensorflow as tf

import utils

Load the [text8 dataset](http://mattmahoney.net/dc/textdata.html), a file of cleaned up Wikipedia articles from Matt Mahoney. The next cell will download the data set to the `data` folder. Then you can extract it and delete the archive file to save storage space.

In [2]:
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import zipfile

dataset_folder_path = 'data'
dataset_filename = 'text8.zip'
dataset_name = 'Text8 Dataset'

class DLProgress(tqdm):
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

if not isfile(dataset_filename):
    with DLProgress(unit='B', unit_scale=True, miniters=1, desc=dataset_name) as pbar:
        urlretrieve(
            'http://mattmahoney.net/dc/text8.zip',
            dataset_filename,
            pbar.hook)

if not isdir(dataset_folder_path):
    with zipfile.ZipFile(dataset_filename) as zip_ref:
        zip_ref.extractall(dataset_folder_path)
        
with open('data/text8') as f:
    text = f.read()

Text8 Dataset: 31.4MB [00:08, 3.88MB/s]                            


## Preprocessing

Here I'm fixing up the text to make training easier. This comes from the `utils` module I wrote. The `preprocess` function coverts any punctuation into tokens, so a period is changed to ` <PERIOD> `. In this data set, there aren't any periods, but it will help in other NLP problems. I'm also removing all words that show up five or fewer times in the dataset. This will greatly reduce issues due to noise in the data and improve the quality of the vector representations. If you want to write your own functions for this stuff, go for it.

In [3]:
words = utils.preprocess(text)
print(words[:30])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst']


In [4]:
print("Total words: {}".format(len(words)))
print("Unique words: {}".format(len(set(words))))

Total words: 16680599
Unique words: 63641


And here I'm creating dictionaries to convert words to integers and backwards, integers to words. The integers are assigned in descending frequency order, so the most frequent word ("the") is given the integer 0 and the next most frequent is 1 and so on. The words are converted to integers and stored in the list `int_words`.

In [5]:
vocab_to_int, int_to_vocab = utils.create_lookup_tables(words)
int_words = [vocab_to_int[word] for word in words]

## Subsampling

Words that show up often such as "the", "of", and "for" don't provide much context to the nearby words. If we discard some of them, we can remove some of the noise from our data and in return get faster training and better representations. This process is called subsampling by Mikolov. For each word $w_i$ in the training set, we'll discard it with probability given by 

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

I'm going to leave this up to you as an exercise. This is more of a programming challenge, than about deep learning specifically. But, being able to prepare your data for your network is an important skill to have. Check out my solution to see how I did it.

> **Exercise:** Implement subsampling for the words in `int_words`. That is, go through `int_words` and discard each word given the probablility $P(w_i)$ shown above. Note that $P(w_i)$ is the probability that a word is discarded. Assign the subsampled data to `train_words`.

In [6]:
## Your code here
import random
import numpy as np
from collections import Counter

threshold = 1e-5
word_counts = Counter(int_words)
total_count = len(int_words)
word_freqs = {word: count / total_count for word, count in word_counts.items()}
word_pdrop = {word: np.sqrt(threshold/freq) for word, freq in word_freqs.items()}
train_words = [word for word in int_words if word_pdrop[word] < random.random()]
print(len(int_words), len(train_words))

16680599 12053690


## Making batches

Now that our data is in good shape, we need to get it into the proper form to pass it into our network. With the skip-gram architecture, for each word in the text, we want to grab all the words in a window around that word, with size $C$. 

From [Mikolov et al.](https://arxiv.org/pdf/1301.3781.pdf): 

"Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples... If we choose $C = 5$, for each training word we will select randomly a number $R$ in range $< 1; C >$, and then use $R$ words from history and $R$ words from the future of the current word as correct labels."

> **Exercise:** Implement a function `get_target` that receives a list of words, an index, and a window size, then returns a list of words in the window around the index. Make sure to use the algorithm described above, where you choose a random number of words from the window.

In [7]:
def get_target(words, idx, window_size=5):
    ''' Get a list of words in a window around an index. '''
    
    # Your code here
    r = np.random.randint(1, window_size + 1)
    start = idx - r if (idx - r) > 0 else 0
    stop = idx + r
    target_words = set(words[start:idx] + words[idx + 1: stop + 1])
    
    return target_words

Here's a function that returns batches for our network. The idea is that it grabs `batch_size` words from a words list. Then for each of those words, it gets the target words in the window. I haven't found a way to pass in a random number of target words and get it to work with the architecture, so I make one row per input-target pair. This is a generator function by the way, helps save memory.

In [8]:
def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y))
        yield x, y
    

## Building the graph

From [Chris McCormick's blog](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), we can see the general structure of our network.
![embedding_network](./assets/skip_gram_net_arch.png)

The input words are passed in as integers. This will go into a hidden layer of linear units, then into a softmax layer. We'll use the softmax layer to make a prediction like normal.

The idea here is to train the hidden layer weight matrix to find efficient representations for our words. We can discard the softmax layer becuase we don't really care about making predictions with this network. We just want the embedding matrix so we can use it in other networks we build from the dataset.

I'm going to have you build the graph in stages now. First off, creating the `inputs` and `labels` placeholders like normal.

> **Exercise:** Assign `inputs` and `labels` using `tf.placeholder`. We're going to be passing in integers, so set the data types to `tf.int32`. The batches we're passing in will have varying sizes, so set the batch sizes to [`None`]. To make things work later, you'll need to set the second dimension of `labels` to `None` or `1`.

In [9]:
train_graph = tf.Graph()
with train_graph.as_default():
    inputs = tf.placeholder(tf.int32, [None], name='inputs')
    labels = tf.placeholder(tf.int32, [None, None], name='labels')

## Embedding



The embedding matrix has a size of the number of words by the number of units in the hidden layer. So, if you have 10,000 words and 300 hidden units, the matrix will have size $10,000 \times 300$. Remember that we're using tokenized data for our inputs, usually as integers, where the number of tokens is the number of words in our vocabulary.


> **Exercise:** Tensorflow provides a convenient function [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) that does this lookup for us. You pass in the embedding matrix and a tensor of integers, then it returns rows in the matrix corresponding to those integers. Below, set the number of embedding features you'll use (200 is a good start), create the embedding matrix variable, and use `tf.nn.embedding_lookup` to get the embedding tensors. For the embedding matrix, I suggest you initialize it with a uniform random numbers between -1 and 1 using [tf.random_uniform](https://www.tensorflow.org/api_docs/python/tf/random_uniform).

In [10]:
n_vocab = len(int_to_vocab)
n_embedding =  200 # Number of embedding features 
with train_graph.as_default():
    embedding = tf.random_uniform([n_vocab, n_embedding], minval=-1, maxval=1, dtype=tf.float32, name='embedding')
    embed = tf.nn.embedding_lookup(embedding, inputs) # use tf.nn.embedding_lookup to get the hidden layer output

## Negative sampling



For every example we give the network, we train it using the output from the softmax layer. That means for each input, we're making very small changes to millions of weights even though we only have one true example. This makes training the network very inefficient. We can approximate the loss from the softmax layer by only updating a small subset of all the weights at once. We'll update the weights for the correct label, but only a small number of incorrect labels. This is called ["negative sampling"](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). Tensorflow has a convenient function to do this, [`tf.nn.sampled_softmax_loss`](https://www.tensorflow.org/api_docs/python/tf/nn/sampled_softmax_loss).

> **Exercise:** Below, create weights and biases for the softmax layer. Then, use [`tf.nn.sampled_softmax_loss`](https://www.tensorflow.org/api_docs/python/tf/nn/sampled_softmax_loss) to calculate the loss. Be sure to read the documentation to figure out how it works.

In [11]:
# Number of negative labels to sample
n_sampled = 100
with train_graph.as_default():
    softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding), stddev=0.1)) # create softmax weight matrix here
    softmax_b = tf.Variable(tf.zeros(n_vocab)) # create softmax biases here
    
    # Calculate the loss using negative sampling
    loss = tf.nn.sampled_softmax_loss(softmax_w, softmax_b, labels, embed, n_sampled, n_vocab)
    
    cost = tf.reduce_mean(loss)
    optimizer = tf.train.AdamOptimizer().minimize(cost)

## Validation

This code is from Thushan Ganegedara's implementation. Here we're going to choose a few common words and few uncommon words. Then, we'll print out the closest words to them. It's a nice way to check that our embedding table is grouping together words with similar semantic meanings.

In [12]:
with train_graph.as_default():
    ## From Thushan Ganegedara's implementation
    valid_size = 16 # Random set of words to evaluate similarity on.
    valid_window = 100
    # pick 8 samples from (0,100) and (1000,1100) each ranges. lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples, 
                               random.sample(range(1000,1000+valid_window), valid_size//2))

    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
    
    # We use the cosine distance:
    norm = tf.sqrt(tf.reduce_sum(tf.square(embedding), 1, keep_dims=True))
    normalized_embedding = embedding / norm
    valid_embedding = tf.nn.embedding_lookup(normalized_embedding, valid_dataset)
    similarity = tf.matmul(valid_embedding, tf.transpose(normalized_embedding))

In [13]:
# If the checkpoints directory doesn't exist:
!mkdir checkpoints

## Training

Below is the code to train the network. Every 100 batches it reports the training loss. Every 1000 batches, it'll print out the validation words.

In [14]:
epochs = 10
batch_size = 1000
window_size = 10

with train_graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=train_graph) as sess:
    iteration = 1
    loss = 0
    sess.run(tf.global_variables_initializer())

    for e in range(1, epochs+1):
        batches = get_batches(train_words, batch_size, window_size)
        start = time.time()
        for x, y in batches:
            
            feed = {inputs: x,
                    labels: np.array(y)[:, None]}
            train_loss, _ = sess.run([cost, optimizer], feed_dict=feed)
            
            loss += train_loss
            
            if iteration % 100 == 0: 
                end = time.time()
                print("Epoch {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Avg. Training loss: {:.4f}".format(loss/100),
                      "{:.4f} sec/batch".format((end-start)/100))
                loss = 0
                start = time.time()
            
            if iteration % 1000 == 0:
                ## From Thushan Ganegedara's implementation
                # note that this is expensive (~20% slowdown if computed every 500 steps)
                sim = similarity.eval()
                for i in range(valid_size):
                    valid_word = int_to_vocab[valid_examples[i]]
                    top_k = 8 # number of nearest neighbors
                    nearest = (-sim[i, :]).argsort()[1:top_k+1]
                    log = 'Nearest to %s:' % valid_word
                    for k in range(top_k):
                        close_word = int_to_vocab[nearest[k]]
                        log = '%s %s,' % (log, close_word)
                    print(log)
            
            iteration += 1
    save_path = saver.save(sess, "checkpoints/text8.ckpt")
    embed_mat = sess.run(normalized_embedding)

Epoch 1/10 Iteration: 100 Avg. Training loss: 9.1221 0.0306 sec/batch
Epoch 1/10 Iteration: 200 Avg. Training loss: 9.0676 0.0241 sec/batch
Epoch 1/10 Iteration: 300 Avg. Training loss: 9.0406 0.0238 sec/batch
Epoch 1/10 Iteration: 400 Avg. Training loss: 9.0652 0.0236 sec/batch
Epoch 1/10 Iteration: 500 Avg. Training loss: 8.9473 0.0235 sec/batch
Epoch 1/10 Iteration: 600 Avg. Training loss: 8.9503 0.0234 sec/batch
Epoch 1/10 Iteration: 700 Avg. Training loss: 8.9307 0.0230 sec/batch
Epoch 1/10 Iteration: 800 Avg. Training loss: 8.8522 0.0231 sec/batch
Epoch 1/10 Iteration: 900 Avg. Training loss: 8.7550 0.0230 sec/batch
Epoch 1/10 Iteration: 1000 Avg. Training loss: 8.6917 0.0230 sec/batch
Nearest to new: nord, juror, arda, bakesperson, euratom, word, plp, insensitivity,
Nearest to about: ath, pixar, kubitzki, drastic, demonstrations, dionysius, archbishopric, harker,
Nearest to american: erasmus, brooklyn, emporia, texas, nutation, wan, maser, surprised,
Nearest to after: cherished,

Epoch 1/10 Iteration: 4100 Avg. Training loss: 6.8539 0.0241 sec/batch
Epoch 1/10 Iteration: 4200 Avg. Training loss: 6.7934 0.0231 sec/batch
Epoch 1/10 Iteration: 4300 Avg. Training loss: 6.7757 0.0231 sec/batch
Epoch 1/10 Iteration: 4400 Avg. Training loss: 6.7088 0.0231 sec/batch
Epoch 1/10 Iteration: 4500 Avg. Training loss: 6.7251 0.0229 sec/batch
Epoch 1/10 Iteration: 4600 Avg. Training loss: 6.6198 0.0231 sec/batch
Epoch 1/10 Iteration: 4700 Avg. Training loss: 6.6015 0.0230 sec/batch
Epoch 1/10 Iteration: 4800 Avg. Training loss: 6.5147 0.0233 sec/batch
Epoch 1/10 Iteration: 4900 Avg. Training loss: 6.4833 0.0231 sec/batch
Epoch 1/10 Iteration: 5000 Avg. Training loss: 6.4322 0.0231 sec/batch
Nearest to new: sheol, aerobatics, involve, hosea, versioning, coenzyme, scarred, vyasa,
Nearest to about: rows, amish, cleaveland, baddeck, tibetans, deceitful, rinzai, clay,
Nearest to american: excommunicated, deutscher, cultus, benthic, propel, kucinich, nie, rival,
Nearest to after: t

Epoch 1/10 Iteration: 8100 Avg. Training loss: 5.1880 0.0243 sec/batch
Epoch 1/10 Iteration: 8200 Avg. Training loss: 5.1609 0.0233 sec/batch
Epoch 1/10 Iteration: 8300 Avg. Training loss: 5.1214 0.0231 sec/batch
Epoch 1/10 Iteration: 8400 Avg. Training loss: 5.1648 0.0232 sec/batch
Epoch 1/10 Iteration: 8500 Avg. Training loss: 5.0940 0.0231 sec/batch
Epoch 1/10 Iteration: 8600 Avg. Training loss: 5.0744 0.0231 sec/batch
Epoch 1/10 Iteration: 8700 Avg. Training loss: 5.0739 0.0230 sec/batch
Epoch 1/10 Iteration: 8800 Avg. Training loss: 5.0384 0.0232 sec/batch
Epoch 1/10 Iteration: 8900 Avg. Training loss: 5.0450 0.0231 sec/batch
Epoch 1/10 Iteration: 9000 Avg. Training loss: 5.0189 0.0232 sec/batch
Nearest to new: liver, pendant, neurosurgery, propellers, revolved, tsunami, header, lewdness,
Nearest to about: katie, restated, nuit, walloon, quips, amistad, worms, ninja,
Nearest to american: lichtenberg, heflin, abaye, thrift, grace, merced, lxxviii, kombat,
Nearest to after: syntacti

Epoch 2/10 Iteration: 12100 Avg. Training loss: 4.7413 0.0116 sec/batch
Epoch 2/10 Iteration: 12200 Avg. Training loss: 4.7380 0.0232 sec/batch
Epoch 2/10 Iteration: 12300 Avg. Training loss: 4.7167 0.0233 sec/batch
Epoch 2/10 Iteration: 12400 Avg. Training loss: 4.6960 0.0231 sec/batch
Epoch 2/10 Iteration: 12500 Avg. Training loss: 4.7016 0.0232 sec/batch
Epoch 2/10 Iteration: 12600 Avg. Training loss: 4.6891 0.0231 sec/batch
Epoch 2/10 Iteration: 12700 Avg. Training loss: 4.6736 0.0230 sec/batch
Epoch 2/10 Iteration: 12800 Avg. Training loss: 4.6489 0.0230 sec/batch
Epoch 2/10 Iteration: 12900 Avg. Training loss: 4.6367 0.0229 sec/batch
Epoch 2/10 Iteration: 13000 Avg. Training loss: 4.6703 0.0230 sec/batch
Nearest to new: tomorrowland, crokinole, sweeter, expansions, hospital, thuringiensis, duras, franchises,
Nearest to about: bae, ambit, shams, prescience, jordanes, ives, spider, raiders,
Nearest to american: digested, mounting, scalars, discontents, primrose, farms, crystallize,

Epoch 2/10 Iteration: 16100 Avg. Training loss: 4.5661 0.0246 sec/batch
Epoch 2/10 Iteration: 16200 Avg. Training loss: 4.5759 0.0235 sec/batch
Epoch 2/10 Iteration: 16300 Avg. Training loss: 4.5705 0.0236 sec/batch
Epoch 2/10 Iteration: 16400 Avg. Training loss: 4.5624 0.0235 sec/batch
Epoch 2/10 Iteration: 16500 Avg. Training loss: 4.5660 0.0233 sec/batch
Epoch 2/10 Iteration: 16600 Avg. Training loss: 4.5693 0.0234 sec/batch
Epoch 2/10 Iteration: 16700 Avg. Training loss: 4.5695 0.0234 sec/batch
Epoch 2/10 Iteration: 16800 Avg. Training loss: 4.5487 0.0233 sec/batch
Epoch 2/10 Iteration: 16900 Avg. Training loss: 4.5597 0.0233 sec/batch
Epoch 2/10 Iteration: 17000 Avg. Training loss: 4.5403 0.0234 sec/batch
Nearest to new: itanium, monosaccharide, bioprospecting, detachable, rees, apportioned, simeon, predictably,
Nearest to about: triiodothyronine, cinema, clippers, asm, lessig, turbochargers, keeper, notaries,
Nearest to american: exegetes, ruach, steaks, severance, sleepers, nipp

Epoch 2/10 Iteration: 20100 Avg. Training loss: 4.4558 0.0244 sec/batch
Epoch 2/10 Iteration: 20200 Avg. Training loss: 4.4654 0.0234 sec/batch
Epoch 2/10 Iteration: 20300 Avg. Training loss: 4.4493 0.0234 sec/batch
Epoch 2/10 Iteration: 20400 Avg. Training loss: 4.4805 0.0232 sec/batch
Epoch 2/10 Iteration: 20500 Avg. Training loss: 4.4866 0.0235 sec/batch
Epoch 2/10 Iteration: 20600 Avg. Training loss: 4.4827 0.0236 sec/batch
Epoch 2/10 Iteration: 20700 Avg. Training loss: 4.4403 0.0233 sec/batch
Epoch 2/10 Iteration: 20800 Avg. Training loss: 4.4739 0.0234 sec/batch
Epoch 2/10 Iteration: 20900 Avg. Training loss: 4.4757 0.0235 sec/batch
Epoch 2/10 Iteration: 21000 Avg. Training loss: 4.4627 0.0236 sec/batch
Nearest to new: unfree, duisburg, entice, proteg, deccan, doe, quan, matic,
Nearest to about: sps, boundary, themas, sweets, cu, vivant, zionism, learner,
Nearest to american: may, bdp, fielded, lufthansa, kernow, bashing, smg, kinderhook,
Nearest to after: sven, stoneman, shakuh

Epoch 2/10 Iteration: 24100 Avg. Training loss: 4.4228 0.0245 sec/batch
Epoch 3/10 Iteration: 24200 Avg. Training loss: 4.4416 0.0226 sec/batch
Epoch 3/10 Iteration: 24300 Avg. Training loss: 4.4205 0.0234 sec/batch
Epoch 3/10 Iteration: 24400 Avg. Training loss: 4.4283 0.0233 sec/batch
Epoch 3/10 Iteration: 24500 Avg. Training loss: 4.3960 0.0234 sec/batch
Epoch 3/10 Iteration: 24600 Avg. Training loss: 4.4014 0.0233 sec/batch
Epoch 3/10 Iteration: 24700 Avg. Training loss: 4.3840 0.0233 sec/batch
Epoch 3/10 Iteration: 24800 Avg. Training loss: 4.3851 0.0230 sec/batch
Epoch 3/10 Iteration: 24900 Avg. Training loss: 4.3427 0.0231 sec/batch
Epoch 3/10 Iteration: 25000 Avg. Training loss: 4.3903 0.0233 sec/batch
Nearest to new: folic, sorenson, sudan, deicide, aise, poachers, carabineros, parra,
Nearest to about: casas, philippi, cleo, astringent, paine, sublayer, separately, sprint,
Nearest to american: sinhala, tripolitania, follower, instabilities, coexist, blanket, tomatoes, unsurpri

Epoch 3/10 Iteration: 28100 Avg. Training loss: 4.3842 0.0246 sec/batch
Epoch 3/10 Iteration: 28200 Avg. Training loss: 4.3785 0.0236 sec/batch
Epoch 3/10 Iteration: 28300 Avg. Training loss: 4.3702 0.0234 sec/batch
Epoch 3/10 Iteration: 28400 Avg. Training loss: 4.3883 0.0235 sec/batch
Epoch 3/10 Iteration: 28500 Avg. Training loss: 4.3815 0.0235 sec/batch
Epoch 3/10 Iteration: 28600 Avg. Training loss: 4.3807 0.0232 sec/batch
Epoch 3/10 Iteration: 28700 Avg. Training loss: 4.3816 0.0233 sec/batch
Epoch 3/10 Iteration: 28800 Avg. Training loss: 4.3726 0.0234 sec/batch
Epoch 3/10 Iteration: 28900 Avg. Training loss: 4.3750 0.0235 sec/batch
Epoch 3/10 Iteration: 29000 Avg. Training loss: 4.3641 0.0234 sec/batch
Nearest to new: deprogrammings, neurath, caver, snoopy, bomb, departs, airship, adrift,
Nearest to about: unsophisticated, omri, bauds, additive, argos, postumus, siggraph, comment,
Nearest to american: recantation, hypnotized, charity, remediation, venerate, catalog, prog, amuse

Epoch 3/10 Iteration: 32100 Avg. Training loss: 4.3361 0.0247 sec/batch
Epoch 3/10 Iteration: 32200 Avg. Training loss: 4.3309 0.0235 sec/batch
Epoch 3/10 Iteration: 32300 Avg. Training loss: 4.3289 0.0237 sec/batch
Epoch 3/10 Iteration: 32400 Avg. Training loss: 4.3402 0.0235 sec/batch
Epoch 3/10 Iteration: 32500 Avg. Training loss: 4.3504 0.0232 sec/batch
Epoch 3/10 Iteration: 32600 Avg. Training loss: 4.3520 0.0232 sec/batch
Epoch 3/10 Iteration: 32700 Avg. Training loss: 4.3634 0.0234 sec/batch
Epoch 3/10 Iteration: 32800 Avg. Training loss: 4.3310 0.0231 sec/batch
Epoch 3/10 Iteration: 32900 Avg. Training loss: 4.3482 0.0233 sec/batch
Epoch 3/10 Iteration: 33000 Avg. Training loss: 4.3440 0.0234 sec/batch
Nearest to new: beverage, birdman, attends, channelling, chording, yeni, releases, undirected,
Nearest to about: krew, santander, drowning, progs, deferred, prima, flynn, midtown,
Nearest to american: outcry, coy, craven, basements, down, reject, napo, cooperating,
Nearest to aft

Epoch 3/10 Iteration: 36100 Avg. Training loss: 4.3374 0.0245 sec/batch
Epoch 4/10 Iteration: 36200 Avg. Training loss: 4.3430 0.0103 sec/batch
Epoch 4/10 Iteration: 36300 Avg. Training loss: 4.3381 0.0233 sec/batch
Epoch 4/10 Iteration: 36400 Avg. Training loss: 4.3263 0.0232 sec/batch
Epoch 4/10 Iteration: 36500 Avg. Training loss: 4.3292 0.0231 sec/batch
Epoch 4/10 Iteration: 36600 Avg. Training loss: 4.3046 0.0231 sec/batch
Epoch 4/10 Iteration: 36700 Avg. Training loss: 4.3054 0.0232 sec/batch
Epoch 4/10 Iteration: 36800 Avg. Training loss: 4.2956 0.0232 sec/batch
Epoch 4/10 Iteration: 36900 Avg. Training loss: 4.2643 0.0231 sec/batch
Epoch 4/10 Iteration: 37000 Avg. Training loss: 4.2748 0.0232 sec/batch
Nearest to new: thoroughfares, tolbert, mazurek, metric, chariots, leapt, woodson, tauranac,
Nearest to about: ephesians, caspar, campana, insisted, goalposts, teleoperation, exhaled, contestants,
Nearest to american: radhakrishnan, convenes, lederman, archaelogical, letting, doc

Epoch 4/10 Iteration: 40100 Avg. Training loss: 4.3187 0.0244 sec/batch
Epoch 4/10 Iteration: 40200 Avg. Training loss: 4.2958 0.0234 sec/batch
Epoch 4/10 Iteration: 40300 Avg. Training loss: 4.3039 0.0233 sec/batch
Epoch 4/10 Iteration: 40400 Avg. Training loss: 4.3023 0.0233 sec/batch
Epoch 4/10 Iteration: 40500 Avg. Training loss: 4.3081 0.0233 sec/batch
Epoch 4/10 Iteration: 40600 Avg. Training loss: 4.3045 0.0231 sec/batch
Epoch 4/10 Iteration: 40700 Avg. Training loss: 4.3051 0.0230 sec/batch
Epoch 4/10 Iteration: 40800 Avg. Training loss: 4.3133 0.0230 sec/batch
Epoch 4/10 Iteration: 40900 Avg. Training loss: 4.3081 0.0232 sec/batch
Epoch 4/10 Iteration: 41000 Avg. Training loss: 4.3131 0.0231 sec/batch
Nearest to new: rose, janine, businesspeople, lawless, taxonomists, sealift, bondi, hotelier,
Nearest to about: warrington, acs, whitney, saxophones, succinic, potter, chianti, daimler,
Nearest to american: henrique, connally, assemblage, allophones, grossly, family, clashing, su

Epoch 4/10 Iteration: 44100 Avg. Training loss: 4.2890 0.0242 sec/batch
Epoch 4/10 Iteration: 44200 Avg. Training loss: 4.2850 0.0232 sec/batch
Epoch 4/10 Iteration: 44300 Avg. Training loss: 4.2773 0.0232 sec/batch
Epoch 4/10 Iteration: 44400 Avg. Training loss: 4.2807 0.0232 sec/batch
Epoch 4/10 Iteration: 44500 Avg. Training loss: 4.2802 0.0231 sec/batch
Epoch 4/10 Iteration: 44600 Avg. Training loss: 4.3043 0.0231 sec/batch
Epoch 4/10 Iteration: 44700 Avg. Training loss: 4.3148 0.0232 sec/batch
Epoch 4/10 Iteration: 44800 Avg. Training loss: 4.2736 0.0232 sec/batch
Epoch 4/10 Iteration: 44900 Avg. Training loss: 4.2940 0.0233 sec/batch
Epoch 4/10 Iteration: 45000 Avg. Training loss: 4.3037 0.0236 sec/batch
Nearest to new: kambojas, restrict, homoerotic, undiscovered, uta, kwanzaa, theocracy, singapore,
Nearest to about: lussac, trams, mayne, expounds, gunther, upload, gss, ringen,
Nearest to american: photographer, imitate, northside, hussite, mercader, trademark, paulus, stated,
N

Epoch 4/10 Iteration: 48100 Avg. Training loss: 4.2816 0.0244 sec/batch
Epoch 4/10 Iteration: 48200 Avg. Training loss: 4.2923 0.0234 sec/batch
Epoch 5/10 Iteration: 48300 Avg. Training loss: 4.2923 0.0213 sec/batch
Epoch 5/10 Iteration: 48400 Avg. Training loss: 4.2907 0.0233 sec/batch
Epoch 5/10 Iteration: 48500 Avg. Training loss: 4.2805 0.0234 sec/batch
Epoch 5/10 Iteration: 48600 Avg. Training loss: 4.2747 0.0234 sec/batch
Epoch 5/10 Iteration: 48700 Avg. Training loss: 4.2678 0.0234 sec/batch
Epoch 5/10 Iteration: 48800 Avg. Training loss: 4.2512 0.0232 sec/batch
Epoch 5/10 Iteration: 48900 Avg. Training loss: 4.2451 0.0232 sec/batch
Epoch 5/10 Iteration: 49000 Avg. Training loss: 4.2144 0.0233 sec/batch
Nearest to new: flagella, belgaum, pelayo, boucher, eek, steyr, tape, disobedience,
Nearest to about: relies, bundeswehr, cleyre, falcons, goofy, leaps, likes, warranty,
Nearest to american: residents, sojourns, rematch, rutles, consultants, bernard, altenberg, ulema,
Nearest to 

Epoch 5/10 Iteration: 52100 Avg. Training loss: 4.2663 0.0243 sec/batch
Epoch 5/10 Iteration: 52200 Avg. Training loss: 4.2770 0.0233 sec/batch
Epoch 5/10 Iteration: 52300 Avg. Training loss: 4.2617 0.0233 sec/batch
Epoch 5/10 Iteration: 52400 Avg. Training loss: 4.2645 0.0234 sec/batch
Epoch 5/10 Iteration: 52500 Avg. Training loss: 4.2667 0.0232 sec/batch
Epoch 5/10 Iteration: 52600 Avg. Training loss: 4.2712 0.0233 sec/batch
Epoch 5/10 Iteration: 52700 Avg. Training loss: 4.2724 0.0232 sec/batch
Epoch 5/10 Iteration: 52800 Avg. Training loss: 4.2592 0.0233 sec/batch
Epoch 5/10 Iteration: 52900 Avg. Training loss: 4.2733 0.0233 sec/batch
Epoch 5/10 Iteration: 53000 Avg. Training loss: 4.2684 0.0233 sec/batch
Nearest to new: bakr, convicts, destabilization, peru, relief, breaking, violate, katrina,
Nearest to about: bridgeport, munroe, webster, paltry, unmibh, patiently, amnesia, cum,
Nearest to american: anticoagulant, blame, insertion, motivations, cochrane, surprisingly, xcf, horos

Epoch 5/10 Iteration: 56100 Avg. Training loss: 4.2512 0.0242 sec/batch
Epoch 5/10 Iteration: 56200 Avg. Training loss: 4.2559 0.0232 sec/batch
Epoch 5/10 Iteration: 56300 Avg. Training loss: 4.2514 0.0232 sec/batch
Epoch 5/10 Iteration: 56400 Avg. Training loss: 4.2453 0.0233 sec/batch
Epoch 5/10 Iteration: 56500 Avg. Training loss: 4.2554 0.0234 sec/batch
Epoch 5/10 Iteration: 56600 Avg. Training loss: 4.2560 0.0232 sec/batch
Epoch 5/10 Iteration: 56700 Avg. Training loss: 4.2687 0.0232 sec/batch
Epoch 5/10 Iteration: 56800 Avg. Training loss: 4.2773 0.0234 sec/batch
Epoch 5/10 Iteration: 56900 Avg. Training loss: 4.2576 0.0231 sec/batch
Epoch 5/10 Iteration: 57000 Avg. Training loss: 4.2590 0.0232 sec/batch
Nearest to new: washes, rabinowitz, deepwater, krey, shelf, lifelong, preached, chydenius,
Nearest to about: geophysical, germs, patrolled, pressures, flavors, melons, fontvieille, playstation,
Nearest to american: fontaine, tablets, ravages, sabbats, costanza, lignite, jukebox, 

Epoch 5/10 Iteration: 60100 Avg. Training loss: 4.2433 0.0242 sec/batch
Epoch 5/10 Iteration: 60200 Avg. Training loss: 4.2592 0.0233 sec/batch
Epoch 6/10 Iteration: 60300 Avg. Training loss: 4.2715 0.0088 sec/batch
Epoch 6/10 Iteration: 60400 Avg. Training loss: 4.2647 0.0231 sec/batch
Epoch 6/10 Iteration: 60500 Avg. Training loss: 4.2516 0.0231 sec/batch
Epoch 6/10 Iteration: 60600 Avg. Training loss: 4.2649 0.0233 sec/batch
Epoch 6/10 Iteration: 60700 Avg. Training loss: 4.2426 0.0231 sec/batch
Epoch 6/10 Iteration: 60800 Avg. Training loss: 4.2454 0.0232 sec/batch
Epoch 6/10 Iteration: 60900 Avg. Training loss: 4.2191 0.0230 sec/batch
Epoch 6/10 Iteration: 61000 Avg. Training loss: 4.2054 0.0230 sec/batch
Nearest to new: jaundice, dread, believer, manganese, lg, eruptions, butte, raman,
Nearest to about: corrupts, transylvanian, horse, foamy, thickets, payoffs, millar, plautdietsch,
Nearest to american: astatine, lew, sickness, saddled, elgin, coached, petronas, abgar,
Nearest to 

Epoch 6/10 Iteration: 64100 Avg. Training loss: 4.2460 0.0244 sec/batch
Epoch 6/10 Iteration: 64200 Avg. Training loss: 4.2505 0.0232 sec/batch
Epoch 6/10 Iteration: 64300 Avg. Training loss: 4.2359 0.0234 sec/batch
Epoch 6/10 Iteration: 64400 Avg. Training loss: 4.2376 0.0234 sec/batch
Epoch 6/10 Iteration: 64500 Avg. Training loss: 4.2503 0.0233 sec/batch
Epoch 6/10 Iteration: 64600 Avg. Training loss: 4.2537 0.0234 sec/batch
Epoch 6/10 Iteration: 64700 Avg. Training loss: 4.2554 0.0233 sec/batch
Epoch 6/10 Iteration: 64800 Avg. Training loss: 4.2484 0.0232 sec/batch
Epoch 6/10 Iteration: 64900 Avg. Training loss: 4.2651 0.0230 sec/batch
Epoch 6/10 Iteration: 65000 Avg. Training loss: 4.2467 0.0230 sec/batch
Nearest to new: chalcedonian, drinker, ceefax, ulrich, codification, provides, meun, sylvian,
Nearest to about: fatima, thanking, cram, deliverance, maricopa, incheon, capone, clans,
Nearest to american: kauravas, bubonic, channeled, genius, sympathizer, cofinality, gamma, bogart

Epoch 6/10 Iteration: 68100 Avg. Training loss: 4.2251 0.0241 sec/batch
Epoch 6/10 Iteration: 68200 Avg. Training loss: 4.2404 0.0231 sec/batch
Epoch 6/10 Iteration: 68300 Avg. Training loss: 4.2457 0.0231 sec/batch
Epoch 6/10 Iteration: 68400 Avg. Training loss: 4.2265 0.0231 sec/batch
Epoch 6/10 Iteration: 68500 Avg. Training loss: 4.2253 0.0232 sec/batch
Epoch 6/10 Iteration: 68600 Avg. Training loss: 4.2408 0.0232 sec/batch
Epoch 6/10 Iteration: 68700 Avg. Training loss: 4.2456 0.0231 sec/batch
Epoch 6/10 Iteration: 68800 Avg. Training loss: 4.2571 0.0232 sec/batch
Epoch 6/10 Iteration: 68900 Avg. Training loss: 4.2322 0.0232 sec/batch
Epoch 6/10 Iteration: 69000 Avg. Training loss: 4.2493 0.0232 sec/batch
Nearest to new: doer, heidi, booth, perjury, fong, amphetamines, marathas, thuringian,
Nearest to about: snoopy, dodecanese, knob, tegucigalpa, spouts, resolving, muller, auks,
Nearest to american: outkast, extensional, bangla, plausibly, precipitation, agony, banjos, literalism,

Epoch 6/10 Iteration: 72100 Avg. Training loss: 4.2112 0.0244 sec/batch
Epoch 6/10 Iteration: 72200 Avg. Training loss: 4.2381 0.0233 sec/batch
Epoch 6/10 Iteration: 72300 Avg. Training loss: 4.2466 0.0236 sec/batch
Epoch 7/10 Iteration: 72400 Avg. Training loss: 4.2514 0.0199 sec/batch
Epoch 7/10 Iteration: 72500 Avg. Training loss: 4.2424 0.0232 sec/batch
Epoch 7/10 Iteration: 72600 Avg. Training loss: 4.2465 0.0234 sec/batch
Epoch 7/10 Iteration: 72700 Avg. Training loss: 4.2321 0.0233 sec/batch
Epoch 7/10 Iteration: 72800 Avg. Training loss: 4.2227 0.0232 sec/batch
Epoch 7/10 Iteration: 72900 Avg. Training loss: 4.2213 0.0233 sec/batch
Epoch 7/10 Iteration: 73000 Avg. Training loss: 4.2045 0.0232 sec/batch
Nearest to new: pilkington, vuk, insurer, emitting, golden, conceptions, villein, fibres,
Nearest to about: meun, loc, adjoint, foreleg, philosophicus, mannered, hrer, holdover,
Nearest to american: questioner, figurines, marlins, boomerangs, egoism, track, intransitive, anchorit

Epoch 7/10 Iteration: 76100 Avg. Training loss: 4.2432 0.0245 sec/batch
Epoch 7/10 Iteration: 76200 Avg. Training loss: 4.2278 0.0234 sec/batch
Epoch 7/10 Iteration: 76300 Avg. Training loss: 4.2388 0.0231 sec/batch
Epoch 7/10 Iteration: 76400 Avg. Training loss: 4.2211 0.0232 sec/batch
Epoch 7/10 Iteration: 76500 Avg. Training loss: 4.2323 0.0233 sec/batch
Epoch 7/10 Iteration: 76600 Avg. Training loss: 4.2202 0.0232 sec/batch
Epoch 7/10 Iteration: 76700 Avg. Training loss: 4.2332 0.0231 sec/batch
Epoch 7/10 Iteration: 76800 Avg. Training loss: 4.2509 0.0231 sec/batch
Epoch 7/10 Iteration: 76900 Avg. Training loss: 4.2424 0.0231 sec/batch
Epoch 7/10 Iteration: 77000 Avg. Training loss: 4.2394 0.0230 sec/batch
Nearest to new: prosecutes, cheshire, naqada, colonna, ombudsman, arteriovenous, fptp, coupland,
Nearest to about: haredi, quantifier, caer, ashmole, standout, knows, vocation, apollo,
Nearest to american: esquire, bikes, explored, diverted, honoria, shipman, restarting, ramoth,


Epoch 7/10 Iteration: 80100 Avg. Training loss: 4.2236 0.0242 sec/batch
Epoch 7/10 Iteration: 80200 Avg. Training loss: 4.2130 0.0231 sec/batch
Epoch 7/10 Iteration: 80300 Avg. Training loss: 4.2174 0.0232 sec/batch
Epoch 7/10 Iteration: 80400 Avg. Training loss: 4.2125 0.0232 sec/batch
Epoch 7/10 Iteration: 80500 Avg. Training loss: 4.2275 0.0233 sec/batch
Epoch 7/10 Iteration: 80600 Avg. Training loss: 4.2165 0.0234 sec/batch
Epoch 7/10 Iteration: 80700 Avg. Training loss: 4.2381 0.0231 sec/batch
Epoch 7/10 Iteration: 80800 Avg. Training loss: 4.2399 0.0233 sec/batch
Epoch 7/10 Iteration: 80900 Avg. Training loss: 4.2444 0.0233 sec/batch
Epoch 7/10 Iteration: 81000 Avg. Training loss: 4.2215 0.0231 sec/batch
Nearest to new: netting, tequila, glu, chad, brawl, gelatin, dispelled, middle,
Nearest to about: cantona, beginning, afterward, remade, frustrated, platonist, poulsen, jaspers,
Nearest to american: shipwrecks, conspicuous, overwhelms, cohomology, manukau, metallurgical, playwrig

Epoch 7/10 Iteration: 84100 Avg. Training loss: 4.1885 0.0240 sec/batch
Epoch 7/10 Iteration: 84200 Avg. Training loss: 4.2080 0.0231 sec/batch
Epoch 7/10 Iteration: 84300 Avg. Training loss: 4.2251 0.0233 sec/batch
Epoch 8/10 Iteration: 84400 Avg. Training loss: 4.2464 0.0074 sec/batch
Epoch 8/10 Iteration: 84500 Avg. Training loss: 4.2294 0.0232 sec/batch
Epoch 8/10 Iteration: 84600 Avg. Training loss: 4.2291 0.0229 sec/batch
Epoch 8/10 Iteration: 84700 Avg. Training loss: 4.2338 0.0233 sec/batch
Epoch 8/10 Iteration: 84800 Avg. Training loss: 4.2211 0.0231 sec/batch
Epoch 8/10 Iteration: 84900 Avg. Training loss: 4.2176 0.0230 sec/batch
Epoch 8/10 Iteration: 85000 Avg. Training loss: 4.2021 0.0231 sec/batch
Nearest to new: selwyn, characterizes, methanol, deepening, unsecured, sei, inefficiencies, undercarriage,
Nearest to about: scipione, iuds, generated, maturing, colle, sandford, philosophie, obituaries,
Nearest to american: outcasts, resartus, theroux, bevan, theora, wynette, ad

Epoch 8/10 Iteration: 88100 Avg. Training loss: 4.2328 0.0243 sec/batch
Epoch 8/10 Iteration: 88200 Avg. Training loss: 4.2177 0.0231 sec/batch
Epoch 8/10 Iteration: 88300 Avg. Training loss: 4.2318 0.0233 sec/batch
Epoch 8/10 Iteration: 88400 Avg. Training loss: 4.2148 0.0234 sec/batch
Epoch 8/10 Iteration: 88500 Avg. Training loss: 4.2226 0.0233 sec/batch
Epoch 8/10 Iteration: 88600 Avg. Training loss: 4.2043 0.0234 sec/batch
Epoch 8/10 Iteration: 88700 Avg. Training loss: 4.2224 0.0232 sec/batch
Epoch 8/10 Iteration: 88800 Avg. Training loss: 4.2344 0.0232 sec/batch
Epoch 8/10 Iteration: 88900 Avg. Training loss: 4.2272 0.0232 sec/batch
Epoch 8/10 Iteration: 89000 Avg. Training loss: 4.2244 0.0233 sec/batch
Nearest to new: tann, playboy, frequented, boty, feat, selene, faults, foil,
Nearest to about: elliptic, footed, seamonkey, skirmishing, jobs, armed, shrieks, biscuit,
Nearest to american: benno, px, troy, utilised, rosenbloom, chaldeans, longshanks, systran,
Nearest to after: ho

Epoch 8/10 Iteration: 92100 Avg. Training loss: 4.2109 0.0243 sec/batch
Epoch 8/10 Iteration: 92200 Avg. Training loss: 4.2093 0.0231 sec/batch
Epoch 8/10 Iteration: 92300 Avg. Training loss: 4.2264 0.0232 sec/batch
Epoch 8/10 Iteration: 92400 Avg. Training loss: 4.2009 0.0233 sec/batch
Epoch 8/10 Iteration: 92500 Avg. Training loss: 4.2075 0.0231 sec/batch
Epoch 8/10 Iteration: 92600 Avg. Training loss: 4.2074 0.0232 sec/batch
Epoch 8/10 Iteration: 92700 Avg. Training loss: 4.2194 0.0232 sec/batch
Epoch 8/10 Iteration: 92800 Avg. Training loss: 4.2207 0.0231 sec/batch
Epoch 8/10 Iteration: 92900 Avg. Training loss: 4.2258 0.0233 sec/batch
Epoch 8/10 Iteration: 93000 Avg. Training loss: 4.2132 0.0232 sec/batch
Nearest to new: coleridge, perkin, brokerage, tallis, photosynthesis, picnic, hecker, familiaris,
Nearest to about: angelo, anglicans, relatedness, headstone, viz, resilient, quoting, cpr,
Nearest to american: multilevel, kakatiya, mandela, terceira, disloyalty, rescind, packages

Epoch 8/10 Iteration: 96100 Avg. Training loss: 4.1959 0.0242 sec/batch
Epoch 8/10 Iteration: 96200 Avg. Training loss: 4.2050 0.0232 sec/batch
Epoch 8/10 Iteration: 96300 Avg. Training loss: 4.2163 0.0233 sec/batch
Epoch 8/10 Iteration: 96400 Avg. Training loss: 4.2315 0.0234 sec/batch
Epoch 9/10 Iteration: 96500 Avg. Training loss: 4.2209 0.0185 sec/batch
Epoch 9/10 Iteration: 96600 Avg. Training loss: 4.2251 0.0232 sec/batch
Epoch 9/10 Iteration: 96700 Avg. Training loss: 4.2148 0.0234 sec/batch
Epoch 9/10 Iteration: 96800 Avg. Training loss: 4.2198 0.0232 sec/batch
Epoch 9/10 Iteration: 96900 Avg. Training loss: 4.2023 0.0231 sec/batch
Epoch 9/10 Iteration: 97000 Avg. Training loss: 4.2008 0.0230 sec/batch
Nearest to new: enz, wondering, multilevel, tangential, stitched, voyagers, observation, lecturer,
Nearest to about: rocketry, govt, epp, adhered, aquino, sexual, permissions, alvarez,
Nearest to american: affectation, trismegistus, strictest, trot, shivers, spurs, redesign, heck

Epoch 9/10 Iteration: 100100 Avg. Training loss: 4.2046 0.0241 sec/batch
Epoch 9/10 Iteration: 100200 Avg. Training loss: 4.2362 0.0232 sec/batch
Epoch 9/10 Iteration: 100300 Avg. Training loss: 4.2108 0.0232 sec/batch
Epoch 9/10 Iteration: 100400 Avg. Training loss: 4.2238 0.0233 sec/batch
Epoch 9/10 Iteration: 100500 Avg. Training loss: 4.1949 0.0232 sec/batch
Epoch 9/10 Iteration: 100600 Avg. Training loss: 4.2165 0.0233 sec/batch
Epoch 9/10 Iteration: 100700 Avg. Training loss: 4.2185 0.0233 sec/batch
Epoch 9/10 Iteration: 100800 Avg. Training loss: 4.2194 0.0234 sec/batch
Epoch 9/10 Iteration: 100900 Avg. Training loss: 4.2245 0.0231 sec/batch
Epoch 9/10 Iteration: 101000 Avg. Training loss: 4.2165 0.0230 sec/batch
Nearest to new: except, eukaryotic, sv, flippers, berkshire, vanderbilt, kio, branch,
Nearest to about: thief, eyes, oicw, tente, leda, ramifications, hug, funchal,
Nearest to american: unconditional, klaviertrio, antrim, converged, bookmakers, tema, kuru, tocqueville,


Epoch 9/10 Iteration: 104100 Avg. Training loss: 4.2043 0.0244 sec/batch
Epoch 9/10 Iteration: 104200 Avg. Training loss: 4.2044 0.0230 sec/batch
Epoch 9/10 Iteration: 104300 Avg. Training loss: 4.2107 0.0231 sec/batch
Epoch 9/10 Iteration: 104400 Avg. Training loss: 4.2088 0.0233 sec/batch
Epoch 9/10 Iteration: 104500 Avg. Training loss: 4.1918 0.0231 sec/batch
Epoch 9/10 Iteration: 104600 Avg. Training loss: 4.2136 0.0232 sec/batch
Epoch 9/10 Iteration: 104700 Avg. Training loss: 4.2106 0.0233 sec/batch
Epoch 9/10 Iteration: 104800 Avg. Training loss: 4.2074 0.0234 sec/batch
Epoch 9/10 Iteration: 104900 Avg. Training loss: 4.2178 0.0233 sec/batch
Epoch 9/10 Iteration: 105000 Avg. Training loss: 4.2207 0.0235 sec/batch
Nearest to new: clear, colonised, lifestyles, shame, suppression, x, inadvertently, granites,
Nearest to about: commissary, reinforcement, reprints, aquifers, allori, reddish, sdi, ellsworth,
Nearest to american: schwarzwald, ebook, bultmann, displacements, lingual, dem

Epoch 9/10 Iteration: 108100 Avg. Training loss: 4.1968 0.0242 sec/batch
Epoch 9/10 Iteration: 108200 Avg. Training loss: 4.1735 0.0229 sec/batch
Epoch 9/10 Iteration: 108300 Avg. Training loss: 4.2074 0.0232 sec/batch
Epoch 9/10 Iteration: 108400 Avg. Training loss: 4.2088 0.0234 sec/batch
Epoch 10/10 Iteration: 108500 Avg. Training loss: 4.2285 0.0061 sec/batch
Epoch 10/10 Iteration: 108600 Avg. Training loss: 4.2203 0.0232 sec/batch
Epoch 10/10 Iteration: 108700 Avg. Training loss: 4.2103 0.0233 sec/batch
Epoch 10/10 Iteration: 108800 Avg. Training loss: 4.2144 0.0234 sec/batch
Epoch 10/10 Iteration: 108900 Avg. Training loss: 4.1988 0.0232 sec/batch
Epoch 10/10 Iteration: 109000 Avg. Training loss: 4.1864 0.0232 sec/batch
Nearest to new: tub, goebbels, worry, khanty, einen, cornelius, bluish, capitalisation,
Nearest to about: staring, seeding, persecute, climaxed, illegitimate, ludwik, estimated, dunk,
Nearest to american: rabinowitz, paddy, contagious, dadaists, landsmannschaft, t

Epoch 10/10 Iteration: 112100 Avg. Training loss: 4.2418 0.0242 sec/batch
Epoch 10/10 Iteration: 112200 Avg. Training loss: 4.2205 0.0232 sec/batch
Epoch 10/10 Iteration: 112300 Avg. Training loss: 4.2029 0.0233 sec/batch
Epoch 10/10 Iteration: 112400 Avg. Training loss: 4.2139 0.0231 sec/batch
Epoch 10/10 Iteration: 112500 Avg. Training loss: 4.2041 0.0233 sec/batch
Epoch 10/10 Iteration: 112600 Avg. Training loss: 4.1997 0.0231 sec/batch
Epoch 10/10 Iteration: 112700 Avg. Training loss: 4.2088 0.0233 sec/batch
Epoch 10/10 Iteration: 112800 Avg. Training loss: 4.2034 0.0235 sec/batch
Epoch 10/10 Iteration: 112900 Avg. Training loss: 4.2104 0.0233 sec/batch
Epoch 10/10 Iteration: 113000 Avg. Training loss: 4.2173 0.0231 sec/batch
Nearest to new: hooked, conurbation, buys, falsifiable, immortalised, cortona, soured, yamaha,
Nearest to about: kilobytes, sovereignly, councillor, moody, overtaking, sporadically, supremes, euclidean,
Nearest to american: videoconferencing, objected, traveli

Epoch 10/10 Iteration: 116100 Avg. Training loss: 4.2075 0.0243 sec/batch
Epoch 10/10 Iteration: 116200 Avg. Training loss: 4.2056 0.0234 sec/batch
Epoch 10/10 Iteration: 116300 Avg. Training loss: 4.1854 0.0233 sec/batch
Epoch 10/10 Iteration: 116400 Avg. Training loss: 4.2077 0.0233 sec/batch
Epoch 10/10 Iteration: 116500 Avg. Training loss: 4.2020 0.0233 sec/batch
Epoch 10/10 Iteration: 116600 Avg. Training loss: 4.1925 0.0234 sec/batch
Epoch 10/10 Iteration: 116700 Avg. Training loss: 4.2141 0.0233 sec/batch
Epoch 10/10 Iteration: 116800 Avg. Training loss: 4.2111 0.0233 sec/batch
Epoch 10/10 Iteration: 116900 Avg. Training loss: 4.2086 0.0233 sec/batch
Epoch 10/10 Iteration: 117000 Avg. Training loss: 4.2068 0.0232 sec/batch
Nearest to new: lavos, aneristic, suited, horeb, ethiopians, jagan, dwelt, doubles,
Nearest to about: deacon, daewoo, compensatory, suda, paradis, pershing, magnates, teleportation,
Nearest to american: limes, disbarment, brahms, infusions, planner, samhain, f

Epoch 10/10 Iteration: 120100 Avg. Training loss: 4.2033 0.0243 sec/batch
Epoch 10/10 Iteration: 120200 Avg. Training loss: 4.1936 0.0232 sec/batch
Epoch 10/10 Iteration: 120300 Avg. Training loss: 4.1937 0.0232 sec/batch
Epoch 10/10 Iteration: 120400 Avg. Training loss: 4.2059 0.0234 sec/batch
Epoch 10/10 Iteration: 120500 Avg. Training loss: 4.2209 0.0234 sec/batch


Restore the trained network if you need to:

In [None]:
with train_graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=train_graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    embed_mat = sess.run(embedding)

## Visualizing the word vectors

Below we'll use T-SNE to visualize how our high-dimensional word vectors cluster together. T-SNE is used to project these vectors into two dimensions while preserving local stucture. Check out [this post from Christopher Olah](http://colah.github.io/posts/2014-10-Visualizing-MNIST/) to learn more about T-SNE and other ways to visualize high-dimensional data.

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [None]:
viz_words = 500
tsne = TSNE()
embed_tsne = tsne.fit_transform(embed_mat[:viz_words, :])

In [None]:
fig, ax = plt.subplots(figsize=(14, 14))
for idx in range(viz_words):
    plt.scatter(*embed_tsne[idx, :], color='steelblue')
    plt.annotate(int_to_vocab[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)