# Lecture 4 Structure Your TensorFlow Model
<a href="http://web.stanford.edu/class/cs20si/lectures/notes_04.pdf">CS20 SI Lecture 4 Notes</a><br>
<a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">McCormickML Word2vec Tutorial</a>

## Don't run the entire notebook! Just run one of the two models, or you'll get errors

In this lecture, we will try to build word2vec, a skip-gram model.

Skip-gram vs CBOW   (Continuous Bag-of-Words)
Algorithmically, these models are similar, except that CBOW predicts center words from context words, while the skip-gram does the inverse and predicts source context-words from the center words. For  example, if we have the sentence: "The  quick brown fox jumps", then CBOW tries to predict "brown" from "the", "quick", "fox", and "jumps", while skip-gram tries to predict "the", "quick", "fox", and "jumps" from "brown".

Statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.

<img src="http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png" style="width:500px;height:300px;"></img>

We want 300 features, so 300 neurons. You can see that our 10000 rows become our word vectors (of length 300!)

<img src="http://mccormickml.com/assets/word2vec/word2vec_weight_matrix_lookup_table.png" style="width:500px;height:300px;"></img>

In CS 224N, we learned about two training methods: hierachical softmax and negative sampling. We rule out softmax because the normalization factor is too computationally expensive and the students in CS 224N implemented the skip-gram model with negative sampling.

Negative samplying belongs to a family of sampling-based approachs that also includes importance sampling and target sampling. Negative sampling is a simplified model of Noise Constrastive Estimatino (NCE), e.g. negative sampling makes certain assumptions about the number of noise samples to generate (k) and the distribution of noise sames (Q) (negative sampling assumes that kQ(w) = 1) to simplify computation.

While negative sampling is useful for learning word embeddings, it doesn't have the theoretical guarantee that its derivative tends towards the gradient of the softmax function, which makes it not so useful for language modelling. 

NCE has nice theoretical guarantees that negative sampling lacks as the number of noise samples increases. Mnih and Teh (2012) reported that 25 noise samples are sufficient to match the performance of the regular softmax, with an expected speed-up factor of about 45.

In this example, we will be using NCE because of its nice theoretical guarantee. Note that sampling-based approaches are only useful at training time - during inference, the full softmax still needs to be computed to obtain a normalized probability.

** About the dataset **
text8 is the first 100 MB of cleaned text of the English Wikipedia dump on Mar. 3, 2006. It is not enough to train really good word embeddings, for better results use the dataset fil9 of the first 10^9 bytes of the Wikipedia dump, as described on Matt Mahoney's website.

** Interface: How to structure your TensorFlow model **
We've only done 2 models in the past, and they more or less have the same structure:

Phase 1: assemble your graph
1. Define placeholders for input and output
2. Define the weights (variable)
3. Define the inference model
4. Define loss function
5. Define optimizer

Phase 2: execute the computation (training your model)
1. Initialize all model variables for the first time.
2. Feed in the training data. Might involve randomizing the order of the data samples. 
3. Execute the inference model on the training data, so it calculates for each training input example the output with the current model parameters.
4. Compute the cost
5. Adjust the model parameters to minimize/maximize the cost depending on the model.

Let's apply these steps to create our word2vec, skip-gram model.

## No-frills word2vec skip-gram

In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf
import math

from process_data import process_data

VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128 # dimension of the word embedding vectors
SKIP_WINDOW = 1 # the context window
NUM_SAMPLED = 64    # Number of negative examples to sample.
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 10000
SKIP_STEP = 2000 # how many steps to skip before reporting the loss

def word2vec(batch_gen):
    """ Build the graph for word2vec model and train it """
    
    # Step 1: Define the placeholders for input and output
    # center_words have to be int to work on embedding lookup
    # Input: Center word (Using index number instead of one hot, e.g. 234)
    # Output: Target word
    with tf.name_scope("data"):
        center_words = tf.placeholder(tf.float32, shape = [BATCH_SIZE], 
                                      name = 'center_words')
        target_words = tf.placeholder(tf.float32, shape = [BATCH_SIZE, 1], 
                                      name = 'target_words')

    # Step 2: Define weights. In word2vec, it's actually the weights that 
    # we care about
    # If one word is represented with a vector of size EMBED_SIZE, then the
    # embedding matrix will have shape [VOCAB_SIZE, EMBED_SIZE]
    # Initialized to random uniform -1 to 1
    with tf.device('/cpu:0'):
        with tf.name_scope("embed"):
            embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, 
                                                          EMBED_SIZE], 
                                                         -1.0, 1.0), 
                                       name = 'embed_matrix')

        # Step 3: Define the inference
        # Our goal is to get the vector representations of words in our dictionary.
        # Each row of the embedding matrix corresponds to the vector representation
        # of the word at that index. So to get the representation of all the center
        # words in the batch, we get the slice of all corresponding rows in the
        # embedding matrix. TensorFlow provides a convenient method to do so called
        # tf.nn.embedding_lookup()
        # This method is really useful when it comes to matrix multiplication with
        # one-hot vectors because it saves us from doing a bunch of computation that
        # will return 0 anyway.
        with tf.name_scope("loss"):
            #  embedding_lookup retrieves rows of the first tensor
            embed = tf.nn.embedding_lookup(embed_matrix, tf.to_int32(center_words), name='embed')

            # Step 4: construct variables for NCE loss
            # tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, ...)
            # nce_weight (vocab size x embed size), intialized to truncated_normal stddev=1.0 / (EMBED_SIZE ** 0.5)
            # bias: vocab size, initialized to 0

            """ Define the loss function """
            nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],
                                                stddev = 1.0 / math.sqrt(EMBED_SIZE)), 
                                     name = "nce_weight")
            nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name = "nce_bias")

            # define loss function to be NCE loss function
            # tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, ...)
            # need to get the mean accross the batch
            loss = tf.reduce_mean(tf.nn.nce_loss(weights = nce_weight,
                                        biases = nce_bias,
                                        labels = target_words,
                                        inputs = embed,
                                        num_sampled = NUM_SAMPLED,
                                        num_classes = VOCAB_SIZE),
                                 name="loss")

        # Step 5: define optimizer
        optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)

    with tf.Session() as sess:
        # initialize variables
        sess.run(tf.global_variables_initializer())
        
        total_loss = 0.0 # we use this to calculate the average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter("graphs", sess.graph)
        for index in range(NUM_TRAIN_STEPS):
            batch = batch_gen.__next__()
            # create feed_dict, run optimizer, fetch loss_batch
            loss_batch, _ = sess.run([loss, optimizer],
                                    feed_dict = {center_words: batch[0],
                                                target_words: batch[1]})
            total_loss += loss_batch
            if (index + 1) % SKIP_STEP == 0:
                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                total_loss = 0.0
        print(embed_matrix.eval())
        writer.close()

def main():
    batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)
    word2vec(batch_gen)

if __name__ == '__main__':
    main()

Dataset ready
Average loss at step 1999: 113.5
Average loss at step 3999:  52.8
Average loss at step 5999:  33.5
Average loss at step 7999:  23.2
Average loss at step 9999:  17.8
[[ 0.73839045 -0.15143414 -0.03583032 ..., -0.5138132  -0.07501161
  -0.14482619]
 [-0.40334132 -0.46847922 -0.11962534 ..., -0.76152456 -0.21833348
  -0.82695729]
 [ 0.39207357 -0.30086923  0.69663316 ..., -0.35250014  0.16394222
  -0.30729359]
 ..., 
 [ 0.12885094  0.29688597  0.68227816 ...,  0.4535439  -0.28766537
  -0.62341571]
 [ 0.49020457 -0.24589086 -0.58393574 ...,  0.75632381  0.14895558
  -0.97532344]
 [ 0.98762894 -0.62763882  0.14082861 ...,  0.3909626   0.07093954
   0.88820386]]


Remember to
```
source activate tensorflow
```
Use this in terminal for tensorboard
``` 
tensorboard  --logdir = "graphs" 
```

You can visualize the embedding with PCA or t-SNE:

t-distributed stochastic neighbor embedding is a ML algorithm for dimensionality reduction develoepd by Geoffrey Hinton and Laurens van der Maatan. It is a nonlinear dim. reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in the scatter plot.

## Word2vec with NCE loss and code to visualize the embeddings on TensorBoard
<a href = "https://github.com/chiphuyen/tf-stanford-tutorials/blob/master/examples/04_word2vec_visualize.py">Word2vec_visualize.py Reference</a>

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os

import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector
import tensorflow as tf

from process_data import process_data

import math

VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128 # dimension of the word embedding vectors
SKIP_WINDOW = 1 # the context window
NUM_SAMPLED = 64 # Number of negative examples to sample.
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 100000
WEIGHTS_FLD = 'processed/'
SKIP_STEP = 2000

class SkipGramModel:
    """ Build the graph for word2vec model """
    def __init__(self, vocab_size, embed_size, batch_size, num_sampled, learning_rate):
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.batch_size = batch_size
        self.num_sampled = num_sampled
        self.lr = learning_rate
        self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

    def _create_placeholders(self):
        """ Step 1: define the placeholders for input and output """
        with tf.name_scope("data"):
            self.center_words = tf.placeholder(tf.int32, shape=[self.batch_size], name='center_words')
            self.target_words = tf.placeholder(tf.int32, shape=[self.batch_size, 1], name='target_words')

    def _create_embedding(self):
        """ Step 2: define weights. In word2vec, it's actually the weights that we care about """
        # Assemble this part of the graph on the CPU. You can change it to GPU if you have GPU
        with tf.device('/cpu:0'):
            with tf.name_scope("embed"):
                self.embed_matrix = tf.Variable(tf.random_uniform([self.vocab_size, 
                                                                    self.embed_size], -1.0, 1.0), 
                                                                    name='embed_matrix')

    def _create_loss(self):
        """ Step 3 + 4: define the model + the loss function """
        with tf.device('/cpu:0'):
            with tf.name_scope("loss"):
                # Step 3: define the inference
                embed = tf.nn.embedding_lookup(self.embed_matrix, self.center_words, name='embed')

                # Step 4: define loss function
                # construct variables for NCE loss
                nce_weight = tf.Variable(tf.truncated_normal([self.vocab_size, self.embed_size],
                                                            stddev=1.0 / math.sqrt(self.embed_size)), name='nce_weight')
                nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')

                # define loss function to be NCE loss function
                self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                                    biases=nce_bias, 
                                                    labels=self.target_words, 
                                                    inputs=embed, 
                                                    num_sampled=self.num_sampled, 
                                                    num_classes=self.vocab_size), name='loss')
    def _create_optimizer(self):
        """ Step 5: define optimizer """
        with tf.device('/cpu:0'):
            self.optimizer = tf.train.GradientDescentOptimizer(self.lr).minimize(self.loss, 
                                                              global_step=self.global_step)

    def _create_summaries(self):
        with tf.name_scope("summaries"):
            tf.summary.scalar("loss", self.loss)
            tf.summary.histogram("histogram_loss", self.loss)
            # because you have several summaries, we should merge them all
            # into one op to make it easier to manage
            self.summary_op = tf.summary.merge_all()

    def build_graph(self):
        """ Build the graph for our model """
        self._create_placeholders()
        self._create_embedding()
        self._create_loss()
        self._create_optimizer()
        self._create_summaries()

def train_model(model, batch_gen, num_train_steps, weights_fld):
    saver = tf.train.Saver() # defaults to saving all variables - in this case embed_matrix, nce_weight, nce_bias

    initial_step = 0
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))
        # if that checkpoint exists, restore from checkpoint
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)

        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('improved_graph/lr' + str(LEARNING_RATE), sess.graph)
        initial_step = model.global_step.eval()
        for index in range(initial_step, initial_step + num_train_steps):
            centers, targets = batch_gen.__next__()
            feed_dict={model.center_words: centers, model.target_words: targets}
            loss_batch, _, summary = sess.run([model.loss, model.optimizer, model.summary_op], feed_dict=feed_dict)
            writer.add_summary(summary, global_step=index)
            total_loss += loss_batch
            if (index + 1) % SKIP_STEP == 0:
                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                total_loss = 0.0
                saver.save(sess, 'checkpoints/', index)
        
        ####################
        # code to visualize the embeddings. uncomment the below to visualize embeddings
        final_embed_matrix = sess.run(model.embed_matrix)
        
        # it has to variable. constants don't work here. you can't reuse model.embed_matrix
        embedding_var = tf.Variable(final_embed_matrix[:1000], name='embedding')
        sess.run(embedding_var.initializer)

        config = projector.ProjectorConfig()
        summary_writer = tf.summary.FileWriter('processed')

        # add embedding to the config file
        embedding = config.embeddings.add()
        embedding.tensor_name = embedding_var.name
        
        # link this tensor to its metadata file, in this case the first 500 words of vocab
        embedding.metadata_path = 'processed/vocab_1000.tsv'

        # saves a configuration file that TensorBoard will read during startup.
        projector.visualize_embeddings(summary_writer, config)
        saver_embed = tf.train.Saver([embedding_var])
        saver_embed.save(sess, 'processed/model3.ckpt', 1)

def main():
    model = SkipGramModel(VOCAB_SIZE, EMBED_SIZE, BATCH_SIZE, NUM_SAMPLED, LEARNING_RATE)
    model.build_graph()
    batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)
    train_model(model, batch_gen, NUM_TRAIN_STEPS, WEIGHTS_FLD)

if __name__ == '__main__':
    main()

Dataset ready
INFO:tensorflow:Restoring parameters from checkpoints/skip-gram-199999
Average loss at step 201999:   4.3
Average loss at step 203999:   4.3
Average loss at step 205999:   4.3
Average loss at step 207999:   4.3
Average loss at step 209999:   4.3
Average loss at step 211999:   4.3
Average loss at step 213999:   4.4
Average loss at step 215999:   4.2
Average loss at step 217999:   4.2
Average loss at step 219999:   4.3
Average loss at step 221999:   4.3
Average loss at step 223999:   4.3
Average loss at step 225999:   4.3
Average loss at step 227999:   4.3
Average loss at step 229999:   4.3
Average loss at step 231999:   4.3
Average loss at step 233999:   4.3
Average loss at step 235999:   4.3
Average loss at step 237999:   4.2
Average loss at step 239999:   4.2
Average loss at step 241999:   4.3
Average loss at step 243999:   4.3
Average loss at step 245999:   4.3
Average loss at step 247999:   4.3
Average loss at step 249999:   4.3
Average loss at step 251999:   4.3
Avera

```
source activate tensorflow
tensorboard --logdir="improved_graph/lr1.0"
tensorboard --logdir="checkpoints/"
```