#  How to structure your model in TensorFlow
----------------
Common steps:
    
    Phase 1: assemble your graph:
        1. Define placeholders for input and output
        2. Define the weights
        3. Define the inference model
        4. Define loss function
        5. Define optimizer
    Phase 2: execute the computation Which is basically training your model. There are a few steps:
        1. Initialize all model variables for the first time.
        2. Feed in the training data. Might involve randomizing the order of data samples.
        3. Execute the inference model on the training data, so it calculates for each training input example the output with the current model parameters.
        4. Compute the cost
        5. Adjust the model parameters to minimize/maximize the cost depending on the model.
        
    Let’s apply these steps to creating our word2vec, skip-gram model(use center word predict context).


## Phase 1: Assemble graph
-------
### 1. Define placeholders for input and output
        Using word index as input instead of using one-hot vector. The index limit to vocabulary size.

```python
center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE])
target_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE])
```
### 2. Define the weight (in this case, embedding matrix)
```python
embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0))
```
### 3. Inference (compute the forward path of the graph)
```python
tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None, validate_indices=True, max_norm=None)
```
    
    This interface help get embeedding like below:
![embedding-lookup](pic/embedding-lookup.png)

    So we can get embedding by below code:
```python
embed = tf.nn.embedding_lookup(embed_matrix, center_words)
```
### 4. Define the loss function
    The noise-contrastive estimation loss is defined in terms of a logistic regression model. For this, we need to define the weights and biases for each word in the vocabulary (also called the output weights as opposed to the input embeddings).
```python
nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE], stddev=1.0 / EMBED_SIZE ** 0.5))
nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]))
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                     biases=nce_bias,
                                     labels=target_words,
                                     inputs=embed,
                                     num_sampled=NUM_SAMPLED,
                                     num_classes=VOCAB_SIZE))

```
### 5. Define optimizer
```python
optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)
```
## Phase 2: Execute the computation
-------
```python
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    average_loss = 0.0
    for index in xrange(NUM_TRAIN_STEPS):
        batch = batch_gen.next()
        loss_batch, _ = sess.run([loss, optimizer], feed_dict={center_words: batch[0], target_words: batch[1]})
        average_loss += loss_batch
        
        if (index + 1) % 2000 == 0:
            print('Average loss at step {}: {:5.1f}'.format(index + 1, average_loss / (index + 1)))
 ```
## Name Scope
-------
    Simplify the graph displaied by tensorboard.
```python
with tf.name_scope('data'):
    center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='center_words')
    target_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1], name='target_words')
with tf.name_scope('embed'):
    embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), name='embed_matrix')
with tf.name_scope('loss'):
    embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')
    nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE], stddev=1.0 / math.sqrt(EMBED_SIZE)), name='nce_weight')
    nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')
    loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight,
                                         biases=nce_bias,
                                         labels=target_words,
                                         inputs=embed,
                                         num_sampled=NUM_SAMPLED,
                                         num_classes=VOCAB_SIZE),
                                         name='loss')
```
## Tensorboard Visualization Guide
```python
from tensorflow.contrib.tensorboard.plugins import projector

# obtain the embedding_matrix after you’ve trained it
final_embed_matrix = sess.run(model.embed_matrix)

# create a variable to hold your embeddings. It has to be a variable. Constants
# don’t work. You also can’t just use the embed_matrix we defined earlier for our model. Why
# is that so? I don’t know. I get the 500 most popular words.
embedding_var = tf.Variable(final_embed_matrix[:500], name='embedding')
sess.run(embedding_var.initializer)
config = projector.ProjectorConfig()
summary_writer = tf.summary.FileWriter(LOGDIR)

# add embeddings to config
embedding = config.embeddings.add()
embedding.tensor_name = embedding_var.name

# link the embeddings to their metadata file. In this case, the file that contains
# the 500 most popular words in our vocabulary
embedding.metadata_path = LOGDIR + '/vocab_500.tsv'

# save a configuration file that TensorBoard will read during startup
projector.visualize_embeddings(summary_writer, config)

# save our embedding
saver_embed = tf.train.Saver([embedding_var])
saver_embed.save(sess, LOGDIR + '/skip-gram.ckpt', 1)
```

## Code
----
Below code is very simple to help understand word2vec and tensorflow interfaces. Will provide anothe well structured code in this repertory. 

In [None]:
from __future__ import division

import tensorflow as tf
import zipfile
import collections
import random
import numpy as np

# MACRO
VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128
EPOCH = 100001
SKIP_WINDOW = 1 # the number of context words from left/right of input word
NUM_SKIPS = 2    # the number of labels used for one input
NUM_SAMPLED = 64

# data
data_name = "data/text/text8.zip"

def read_data():
    with zipfile.ZipFile(data_name)    as zf:
        data = tf.compat.as_str(zf.read(zf.namelist()[0])).split()
    return data

words = read_data()
print("data size=%r" % len(words))


def build_dataset(words):
    count = [['UNK', -1]]
    #temp = collections.Counter(words))
    count.extend(collections.Counter(words).most_common(VOCAB_SIZE - 1))
    vocabulary = dict()

    for word, _ in count:
        vocabulary[word] = len(vocabulary) # index

    indices = list()
    unk_count = 0
    for word in words:
        if word in vocabulary:
            index = vocabulary[word]
        else:
            index = 0
            unk_count += 1
        indices.append(index)

    count[0][1] = unk_count
    reversed_vocabulary = dict(zip(vocabulary.values(), vocabulary.keys()))
    return indices, count, vocabulary, reversed_vocabulary

indices, count, vocabulary, reversed_vocabulary = build_dataset(words)

del vocabulary
print('Most common words (+UNK)', count[:5])
print('Sample data', indices[:10], [reversed_vocabulary[i] for i in indices[:10]])

index = 0
def generate_batch():
    assert BATCH_SIZE % NUM_SKIPS == 0
    assert NUM_SKIPS <=    (2 * SKIP_WINDOW)
    batch = np.ndarray(shape=(BATCH_SIZE), dtype=np.int32)
    labels = np.ndarray(shape=(BATCH_SIZE, 1), dtype=np.int32)
    span = 2 * SKIP_WINDOW + 1
    buf = collections.deque(maxlen=span)

    global index
    # round back
    if index + span > len(indices):
        index = 0

    buf.extend(indices[index:index + span])
    index += span

    for i in range(BATCH_SIZE // NUM_SKIPS): # for each span
        target = SKIP_WINDOW # center words as target
        targets_to_avoid = [SKIP_WINDOW]
        
        for j in range(NUM_SKIPS):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * NUM_SKIPS + j] = buf[SKIP_WINDOW]
            labels[i * NUM_SKIPS + j, 0] = buf[target]
        
        if index == len(indices):
            buf[:] = indices[:span]
            index = span
        else:
            buf.append(indices[index])
            index += 1

    index = (index + len(indices) - span) % len(indices)
    return batch, labels


# skip-gram model
# define placeholder for input and output
train_inputs = tf.placeholder(tf.int32, [BATCH_SIZE])
train_labels = tf.placeholder(tf.int32,[BATCH_SIZE, 1])

# define the weight
embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0))

# inference
embed = tf.nn.embedding_lookup(embed_matrix, train_inputs)

# define the loss function
nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE], stddev=1.0 / EMBED_SIZE ** 0.5))
nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]))
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                    biases=nce_bias, 
                                    labels=train_labels, 
                                    inputs=embed,
                                    num_sampled=NUM_SAMPLED,
                                    num_classes=VOCAB_SIZE))

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    average_loss = 0.0
    for step in range(EPOCH):
        batch_inputs, batch_labels = generate_batch()
        feed_dict = {train_inputs:batch_inputs, train_labels:batch_labels}
        _, batch_loss = sess.run([optimizer, loss], feed_dict)
        average_loss += batch_loss

        if step % 2000 == 0:
            if step > 0:
                average_loss = average_loss / 2000
            print("average loss=%r" % average_loss)
            average_loss = 0    


data size=17005207
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5237, 3082, 12, 6, 195, 2, 3137, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
average loss=280.04736328125
average loss=113.89051965999603
average loss=52.584254512548448
average loss=33.501050638914109
average loss=23.523179512977599
average loss=17.446803200244904
average loss=14.081227049946785
average loss=11.60793666422367
average loss=10.112544078707694
average loss=8.3949593905210502
average loss=8.0428597589731208
average loss=7.1501791990995409
average loss=6.8352160725593567
average loss=6.7004434989690784
average loss=6.4221606501340869
average loss=5.9205268704891205
average loss=5.9438270325660705
average loss=5.6778586281538006
average loss=5.760340624809265
average loss=5.4955958029031757
average loss=5.2549022141695021
average loss=5.3695967180728914
average loss=5.2273469