# TensorFlow Modeling Tutorial
In this tutorial you will learn how to create a word2vec model. TensorFlow provides a detailed word2vec tutorial [here](https://www.tensorflow.org/tutorials/word2vec). In this Notebook, I will just highlight the main features of the model and give a general introduction to word2vec.

## Background
Word2Vec is a model that was created by [Mikolov et al.](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). It uses the concept of "word embeddings", which is a way to represent relationships between words using vectors. For example, let's say you are given the words 'king' and 'queen', there exists a vector that can connect 'king' to 'queen'. Now, take the word 'man. If the vector is projected from 'man', the word2vec model would point to the word 'woman'. This means that this model can use learned embeddings to predict relationships between words it hasn't seen.

Here is an example of an embedding matrix taken from the TensorFlow tutorial:

![embedding_matrix](https://www.tensorflow.org/images/tsne.png)

## Bias
One must be careful when creating an embedding matrix that gender bias is not included. There is a good article [here](https://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pd).

## `word2vec_optimized.py`
The TensorFlow team created a word2vec model, word2vec_optimized.py, which I have included in the pybotframework repo to make it easier to run. Let's take a look at the main parts of this file.

### ops file
TensorFlow uses ops files to stores common operations performed by this package. In our case, we need to load an ops file that is not included with the standard TensorFlow installation. We are going to extend TensorFlow by using the **word2vec_ops.so** ops file. Please see the install instructions in the **examples/tf_bot/** directory if you have not already installed the word2vec ops file on your computer. In **word2vec_optimized.py**, the ops file is loaded using the following command:

## Flags
Various parameters in a TensorFlow session can be defined using flags. For example, hyperparameters can be stored using flags. In word2vec_optimized.py, the following flags are used:

## `Options()`
An options class is used to retrieve flag values and pass them to the algorithem.


## `Word2Vec()`
This class defines the word2vec object. This is where all of the modeling is done. Let's take a look at a few of the methods.

### `read_analogies()`
Read in a list of word analogies. Each line of the text file consists of four words; the first two words are an analogy of the last two words. The model will use this file during training.

### `build_graph()`
Build the TensorFlow graph. 

### `build_eval_graph()`
Build the evaluation graph. Batches of three words will be analyzed. The embedding vector for each word will be collected. This will be used to predict the embedding vector for the fourth, unknown word.

### `train()`
Set up a TensorFlow session and train the model.

### `eval()`
Evaluate the model accuracy and report it.

### `analogy()`
This function is used with the final, trained model. The user supplies three words and the model tries to predict the fourth word.

### `nearby()`
This function is also used with the final, trained model. Print out nearby words for a list of words.

## Modeling in Practice
Now that we have a brief understanding of how word2vec works, let's train a model. This section will follow the script, tf_train_model.py in the _examples/tf_bot/_ directory. First, we need to import some things:

In [1]:
import os
import tensorflow as tf
import examples.tf_bot.word2vec.word2vec_optimized as word2vec_optimized
from examples.tf_bot.word2vec.word2vec_optimized import Options, Word2Vec, FLAGS

Now, we need to read in the TensorFlow ops file:

Some TensorFlow flags are originally set to `None` as place holders. We need to define these so that the model can run:

In [2]:
repo_path = '/'.join(os.getcwd().split('/')[0:-2])

# Save model to the current path
FLAGS.save_path = os.path.join(repo_path, 'examples/tf_bot/data/')

# Dataset to train the model
FLAGS.train_data = os.path.join(repo_path, 'examples/tf_bot/data/text8_trimmed.txt')

# A list of analogies to test the model
FLAGS.eval_data = os.path.join(repo_path, 'examples/tf_bot/data/questions-words.txt')

Note: repo_path will depend on where you cloned the repo on your system.

Lastly, we can setup our TensorFlow graph and train the model:

In [3]:
opts = Options()
with tf.Graph().as_default(), tf.Session() as session:
    with tf.device("/cpu:0"):
        model = Word2Vec(opts, session)
        model.read_analogies()
    for _ in range(opts.epochs_to_train):
        model.train()
        model.eval()
    # Save the model after training has finished
    model.saver.save(session, os.path.join(opts.save_path, "model.ckpt"),
                     global_step=model.global_step)

Data file:  /Users/dave/DataScience/Projects/GitHub/pybotframework/examples/tf_bot/data/text8_trimmed.txt
Vocab size:  1467  + UNK
Words per epoch:  49999
Eval analogy file:  /Users/dave/DataScience/Projects/GitHub/pybotframework/examples/tf_bot/data/questions-words.txt
Questions:  76
Skipped:  19468
Epoch    1 Step      327: lr = 0.024 words/sec =     5700
Eval    0/76 accuracy =  0.0%
Epoch    2 Step      661: lr = 0.023 words/sec =     5812
Eval    0/76 accuracy =  0.0%
Epoch    3 Step      995: lr = 0.022 words/sec =     5820
Eval    1/76 accuracy =  1.3%
Epoch    4 Step     1329: lr = 0.021 words/sec =     5859
Eval    1/76 accuracy =  1.3%
Epoch    5 Step     1645: lr = 0.020 words/sec =     5511
Eval    1/76 accuracy =  1.3%
Epoch    6 Step     1979: lr = 0.019 words/sec =     5819
Eval    2/76 accuracy =  2.6%
Epoch    7 Step     2313: lr = 0.018 words/sec =     5859
Eval    2/76 accuracy =  2.6%
Epoch    8 Step     2641: lr = 0.017 words/sec =     5713
Eval    2/76 accuracy = 

# Exercises:

1. Using the text8_trimmed.txt file and the questions-words.txt file, create a TensorFlow model and train it.

2. Think of three analogies, each analogy consisting of four words. Pass three of the words to the trained, TensorFlow model and return a prediction for the fourth word. Hint: You will need to use the analogy() method.

## Advanced Exercises:
1. As can be seen, the model training did not perform well. One reason is because the train dataset is too small. This was done to save time during the workshop. At home, or if there is time, download the larger training set from here [provide link] and train the model. Does the model peform better with this larger text file?

2. Find a large, text document on-line. Download it, reformat it to look like the above training set, and train the model. Are there any improvements in the model?