In [1]:
import tensorflow as tf
import math

In this iteration, I want to map out all the variables described in Bengio's Neural Probabilistic Language Model and focus entirely on the `class Model()` architecture. The full class will be defined at the bottom, with each individual attribute and method explained in detail. 

## Initialization

Four things need to be defined in the initialization that I won't get from the data preprocessor that are either standard variables or derived from a specific version of the model implemented on the Brown Corpus:
1. the batch_size (standard across all configurations)
2. embedding_size (dependent on number of word features in a specific configuration (i.e. m on pg. 1149 of Bengio) 
3. window_size (dependent on the order of the model in a specific configuration (i.e. n on pg. 1149 of Bengio). I.e. this determines the [n] in __n__-gram. 
4. hidden units (i.e. h on pg. 1149 of Bengio, and dependent on the configuration chosen. 

__Nico's route__: Nico did initialization by making each of the variables a key to a value as a value within another dictionary where the key is the configuration name - then just extracts the value 2 dictionaries in. All of the variables become attributes of the model upon initialization. The variables are automatically initialized when a method of the MyModel class is called __(not entirely sure how the configs get in there b/c nothing gets passed into any of the methods that relates to them, so its magic for now)__

__On Batch Sizes__: They're needed for the stochastic gradient descent. Good batch sizes are typically 32, 64, 128, or 256. 

In [2]:
def __init__(self):
    self.window_size = user_type['window_size']
    self.embedding_size = user_type['embedding_size']
    self.hidden_units = user_type['hidden_units']
    self.batch_size = 128 

## Training

This is the most complicated part of the entire model. I've broken it down into two major phases, the second phase being so exhaustive that it needs to be broken down itself into seven steps.

`def train():` takes in the following arguments:
* `self` allows the attributes assigned in the `__init__()` method above to be used by the method.
* `train_data` the result derived from the `corpus.generate_data(path)` method, which is a method of the Preprocessor class that can be linked to a `corpus` object after its assigned by calling Preprocessor on a path to the data. It takes the form of a __list__.
* `validation_data`, result derived from the same process as above, just with a link to a different path. It takes the form of a __list__. 
* `number_of_epochs=int`, a manually set integer indicating how many epochs to run. Because of the size of the data, epochs aren't the smallest data input but are __instead split into batches that have their own comamnds altering them__. 

### Step One - If Statements to Determine Runing on GPU/CPU/TPU
In this step, the program makes use of the following methods:
1. `tf.test.is_gpu_available()`
    * This method accepts two arguments:
        - cuda_only=__True/False__ = _limit the search to CUDA GPUs_
        - min_cuda_compute_capability=__None__ = _a major/minor pair that indicates the minimum CUDA compute capability required, or None if no requirement.
    * This method returns a boolean that indicates whether or not a GPU device of the requested kind is available. 
2. `tf.device()`
    * This method allows me to __manually__ place my device instead of tf automatically assigning things. 
    * This method accepts an argument: device. 
        - device is a __string__, that will always take the following forms:
            * '/cpu:0'
            * '/gpu:0' (<-- The integer will change if I have another GPU available through my machine, in the same way a list in programming works (i.e. 0 = 1, 1 = 2, etc.)
            
3. `with tf.device(device):`
    * This indicates that everything indented after this statement will run on the intended machine. 
    
__NOTE:__ Not having Step One implemented was the reason that none of my training speeds were actually going faster with Google Colab. My program was basically pretty much just ignoring that a GPU existed, even if I had selected the program to run with a GPU (all that selection did was make it available for the program to use, I actually had to hard code a command for my program to take advantage of it. 

I still have a bit of confusion here - the TF documentation indicates that certain operations will give priority to certain operations if a GPU is made available. Still don't understand why my programs were still being obnoxiously slow if this was the case - maybe I just didn't use an operation that gave it priority - hence what I was missing being the lack of manual placement. 

In [3]:
# def train(self, train_data, validation_data, number_of_epochs=5):
    # insert program here

In [4]:
if tf.is_gpu_available(cuda_only=False,
                       min_cuda_compute_capability=None):
    device = '/gpu:0'
    print("Currently using GPU capabilities")
# here I'll figure out how to make it use a TPU
else: 
    device = '/cpu:0'
    print('No GPU available, using CPU capabilities')
with tf.device(device):
    # everything I want to run using the device indented here 

SyntaxError: unexpected EOF while parsing (<ipython-input-4-107104b403ff>, line 10)

### Step Two - Everything Else But Saving the Model

Here, I have to split the remaining part into six parts because they __all happen with the device running__. In this part, I'll assign, define, and describe:
1. Placeholders
2. The Model's Variables
3. The Hidden Layer
4. The Softmax
5. Stochastic Gradient Descent Optimizer
6. The Compiler (bit more complicated than how Keras makes it) 

#### 2.1 = The Placeholders
_What are placeholders in TensorFlow?_ A placeholder is just a variable that I'll asign data to at a later date. It allows me to create operations and begin to build the computation graph without actually needing the data itself. Useful when I'm creating the model as a class instead of actually passing it the data beforehand. 

_What does `tf.placeholder()` take as its arguments?_
* This method takes three arguments:
        1. data_type = a `tf.attribute` such as `tf.int64`
        2. shape = [optional] the shape (i.e. a list of len=2 with the # of total x's, and the # of total y's) to be fed.
        3. name = [optional] a name for the operation. 

_What are the placeholders I need to create?_ Right now, I need two placeholders:
1. x_input = indexes of the window-size words before the label
2. y_input = indexes of the next word

_Where does Bengio actually talk about these in his paper?_ Who the fuck knows. __CHECKUP: I'll have to run the total program in PyCharm to see what they produce__. 

In [None]:
# I need my test function to be able to acccess this outside of train()
self.x_input = tf.placeholder(tf.int64, [None, self.window_size])
# Not sure why I don't need to feed in the shape here
self.y_input = tf.placeholder(tf.int64, [None])

#### 2.2 = The Variables

_What are all the variables that Bengio discusses in his discussion of the model?_ Bengio begins his discussion of his model on Pg. 1141 of his paper, while the exact variables are discussed on Pg. 1143

__The Set-Up__:
* z = simplifying a common multipication of embed_size and window_size
* x_flat = ??
* embed = ??
* x_t = ??

__The Free-Parameters:__
* C = word features/embeddings
* b = output biases
* d = hidden layer biases 
* W = word features to output weights
* U = hidden-to-output weights
* H = hidden layer weights

__The New `tf.methods`__:
- `tf.Variable()` = adding a variable to graph. 

__Note:__ Page 1143 explicitly lays out what each of the free parameters are, though I found it difficult to ascertain what exactly the set up variables meant. I also found it very difficult to plot the exact way of representing all of the calculations in TensorFlow/python - required a lot of tutoring and careful walking through to get to that point. 

_I keep hearing about a TensorFlow dataflow graph, explain more about that._ __Dataflow__ is a common programming model for parallel computing. In a dataflow graph, the nodes represent units of ccomputation and the endges represent data consumed or produced by a computtation. If I look at the diagram [here](https://www.tensorflow.org/guide/graphs), I can see that each operation (like `tf.matmul` has two incoming edges (edges meaning the arrows that flow into the node) symbolizing their two inputs, and one outgoing edge (i.e. the output) symbolizing the result. The DataFlow graph makes it easier to visualize these operations. 

_That's cool - but there's a thing called `tf.graph` that seems different_ Yes, `tf.graph` is a bit more technical. A single `tf.graph` instancec contains two types of information:
1. Graph strucutre = the nodes and edges of the graph, indicating how individual operations are composed together but not talking about how they should be used. The graph structure is __like assembly code__ inspecting it can convey useful information but doesn't ccontain context the same way that source code does. 
2. Graph collections = collections of metadata associated with a bunch of informative methods like `tf.add_to_collection()` which enables me to associate a list of objects with a key (where `tf.GraphKeys` defines some of the standard keys and `tf.get_collection` enables me to look up all objects associated with a key. 

_Awesome, but I still don't get how that connects to the actual stuff I'm writing, that still sounds really abstract_ Yep, let's just keep diving into it. Many parts of the TensorFlow library use the `tf.graph` facility. When I create a `tf.Variable`, it's added by default to collections representing "global variables" and "trainable variables". When I later come to create a `tf.train.Saver` or `tf.train.Optimizer`, the variables stored in these collections are used as the default arguments.

_Alright, so how do I build a graph? And I still don't really understand how it fits into a model construction workflow?_ No problem - most TensorFlow programs start with a dataflow graph construction phase. You don't realize you're doing it because it's done automatically through other TensorFlow API methods that construct new `tf.Operation` (i.e. Nodes) and `tf.Tensor` (edges) objects and add them to a `tf.graph` instance. This is all implicit within the API - unfortunate if you want to get a handle on what's going on underneath the surface, but rest easy: you can get a sense of what's happening when you call these auxillary functions.
    * executing something like what I'm doing below (i.e. `v = tf.Variable()`) adds a `tf.Operation` that stores a writeable tensor value that can be used like a tensor. __This is what I'm saying - get a sense of what's happening__. Ask yourself, what can a tensor do? Having a tensor object lets you use TensorFlow operations. This is great! Most TF operations take one or more `tf.Tensor` objects as arguments. 
    
_Cool, so I know how to create these things now. What do I do with them?_ That's the next section. For now, I'll focus on the code implementation of the abstract math that's on pg. 1143 of the Bengio model.

- I'm pretty sure that `tf.reshape()` and `tf.truncated_normal` do the same thing. 

- `tf.shape` a tf shape is basically just a description of a matrix (i.e. `Shape.create(2, 3)` will create a 2 X 3 matrix. 

- `tf.random_uniform()` = the bread and butter of creating tf variables. It takes the following arguments:
    * shape = a 1-D integer Tensor or  Python array. The shape of the output tensor. 
    * minval = a 0-D Tensor or  Python value of the same type. Essentially the lower bound on the range of random values to generate. 
    * maxval = same, but for the upper bound. 
    * dtype (optional) 
    * seed (random seed) 
    * name = name for the operation 
- `tf.flatten()` = flattens an input tensor while preserving the batch axis.  

In [None]:
# seemed to need this repetitively 
z = self.embedding_size * self.window_size

#### Creating The Free Parameters

In [None]:
d = tf.Variable(tf.random_uniform([self.hidden_units]))

In [None]:
b = tf.Variable(tf.random_uniform([vocab_size]))

##### The C Matrix
Now I'll do the biases - first with the word features matrix. This is the most complicated parameter to make and understand, becuase it requires four different operations to get to the point where it can be used in the model. 

First, I need to create a TF variable called word_embeddings to represent a matrix of shape [vocab_size, embedding_size] with the lower bound being -1.0 and upper bound being 1.0. 

In [5]:
word_embeddings = tf.Variable(tf.random_uniform([vocab_size, 
                                                 self.embedding_size],
                                                -1.0,
                                                1.0))

NameError: name 'vocab_size' is not defined

Next, I need to multiply the additional parameters of x_input. 

__CHECKUP__: I'm not really sure why I need to do this - it seems like it just multiplies any additional axis together to make a none X 1 matrix and I thought my self.x_input was already that. I'll check the values when I have values to experiment with.

_GUESS_: It might be that the `.layers` does something unique here too. I'll check the difference later. 

In [None]:
x_flat = tf.layers.flatten(self.x_input)

Next, I'll need to take my newfound x_flat elements and word_embeddings and use:

`tf.nn.embedding_lookup()` = returns a `Tensor` with the same type as the tenors in `params`. It takes the following as arguments:
* `params` = A single tensor representing the complete embedding tensor, or a list of P tensors all of the same shape except for the first dimension, representing sharded embedding tensors. 
* `ids` = A `Tensor` with type `int64` or something similar containing the IDs  to be looked up in params.

In [None]:
lookup = tf.nn.embedding_lookup(word_embeddings, x_flat)

Finally, I'll reshape the lookup result with a matrix of the batch_size and the word features (m) 

In [None]:
xt = tf.reshape(lookup, [self.batch_size, z])

##### H

H is the hidden layer weights found by taking a matrix of [ |V| x (n - 1)m]. 

In creating this weight, I'm going to use:
`tf.truncated_normal(arg1, arg2, arg3)` which returns generated values that follow a normal distribution with specified mean and standard deviation. Values whose magnitude is more than 2 standard deviations from the mean are dropped and re-picked. This method takes in the following as arguments:
- `shape` a 1D integer Tensor or Python array that is the shape of the output tensor. 
- `mean` a  0-D Tensor or Python value of type dtype. It will be the mean of the truncated normal distribution. 
- `stddev` a 0-D Tensor or Python value of type dtype. It will be the standard deviation of the normal distribution before truncation. I can skip this if I divide the stddev by the sqrt of the divisor.  

Weights are essential to get artificial neurons to learn. If the weight falls outside your truncated normal distribiution, the neuron __won't learn__. 

In [None]:
H = tf.Variable(tf.truncated_normal([z, self.hidden_units], 
                                    (stddev=1.0 / math.sqrt(z)))

#####  W 

W is the word-features to output weight found by taking a matrix of [ |V| x (n - 1)m].

I will also be using `tf.truncated_normal()` to create this weight. 

In [None]:
W = tf.Variable(tf.truncated_normal([z, vocab_size], 
                                    (stddev=1.0 / math.sqrt(x)))

##### U 

U is the hidden-to-output weight found by taking a matrix [ |V| x h]. 

I will also be using `tf.truncated_normal()` to create this weight. 

In [None]:
U = tf.Variable(tf.truncated_normal([self.hidden_layers, vocab_size],
                                    (stddev=1.0 / vocab_size)))

#### 2.3 - Hidden Layer 

For this layer, I'll be using:

`tf.nn.bias_add()` =

AND

`tf.matmul()` = 

AND

`tf.tanh()` = 


This layer is referenced on Pg. 1142 of Bengio's paper, "therefore there are really two hidden layers, the shared word features layer C (xt below), which has no non-linearity, and the ordinary hyperbolic tangent layer with input being h1_output. 

Unnormalized log-probabilities (y) for each output word i is computed with the third command. The equation is seen in the first paragraph of Pg. 1143 of Bengio's paper. 

In [None]:
# hidden layers
# WAR
tanh = tf.nn.tanh(tf.nn.bias_add(tf.matmul(xt, H), d))
y = tf.nn.bias_add(tf.matmul(xt, W), b) + tf.matmul(tanh_output, U)

#### 2.4 Softmax 

Here, I'll be using:

`tf.nn.softmax()` = computes softmax activations (i.e. divides the exponential of a logit by the reduced sum of those same logits). Takes an input of a vector of K real numbers and normalizes it into a probability distribution.  

`tf.math.argmax()` = returns the index with the largest value across axes of a tensor. Used as a way of measuring accuracy of result in the optimizer. 

`tf.one_hot()` = 

`tf.math.reduce_mean()` = computes the mean of elements across dimensions of a tensor. 

`tf.nn.softmax_cross_entropy_with_logits_v2` = Measures the probability error in discrete classification tasks. Takes the following arguments:
* arg1 = __labels__ = valid probability distribution 
* arg2 = __logits__ = unscaled log probabilities 

__NOTE:__ In all honesty, I had no idea I was supposed to use softmax cross entropy with logits. I was kind of winging it with softmax cross entropy up to this point. 

The goal of this section is to create the arguments to feed into the `tf.nn.softmax_cross_entropy_with_logits_v2`. 

In [None]:
y_probdist = tf.nn.softmax(y)
y_ideal = tf.math.argmax(y_probdist, axis=1)
# produces labels to use in the softmax_cross_entropy_with_logits
y_labels = tf.one_hot(self.y_input, vocab_size)
# I want other functions to be able to access the result (i.e. self)
# WAR
self.ce_result = tf.math.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(y_labels, y_probdist))

#### 2.5 Stochastic Gradient Ascent Optimizer

Bengio names his optimizer on Pg. 1143 in his discussion of SGA. Gradient _ascent_ differs from _descent_ by aiming to maximize some objective function. 

Constructing an optimizer needs a learn rate, which is mentioned on Pg. 1147 as ε_o = 10−3.

To build the optimizer, I just went off of what TF had in their [instructions](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) for the Adam Optimizer. 

In [None]:
# building the SGA 
# WAR WAR WAR
learning_rate = 0.001
beta1 = 0.9
beta2 = 0.999
adam = tf.train.AdamOptimizer(learning_rate, beta1, beta2).maximize(self.ce_result)
ra = tf.equal(y_ideal, self.y_input)
self.accuracy = tf.math.reduce_mean(tf.cast(ra, tf.float32))

In [None]:
self.session = tf.Session()
self.session.run(tf.global_variables_initalizer())
saver = tf.train.Saver()

In [None]:
class MyModel:
    
    def __init__(self):
        self.batch_size = 256 # could be changed 
        self.embedding_size = config['embedding_size']
        self.window_size = config['window_size']
        self.hidden_layers = config['hidden_units']
        

    def train(self, t_data, v_data, number_of_epochs=5):
        # insert rest of program here
        if tf.is_gpu_available(cuda_only=False,
                           min_cuda_compute_capability=None):
        device = '/gpu:0'
        print("Currently using GPU capabilities")
        # here I'll figure out how to make it use a TPU
        else: 
            device = '/cpu:0'
            print('No GPU available, using CPU capabilities')
        with tf.device(device):
        # everything I want to run using the device indented here 
            # I need my test function to be able to acccess this outside of train()
            self.x_input = tf.placeholder(tf.int64, [None, self.window_size])
            # Not sure why I don't need to feed in the shape here
            self.y_input = tf.placeholder(tf.int64, [None])
            
            # seemed to need this repetitively 
            z = self.embedding_size * self.window_size
            
            # hidden layer biases
            d = tf.Variable(tf.random_uniform([self.hidden_units]))
            # output biases
            b = tf.Variable(tf.random_uniform([vocab_size]))
            
            # weights
            # C function 
            word_embeddings = tf.Variable(tf.random_uniform([vocab_size, 
                                                 self.embedding_size],
                                                -1.0,
                                                1.0))
            x_flat = tf.layers.flatten(self.x_input)
            lookup = tf.nn.embedding_lookup(word_embeddings, x_flat)
            xt = tf.reshape(lookup, [self.batch_size, z])
            
            # H
            H = tf.Variable(tf.truncated_normal([z, self.hidden_units], 
                                    (stddev=1.0 / math.sqrt(z)))
            # W
            W = tf.Variable(tf.truncated_normal([z, vocab_size], 
                                    (stddev=1.0 / math.sqrt(z)))
            # U
            U = tf.Variable(tf.truncated_normal([self.hidden_units, vocab_size],
                                    (stddev=1.0 / vocab_size)))
                            
            # hidden layers
            # WAR
            tanh = tf.nn.tanh(tf.nn.bias_add(tf.matmul(xt, H), d))
            y = tf.nn.bias_add(tf.matmul(xt, W), b) + tf.matmul(tanh, U)
            
            # softmax 
            y_probdist = tf.nn.softmax(y)
            Y_ideal = tf.math.argmax(y_probdist, axis=1)
            # produces labels to use in the softmax_cross_entropy_with_logits
            y_labels = tf.one_hot(self.y_input, vocab_size)
            # I want other functions to be able to access the result (i.e. self)
            # WAR
            self.ce_result = tf.math.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(y_labels, y_probdist))
            
            # building the SGA 
            # WAR WAR WAR
            learning_rate = 0.001
            beta1 = 0.9
            beta2 = 0.999
            adam = tf.train.AdamOptimizer(learning_rate, beta1, beta2).maximize(self.ce_result)
            ra = tf.equal(y_ideal, self.y_input)
            self.accuracy = tf.math.reduce_mean(tf.cast(ra, tf.float32))
                            
            self.session = tf.Session()
            self.session.run(tf.global_variables_initalizer())
            saver = tf.train.Saver()
            
            print('training beginning . . . . .')
            global accuracy_hist_train, cost_hist_train
            for i in range(number_of_epochs):
                batches = generate_batches(train_data, 
                                           self.batch_size, 
                                           self.window_size)
                total_batches = len(batches)
                batch_count = 0
                last_complete = 0
                num_messages = 10 # the number of  printouts  per  epoch
                for batch in batches:
                    batch_count += 1
                    x_batch = batch[0]
                    y_batch = batch[1]
                    feed_dict_train = {self.x_input: x_batch,
                                       self.y_input: y_batch}
                    self.session.run(optimizer, feed_dict=feed_dict_train)
                    completion = 100 * batch_count / total_batches
                    if batch_count % (int(total_batches / num_messages)) == 0:
                        print('Epoch #%2d-   Batch #%5d:   %4.2f %% completed.' % (i + 1, batch_count, completion))
                        a_t, c_t = self.test(train_data)
                        a, c = self.test(validate_data)
                        accuracy_hist_train.append(a)
                        cost_hist_train.append(c)

                        if sum(cost_hist_train[-4:]) > sum(cost_hist_train[-8:-4]):
                            patience = patience - 1
                        else:
                            patience = 2

                        if patience == 0:
                            print("Cost Too High, Early Stop Activated")
                            save_path = saver.save(self.session, "../models/" + arg_2 + '_' + arg_3 + ".ckpt")
                            print("Model saved in path: %s" % save_path)
                            return
                            
        print("The training has been completed")
        save_path = saver.save(self.session, "../models/" + arg_2 + '_' + arg_3 + ".ckpt")
        print("Model saved in path: %s" % save_path)
        return