In [1]:
import tensorflow as tf
import numpy as np

# Calculations on Computation Graph

## Computation Graphs

Let's consider expression:
$$e=(a+b)∗(b+1)$$
Introduce new variables to build computation graph basing on this expression:
$$c=a+b$$
$$d=b+1$$
$$e=c∗d$$

Build computation graph for this expression:

<img src="comp-graph.png" alt="Drawing" style="width: 400px;"/>

### Forward Propagation

To run forward propagetion on a computation graph, set all variables in leaf nodes to some values. Then calculate values of other node basing on computational relations between child and parent nodes.

<img src="comp-graph-eval.png" alt="Drawing" style="width: 400px;"/>

### Back Propagation

Firstly, calculate derivations of nodes and extend computation graph with them by adding back (derivative) relations to nodes. Then one can calculate node values by going by direction from the top node of the graph to its leaf nodes.

<img src="comp-graph-derivs.png" alt="Drawing" style="width: 400px;"/>

PLease, take a look on variable $b$ node. To compute derivation one needs to get values of two nodes $c$ and $d$ and sum them. For more complex expressions with more variables such sum may be very sofisticated. To not compute values of $c$ and $d$ again and again one need to **cache** them.


## TensorFlow Computation Model

TensorFlow builds computation graph by basing on your Python code, where graph nodes are defined. Leaf nodes are variables and constants, internal nodes are operations. An each node in the computation graph has its own unique identifier and its value is computed only once and cached after the computation.

TensorFlow extend an each node in the computation graph by adding back derivative relation to child nodes. An each node in the graph has relations to child nodes to compute in forward propagation and relations to its parent nodes to compute back propagation.

## Using of TensorFlow

#### The first phase:
Define computation graph in your code by using operations and variables from tf namespace.

#### The second phase:
Create ccalculation session to TF server and run computations on the graph by passing values of leaf nodes and names of top nodes of the graph.

In [2]:
tf.reset_default_graph()

a = tf.constant(2.)
b = tf.constant(3.)
r = tf.add(a, b)

with tf.Session() as sess:
    print(sess.run(r))

5.0


In [3]:
tf.reset_default_graph()

a = tf.Variable(2., name='a', dtype=tf.float32)
b = tf.Variable(1., name='b', dtype=tf.float32)
c = tf.add(a, b)
d = tf.add(b, tf.constant(1.))
e = tf.multiply(c, d)

init_op = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init_op)
    print('e = ', sess.run(e))

Instructions for updating:
Colocations handled automatically by placer.
e =  6.0


# Feed Forward Neural Network


## Formula for one neuron

<img src="artificial_neuron.png" alt="Drawing" style="width: 500px;"/>

For each neuron in the layer we have:

Input: $\boldsymbol{x} = \{ x_1, x_2, \ldots, x_n \}$

Weights: $\boldsymbol{w} = \{ w_1, w_2, \ldots, w_n \}$

Bias: $b$

Sum: $s = \sum_{i=1}^n w_ix_i + b = \boldsymbol{w} \boldsymbol{x} + b$

Activation function: $y = f(s)$

## Network Layer

<img src="ffn.png" alt="Drawing" style="width: 300px;"/>

The NN layer can be desribed by:

Matrix $W$ with rows as vectors of $\boldsymbol{w}$ weights for each neuron in the layer

Vector $\boldsymbol{b}$ of bias for each neuron in the layer

Sum is implemented as multiplication of matrix $W^T$ to input vector $\boldsymbol{x}$

Vector $\boldsymbol{b}$ represents activation functions for all neurons in the layer


## TensorFlow representation for n-dim values

One may notice that in NN we have different dimension values:
* Vectors of input, bias and activation functions.
* Matrix of weights.
* For more complex configurations of NN one has to operate to 3 or more dimensional values.

So, TF introduced notion of **tensor** to operate values of different dimention. Tensors like numpy arrays, but they are used inside TF server during execution circle in forward and back propagation calculations. It's possible to pass numpy array as the input values of computation graph before running calculation of the graph inside a session.

Tensors have shape, which defines dimension of the tensor. For example, any number has shape=() - 0-dimesional. Vectors have shape=(N), where N is length of the vector. Matrices have shape=(N,M), where N and M are matrix dimensions.

One needs a way to pass values as numpy arrays (or something else) to the graph to run computations on the same graph but different values. TensoFlow provides following ways to do it:
1. Use predefined values in constants and variable as it was demostrated before.
2. Use placeholders.
3. Use special data structures to fast value feeding (tf.data.Dataset).

### Using Placeholders

One should declared placeholder exactly like a variable, by defining shape of the data to keep in the placeholder. Then one can use this placeholder anywhere in the operations of program code. The placeholder node in the computation graph doesn't contain any value yet. One shoul pass (**feed**) this value to placeholder when the computation session is running. TF Session run method has additional parameter 'feed_dict', which is Python dict where one ca pass values by assigning them to placeholders.

Let's take a look on the example below.


In [4]:
tf.reset_default_graph()

a = tf.placeholder(dtype=tf.float32, shape=())
b = tf.placeholder(dtype=tf.float32, shape=())
c = tf.add(a, b)
d = tf.add(b, tf.constant(1.))
e = tf.multiply(c, d)

init_op = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init_op)
    print('e = ', sess.run(e, feed_dict={a: 2., b: 1.}))
    print('e = ', sess.run(e, feed_dict={a: 5., b: 6.}))

e =  6.0
e =  77.0


# Neural Network Learning
## Gradient Descent

From Wikipedia:

> Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead

> Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task

> The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning

Neural Networks (NN) also are ML models, where optimization parameters are weights and biases for each NN layers. The most popular learning algorithm for NN is Gradient Descent, which tries to minimize error by substructing gradients on the network layers' weights and biases.

<img src="gd.png" alt="Drawing" style="width: 600px;"/>

TensorFlow provides a variety of optimization algorithm implementations. The all are inherited from tf.train.Optimizer class. The class provides two method for optimization:
* minimize method is the usual way to optimize NN parameters.
* compute_gradients and apply_gradients are two part of minimize methods. One can, for instance, clip calculated gradients before applying them to parameters to escape gradient explosion (obviously, for RNN networs).

List some of optimizer implementations:
* tf.train.GradientDescentOptimizer - classic optimization algorithm. The algorithm can be successfully applied to many tasks, but it's recomended to use more sofisticated methods.
* tf.train.MomentumOptimizer implements momentum algorithmm to work better on plateus.
* tf.train.AdamOptimizer is the implementation of Adam algorithm, which is suitable for at least all cases. It's recomended to use.


## Loss Function

One needs to know error function's formula to understand how to calculate gradients on NN parameters. This formula connects prediction errors calculated on a train collection and the network. So, NN with error function is called extended neural network. Error function is also called as **loss** function.

<img src="extended_fnn.png" alt="Drawing" style="width: 800px;"/>

There are two most popular loss functions:
* Square (quadratic) error lost function. The formula is: $\lambda(\boldsymbol{x})=\frac{1}{2}\sum_i (t_i-x_i)^2$. It's obviously used for continuous values. One of the implementatoin in TF is named as tf.losses.mean_squared_error.
* Cross entropy is used to learn on categorial values (finite set of labels). The formula is: $H(\boldsymbol{x})=-\sum_i x_i * log(t_i)$. Usually, one uses cross entropy with softmax: cross entropy is used for learning puprposes, but softmax is used for prediction.

## Stochastic Gradient Descent

Suppose, one has train data collection $\boldsymbol{x} = \{ x_1, x_2, \ldots, x_n\}$. The question is: how to calculate the error and apply gradients? Ther are following options:
1. Calculate the error on one train example and then update NN parameters basing on this error. It's called Stochastic Gradient Descent.
2. Calculate error on all train collection and then apply gradients.
3. Randomly select K examples from train collection, do forward propagation and calculate error on them, then apply gradients. This technics is known as **mini-batches** and it's used in TF at least at all cases.

<img src="sgd.png" alt="Drawing" style="width: 800px;"/>


# MNIST Prediction

## MNIST Database

<img src="mnist-examples.png" alt="Drawing" style="width: 500px;"/>

Train data collection contains 60000 examples, test collection - 10000. Each picture is $28\times 28$ gray scaled image. 


## Feed Forward Network for MNIST Prediction

<img src="ffn-mnist.png" alt="Drawing" style="width: 600px;"/>

* $X$ - flattened vector of $28\times 28 = 784$ length.
* $Y$ - labels as hot-vector. For instance, 3 = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].
* $W_1$, $b_1$ - weights and bias for the first layer.
* $W_1$, $b_1$ - weights and bias for the second layer.
* $H = ReLU(X\times W_1 + b_1)$
* $O = Softmax(H\times W_2 + b_2)$

<img src="relu.png" alt="Drawing" style="width: 600px;"/>

<img src="softmax.png" alt="Drawing" style="width: 800px;"/>

In [5]:
from tensorflow.examples.tutorials.mnist import input_data

tf.reset_default_graph()

mnist = input_data.read_data_sets("data/MNIST_data/", one_hot=True)

INPUT_SIZE, HIDDEN_SIZE, OUTPUT_SIZE = 784, 100, 10 

initializer = tf.random_normal_initializer(stddev=0.1)

# Input placeholders
X = tf.placeholder(tf.float32, shape=(None, INPUT_SIZE))  
Y = tf.placeholder(tf.float32, shape=(None, OUTPUT_SIZE))

# Hidden layer weights and bias
W_1 = tf.get_variable("Hidden_W", shape=[INPUT_SIZE, HIDDEN_SIZE], initializer=initializer)
b_1 = tf.get_variable("Hidden_b", shape=[HIDDEN_SIZE], initializer=initializer)

# Hidden layes sum and activation
hidden = tf.nn.relu(tf.matmul(X, W_1) + b_1)

# Output layer weights and bias
W_2 = tf.get_variable("Output_W", shape=[100, 10], initializer=initializer)
b_2 = tf.get_variable("Output_b", shape=[10], initializer=initializer)

# Output layer sum
output = tf.matmul(hidden, W_2) + b_2

# Loss as crossentropy with softmax
loss = tf.losses.softmax_cross_entropy(Y, output)

# Accuracy for prediction
correct_prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(output, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Optimizer is Adam with default learing rate
train_op = tf.train.AdamOptimizer().minimize(loss)

# Run session
BATCH_SIZE, NUM_TRAINING_STEPS = 100, 1000
with tf.Session() as sess:
    # Initialize all variables in the graph
    sess.run(tf.global_variables_initializer())

    # Training Loop
    for i in range(NUM_TRAINING_STEPS):
        batch_x, batch_y = mnist.train.next_batch(BATCH_SIZE)
        curr_acc, _ = sess.run([accuracy, train_op], feed_dict={X: batch_x, Y: batch_y})
        if i % 100 == 0:
            print('Step {} Current Training Accuracy: {:.3f}'.format(i, curr_acc))
    
    # Evaluate on Test Data
    print('Test Accuracy: {:.3f}'.format(sess.run(accuracy, feed_dict={X: mnist.test.images, 
                                                                Y: mnist.test.labels})))

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting data/MNIST_data/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting data/MNIST_data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting data/MNIST_data/t10k-images-idx3-ubyte.gz
Extracting data/MNIST_data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Use tf.cast instead.
Step 0 Current Training Accuracy: 0.080
Step 100 Current Training Accuracy: 0.830
Step 200 Current Training Accuracy: 0.900
Step 300 Current Training Accuracy: 0.900
Step 400 Current Training Accuracy: 0.940
Step 500 Current Training