In [1]:
import tensorflow as tf

## Basic TensorFlow

Use tf constants and tf functions to do basic math operations. Also use feed_dict to allow user input.

In [2]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf

# TODO: Convert the following to TensorFlow:
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
c = tf.placeholder(tf.float32)

# TODO: Print z from a session
z = tf.subtract(tf.divide(x,y),c)

with tf.Session() as sess:
    output = sess.run(z,feed_dict={x:10,y:2,c:1})
    print(output)

4.0


## Linear Function in TensorFlow

Implement a simple calculation of inputs, weights, and a bias, 

 $$ y = xW + b$$
 
 We need to use a tensor that is modifiable, so instead of using tf.constant() or tf.placeholder() we're going to use tf.Variable. 

In [3]:
init = tf.global_variables_initializer() # returns operation that will initialize variables from graph
with tf.Session() as sess:
    sess.run(init)

It is good practice to initialize variables from a normal distribution to prevent any one variable from overwhelming the others.

In [4]:
n_features = 120
n_labels = 5

#initialize Normally distributed weights
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))

#initialize bias variable as a tensor of bunch of zeros
bias = tf.Variable(tf.zeros(n_labels))

In [5]:
#todo

## Softmax function in TensorFlow

In [6]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf

def run():
    output = None
    logit_data = [2.0, 1.0, 0.1]
    logits = tf.placeholder(tf.float32)
    
    # TODO: Calculate the softmax of the logits
    softmax = tf.nn.softmax(logits)    
    
    with tf.Session() as sess:
        # TODO: Feed in the logit data
        output = sess.run(softmax,feed_dict={logits:logit_data})
    
    return output

output = run()
print(output)

[ 0.65900117  0.24243298  0.09856589]


## One-hot Coding
Transform labels into binary vectors using scikit learn LabelBinarizer.

In [7]:
import numpy as np
from sklearn import preprocessing

# Example labels
labels = np.array([1,5,3,2,1,4,2,1,3])

# Create the encoder
lb = preprocessing.LabelBinarizer()

# Here the encoder finds the classes and assigns one-hot vectors 
lb.fit(labels)

# And finally, transform the labels into one-hot encoded vectors
lb.transform(labels)

array([[1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0]])

## Cross-entropy in Tensorflow
Cross-entropy, $D$, is an error measurement for categorical outputs. This is computed by

<img src="./www/cross-entropy-diagram.png" alt="cross-entropy" style="width: 400px;"/>

where $y$ is the one-hot coded ground truth vector, and $\hat{y}$ is our probability output from the softmax function.

In [8]:
import tensorflow as tf

softmax_data = [0.7, 0.2, 0.1]
one_hot_data = [1.0, 0.0, 0.0]

softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

# TODO: Print cross entropy from session
DD = -tf.reduce_sum(tf.multiply(one_hot_data,tf.log(softmax_data)))

with tf.Session() as sess:
    
    output = sess.run(DD,feed_dict={softmax:softmax_data,one_hot:one_hot_data})
    
    print(output)

0.356675


## Numerical Stability
When computing the gradient, we must ensure that the values we output are not too large or too small. A computer's precision is finite so things can quickly explode and overflow, or become too small and effectively be treated as zero.

Example below: take a = 1 billion, add 10^-6 one million times, then subtract 1 billion. We obviously expect a result of 1, but..

In [9]:
a = 1000000000
for i in range(1000000):
    a = a + 1e-6
print(a - 1000000000)

0.95367431640625


The result is actually less than one!

### A rule of thumb:
Always have our variables have a mean of zero and equal variance whenever possible, that is:

\begin{align} \mu(X_i) &= 0\\
\sigma(X_i) &= \sigma(X_j)\end{align}

<img src="./www/conditioning-vars.png" alt="conditioning-variables" style="width: 600px;"/>

Without properly conditioning our variables (left), our gradient descent algorithm may not find the minimum  in an efficient fashion.

In the case of images, where (R,G,B) pixel values range from 0-250, we should subtract 128 and scale each pixel by $\frac{1}{128}$.

$$(R,G,B) \rightarrow \left(\frac{R-128}{128}, \frac{G-128}{128}, \frac{B-128}{128}\right)$$

### Weight Initialization

<img src="./www/weight-initialization.png" alt="weight-initialization" style="width: 500px;"/>

Draw the weights randomly from a Gaussian distribution with $\mu = 0$ and standard deviation $\sigma$. $\sigma$ determines the order of magnitude of the initial points of our optimization. Because of the softmax() on top of our function, this order of magnitude determines the "peakiness" of our initial probability distribution. Large $\sigma$ will result in large peaks, and a very opinionated distribution. A small $\sigma$ means that our distribution will be very uncertain. 

It is better to start with an uncertain (small $\sigma$) distribution, and let our distribution become more certain with more iterations.

## Stochastic Gradient Descent

One major issue with scaling gradient descent: computing a loss function's gradient takes ~3x as long as computing the loss function itself. And the loss function is huge -- it depends on every single element in the training set. We must also do this many times as we perform many iterations over the dataset.

In practice, we use a method called **stochastic gradient descent**. Instead of computing the loss, we compute an _estimate_ of it. We subsample the dataset with replacement and compute the average loss over that. 

<img src="./www/sgd.png" alt="stochastic-gradient-descent" style="width:500px">

Due to this being an estimate of the true error, we will be taking smaller steps (sometimes even in the wrong direction) towards the true optimum overall. Ultimately though, we are able to do these computations much faster than if we were to do the full gradient descent calculation, so in the end we win.

### Helping S.G.D.
**Inputs:**
    - Mean=0
    - equal variance (small)
  
**Weight initialization:**
    - Random!
    - mean =0
    - equal variance (small)
    
**Momentum:**
<img src="./www/momentum.png" alt="momentum" style="width:500px">
    - Instead of using the direction of the current batch gradient, keep a running average of the error
    -this momentum technique works very well and often leads to better convergence

**Learning rate decay:**
<img src="./www/learningrate1.png" alt="learning-rate" style="width:500px">
    -Recall: s.g.d. yields smaller, noisier steps towards the optimum
    -how small should the step be? this is an area of research
    - people have shown it is best to make steps smaller and smaller as you train
    - some apply exponential decay to learning rate
    - some make it smaller when loss reaches plateau


### Learning Rate Tuning
<img src="./www/learningratetuning.png" alt="learning-rate-tuning" style="width:500px">
    - unintuitively, lowering learning rate over time may lead to better performance than having a higher learning rate
    - don't be deceived by how fast the loss converges
    

**Hyper-parameters:**
    - initial learning rate
    - learning rate decay
    - momentum
    - batch size
    - weight initialization
In practice it's not _that_ bad. One thing to keep in mind: try lowering learning rate _first_. 

**Adagrad:**
    - modification of s.g.d. 
    - implicitly does momentum and learning rate decay
    - makes learning less sensitive to hyper-parameters

## Mini-batching

Technique to train on subsets of dataset instead of all data at once. Computationally this is inefficient since we can't calculate loss simultaneously over all samples, but we have the benefit of being able to run the model (some data sets are just way too large).

Combined with SGD: shuffle the data at start of each epoch, then create mini batches. For each mini-batch, train network weights with gradient descent. Thus, we perform SGD with each batch since batches are random.

In [10]:
import math
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    # TODO: Implement batching
    num_features = len(features)
    
    batches = []
    
    num_batches = math.ceil(num_features/batch_size)
    for bb in range(0,num_features,batch_size):
        batches.append([features[bb:bb+batch_size],labels[bb:bb+batch_size]])
    
    return batches

How you would call the code in practice: 

for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
        sess.run(optimizer, feed_dict={features: batch_features, labels: batch_labels})

## Epochs
One epoch is a single forward and backward pass of the whole dataset. 

for epoch_i in range(epochs):
    #then loop over batches
    for batch_features, batch_labels in train_batches:
    train_feed_dict = ...
    
With each epoch, the cost becomes lower, and accuracy gets higher. We should adjust learning rate to see how it affects the accuracy over many iterations.