### TensorFlow Basics

Tensors have a dimension and a type. __hello_constant__ is a 0 dimensional string tensor.

- tf.constant

TensorFlow's api is built around the concept of a computational graph. 

TensorFlow Sessions is an enviroment for running a graph


In [4]:
import tensorflow as tf

# Create TensorFlow object called tensor
hello_constant = tf.constant('Hello World!')

with tf.Session() as sess:
    # Run the tf.constant operation in the session
    output = sess.run(hello_constant)
    print(output)

b'Hello World!'


We can use TensorFlow __placeholder__ variables to feed values into the computational graph within a Session.

- tf.placeholders requires a type. Can also provide a shape



In [5]:
import tensorflow as tf

place_x = tf.placeholder(tf.string)
with tf.Session() as session:
    output = session.run(place_x,{place_x:"Heyo"})
    print(output)

Heyo


Doing math in TensorFlow (full list of ops - https://www.tensorflow.org/versions/r0.11/api_docs/python/math_ops.html#math):

- tf.add
- tf.sub
- tf.mul
- tf.div 

In [6]:
x = tf.placeholder(tf.int32)
y = tf.placeholder(tf.int32)
z = tf.sub(tf.div(x,y),tf.constant(1))

with tf.Session() as session:
    out = session.run(z,feed_dict={x:10,y:2})
    print(out)

4


The goal of training a neural network is to modify weights and biases to best predict the labels. In order to use weights and bias, you'll need a Tensor that can be modified.

- This leaves out tf.placeholder() and tf.constant(), since those __Tensors can't be modified__. 
- This is where tf.Variable() comes in.


In [7]:
x = tf.Variable(5)
init = tf.initialize_all_variables() # <- returns an operation that we can call in a session to initialize all the variables
x = tf.add(x,1)

with tf.Session() as session:
    session.run(init)
    out = session.run(x)
    print(out)

6


Let's build a logistic classifier. 

- Given inputs x, we can use weights w and biases b to generate logits (scores) y. 
- x: N observations, K features
- w: K features, L classes
- b: 1, L classes

We can calculate either:

- w*x+b = y, or 
- x*w + b = y

We will calculate x*w + b. 

Before computing this, and training our weights and biases, we should initialize our weights and biases. appropriate to use a random numbers drawn from a truncated standard normal to init our parameters. We use truncated normals so that:

- Randomness allows for more variation when we restart the algorithm. Helps decrease the likelihood we fall into a local minimum
- Small values prevent overfitting

Since we want to update our weights/biases, we should represent them as TensorFlow __variables__

In [8]:
import numpy as np
n_features = 2
n_classes = 5

x = tf.placeholder(tf.float32)
weights = tf.Variable(tf.truncated_normal(shape=(n_features,n_classes)))
biases = tf.Variable(tf.zeros(n_classes))

logits = tf.add(tf.matmul(x,weights),biases)

init = tf.initialize_all_variables()

inp = np.array([[1,2],[0,1]])
               
with tf.Session() as session:
    session.run(init)
    print(inp)
    print(session.run(weights))
    print(session.run(biases))
    out = session.run(logits,feed_dict={x:inp})
    print(out)
    
    
                

[[1 2]
 [0 1]]
[[-1.07275569  0.58411843 -0.6492455   0.38904485 -0.93004668]
 [ 0.86994207 -1.79790843 -0.25243679  0.82723159  0.83809364]]
[ 0.  0.  0.  0.  0.]
[[ 0.66712844 -3.01169848 -1.15411901  2.04350805  0.7461406 ]
 [ 0.86994207 -1.79790843 -0.25243679  0.82723159  0.83809364]]


Convert logits to probabilities using the softmax function

In [9]:
import numpy as np
def softmax(x):
    """
    x: array of logits
    returns numpy array of same size with softmaxes
    """
    return np.exp(x)/np.sum(np.exp(x),axis=0)

print(softmax([.01,.02,.03]))
print(softmax([.1,.2,.3]))
print(softmax([1,2,3]))
print(softmax([10,20,30]))
print(softmax([100,200,300]))


[ 0.33000561  0.33332222  0.33667217]
[ 0.30060961  0.33222499  0.3671654 ]
[ 0.09003057  0.24472847  0.66524096]
[  2.06106005e-09   4.53978686e-05   9.99954600e-01]
[  1.38389653e-87   3.72007598e-44   1.00000000e+00]


- If we multiple logits by 10, we see that the probabilities get closer to 0 or 1.
- If we divide logits by 10, the probabilities become more uniform

So, the magnitude of the logits is important. Initially, we want our logits to be small. As the model is trained and gets better, we want the magnitude of the logits to increase.

### Training the model. Cross Entropy, Average Cross Entropy, Optimizing avg Cross Entropy by updating weights/biases 

Now that we have softmaxes, we can calculate how closely our predictions are to the true label of our data. For this part, it helps to have our labels 1-hot encoded, or represented as vectors where all values are 0 except the index corresponding to the class. e.g. if there are 5 classes, and obs i is classified as 3, then its 1-hot encoding is [0,0,1,0,0].

We can use the 1 hot encoding and softmaxes to calculate the cross-entropy of our model. Cross-entropy is a distance measure on 2 vectors defined as:

d(S,L) = -1*sum(L\_{i} * log(S\_{i}))

Where S are our predicted softmaxes and L is the one-hot encoded represented of the true label. Some key points:

- Order matters for softmax, d(S,L) not necessarily equal to d(L,S)
- Log is natural log
- Our softmax functions will give a non-zero probability to every class, so the natural log will not be undefined.

As in all machine learning problems, we do training by formulating a loss function and minimizing it. For this multinomial logistic regression, we can express our Loss function as a sum of d(S,L) over all the examples in our data, divided by the total number of examples in our data. 

Then we can minimize this loss function by taking it's gradient, setting it equal to zero, and incrementally updating our parameters in the direction of steepest descent for the loss function

### Preprocessing Input before Training

- Bad conditioned vs. Well conditioned
- numerical instability



In [10]:

for exp in range(10,0,-1):
    a = 10**exp
    b = a
    for i in range(1000000):
        a += 10**-6
    print(1-(a-b))

-0.9073486328125
0.04632568359375
0.0016222000122070312
-0.00024044513702392578
-7.614493370056152e-06
6.9374218583106995e-06
-3.3853575587272644e-07
2.5247572921216488e-09
2.5247572921216488e-09
7.484004527213983e-10


Adding really big and really small numbers gives unpredictable results. We see that as a,b -> 0, the error 1-(a-b) -> 0. We do not want numerical instability to impact the minimization of our loss function. So, for our input features, it is better to normalize the values to having 0 mean and equal variance before feeding them to the algorithm. For example, for images with 255 pixels, for each color channel, we can do:

- r = (r+128)/128
- g = (g+128)/128
- b = (b+128)/128

### Validation Set Size

If classes are well-balanced, can use rule of 30 as a proxy for determining validation set size:

- Hold back more than 30000 examples for validation. 
- Accuracy changes >.1% are significant

If classes are not well-balanced, as is the case for more real world examples, what can we do?

- Resample to balance classes
- generate synthetic data to rebalance
- ??

### Stochastic Gradient Descent

- Gradient Descent runs into scalability issues for large data sets. The loss function over the entire dataset is computational intensive to calculate, and calculating gradients can be approx 3x more intensive (as a rule of thumb) than computing the loss function
- So, we can approximate GD by training on small batches (< 1000 examples), and calculating the loss (average cross entropy) for these examples, assuming it is an approximation to the true loss (given the current parameters and data), and updating the parameters accordingly by propagating this error back through the network.
- This is a scalable approach but not a great optimizer. Oftentimes, the gradient of these batches is not in the best direction and cost function may not be monotonically decreasing

Tricks to implement SGD:
    
1. Inputs: 0 mean and equal variance
2. Weights: random values, 0 mean, equal variance
3. Momentum - running average of gradient to get the general direction in which we should update our parameters and move towards our objective.
4. Learning Rate Decay - smaller noiser steps to objective. Beneficial to make the learning rate smaller as we train. Lower it over time is empirically shown to help.


SGD Parameters:

- initial weights/biases
- initial learning rate
- momentum
- decay
- batch size

When things don't work, try lowering learning rate to start.

#### Adagrad

- implicitly, does momentum and learning rate decay
- often makes learning less sensitive to hyper parameters
- but, may be less performant than SGD with good tuning.
- but, good place to start


## Deep Neural Network in TensorFlow

In [11]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)




Extracting ./train-images-idx3-ubyte.gz
Extracting ./train-labels-idx1-ubyte.gz
Extracting ./t10k-images-idx3-ubyte.gz
Extracting ./t10k-labels-idx1-ubyte.gz


### Building a Deep NN Flow

- Read in data and preprocess
    - Normalization
    - 1 hot encoding labels
    - split into train, validation, test

- Define Learning Parameters
    - learning rate
    - training epochs
    - batch size

- Define Input Parameters
    - Features Size
    - Number of Classes

- Define the number of units in each hidden layer
- Initialize weights and biases for each layer
- Input
    - Placeholder variables for x, labels
    - reshape x if necessary

- Multilayer Perceptron
    - layer 1 output
    - layer 2 output 
    - etc
    
- Optimizer
    - Use softmax to convert logits to scores
    - define average cross entropy as a function of the true labels (1 hot encoded) and logits
    - Choose SGD or ADAGRAD as gradient descent implementation to use to minimize loss function
    
- Session
    - initialize and run variables
    - for each training epoch, get a batch of __batch_size__, run optimizer with learning rate __learning_rate__ and feed in current batch, current labels. 
    - Calculate validation error after each epoch?

In [12]:
import tensorflow as tf

learning_rate = 0.01
training_epochs = 15
batch_size = 100
display_step = 1

n_input = 784
n_classes = 10

layers = {
    'layer_1':256,
    'layer_2':512
}

weights = {
    'layer_1':tf.Variable(tf.truncated_normal([n_input,layers['layer_1']])),
    'layer_2':tf.Variable(tf.truncated_normal([layers['layer_1'],layers['layer_2']])),
    'output':tf.Variable(tf.truncated_normal([layers['layer_2'],n_classes]))

}

biases = {
    'layer_1': tf.Variable(tf.truncated_normal([layers['layer_1']])),
    'layer_2': tf.Variable(tf.truncated_normal([layers['layer_2']])),
    'output': tf.Variable(tf.truncated_normal([n_classes]))
}


x = tf.placeholder(dtype=tf.float32,shape=[None,28,28,1])
y = tf.placeholder(dtype=tf.float32,shape=[None,n_classes])

x_flat = tf.reshape(x,[-1,n_input])

keep_prob = tf.placeholder(tf.float32)

def multi_layer_perceptron(x_flat,weights,biases,keep_prob):
    
    layer_1 = tf.add(tf.matmul(x_flat,weights['layer_1']),biases['layer_1'])
    layer_1 = tf.nn.relu(layer_1)
    layer_1 = tf.nn.dropout(layer_1,keep_prob)

    layer_2 = tf.add(tf.matmul(layer_1,weights['layer_2']),biases['layer_2'])
    layer_2 = tf.nn.relu(layer_2)
    layer_2 = tf.nn.dropout(layer_2,keep_prob=keep_prob)
    
    logits = tf.add(tf.matmul(layer_2,weights['output']),biases['output'])
    return(logits)
    
logits = multi_layer_perceptron(x_flat,weights,biases,keep_prob)
loss = tf.nn.softmax_cross_entropy_with_logits(logits,y)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)

init = tf.initialize_all_variables()

accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits,1),tf.argmax(y,1)),tf.float32))

with tf.Session() as session:
    session.run(init)
    
    for epoch in range(training_epochs):
        total_batch = int(mnist.train.num_examples/batch_size)
        for i in range(total_batch):
            batch_x,batch_y = mnist.train.next_batch(batch_size=batch_size)
            session.run(optimizer,feed_dict={x: batch_x,
                                            y: batch_y,
                                            keep_prob:.5})
        print(epoch)
    
        
    test_x,test_y = mnist.test.next_batch(batch_size=10000)
    print(session.run(accuracy,feed_dict={x:test_x,
                                 y:test_y,
                                 keep_prob:1.0}))


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0.1135


### Methods to Prevent Overfitting

- Early Termination: Stop training when validation performance begins to fall
- L2 Regularization: Add penalty term to loss function, that is the squared sum of the weights
- Dropout: Force the network to hold redundant representations of information. Dropout randomly sets the activations for x% of nodes in a hidden layer to 0. So, for the same input, it's possible that we get different nodes set to 0, that could lead to different predictions. 

### Convolutional Neural Networks

- NN's that share their parameters across space
- width, height, depth

#### Weight Sharing

- Enforces invariance
- reduces parameter space of problem
- squeezes higher order features out of image

#### Parameters

- number of filters
- a convolution is the application of a filter across an image
- the result of a convolution is a feature map
- patch/kernel: height, width, depth
- stride: how many pixels do we shift our filter each time we move it?
- padding: how to handle the edges? 'valid' or 'same' (adds a padding of zero's to the outside of the image so that the output map size is the same as the input map)

#### Calculate volume of output

Given our input layer has a volume of W, our filter has a volume (height * width * depth) of F, we have a stride of S, and a padding of P, the following formula gives us the volume of the next layer: (W−F+2P)/S+1.

#### Calculate number of parameters (with and without weight sharing)

Input Image: 32x32x5
Filters: Apply 20 filters of 8x8x3
Stride: 2
Padding: 1

1. new height: (32-8+2x1)/2 + 1 = 14
2. new width: (32-8+2x1)/2 + 1 = 14
3. new depth: # of filters = 20

##### Number of Parameters without weight sharing

- Connection between every nueron in filter and every neuron in output map, plus bias neurons
- (8x8x3 + 1)x 14x14x20 = 756560

##### Number of Parameters with weight sharing

- Connection between every nueron in filter and each output map, plus bias neurons
- (8x8x3 + 1)x 20 = 3860

We use 196X fewer parameters when using weight sharing!





In [16]:
input = tf.placeholder(tf.float32, (None, 32, 32, 3))
filter_weights = tf.Variable(tf.truncated_normal((8, 8, 3, 20))) # (height, width, input_depth, output_depth)
filter_bias = tf.Variable(tf.zeros(20))
strides = [1, 2, 2, 1] # (batch, height, width, depth)
padding = 'VALID'
conv_layer = tf.nn.conv2d(input, filter_weights, strides, padding) + filter_bias

### Improvements to Simple ConvNets

- Pooling
- 1x1 Convolutions
- Inception Architecture

#### Pooling

- at every point on feature map, look at a small neighborhood around it and compute maximum
- doesnt add to parameters
- makes computation more expensive, because we have presumably modified the convolution in the previous layer to have a lower stride.
- adds hyper parameters (pooling region size, pooling stride)
- le net 5 (first to use max pooling), alex net. 
- instead of taking the max, can take the average

- Pooling decreases the size of the output and prevent overfitting. Preventing overfitting is a consequence of the reducing the output size, which in turn, reduces the number of parameters in future layers.

Recent datasets are so big and complex we're more concerned about underfitting.
Dropout is a much better regularizer.
Pooling results in a loss of information. Think about the max pooling operation as an example. We only keep the largest of n numbers, thereby disregarding n-1 numbers completely.

In [18]:
# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')

In [21]:
# Parameters
learning_rate = 0.001
batch_size = 128
training_epochs = 30

n_classes = 10  # MNIST total classes (0-9 digits)
layer_depth = {
    'layer_1': 32,
    'layer_2': 64,
    'layer_3': 128,
    'fully_connected': 512
}

weights = {
    'layer_1': tf.Variable(tf.truncated_normal(
        [5, 5, 1, layer_depth['layer_1']])),
    'layer_2': tf.Variable(tf.truncated_normal(
        [5, 5, layer_depth['layer_1'], layer_depth['layer_2']])),
    'layer_3': tf.Variable(tf.truncated_normal(
        [5, 5, layer_depth['layer_2'], layer_depth['layer_3']])),
    'fully_connected': tf.Variable(tf.truncated_normal(
        [4*4*128, layer_depth['fully_connected']])),
    'out': tf.Variable(tf.truncated_normal(
        [layer_depth['fully_connected'], n_classes]))
}
biases = {
    'layer_1': tf.Variable(tf.zeros(layer_depth['layer_1'])),
    'layer_2': tf.Variable(tf.zeros(layer_depth['layer_2'])),
    'layer_3': tf.Variable(tf.zeros(layer_depth['layer_3'])),
    'fully_connected': tf.Variable(tf.zeros(layer_depth['fully_connected'])),
    'out': tf.Variable(tf.zeros(n_classes))
}

def conv2d(x, W, b, strides=1):
    # Conv2D wrapper, with bias and relu activation
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

def maxpool2d(x, k=2):
    return tf.nn.max_pool(
        x,
        ksize=[1, k, k, 1],
        strides=[1, k, k, 1],
        padding='SAME')


def conv_net(x, weights, biases):
    # Layer 1 - 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['layer_1'], biases['layer_1'])
    conv1 = maxpool2d(conv1)

    # Layer 2 - 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['layer_2'], biases['layer_2'])
    conv2 = maxpool2d(conv2)

    # Layer 3 - 7*7*64 to 4*4*128
    conv3 = conv2d(conv2, weights['layer_3'], biases['layer_3'])
    conv3 = maxpool2d(conv3)

    # Fully connected layer - 4*4*128 to 512
    # Reshape conv3 output to fit fully connected layer input
    fc1 = tf.reshape(
        conv3,
        [-1, weights['fully_connected'].get_shape().as_list()[0]])
    fc1 = tf.add(
        tf.matmul(fc1, weights['fully_connected']),
        biases['fully_connected'])
    fc1 = tf.nn.tanh(fc1)

    # Output Layer - class prediction - 512 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out

x = tf.placeholder("float", [None, 28, 28, 1])
y = tf.placeholder("float", [None, n_classes])

logits = conv_net(x, weights, biases)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
    .minimize(cost)

# Initializing the variables
init = tf.initialize_all_variables()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)
    # Training cycle
    for epoch in range(training_epochs):
        total_batch = int(mnist.train.num_examples/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            # Run optimization op (backprop) and cost op (to get loss value)
            sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})
        # Display logs per epoch step
        c = sess.run(cost, feed_dict={x: batch_x, y: batch_y})
        print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c))
    print("Optimization Finished!")

    # Test model
    correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    print(
        "Accuracy:",
        accuracy.eval({x: mnist.test.images, y: mnist.test.labels}))

Epoch: 0001 cost= 9.523663521
Epoch: 0002 cost= 6.950458050
Epoch: 0003 cost= 6.193962097
Epoch: 0004 cost= 5.135995865
Epoch: 0005 cost= 4.015914917


KeyboardInterrupt: 