Tensors have a dimension and a type. __hello_constant__ is a 0 dimensional string tensor.

- tf.constant

TensorFlow's api is built around the concept of a computational graph. 

TensorFlow Sessions is an enviroment for running a graph


In [1]:
import tensorflow as tf

# Create TensorFlow object called tensor
hello_constant = tf.constant('Hello World!')

with tf.Session() as sess:
    # Run the tf.constant operation in the session
    output = sess.run(hello_constant)
    print(output)

b'Hello World!'


We can use TensorFlow __placeholder__ variables to feed values into the computational graph within a Session.

- tf.placeholders requires a type. Can also provide a shape



In [10]:
import tensorflow as tf

place_x = tf.placeholder(tf.string)
with tf.Session() as session:
    output = session.run(place_x,{place_x:"Heyo"})
    print(output)

Heyo


Doing math in TensorFlow (full list of ops - https://www.tensorflow.org/versions/r0.11/api_docs/python/math_ops.html#math):

- tf.add
- tf.sub
- tf.mul
- tf.div 

In [17]:
x = tf.placeholder(tf.int32)
y = tf.placeholder(tf.int32)
z = tf.sub(tf.div(x,y),tf.constant(1))

with tf.Session() as session:
    out = session.run(z,feed_dict={x:10,y:2})
    print(out)

4


The goal of training a neural network is to modify weights and biases to best predict the labels. In order to use weights and bias, you'll need a Tensor that can be modified.

- This leaves out tf.placeholder() and tf.constant(), since those __Tensors can't be modified__. 
- This is where tf.Variable() comes in.


In [37]:
x = tf.Variable(5)
init = tf.initialize_all_variables() # <- returns an operation that we can call in a session to initialize all the variables
y = tf.add(x,1)

with tf.Session() as session:
    session.run(init)
    out = session.run(x)
    print(out)

5


Let's build a logistic classifier. 

- Given inputs x, we can use weights w and biases b to generate logits (scores) y. 
- x: N observations, K features
- w: K features, L classes
- b: 1, L classes

We can calculate either:

- w*x+b = y, or 
- x*w + b = y

We will calculate x*w + b. 

Before computing this, and training our weights and biases, we should initialize our weights and biases. appropriate to use a random numbers drawn from a truncated standard normal to init our parameters. We use truncated normals so that:

- Randomness allows for more variation when we restart the algorithm. Helps decrease the likelihood we fall into a local minimum
- Small values prevent overfitting

Since we want to update our weights/biases, we should represent them as TensorFlow __variables__

In [57]:
import numpy as np
n_features = 2
n_classes = 5

x = tf.placeholder(tf.float32)
weights = tf.Variable(tf.truncated_normal(shape=(n_features,n_classes)))
biases = tf.Variable(tf.zeros(n_classes))

logits = tf.add(tf.matmul(x,weights),biases)

init = tf.initialize_all_variables()

inp = np.array([[1,2],[0,1]])
               
with tf.Session() as session:
    session.run(init)
    print(inp)
    print(session.run(weights))
    print(session.run(biases))
    out = session.run(logits,feed_dict={x:inp})
    print(out)
    
    
                

[[1 2]
 [0 1]]
[[-1.68838227 -1.87039447 -1.45894945 -0.71141493 -0.059352  ]
 [-0.79901886 -0.23569219  0.10964096  0.03092351 -0.22617172]]
[ 0.  0.  0.  0.  0.]
[[-3.28641987 -2.34177876 -1.23966753 -0.6495679  -0.51169544]
 [-0.79901886 -0.23569219  0.10964096  0.03092351 -0.22617172]]


Convert logits to probabilities using the softmax function

In [72]:
import numpy as np
def softmax(x):
    """
    x: array of logits
    returns numpy array of same size with softmaxes
    """
    return np.exp(x)/np.sum(np.exp(x),axis=0)

print(softmax([.01,.02,.03]))
print(softmax([.1,.2,.3]))
print(softmax([1,2,3]))
print(softmax([10,20,30]))
print(softmax([100,200,300]))


[ 0.33000561  0.33332222  0.33667217]
[ 0.30060961  0.33222499  0.3671654 ]
[ 0.09003057  0.24472847  0.66524096]
[  2.06106005e-09   4.53978686e-05   9.99954600e-01]
[  1.38389653e-87   3.72007598e-44   1.00000000e+00]


- If we multiple logits by 10, we see that the probabilities get closer to 0 or 1.
- If we divide logits by 10, the probabilities become more uniform

So, the magnitude of the logits is important. Initially, we want our logits to be small. As the model is trained and gets better, we want the magnitude of the logits to increase.

### Training the model. Cross Entropy, Average Cross Entropy, Optimizing avg Cross Entropy by updating weights/biases 

Now that we have softmaxes, we can calculate how closely our predictions are to the true label of our data. For this part, it helps to have our labels 1-hot encoded, or represented as vectors where all values are 0 except the index corresponding to the class. e.g. if there are 5 classes, and obs i is classified as 3, then its 1-hot encoding is [0,0,1,0,0].

We can use the 1 hot encoding and softmaxes to calculate the cross-entropy of our model. Cross-entropy is a distance measure on 2 vectors defined as:

d(S,L) = -1*sum(L\_{i} * log(S\_{i}))

Where S are our predicted softmaxes and L is the one-hot encoded represented of the true label. Some key points:

- Order matters for softmax, d(S,L) not necessarily equal to d(L,S)
- Log is natural log
- Our softmax functions will give a non-zero probability to every class, so the natural log will not be undefined.

As in all machine learning problems, we do training by formulating a loss function and minimizing it. For this multinomial logistic regression, we can express our Loss function as a sum of d(S,L) over all the examples in our data, divided by the total number of examples in our data. 

Then we can minimize this loss function by taking it's gradient, setting it equal to zero, and incrementally updating our parameters in the direction of steepest descent for the loss function

### Preprocessing Input before Training

- Bad conditioned vs. Well conditioned
- numerical instability



In [90]:

for exp in range(10,0,-1):
    a = 10**exp
    b = a
    for i in range(1000000):
        a += 10**-6
    print(1-(a-b))

-0.9073486328125
0.04632568359375
0.0016222000122070312
-0.00024044513702392578
-7.614493370056152e-06
6.9374218583106995e-06
-3.3853575587272644e-07
2.5247572921216488e-09
2.5247572921216488e-09
7.484004527213983e-10


Adding really big and really small numbers gives unpredictable results. We see that as a,b -> 0, the error 1-(a-b) -> 0. We do not want numerical instability to impact the minimization of our loss function. So, for our input features, it is better to normalize the values to having 0 mean and equal variance before feeding them to the algorithm. For example, for images with 255 pixels, for each color channel, we can do:

- r = (r+128)/128
- g = (g+128)/128
- b = (b+128)/128

### Validation Set Size

If classes are well-balanced, can use rule of 30 as a proxy for determining validation set size:

- Hold back more than 30000 examples for validation. 
- Accuracy changes >.1% are significant

If classes are not well-balanced, as is the case for more real world examples, what can we do?

- Resample to balance classes
- generate synthetic data to rebalance
- ??

### Stochastic Gradient Descent

- Gradient Descent runs into scalability issues for large data sets. The loss function over the entire dataset is computational intensive to calculate, and calculating gradients can be approx 3x more intensive (as a rule of thumb) than computing the loss function
- So, we can approximate GD by training on small batches (< 1000 examples), and calculating the loss (average cross entropy) for these examples, assuming it is an approximation to the true loss (given the current parameters and data), and updating the parameters accordingly by propagating this error back through the network.
- This is a scalable approach but not a great optimizer. Oftentimes, the gradient of these batches is not in the best direction and cost function may not be monotonically decreasing

Tricks to implement SGD:
    
1. Inputs: 0 mean and equal variance
2. Weights: random values, 0 mean, equal variance
3. Momentum - running average of gradient to get the general direction in which we should update our parameters and move towards our objective.
4. Learning Rate Decay - smaller noiser steps to objective. Beneficial to make the learning rate smaller as we train. Lower it over time is empirically shown to help.


SGD Parameters:

- initial weights/biases
- initial learning rate
- momentum
- decay
- batch size

When things don't work, try lowering learning rate to start.

#### Adagrad

- implicitly, does momentum and learning rate decay
- often makes learning less sensitive to hyper parameters
- but, may be less performant than SGD with good tuning.
- but, good place to start
