# Hello, Tensor World!
## Tensor

Data is encapsulated in an object called a tensor. `hello_constant = tf.constant('Hello World!')`, hello_constant is a 0-dimensional string tensor.  But there are more.
```python
# A is a 0-dimensional int32 tensor
A = tf.constant(1234) 
# B is a 1-dimensional int32 tensor
B = tf.constant([123,456,789]) 
 # C is a 2-dimensional int32 tensor
C = tf.constant([ [123,456,789], [222,333,444] ])
```

Tensor returned by `tf.constant` is called a constant tensor, because the value of the tensor never changes.

## Session
API is built around the idea of a computational graph, a way of visualizing a mathematical process.  TensorFlow code as graph:
![Session](Lesson6/Session.png)

"TensorFlow Session" is an environment for running a graph.  Session is in charge of allocating the operations to GPUs/CPUs, including remote machines.

# TensorFlow Input
Go over the basics of feeding data into TensorFlow.

## tf.placeholder()
Can't set x to data set and put it in TensorFlow.  **tf.placeholder()** returns a tensor that gets its value from data passed to the **tf.session.run()**

## Session's feed_dict
```python
x = tf.placeholder(tf.string)

with tf.session() as sess:
    output = sess.run(x, feed_dict={x: "Hello World"})
```

Use the **feed_dict** parameter in **tf.session.run()** to set the placeholder tensor. Possible to set more than one tensor:
```python
x = tf.placeholder(tf.string)
y = tf.placeholder(tf.int32)
z = tf.placeholder(tf.float32)

with tf.Session() as sess:
    output = sess.run(x, feed_dict={x: 'Test String', y: 123, z: 45.67})
```

# TensorFlow Math
## Addition
```python
x = tf.add(5, 2)
```

It takes in two numbers, two tensors, or one of each, and returns their sum as a tensor.

## Subtraction and Multiplication
```python
x = tf.subtract(10, 4) # 6
y = tf.multiply(2, 5)  # 10
```

## Converting Types
It may be necessary to convert between types to make certain operators work together, i.e. floats and ints.

You can make sure your data is all of the same type, or you can cast a value to another type.
```python
tf.subtract(tf.cast(tf.constant(2.0), tf.int32), tf.constant(1))
```

## Quiz
```python
# Solution is available in the other "solution.py" tab
import tensorflow as tf

# TODO: Convert the following to TensorFlow:
x = tf.constant(10)
y = tf.constant(2)
z = tf.subtract(tf.divide(x, y), tf.cast(tf.constant(1), tf.float64))

# TODO: Print z from a session
with tf.Session() as sess:
    output = sess.run(z)
    print(output)
```

# Supervised Classification
* Take an input and get an output, e.i. identify a letter.
* Use training set with already labeled data
* We will be learing about a logistic classifier

# Training Your Logistic Classifier
* It is linear
* Takes input and applies a linear function to generate its prediction
* Pretty much a matrix multiplier
* A lot like last lesson
* turn scores into probability of correct label.
    * Use soft max function to turn scores into probablilites
$$
S(y_i) = \frac{e^{y_i}}{\sum_j e^{y_i}}
$$

* Scores of logistic regression are also called logits

# TensorFlow Linear Function
Let's derive function **y = Wx + b**

First it's **y = Wx**
![wx 1](Lesson6/wx-1.jpg)

**y = Wx + b**
![wx 1](Lesson6/wx-b.jpg)

## Transposition
Instead of **y = Wx + b**, we will be using **y = xW + b**.  They are very similar, except that they have been transposed and the rows and the columns have been switched.  This is what TensorFlow uses.

## Weights and Bias in TensorFlow
Goal of training a NN is to modify the weights ans biases to best predict the labels.  A tensor that can be modified must be used.  Rules out **tf.placeholder()** and **tf.constant()**. We will use **tf.variable**

### tf.variable()
```python
x = tf.Variable(5)
```

Tensor stores its state in the session, the state of the tensor must be initialized manually.  Use **tf.global_variables_initialize()**

```python
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
```

**tf.global_variables_initialize()** call returns an operation that will initialize all TensorFlow variables from the graph.  Initializing the weights with random numbers from a normal distribution is good practice. Randomizing the weights helps the model from becoming stuck in the same place every time you train it.

Choosing weights from a normal distribution prevents any one weight from overwhilming other weights.  Use **tf.truncate_normal()** to generate rendom numbers from a normal distribution.

### tf.truncated_normal()
```python
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))
```

The tf.truncated_normal() function returns a tensor with random values from a normal distribution whose magnitude is no more than 2 standard deviations from the mean.  Set bias to 0.

### tf.zeros()
```python
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))
```

## Quize
### quiz.py

In [None]:
# Solution is available in the other "quiz_solution.py" tab
import tensorflow as tf

def get_weights(n_features, n_labels):
    """
    Return TensorFlow weights
    :param n_features: Number of features
    :param n_labels: Number of labels
    :return: TensorFlow weights
    """
    # TODO: Return weights
    return tf.Variable(tf.truncated_normal((n_features, n_labels)))


def get_biases(n_labels):
    """
    Return TensorFlow bias
    :param n_labels: Number of labels
    :return: TensorFlow bias
    """
    # TODO: Return biases
    return tf.Variable(tf.zeros(n_labels))


def linear(input, w, b):
    """
    Return linear function in TensorFlow
    :param input: TensorFlow input
    :param w: TensorFlow weights
    :param b: TensorFlow biases
    :return: TensorFlow linear function
    """
    # TODO: Linear Function (xW + b)
    return tf.add(tf.matmul(input, w), b)

### sandbox.py

In [None]:
# Solution is available in the other "sandbox_solution.py" tab
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from quiz import get_weights, get_biases, linear


def mnist_features_labels(n_labels):
    """
    Gets the first <n> labels from the MNIST dataset
    :param n_labels: Number of labels to use
    :return: Tuple of feature list and label list
    """
    mnist_features = []
    mnist_labels = []

    mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

    # In order to make quizzes run faster, we're only looking at 10000 images
    for mnist_feature, mnist_label in zip(*mnist.train.next_batch(10000)):

        # Add features and labels if it's for the first <n>th labels
        if mnist_label[:n_labels].any():
            mnist_features.append(mnist_feature)
            mnist_labels.append(mnist_label[:n_labels])

    return mnist_features, mnist_labels


# Number of features (28*28 image is 784 features)
n_features = 784
# Number of labels
n_labels = 3

# Features and Labels
features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)

# Weights and Biases
w = get_weights(n_features, n_labels)
b = get_biases(n_labels)

# Linear Function xW + b
logits = linear(features, w, b)

# Training data
train_features, train_labels = mnist_features_labels(n_labels)

with tf.Session() as session:
    # TODO: Initialize session variables
    init = tf.global_variables_initializer()
    session.run(init)
    # Softmax
    prediction = tf.nn.softmax(logits)

    # Cross entropy
    # This quantifies how far off the predictions were.
    # You'll learn more about this in future lessons.
    cross_entropy = -tf.reduce_sum(labels * tf.log(prediction), reduction_indices=1)

    # Training loss
    # You'll learn more about this in future lessons.
    loss = tf.reduce_mean(cross_entropy)

    # Rate at which the weights are changed
    # You'll learn more about this in future lessons.
    learning_rate = 0.08

    # Gradient Descent
    # This is the method used to train the model
    # You'll learn more about this in future lessons.
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

    # Run optimizer and get loss
    _, l = session.run(
        [optimizer, loss],
        feed_dict={features: train_features, labels: train_labels})

# Print loss
print('Loss: {}'.format(l))


# Softmax
Next step is to assign a probability to each label, which you can then use to classify the data.  Here is the formula again.
$$
S(y_i) = \frac{e^{y_i}}{\sum_j e^{y_i}}
$$

By taking "e" to the power of any real value we always get back a positive value.
```python
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)
```

# One-Hot Encoding
![NN Confidence](Lesson6/nn-confidence.png)

If size of outputs is increased, claassifier becomes bery confident about its predictions.  If you reduce it is unsure.  You want to start the classifier to not be sure in the beginning then gain confidence as it learns.  Represent labels by using a vector that has a 1 in the location of it's corresponding label. 

Works in many problems, but it does not scale very well.  It is made up of vectors that contain mostly zeros.  Will deal with the problem using embeddings later on.

# Cross Entropy
Nice thing about approach is that it allows us to see how well we are doing by comparing two vectors.
![cross entropy](Lesson6/cross-entropy.png)

## Multinomial Logistic Classification
$$
D(S(WX + b), L)
$$
![mls](Lesson6/mlc.png)

In [3]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf

softmax_data = [0.7, 0.2, 0.1]
one_hot_data = [1.0, 0.0, 0.0]

softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

cross_entropy = -tf.reduce_sum(one_hot * tf.log(softmax))

# TODO: Print cross entropy from session
with tf.Session() as sess:
    output = sess.run(cross_entropy, feed_dict={softmax:softmax_data, one_hot:one_hot_data})
    print(output)

0.356675


# Minimize Cross Entropy
How are we going to find those weights, $w$, and biases $b$, that will get our classigier to do what we want it to do.  Have a low distance for the correct class and high distance for the incorrect class.  Measure distance averaged over the entire training set.  That is the training loss.  Try to minimize loss by using gradient descent.
![Loss](Lesson6/loss.png)
 
# Normalized Inputs and Initial Weights
We want 0 mean and equal variance.
**Mean:** $X_i = 0$  
 
**Variance:** $\sigma(X_i) = \sigma(X_j)$
![Normalized](Lesson6/normalized.png)
 
![Normalized Pics](Lesson6/normalized-pics.png)
 
Large sigma means that dist will have large peaks, very opinionated.
Small sigma, very uncertain abolut things.  Better to begin with uncertain distribution.
 
![Init](Lesson6/init.png)
  
# Measuring Performance
This section is about the classifier not being able to gerneralize when released into production.  It was able to classify the testing data correctly but failed when given new data.  This is because when you train and test the classifier, you are bleeding some of the information from the test set into the training set.  One solution is to split the data into 3 piles, train, validate, and test.  You never look at the test set.

# Validation and Test Set Size

The bigger the test set the less noise you have in the accuracy measurement.

**Rule of thumb** Change that affects 30 examples can be trusted.

Validation set size > 30000 Examples.  If data set is small. Consider cross validation.

# Stochastic Gradient Descent
Gradient descent does not scale.  it uses all the data to compute theloss function.  We want to be able to train on big data.  Instead we use the Sochastic Gradient descent.  We compute the average loss for a very small fraction of the training data.  Has to be random!! This gives a bad estimate, so we have to interate many more times.  At times it may take us in the wrong direction.  We take little steps.  Smaller learning rate.  Scales well with both data and model size. Comes with a lot of issues in practice because of its bed estimates.

# Momentum and Learning Rate Decay
![momentum](Lesson6/momentum.png)
![learning rate](Lesson6/learning-rate.png)

# Parameter Hyperspace!
Never trust how quickly you learn.  It is usually a bad indicator of how well you learn.  A larger learning rate does not mean that you will learn faster!!

Many Hyper-Parameters:
* **Initial Learning Rate**
* **Learning Rate Decay**
* **Momentum**
* **Batch Size**
* **Weight Initialization**

Lots of good solutions for small models, by non satisfactory for large models.

Check out ADAGRAD

# Mini-batching
Mini-batching is a technique for training on subsets of the dataset instead of all the data at one time. This provides the ability to train a model, even if a computer lacks the memory to store the entire dataset.

Useful when combined with SGD.  Randomly shuffle the data at the start of each epoch, then create mini-batches.  For each mini-batch, you train the network weights with gradient descent. Since these batches are random, you're performing SGD with each batch.

```python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))
```
The total memory space required for the inputs, weights and bias is around 174 megabytes, which isn't that much memory. You could train this whole dataset on most CPUs and GPUs.

Use mini-batching to run larger models on machines.

## TensorFlow Mini-batching
### quiz.py

In [4]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches

learning_rate = 0.001
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))


# TODO: Set batch size
batch_size = 128
assert batch_size is not None, 'You must set the batch size'

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    # TODO: Train optimizer on all batches
    for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
        sess.run(optimizer, feed_dict={features: batch_features, labels: batch_labels})

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))


ImportError: No module named 'helper'

### helper.py

In [6]:
import math
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    outout_batches = []
    
    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        outout_batches.append(batch)
        
    return outout_batches

# Epochs
An epoch is a single forward and backward pass of the whole dataset. This is used to increase the accuracy of the model without requiring more data.

```python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches  # Helper function created in Mini-batching section


def print_epoch_stats(epoch_i, sess, last_features, last_labels):
    """
    Print cost and validation accuracy of an epoch
    """
    current_cost = sess.run(
        cost,
        feed_dict={features: last_features, labels: last_labels})
    valid_accuracy = sess.run(
        accuracy,
        feed_dict={features: valid_features, labels: valid_labels})
    print('Epoch: {:<4} - Cost: {:<8.3} Valid Accuracy: {:<5.3}'.format(
        epoch_i,
        current_cost,
        valid_accuracy))

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
valid_features = mnist.validation.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
valid_labels = mnist.validation.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
learning_rate = tf.placeholder(tf.float32)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init = tf.global_variables_initializer()

batch_size = 128
epochs = 10
learn_rate = 0.001

train_batches = batches(batch_size, train_features, train_labels)

with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch_i in range(epochs):

        # Loop over all batches
        for batch_features, batch_labels in train_batches:
            train_feed_dict = {
                features: batch_features,
                labels: batch_labels,
                learning_rate: learn_rate}
            sess.run(optimizer, feed_dict=train_feed_dict)

        # Print cost and validation accuracy of an epoch
        print_epoch_stats(epoch_i, sess, batch_features, batch_labels)

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))
```

The accuracy only reached 0.86, but that could be because the learning rate was too high. Lowering the learning rate would require more epochs, but could ultimately achieve better accuracy.