1. [Tensor](#Tensor)
2. [Session](#Session)
3. [Input](#Input)
    1. [tf.placeholder()](#tf.placeholder)
    2. [Session's feed_dict](#feed_dict)
4. [TensorFlow Math](#TensorFlow Math)
    1. [Addition](#Addition)
    2. [Substraction and Multiplication](#Substraction and Multiplication)
    3. [Converting types](#Converting types)
5. [Classification (Multinomial logistic classification)](#Classification)
    1. [Logistic classifier](#Logistic classifier)
    2. [TensorFlow Linear Function](#TensorFlow Linear Function)
    3. [Weights and Bias in TensorFlow](#Weights and Bias in TensorFlow)
    4. [TensorFlow Softmax](#TensorFlow Softmax)
    5. [One-Hot Encoding](#One-Hot Encoding)
    6. [Cross Entropy](#Cross Entropy)
    8. [Minimizing Cross Entropy](#Minimizing Cross Entropy)
6. [Practical Aspects of Learning](#Practical Aspects of Learning)
    1. [Numerical Stability](#Numerical Stability)
    2. [Validation set](#validation set)
    3. [Stochastic Gradient Descent](#Stochastic Gradient Descent)
    4. [Parameter Hyperspace](#Parameter Hyperspace)
    5. [Mini-batch](#Mini-batch)
    6. [Epochs](#Epochs)

Let’s analyze the Hello World script.

In [None]:
import tensorflow as tf

hello_constant = tf.constant('Hello World!')

with tf.Session() as sess:
    output = sess.run(hello_constant)
    print(output)

---
# 1. Tensor <a name='Tensor'></a>

In TensorFlow, data isn’t stored as integers, floats, or strings. _These values are encapsulated in an object called a tensor_. In the case of __hello_constant = tf.constant('Hello World!')__, __hello_constant__ is a 0-dimensional string tensor, but tensors come in a variety of sizes as shown below:

```Python
# A is a 0-dimensional int32 tensor
A = tf.constant(1234) 
# B is a 1-dimensional int32 tensor
B = tf.constant([123,456,789]) 
 # C is a 2-dimensional int32 tensor
C = tf.constant([ [123,456,789], [222,333,444] ])
```

__tf.constant()__ is one of many TensorFlow operations you will use in this lesson. The tensor returned by __tf.constant()__ is called a __constant tensor__, because _the value of the tensor never changes_.

---
# 2. Session <a name='Session'></a>

TensorFlow’s api is built around the idea of a computational graph, a way of visualizing a mathematical process which you learned about in the MiniFlow lesson. Let’s take the TensorFlow code you ran and turn that into a graph:

<img src='Figures3/session.png' width = 500>

A "TensorFlow Session", as shown above, is _an environment for running a graph_. The session is in charge of allocating the operations to GPU(s) and/or CPU(s), including remote machines. Let’s see how you use it.

```Python
with tf.Session() as sess:
    output = sess.run(hello_constant)
```

The code has already created the tensor, __hello_constant__, from the previous lines. The next step is to evaluate the tensor in a session.

The code creates a session instance, __sess__, using __tf.Session__. The __sess.run()__ function then evaluates the tensor and returns the results.

---
# 3. Input <a name='Input'></a>

In the last section, you passed a tensor into a session and it returned the result. What if you want to use a non-constant? This is where __tf.placeholder()__ and __feed_dict__ come into place. In this section, you'll go over the basics of feeding data into TensorFlow.

### 3.1. tf.placeholder() <a name='tf.placeholder'></a>

Sadly you can’t just set __x__ to your dataset and put it in TensorFlow, because over time you'll want your TensorFlow model to take in different datasets with different parameters. You need __tf.placeholder()__!

__tf.placeholder()__ returns a tensor that gets its value from data passed to the __tf.session.run()__ function, allowing you to set the input right before the session runs.

### 3.2. Session's feed_dict <a name='feed_dict'></a>

```Python
x = tf.placeholder(tf.string)

with tf.Session() as sess:
    output = sess.run(x, feed_dict={x: 'Hello World'})
```

Use the __feed_dict__ parameter in __tf.session.run()__ to set the placeholder tensor. The above example shows the tensor __x__ being set to the string __"Hello, world"__. It's also possible to set more than one tensor using __feed_dict__ as shown below.

```Python
x = tf.placeholder(tf.string)
y = tf.placeholder(tf.int32)
z = tf.placeholder(tf.float32)

with tf.Session() as sess:
    output = sess.run(x, feed_dict={x: 'Test String', y: 123, z: 45.67})
```

__Note__: If the data passed to the **feed\_dict** doesn’t match the tensor type and can’t be cast into the tensor type, you’ll get the error “__ValueError: invalid literal for__...”.

__Quiz__:
Let's see how well you understand __tf.placeholder()__ and __feed_dict__. The code below throws an error, but I want you to make it return the number __123__. Change line 11, so that the code returns the number __123__.

In [None]:
import tensorflow as tf


def run():
    output = None
    x = tf.placeholder(tf.int32)

    with tf.Session() as sess:
        # TODO: Feed the x tensor 123
        output = sess.run(x)

    return output

__Answer__:

```Python
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf


def run():
    output = None
    x = tf.placeholder(tf.int32)

    with tf.Session() as sess:
        output = sess.run(x, feed_dict={x: 123})

    return output
```

---
# 4. TensorFlow Math <a name='TensorFlow Math'></a>

[Documentation](https://www.tensorflow.org/api_guides/python/math_ops)

### 4.1. Addition <a name='Addition'></a>

```Python
x = tf.add(5, 2)  # 7
```

The __tf.add()__ function takes in two numbers, two tensors, or one of each, and returns their sum as a tensor.

### 4.2. Substraction and Multiplication <a name='Substraction and Multiplication'></a>

```Python
x = tf.subtract(10, 4) # 6
y = tf.multiply(2, 5)  # 10
```

### 4.3. Converting types <a name='Converting types'></a>

It may be necessary to convert between types to make certain operators work together. For example, if you tried the following, it would fail with an exception:

```Python
tf.subtract(tf.constant(2.0),tf.constant(1))  # Fails with ValueError: Tensor conversion requested dtype float32 for Tensor with dtype int32:
```

That's because the constant __1__ is an integer but the constant __2.0__ is a floating point value and __subtract__ expects them to match.

In cases like these, you can either make sure your data is all of the same type, or you can cast a value to another type. In this case, converting the __2.0__ to an integer before subtracting, like so, will give the correct result:

```Python
tf.subtract(tf.cast(tf.constant(2.0), tf.int32), tf.constant(1))   # 1
```

__Quiz:__

Convert the following algorithm in regular Python to TensorFlow and print the results of the session. You can use tf.constant() for the values 10, 2, and 1.

```Python
# Solution is available in the other "solution.py" tab
import tensorflow as tf

# TODO: Convert the following to TensorFlow:
x = 10
y = 2
z = x/y - 1

# TODO: Print z from a session
```

In [None]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf

# TODO: Convert the following to TensorFlow:
x = 10
y = 2
z = x/y - 1

# TODO: Print z from a session

__Anser__:

```Python
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

# TODO: Convert the following to TensorFlow:
x = tf.constant(10)
y = tf.constant(2)
z = tf.subtract(tf.divide(x,y),tf.cast(tf.constant(1), tf.float64))

# TODO: Print z from a session
with tf.Session() as sess:
    output = sess.run(z)
    print(output)
```

---
# 5. Classification (Multinomial logistic classification) <a name='Classification'></a>

Classification is the task of taking an input, and giving it a label. 

The typical setting for classification is a lot of examples called the training set which is already been sorted in. Then when a completely new example is come out, the goal is going to be to figure out which of those classes it belons to. 

There is a lot more to machine learning that just classification. But classification, or more generally prediction, is the central building block of machine learning. Once we know how to classify thins, it's very each, for example, to learn how to detect them, or to rank them, or for regression, or for reinforcement learning. 

### 5.1. Logistic classifier <a name='Logistic classifier'></a>

This is basic but very important classifier, which is what's called the __linear classifier__.

$$
W\mathbf{x} + b = y
$$

It takes the input $X$, for example, the pixels in an image, and applies a linear function to them to generate its predictions, $Y$. A linear function is just a giant matrix multiply. The weights, $W$, of the matrix and the bias, $b$, is where the machine learning comes in. We are going to train that model, which means going to try to find the values for the weights and bias, which are good at performing those predictions.

How are we going to use the scores, $y$, to perform the classification? For example, each image that we have as an input can have one and only one possible label. So, we are going to turn the scores, $y$, into probabilities of which the correct class which is very close to 1 and the probability for every other class to be close to 0.

<img src='Figures3/Screen Shot 2017-03-22 at 12.33.23.png' width=400>

The way to turn scores into probabilities is to use a softmax function, $S(y_i)$. 

$$
S(y_i) = \frac{y_i}{\sum_j e^{y_j}}
$$

<img src='Figures3/Screen Shot 2017-03-22 at 12.35.11.png' width=400>

Scores in the context of logistic regression are often also called __logits__. 

### 5.2. TensorFlow Linear Function <a name='TensorFlow Linear Function'></a>

Let’s derive the function $y = W\mathbf{x} + b$. We want to translate our input, $x$, to labels, $y$.

For example, imagine we want to classify images as digits.

$x$ would be our list of pixel values, and $y$ would be the logits, one for each digit. Let's take a look at $y = Wx$, where the weights, $W$, determine the influence of $x$ at predicting each $y$.

<img src='Figures3/wx-1.jpg' width = 500>
$$ \text{Function } y=W\mathbf{x} $$

$y = W\mathbf{x}$ allows us to segment the data into their respective labels using a line.

However, this line has to pass through the origin, because whenever $x$ equals 0, then $y$ is also going to equal 0.

We want the ability to shift the line away from the origin to fit more complex data. The simplest solution is to add a number to the function, which we call “bias”.

<img src='Figures3/wx-b.jpg' width = 500>
$$ \text{Function } y=W\mathbf{x}+b $$

Our new function becomes $W\mathbf{x} + b$, allowing us to create predictions on linearly separable data. Let’s use a concrete example and calculate the logits.

For example, 

<img src='Figures3/codecogseqn-13.gif' width=500>
$$ y=W\mathbf{x}+b $$

We've been using the $y = W\mathrm{x} + b$ function for our linear function.

But there's another function that does the same thing, y = $\mathrm{x}W + b$. These functions do the same thing and are interchangeable, except for the dimensions of the matrices involved.

To shift from one function to the other, you simply have to swap the row and column dimensions of each matrix. This is called transposition.

For example,

<img src='Figures3/codecogseqn-18.gif' width=500>
$$ \text{Function } y= \text{x}W+b $$

The above example is identical, except that the matrices are transposed.

$\mathrm{x}$ now has the dimensions 1x3, $W$ now has the dimensions 3x2, and $b$ now has the dimensions 1x2. Calculating this will produce a matrix with the dimension of 1x2.

You'll notice that the elements in this 1x2 matrix are the same as the elements in the 2x1 matrix from the equation before transposition. Again, these matrices are simply transposed.

<img src='Figures3/codecogseqn-20.gif' width=200>

We now have our logits! The columns represent the logits for our two labels.

Now you can learn how to train this function in TensorFlow.

### 5.3. Weights and Bias in TensorFlow <a name='Weights and Bias in TensorFlow'></a>

__The goal of training a neural network is to modify weights and biases to best predict the labels__. In order to use weights and bias, you'll need a Tensor that can be modified. This leaves out __tf.placeholder()__ and __tf.constant()__, since those Tensors can't be modified. This is where __tf.Variable__ class comes in.

#### a) tf.Variable():

```Python
x = tf.Variable(5)
```

The [tf.Variable](https://www.tensorflow.org/api_docs/python/tf/Variable) class creates a tensor with an initial value that can be modified, much like a normal Python variable. This tensor stores its state in the session, so you must initialize the state of the tensor manually. You'll use the [tf.global_variables_initializer()](https://www.tensorflow.org/programmers_guide/variables) function to initialize the state of all the Variable tensors.

```Python
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
```

The __tf.global_variables_initializer()__ call returns an operation that will initialize all TensorFlow variables from the graph. You call the operation using a session to initialize all the variables as shown above. Using the tf.Variable class allows us to change the weights and bias, but an initial value needs to be chosen.

Initializing the weights with random numbers from a normal distribution is good practice. Randomizing the weights helps the model from becoming stuck in the same place every time you train it. You'll learn more about this in the next lesson, when you study gradient descent.

Similarly, choosing weights from a normal distribution prevents any one weight from overwhelming other weights. You'll use the __tf.truncated_normal()__ function to generate random numbers from a normal distribution.

#### tf.truncated_normal()

```Python
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))
```
The [tf.truncated_normal()](https://www.tensorflow.org/api_docs/python/tf/truncated_normal) function returns a tensor with random values from a normal distribution whose magnitude is no more than 2 standard deviations from the mean.

Since the weights are already helping prevent the model from getting stuck, you don't need to randomize the bias. Let's use the simplest solution, setting the bias to 0.

#### tf.zeros()
```Python
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))
```

The [tf.zeros()](https://www.tensorflow.org/api_docs/python/tf/zeros) function returns a tensor with all zeros.

__Linear Classifier Quiz:__

<img src='Figures3/mnist-012.png' width=500>
$$ \text{Subset of MNIST dataset.} $$

You'll be classifying the handwritten numbers 0, 1, and 2 from the MNIST dataset using TensorFlow. The above is a small sample of the data you'll be training on. Notice how some of the 1s are written with a serif at the top and at different angles. The similarities and differences will play a part in shaping the weights of the model.

<img src='Figures3/weights-0-1-2.png' width=500>
$$ \text{Left: Weights for labeling 0. Middle: Weights for labeling 1. Right: Weights for labeling 2.} $$

The images above are trained weights for each label (0, 1, and 2). The weights display the unique properties of each digit they have found. Complete this quiz to train your own weights using the MNIST dataset.

1. Open quiz.py.
    1. Implement get_weights to return a tf.Variable of weights
    2. Implement get_biases to return a tf.Variable of biases
    3. Implement xW + b in the linear function
1. Open sandbox.py
    1. Initialize all weights

Since $xW$ in $xW + b$ is matrix multiplication, you have to use the [tf.matmul()](https://www.tensorflow.org/api_docs/python/tf/matmul) function instead of [tf.multiply()](https://www.tensorflow.org/api_docs/python/tf/multiply). Don't forget that order matters in matrix multiplication, so __tf.matmul(a,b)__ is not the same as __tf.matmul(b,a)__.


In [1]:
# quiz.py
# Solution is available in the other "quiz_solution.py" tab

import tensorflow as tf

def get_weights(n_features, n_labels):
    """
    Return TensorFlow weights
    :param n_features: Number of features
    :param n_labels: Number of labels
    :return: TensorFlow weights
    """
    # TODO: Return weights
    return tf.Variable(tf.truncated_normal((n_features, n_labels)))


def get_biases(n_labels):
    """
    Return TensorFlow bias
    :param n_labels: Number of labels
    :return: TensorFlow bias
    """
    # TODO: Return biases
    return tf.Variable(tf.zeros(n_labels))


def linear(input, w, b):
    """
    Return linear function in TensorFlow
    :param input: TensorFlow input
    :param w: TensorFlow weights
    :param b: TensorFlow biases
    :return: TensorFlow linear function
    """
    # TODO: Linear Function (xW + b)
    return tf.add(tf.matmul(input, w), b)

In [4]:
# sandbox.py

# Solution is available in the other "sandbox_solution.py" tab
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
#from quiz import get_weights, get_biases, linear


def mnist_features_labels(n_labels):
    """
    Gets the first <n> labels from the MNIST dataset
    :param n_labels: Number of labels to use
    :return: Tuple of feature list and label list
    """
    mnist_features = []
    mnist_labels = []

    mnist = input_data.read_data_sets('datasets/mnist', one_hot=True)
    #mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
    
    # In order to make quizzes run faster, we're only looking at 10000 images
    for mnist_feature, mnist_label in zip(*mnist.train.next_batch(10000)):

        # Add features and labels if it's for the first <n>th labels
        if mnist_label[:n_labels].any():
            mnist_features.append(mnist_feature)
            mnist_labels.append(mnist_label[:n_labels])

    return mnist_features, mnist_labels


# Number of features (28*28 image is 784 features)
n_features = 784
# Number of labels
n_labels = 3

# Features and Labels
features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)

# Weights and Biases
w = get_weights(n_features, n_labels)
b = get_biases(n_labels)

# Linear Function xW + b
logits = linear(features, w, b)

# Training data
train_features, train_labels = mnist_features_labels(n_labels)

with tf.Session() as session:
    # TODO: Initialize session variables
    session.run(tf.global_variables_initializer())
    
    # Softmax
    prediction = tf.nn.softmax(logits)

    # Cross entropy
    # This quantifies how far off the predictions were.
    # You'll learn more about this in future lessons.
    cross_entropy = -tf.reduce_sum(labels * tf.log(prediction), reduction_indices=1)

    # Training loss
    # You'll learn more about this in future lessons.
    loss = tf.reduce_mean(cross_entropy)

    # Rate at which the weights are changed
    # You'll learn more about this in future lessons.
    learning_rate = 0.08

    # Gradient Descent
    # This is the method used to train the model
    # You'll learn more about this in future lessons.
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

    # Run optimizer and get loss
    _, l = session.run(
        [optimizer, loss],
        feed_dict={features: train_features, labels: train_labels})

# Print loss
print('Loss: {}'.format(l))

Extracting datasets/ud730/mnist/train-images-idx3-ubyte.gz
Extracting datasets/ud730/mnist/train-labels-idx1-ubyte.gz
Extracting datasets/ud730/mnist/t10k-images-idx3-ubyte.gz
Extracting datasets/ud730/mnist/t10k-labels-idx1-ubyte.gz
Loss: 4.6871771812438965


### 5.4. TensorFlow Softmax <a name='TensorFlow Softmax'></a>

The next step is to assign a probability to each label, which you can then use to classify the data. Use the softmax function to turn your logits into probabilities.

$$
S(y_i) = \frac{e^{y_i}}{\sum_j e^{y_j}}
$$

We can do this by using the formula above, which uses the input of $y$ values and the mathematical constant "$e$" which is approximately equal to 2.718. By taking "$e$" to the power of any real value we always get back a positive value, this then helps us scale when having negative $y$ values. The summation symbol on the bottom of the divisor indicates that we add together all the $e$^(input $y$ value) elements in order to get our calculated probability outputs.

Implement a $softmax(x)$ function that takes in $x$, a one or two dimensional array of logits.

In the one dimensional case, the array is just a single set of logits. In the two dimensional case, each column in the array is a set of logits. The $softmax(x)$ function should return a NumPy array of the same shape as $x$.

For example, given a one-dimensional array:

```Python
# logits is a one-dimensional array with 3 elements
logits = [1.0, 2.0, 3.0]
# softmax will return a one-dimensional array with 3 elements
print softmax(logits)
```

```Python
-> [ 0.09003057  0.24472847  0.66524096]
```

Given a two-dimensional array where each column represents a set of logits:

```Python
# logits is a two-dimensional array
logits = np.array([
    [1, 2, 3, 6],
    [2, 4, 5, 6],
    [3, 8, 7, 6]])
# softmax will return a two-dimensional array with the same shape
print softmax(logits)
```

```Python
->[
    [ 0.09003057  0.00242826  0.01587624  0.33333333]
    [ 0.24472847  0.01794253  0.11731043  0.33333333]
    [ 0.66524096  0.97962921  0.86681333  0.33333333]
  ]
```

The probabilities for each column must sum to 1. Feel free to test your function with the inputs above.

In [None]:
# Solution is available in the other "solution.py" tab
import numpy as np


def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    # TODO: Compute and return softmax(x)
    return np.exp(x) / sum(np.exp(x))

logits = [3.0, 1.0, 0.2]
print(softmax(logits))

__Answer:__
```Python
# Quiz Solution
# Note: You can't run code in this tab
import numpy as np


def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

logits = [3.0, 1.0, 0.2]
print(softmax(logits))
```

Now that you've built a softmax function from scratch, let's see how softmax is done in TensorFlow.

```Python
x = tf.nn.softmax([2.0, 1.0, 0.2])
```

Easy as that! __tf.nn.softmax()__ implements the softmax function for you. It takes in logits and returns softmax activations.

Use the softmax function in the quiz below to return the softmax of the logits.

In [None]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf


def run():
    output = None
    logit_data = [2.0, 1.0, 0.1]
    #logit_data = [20.0, 10.0, 1] # multiply the logits by 10. (Probabilities get closer to 0.0 or 1.0)
    logit_data = [2.0, 1.0, 0.1] # divide the logits by 10. (The probabilities get close to the uniform distribution)

    logits = tf.placeholder(tf.float32)
    
    # TODO: Calculate the softmax of the logits
    softmax = tf.nn.softmax(logits)
    
    with tf.Session() as sess:
        # TODO: Feed in the logit data
        output = sess.run(softmax, feed_dict={logits: logit_data})

    return output

run()

__Answer:__
```Python
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf


def run():
    output = None
    logit_data = [2.0, 1.0, 0.1]
    logits = tf.placeholder(tf.float32)

    softmax = tf.nn.softmax(logits)

    with tf.Session() as sess:
        output = sess.run(softmax, feed_dict={logits: logit_data})

    return output
```

- If we multiply the logits by 10, probabilities get closer to 0.0 or 1.0)
- If we divide the logits by 10, the probabilities get close to the uniform distribution, since all the scores decrease in magnitude, the resulting softmax probabilities will be closer to each other.

In other words, if we increase the size of the outputs from the linear function, the classifier becomes very confident about its predictions. But if we reduce the size of the outputs,  the classifier becomes very unsure.

<img src='Figures3/Screen Shot 2017-03-24 at 14.15.13.png' width=400>

We'll want the classifier to not be too sure of itself in the beginning, and then over time, it will gain confidence as it learns. 

Hot encoding works very well for most problems until we get into situations where we have tens of thousands, or even millions of classes. In that case, the vector becomes really really large and has mostly zeros everywhere and that becomes very inefficient. 

### 5.5. One-Hot Encoding <a name='One-Hot Encoding'></a>

Now, we have the probabilities, and let them for the correct class be close to 1, and the probability for all the others be close to 0. This is often called one-hot encoding.

<img src='Figures3/Screen Shot 2017-03-24 at 14.13.49.png' width=400>

### 5.6. Cross Entropy<a name='Cross Entropy'></a>

We can now measure how well we're doing by simply comparing two vectors; one that comes out of the classifiers (out of sigmoid function) containing the probabilities of the classes, and the one hot encoded vector that corresponds to the labels. 

The natural way to measure the distance between those two probability vectors is called the cross-entropy, denoted by $D$ for distance. 
$$
D(S, L) = - \sum_i L_i log(s_i).
$$

The cross-entropy is not symmetric, $D(S,L) \not= D(L,S)$ and a nasty log in there, because we do not want to take log of zero.

<img src='Figures3/Screen Shot 2017-03-24 at 16.36.59.png' width=400>

This is the overview of our task.

<img src='Figures3/Screen Shot 2017-03-24 at 16.41.10.png' width=400>

This entire setting is often called __multinomial logistic classification__.
- Input is going to be turned into logits using a linear model, which is basically matrix multiply and a bias. 
- Logits which are scores are goind to be fed into a softmax to turn them into probabilities.
- We compare those probabilities to the one hot encoded labels using the cross entropy function. 

To create a cross entropy function in TensorFlow, we'll need to use two new functions:

- Reduce Sum __tf.reduce_sum()__ : This function function takes an array of numbers and sums them together.
```Python
x = tf.reduce_sum([1, 2, 3, 4, 5])  # 15
```
- Natural Log __tf.log()__ : This function does exactly what you would expect it to do. tf.log() takes the natural log of a number.
```Python
x = tf.log(100)  # 4.60517
```

In [None]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf

softmax_data = [0.7, 0.2, 0.1]
one_hot_data = [1.0, 0.0, 0.0]

softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

cross_entropy = - tf.reduce_sum(tf.multiply(one_hot, tf.log(softmax)))
#cross_entropy = tf.reduce_sum(one_hot * tf.log(softmax))

# TODO: Print cross entropy from session
with tf.Session() as sess:
    output = sess.run(cross_entropy, feed_dict={one_hot:one_hot_data , softmax: softmax_data})
    print(output)

__Answer:__
```Python
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

softmax_data = [0.7, 0.2, 0.1]
one_hot_data = [1.0, 0.0, 0.0]

softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

# ToDo: Print cross entropy from session
cross_entropy = -tf.reduce_sum(tf.multiply(one_hot, tf.log(softmax)))

with tf.Session() as sess:
    print(sess.run(cross_entropy, feed_dict={softmax: softmax_data, one_hot: one_hot_data}))
```

### 5.7. Minimizing Cross Entropy<a name='Minimizing Cross Entropy'></a>

The question of course is how we're going to find those weights $w$ and those biases $b$ that will get the classifier to do what we want it to do; that is have a low distance for the correct class but have a high distance for the incorrect class. 

<img src='Figures3/Screen Shot 2017-03-24 at 17.54.57.png' width=250>

One thing we can do is measure that distance averaged over the entire training sets for all the inputs and all the labels that we have available. 

<img src='Figures3/Screen Shot 2017-03-24 at 17.56.53.png' width=300>

We want all the distances to be small, in other words, the minimum loss (average cross-entrop), which would mean we're doing a good job at classifying every example in the training data.  

<img src='Figures3/Screen Shot 2017-03-24 at 17.59.52.png' width=350>

The loss is a function of the weights and the biases, so we are simply going to try and minimize that function.

In our example, the loss is a function of two weights (weight 1 and weight 2). The simplest way of numerical optimizations is gradient descent; take the derivative of the loss, with respect to the parameters, and follow that derivative by taking astep backwards, and repeat until we get to the bottom. 

<img src='Figures3/Screen Shot 2017-03-24 at 18.05.48.png' width=400>

But for typical problem, it could be a function of thousands, millions or even billions of parameters. 

# 6. Practical Aspects of Learning <a name='Practical Aspects of Learning'></a>

### 6.1. Numerical Stability <a name='Numerical Stability'></a>

When we do numerical computations, we always have to worry a bit about caculating values that are too large or too small. Inparticular, adding very small values to a very large value can introduce a lot of erros. 

<img src='Figures3/Screen Shot 2017-03-24 at 18.16.03.png' width=500>

In [None]:
a = 1000000000
for i in range(1000000):
    a = a + 1e-6
print(a - 1000000000)

One good guiding principle is that we always want the variables to have zero mean and equal variance whenever possible. 

$$
\begin{align}
\text{Mean: } x_i &= 0\\
\text{Variance: } \sigma(x_i) &= \sigma(x_j)
\end{align}
$$

There are also really good mathematical reasons to keep values we compute when we are doing optimization. A badly conditioned problem means that the optimizer has to do a lot of searching to go and find a good solution. A well conditioned problem makes it a lot easier for the optimizer to do its job. 

<img src='Figures3/Screen Shot 2017-03-24 at 18.26.37.png' width=500>

In an example to dealing with images, pixel values are ranged from 0 to 255, so the regularization is,

$$
\frac{Red - 128}{128}, \frac{Green - 128}{128}, \frac{Blue - 128}{128}.
$$

We also want the wieghts and biases to be initialized at a good enough starting point for the gradient descent to proceed. The general method is to draw the weights randomly from a Gaussian distribution with mean zero and standard deviation sigma. 

<img src='Figures3/Screen Shot 2017-03-24 at 18.36.43.png' width=500>

The sigma value determines the order of magnitude of the outputs at the initial point of the optimization. Because of the softmax on top of it, the order of magnitude also determines the peakiness of the inital probability distribution. A large sigma means that the distribution will have large peacks; it's going to be very opinionated. A small sigma means that the distribution is very uncertain about things. 

It's usually better to begin with an uncertain distribution and let the optimization become more confident as the train progress. 

<img src='Figures3/Screen Shot 2017-03-24 at 18.39.32.png' width=400>

And optimaztion is as follow,

<img src='Figures3/Screen Shot 2017-03-24 at 18.41.12.png' width=300>

### 6.2. Validation Set <a name='validation set'></a>

The problem is that the classifier memorizes has memorized the training set and it fails to generalize to new examples. Our job is to help for the classifier to generalize to new data. So, how do we measure the generalization instead of measuring how well the classifier memorize the data? 

The simplest way is to take a small subset of the training set, not use it in training and measure the error on that test data. 

But there is still a problem, because training a classifier is usually a process of trial and error. For example, we try a classifier, we measure its performance, and then we try another one, and we measure again and another and another, tweak the model and parameters, and finally we have what we think is the perfect classifier. Then if we get and try more data, and score its performance on that new data, however it doesn't do nearly as well.

There are many ways to solve this problem, and popular way of it is to use validation set. Using validation set also can be a solution of overfitting problem. 

__Validation Set Size:__

How big does the validation and test sets need to be? Well, the bigger the validation set, the more precise numbers will be. The bigger the test set, the less noisy the accuracy measurement will be. 

A useful rule of thumb is that a change that a change that affects 30 examples in the validation set, one way or another, is usually statistically significant, and typically can be trusted. 

Based on the __rule of 30__, for example, if we have 3000 examples in our varidation set with 80% accuracy, we can trust accuracy improvement with 81% of its change. 

$$
\frac{1.0 \times 3000}{100} = 30
$$

If the accuracy improvement is 80.5%, we cannot trust it and can assume it's noisy.

$$
\frac{0.5 \times 3000}{100} = 15
$$

### 6.3. Stochastic Gradient Descent <a name='Stochastic Gradient Descent'></a>

The problem with scaling gradient descent is that if computing the loss takes $n$ floating point operations, computing its gradient takes about three times that compute.

<img src='Figures3/Screen Shot 2017-03-24 at 20.46.41.png' width=500>

The loss function is huge. It depends on every single element in the training set. That can be a lot of compute if the data set is big. And because gradient descent is iterative, we have to do that for many steps. 

So, instead of computing the loss, we're going to compute an estimate of it, a very bad estimate, in fact. It simply computing the average loss for a very small random fraction of the training data. Random is very important, because if the way we pick the samples isn't random enough, it no longer works at all. 

<img src='Figures3/Screen Shot 2017-03-24 at 20.53.13.png' width=400>

We're going to take a very small sliver of the training data, compute the loss for that sample, compute the derivative for that smaple, and pretend that that derivative is the right direction to use to do gradient descent. 

<img src='Figures3/Screen Shot 2017-03-24 at 20.55.42.png' width=400>

It is not at all the right direction, and in fact, at times it might increase the real loss, not reduce it. But we're going to compensate by doing this many many times, taking very very small steps each time, because each step is a lot cheaper to compute. But we pay more price. We have to take many more smaller steps instead of one large step. On balance, though, we win by a lot, because doing this this is vastly more efficient than doing gradient descent. 

This technique is called __stochastic gradient descent (S.G.D.)__. S.G.D. scales well with both data and model size, and we want both big data and big models. S.G.D. is nice and scalable. However, it comes with a lot of issues in practice, because it is the only one that's fast enough. 

### 6.4. Momentum and Learning Rate Decay <a name='Momentum and Learning Rate Decay'></a>

Making the inputs zero mean and equal variance, and to have relatively small variace are very important for S.G.D. However, another thing can help S.G.D.'s some issues in practice. 

The first one is __momentum__. At each step, we're taking a very small step in a random direction in S.G.D., but that on aggregate, those steps take us toward the minimum of the loss. _We can take advantage of the knowledge that we've accumulated from previous steps about where we should be headed_. A cheap way to do that is to keep a running average of the gradients, and to use that running average instead of the direction of the current batch of the data. 

<img src='Figures3/Screen Shot 2017-03-24 at 22.42.26.png' width=500>

The second one is __learning rate decay__. When replacing gradient descent with SGD, we're going to take smaller, noisier steps towards our objective. How small should that step be? It's beneficial to make that step smaller and smaller as we train. So, lowering it over time is the key thing.

<img src='Figures3/Screen Shot 2017-03-24 at 23.03.52.png' width=500>

### 6.5. Mini-batch <a name='Mini-batch'></a>

Mini-batching is a technique for training on subsets of the dataset instead of all the data at one time. This provides the ability to train a model, even if a computer lacks the memory to store the entire dataset.

Mini-batching is computationally inefficient, since you can't calculate the loss simultaneously across all samples. However, this is a small price to pay in order to be able to run the model at all.

It's also quite useful combined with SGD. The idea is to randomly shuffle the data at the start of each epoch, then create the mini-batches. For each mini-batch, you train the network weights with gradient descent. Since these batches are random, you're performing SGD with each batch.

Let's look at the MNIST dataset with weights and a bias to see if your machine can handle it.

```Python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))
```

__Question 1:__
Calculate the memory size of train_features, train_labels, weights, and bias in bytes. Ignore memory for overhead, just calculate the memory required for the stored data.

You may have to look up how much memory a float32 requires, using this link.
- train_features Shape: (55000, 784) Type: float32
- train_labels Shape: (55000, 10) Type: float32
- weights Shape: (784, 10) Type: float32
- bias Shape: (10,) Type: float32

__Answer 1:__
- train_features Shape: (55000, 784) Type: float32 = 172,480,000
- train_labels Shape: (55000, 10) Type: float32 = 2,200,000
- weights Shape: (784, 10) Type: float32 = 31,360
- bias Shape: (10,) Type: float32 = 40

The total memory space required for the inputs, weights and bias is around 174 megabytes, which isn't that much memory. You could train this whole dataset on most CPUs and GPUs.

But larger datasets that you'll use in the future measured in gigabytes or more. It's possible to purchase more memory, but it's expensive. A Titan X GPU with 12 GB of memory costs over $1,000.

Instead, in order to run large models on your machine, you'll learn how to use mini-batching.

Let's look at how you implement mini-batching in TensorFlow.

In order to use mini-batching, you must first divide your data into batches.

Unfortunately, it's sometimes impossible to divide the data into batches of exactly equal size. For example, imagine you'd like to create batches of 128 samples each from a dataset of 1000 samples. Since 128 does not evenly divide into 1000, you'd wind up with 7 batches of 128 samples, and 1 batch of 104 samples. $(7 \times 128 + 1 \times 104 = 1000)$

In that case, the size of the batches would vary, so you need to take advantage of TensorFlow's __tf.placeholder()__ function to receive the varying batch sizes.

Continuing the example, if each sample had __n_input = 784__ features and __n_classes = 10__ possible labels, the dimensions for __features__ would be __[None, n_input]__ and __labels__ would be __[None, n_classes]__.

```Python
# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])
```

What does __None__ do here?

The __None__ dimension is a placeholder for the batch size. At runtime, TensorFlow will accept any batch size greater than 0.

Going back to our earlier example, this setup allows you to feed __features__ and __labels__ into the model as either the batches of 128 samples or the single batch of 104 samples.

__Question 2:__
Use the parameters below, how many batches are there, and what is the last batch size?

- features is (50000, 400)
- labels is (50000, 10)
- batch_size is 128

__Answer 2:__

There are 391 batches, and the last batch size is 80.

__Question 3:__

Implement the batches function to batch features and labels. The function should return each batch with a maximum size of batch_size. To help you with the quiz, look at the following example output of a working batches function.

```Python
# 4 Samples of features
example_features = [
    ['F11','F12','F13','F14'],
    ['F21','F22','F23','F24'],
    ['F31','F32','F33','F34'],
    ['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
    ['L11','L12'],
    ['L21','L22'],
    ['L31','L32'],
    ['L41','L42']]

example_batches = batches(3, example_features, example_labels)
```

The example_batches variable would be the following:

```Python
[
    # 2 batches:
    #   First is a batch of size 3.
    #   Second is a batch of size 1
    [
        # First Batch is size 3
        [
            # 3 samples of features.
            # There are 4 features per sample.
            ['F11', 'F12', 'F13', 'F14'],
            ['F21', 'F22', 'F23', 'F24'],
            ['F31', 'F32', 'F33', 'F34']
        ], [
            # 3 samples of labels.
            # There are 2 labels per sample.
            ['L11', 'L12'],
            ['L21', 'L22'],
            ['L31', 'L32']
        ]
    ], [
        # Second Batch is size 1.
        # Since batch size is 3, there is only one sample left from the 4 samples.
        [
            # 1 sample of features.
            ['F41', 'F42', 'F43', 'F44']
        ], [
            # 1 sample of labels.
            ['L41', 'L42']
        ]
    ]
]
```

Implement the __batches function__ in the "quiz.py" file below.

In [None]:
# quiz.py

import math
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    # TODO: Implement batching

    output_batches = []
    sample_size = len(features)
    
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        output_batches.append(batch)
        
    return output_batches

In [None]:
# from quiz import batches
from pprint import pprint

# 4 Samples of features
example_features = [
    ['F11','F12','F13','F14'],
    ['F21','F22','F23','F24'],
    ['F31','F32','F33','F34'],
    ['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
    ['L11','L12'],
    ['L21','L22'],
    ['L31','L32'],
    ['L41','L42']]

# PPrint prints data structures like 2d arrays, so they are easier to read
pprint(batches(3, example_features, example_labels))

__Answer 3:__

```Python
import math
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    # TODO: Implement batching
    output_batches = []
    
    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        output_batches.append(batch)
        
    return output_batches
```

Let's use mini-batching to feed batches of MNIST features and labels into a linear model.

Set the batch size and run the optimizer over all the __batches__ with the batches function. The recommended batch size is 128. If you have memory restrictions, feel free to make it smaller.

In [6]:
# helper.py

import math
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    outout_batches = []
    
    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        outout_batches.append(batch)
        
    return outout_batches

In [10]:
# quiz.py

from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
# from helper import batches

learning_rate = 0.001
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('datasets/mnist/', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))


# TODO: Set batch size
batch_size = 128
assert batch_size is not None, 'You must set the batch size'

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    # TODO: Train optimizer on all batches
    for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
        sess.run(optimizer, feed_dict={features: batch_features, labels: batch_labels})

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))

Extracting datasets/mnist/train-images-idx3-ubyte.gz
Extracting datasets/mnist/train-labels-idx1-ubyte.gz
Extracting datasets/mnist/t10k-images-idx3-ubyte.gz
Extracting datasets/mnist/t10k-labels-idx1-ubyte.gz
Test Accuracy: 0.11140000075101852


__Answer:__

```Python
# quiz.py

from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches

learning_rate = 0.001
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))


# TODO: Set batch size
batch_size = 128
assert batch_size is not None, 'You must set the batch size'

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    # TODO: Train optimizer on all batches
    for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
        sess.run(optimizer, feed_dict={features: batch_features, labels: batch_labels})

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))
```

The accuracy is low, but you probably know that you could train on the dataset more than once. You can train a model using the dataset multiple times. You'll go over this subject in the next section where we talk about "epochs".

### 6.6. Epochs <a name='Epochs'></a>

An epoch is a single forward and backward pass of the whole dataset. This is used to increase the accuracy of the model without requiring more data. This section will cover epochs in TensorFlow and how to choose the right number of epochs.

The following TensorFlow code trains a model using 10 epochs.

```Python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches  # Helper function created in Mini-batching section


def print_epoch_stats(epoch_i, sess, last_features, last_labels):
    """
    Print cost and validation accuracy of an epoch
    """
    current_cost = sess.run(
        cost,
        feed_dict={features: last_features, labels: last_labels})
    valid_accuracy = sess.run(
        accuracy,
        feed_dict={features: valid_features, labels: valid_labels})
    print('Epoch: {:<4} - Cost: {:<8.3} Valid Accuracy: {:<5.3}'.format(
        epoch_i,
        current_cost,
        valid_accuracy))

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
valid_features = mnist.validation.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
valid_labels = mnist.validation.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
learning_rate = tf.placeholder(tf.float32)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init = tf.global_variables_initializer()

batch_size = 128
epochs = 10
learn_rate = 0.001

train_batches = batches(batch_size, train_features, train_labels)

with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch_i in range(epochs):

        # Loop over all batches
        for batch_features, batch_labels in train_batches:
            train_feed_dict = {
                features: batch_features,
                labels: batch_labels,
                learning_rate: learn_rate}
            sess.run(optimizer, feed_dict=train_feed_dict)

        # Print cost and validation accuracy of an epoch
        print_epoch_stats(epoch_i, sess, batch_features, batch_labels)

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))
```

Running the code will output the following:

```Python
Epoch: 0    - Cost: 11.0     Valid Accuracy: 0.204
Epoch: 1    - Cost: 9.95     Valid Accuracy: 0.229
Epoch: 2    - Cost: 9.18     Valid Accuracy: 0.246
Epoch: 3    - Cost: 8.59     Valid Accuracy: 0.264
Epoch: 4    - Cost: 8.13     Valid Accuracy: 0.283
Epoch: 5    - Cost: 7.77     Valid Accuracy: 0.301
Epoch: 6    - Cost: 7.47     Valid Accuracy: 0.316
Epoch: 7    - Cost: 7.2      Valid Accuracy: 0.328
Epoch: 8    - Cost: 6.96     Valid Accuracy: 0.342
Epoch: 9    - Cost: 6.73     Valid Accuracy: 0.36 
Test Accuracy: 0.3801000118255615
```

Each epoch attempts to move to a lower cost, leading to better accuracy.

This model continues to improve accuracy up to Epoch 9. Let's increase the number of epochs to 100.

```Python
...
Epoch: 79   - Cost: 0.111    Valid Accuracy: 0.86
Epoch: 80   - Cost: 0.11     Valid Accuracy: 0.869
Epoch: 81   - Cost: 0.109    Valid Accuracy: 0.869
....
Epoch: 85   - Cost: 0.107    Valid Accuracy: 0.869
Epoch: 86   - Cost: 0.107    Valid Accuracy: 0.869
Epoch: 87   - Cost: 0.106    Valid Accuracy: 0.869
Epoch: 88   - Cost: 0.106    Valid Accuracy: 0.869
Epoch: 89   - Cost: 0.105    Valid Accuracy: 0.869
Epoch: 90   - Cost: 0.105    Valid Accuracy: 0.869
Epoch: 91   - Cost: 0.104    Valid Accuracy: 0.869
Epoch: 92   - Cost: 0.103    Valid Accuracy: 0.869
Epoch: 93   - Cost: 0.103    Valid Accuracy: 0.869
Epoch: 94   - Cost: 0.102    Valid Accuracy: 0.869
Epoch: 95   - Cost: 0.102    Valid Accuracy: 0.869
Epoch: 96   - Cost: 0.101    Valid Accuracy: 0.869
Epoch: 97   - Cost: 0.101    Valid Accuracy: 0.869
Epoch: 98   - Cost: 0.1      Valid Accuracy: 0.869
Epoch: 99   - Cost: 0.1      Valid Accuracy: 0.869
Test Accuracy: 0.8696000006198883
```

From looking at the output above, you can see the model doesn't increase the validation accuracy after epoch 80. Let's see what happens when we increase the learning rate.

_learn_rate = 0.1_

```Python
Epoch: 76   - Cost: 0.214    Valid Accuracy: 0.752
Epoch: 77   - Cost: 0.21     Valid Accuracy: 0.756
Epoch: 78   - Cost: 0.21     Valid Accuracy: 0.756
...
Epoch: 85   - Cost: 0.207    Valid Accuracy: 0.756
Epoch: 86   - Cost: 0.209    Valid Accuracy: 0.756
Epoch: 87   - Cost: 0.205    Valid Accuracy: 0.756
Epoch: 88   - Cost: 0.208    Valid Accuracy: 0.756
Epoch: 89   - Cost: 0.205    Valid Accuracy: 0.756
Epoch: 90   - Cost: 0.202    Valid Accuracy: 0.756
Epoch: 91   - Cost: 0.207    Valid Accuracy: 0.756
Epoch: 92   - Cost: 0.204    Valid Accuracy: 0.756
Epoch: 93   - Cost: 0.206    Valid Accuracy: 0.756
Epoch: 94   - Cost: 0.202    Valid Accuracy: 0.756
Epoch: 95   - Cost: 0.2974   Valid Accuracy: 0.756
Epoch: 96   - Cost: 0.202    Valid Accuracy: 0.756
Epoch: 97   - Cost: 0.2996   Valid Accuracy: 0.756
Epoch: 98   - Cost: 0.203    Valid Accuracy: 0.756
Epoch: 99   - Cost: 0.2987   Valid Accuracy: 0.756
Test Accuracy: 0.7556000053882599
```

Looks like the learning rate was increased too much. The final accuracy was lower, and it stopped improving earlier. Let's stick with the previous learning rate, but change the number of epochs to 80.

```Python
Epoch: 65   - Cost: 0.122    Valid Accuracy: 0.868
Epoch: 66   - Cost: 0.121    Valid Accuracy: 0.868
Epoch: 67   - Cost: 0.12     Valid Accuracy: 0.868
Epoch: 68   - Cost: 0.119    Valid Accuracy: 0.868
Epoch: 69   - Cost: 0.118    Valid Accuracy: 0.868
Epoch: 70   - Cost: 0.118    Valid Accuracy: 0.868
Epoch: 71   - Cost: 0.117    Valid Accuracy: 0.868
Epoch: 72   - Cost: 0.116    Valid Accuracy: 0.868
Epoch: 73   - Cost: 0.115    Valid Accuracy: 0.868
Epoch: 74   - Cost: 0.115    Valid Accuracy: 0.868
Epoch: 75   - Cost: 0.114    Valid Accuracy: 0.868
Epoch: 76   - Cost: 0.113    Valid Accuracy: 0.868
Epoch: 77   - Cost: 0.113    Valid Accuracy: 0.868
Epoch: 78   - Cost: 0.112    Valid Accuracy: 0.868
Epoch: 79   - Cost: 0.111    Valid Accuracy: 0.868
Epoch: 80   - Cost: 0.111    Valid Accuracy: 0.869
Test Accuracy: 0.86909999418258667
```

The accuracy only reached 0.86, but that could be because the learning rate was too high. Lowering the learning rate would require more epochs, but could ultimately achieve better accuracy.

In the upcoming TensorFLow Lab, you'll get the opportunity to choose your own learning rate, epoch count, and batch size to improve the model's accuracy.