In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

**Note:** These lectures are for TensorFlow 1 (tensorflow 1.15.2 was used).

# What is Deep Learning?

- Deep Learning is an exciting branch of machine learning that uses data, lots of data, to teach computers how to do things only humans were capable of before
- Deep Learning has emerged as a central tool to solve perception problems in recent years
- it's the state of the art on everything having to do with computer vision and speech recognition
- but there is more; icreasingly, people are finding that Deep Learning is a much better tool to solve problems, like discovering new medicines, understanding natural language, understanding documents, and, for example, ranking them for search, etc.

# Solving Problems - Big and Small

- many companies today have made deep learning a central part of their mission learning toolkit
  - Facebook, Baidu, Microsoft and Google, are all using deep learning in their products and pushing the research forward
  - it's easy to understand why, deep learning shines wherever there is lots of data and complex problems to solve
  - and all these companies are facing lots of complicated problems, like understanding what's in an image to help you find it, or translating a document into another language that you can speak
- this class will explore a continuum of complexity from very simple models to very large ones that you will still be able to train in minutes on your personal computer to very elaborate tasks like predicting the meaning of words or classifying images
- one of the nice things about deep learning is it's really a family of techniques that adapts to all sorts of data and all sorts of problems - all using a common infrastructure and a common langauge to describe things
- a lot of the important work on neural networks happen in the 80s and in the 90s, but back then computers were slow and data sets very tiny
- the researchers didn't really find many applications in the real world
- as a result, in the first decade of the 21st century, neural networks have completely disappeared from the world of machine learning
  - working on neural networks was definitely fringe
- it's only in the last few years, first in speech recognition around 2009, and then in computer vision around 2012, that neural networks
made a big comeback
  - what changed? lots of data, and cheap and fast GPUs

# Hello, Tensor World!

In [2]:
import tensorflow as tf

# Create TensorFlow object called tensor
hello_constant = tf.constant("Hello World!")

with tf.Session() as sess:
    # Run the tf.constant operation in the session
    output = sess.run(hello_constant)
    print(output)

b'Hello World!'


<IPython.core.display.Javascript object>

## Tensor

- in TensorFlow, data isn’t stored as integers, floats, or strings
- these values are encapsulated in an object called a **tensor**
- in the case of `hello_constant = tf.constant('Hello World!')`, `hello_constant` is a 0-dimensional string tensor, but tensors come in a variety of sizes as shown below:

`A = tf.constant(1234)` // A is a 0-dimensional int32 tensor

`B = tf.constant([123,456,789])` // B is a 1-dimensional int32 tensor

`C = tf.constant([ [123,456,789], [222,333,444] ])` // C is a 2-dimensional int32 tensor

- `tf.constant()` is one of many TensorFlow operations you will use in this lesson
  - the tensor returned by `tf.constant()` is called a **constant tensor**, because the value of the tensor never changes

## Session

- TensorFlow’s api is built around the idea of a computational graph, a way of visualizing a mathematical process
- https://medium.com/tebs-lab/deep-neural-networks-as-computational-graphs-867fcaa56c9
- let’s take the TensorFlow code you ran and turn that into a graph:

<img src="resources/tf_session.png" style="width: 50%;"/>

- a "TensorFlow Session", as shown above, is an environment for running a graph
- the session is in charge of allocating the operations to GPU(s) and/or CPU(s), including remote machines

## Input
- what if you want to use a non-constant?
- this is where `tf.placeholder()` and `feed_dict` come into place


- sadly you can’t just set `x` to your dataset and put it in TensorFlow, because over time you'll want your TensorFlow model to take in different datasets with different parameters
- you need `tf.placeholder()`
  - it returns a tensor that gets its value from data passed to the `tf.session.run()` function, allowing you to set the input right before the session runs

### Session’s feed_dict

```python
x = tf.placeholder(tf.string)

with tf.Session() as sess:
    output = sess.run(x, feed_dict={x: 'Hello World'})
```
- use the `feed_dict` parameter in `tf.session.run()` to set the placeholder tensor
- the above example shows the tensor `x` being set to the string `"Hello, world"`
- it's also possible to set more than one tensor using `feed_dict` as shown below

```python
x = tf.placeholder(tf.string)
y = tf.placeholder(tf.int32)
z = tf.placeholder(tf.float32)

with tf.Session() as sess:
    output = sess.run(x, feed_dict={x: 'Test String', y: 123, z: 45.67})
```

**Note:** If the data passed to the `feed_dict` doesn’t match the tensor type and can’t be cast into the tensor type, you’ll get the error `“ValueError: invalid literal for...”`.

## Tensorflow Math
- https://www.tensorflow.org/api_docs/python/tf/math

### Addition
```python
x = tf.add(5, 2)  # 7
```
- the `tf.add()` function does exactly what you expect it to do; it takes in two numbers, two tensors, or one of each, and returns their sum as a tensor

### Subtraction
```python
x = tf.subtract(10, 4) # 6
```

### Multiplication
```python
y = tf.multiply(2, 5)  # 10
```

### Converting types
- it may be necessary to convert between types to make certain operators work together
- for example, if you tried the following, it would fail with an exception:
```python
tf.subtract(tf.constant(2.0),tf.constant(1))  # Fails with ValueError: Tensor conversion requested dtype float32 for Tensor with dtype int32: 
```
- that's because the constant `1` is an integer but the constant `2.0` is a floating point value and `subtract` expects them to match

- in cases like these, you can either make sure your data is all of the same type, or you can cast a value to another type
- in this case, converting the `2.0` to an integer before subtracting, like so, will give the correct result:
```python
tf.subtract(tf.cast(tf.constant(2.0), tf.int32), tf.constant(1))   # 1
```

# Supervised Classification

- this entire course, I'm going to focus on the problem of classification
- classification is the task of taking an input, like this letter, and giving it a label that says this is a B
- the typical setting is that you have a lot of examples, called the training sets, that have already been sorted in this is an A, this is a B, and so on
  - now if you get a completely new example, your goal is going to be to figure out which of those classes it belongs to
- there is a lot more to machine learning then just classification, but classification, or marginally prediction, is the central building block of machine learning
- once you know how to classify things, it's very easy, for example, to learn how to detect them or to rank them

# Training Your Logistic Classifier

- a logistic classifier is what’s called the linear classifier, $WX + b = y$
  - it takes the input (X), for example, the pixels in an image, and applies a linear function to them to generate its predictions (y)
  - a linear function is just a giant matrix multiplier
    - it takes all the inputs as a big vector that will denote $X$ and multiplies them with a matrix to generate its predictions, one per output class
  - throughout we'll denote the inputs by $X$, the weights by $W$ and the bias term by $b$
  - the weights of that matrix and the bias is where the machine learning comes in
  - we're going to train that model; that means we're going to try to find the values for the weights and bias which are good at performing those predictions
  
<img src="resources/linear_classification_scores.png" style="width: 50%;"/>

- how are we going to use those scores to perform the classification?
- each image that we have as an input can have one and only one possible label
  - so we're going to turn those scores into probabilities
  - we're going to want the probability of the correct class to be very close to $1$
  - and the probability for every other class to be close to $0$
- the way to turn scores into probabilities is to use a softmax function $S(y_i) = \dfrac{e^{y_i}}{\sum_{j} e^{y_j}}$ which I'll denote here by $S$
- this is what it looks like

<img src="resources/linear_classification_softmax_workfow.png" style="width: 50%;"/>

- but beyond the formula what's important to know about it is that it can take any kind of scores and turn them into proper probabilities
- proper probabilities sum to $1$ and they will be large when the scores are large and small, when the scores are comparatively smaller
- scores, in the context of logistic regression, are often also called logits

# TensorFlow Linear Function

- let’s derive the function `y = Wx + b`
  - we want to translate our input, `x`, to labels, `y`
- for example, imagine we want to classify images as digits
- `x` would be our list of pixel values, and `y` would be the logits, one for each digit
- let's take a look at `y = Wx`, where the weights, `W`, determine the influence of `x` at predicting each `y`

<img src="resources/tensorflow_linear_function_1.jpg" style="width: 50%;"/>

- `y = Wx` allows us to segment the data into their respective labels using a line
- however, this line has to pass through the origin, because whenever `x` equals $0$, then `y` is also going to equal $0$
- we want the ability to shift the line away from the origin to fit more complex data
  - the simplest solution is to add a number to the function, which we call “bias”
  
<img src="resources/tensorflow_linear_function_2.jpg" style="width: 50%;"/>

- our new function becomes `Wx + b`, allowing us to create predictions on linearly separable data

## Transposition

- we've been using the `y = Wx + b` function for our linear function
- but there's another function that does the same thing, `y = xW + b`
- these functions do the same thing and are interchangeable, except for the dimensions of the matrices involved
- to shift from one function to the other, you simply have to swap the row and column dimensions of each matrix
  - this is called transposition

- for rest of this lesson, we actually use `xW + b`, because this is what TensorFlow uses

## Weights and Bias in TensorFlow

- the goal of training a neural network is to modify weights and biases to best predict the labels
- in order to use weights and bias, you'll need a Tensor that can be modified - this leaves out `tf.placeholder()` and `tf.constant()`, since those Tensors can't be modified
- this is where `tf.Variable` class comes in

### tf.Variable()

```python
x = tf.Variable(5)
```

- the `tf.Variable` class creates a tensor with an initial value that can be modified, much like a normal Python variable
- this tensor stores its state in the session, so you must initialize the state of the tensor manually
- you'll use the `tf.global_variables_initializer()` function to initialize the state of all the Variable tensors


#### Initialization
```python
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
```


- the `tf.global_variables_initializer()` call returns an operation that will initialize all TensorFlow variables from the graph
- you call the operation using a session to initialize all the variables as shown above
- using the `tf.Variable` class allows us to change the weights and bias, but an initial value needs to be chosen


- initializing the weights with random numbers from a normal distribution is good practice
- randomizing the weights helps the model from becoming stuck in the same place every time you train it
- you'll learn more about this in the next lesson, when you study gradient descent


- similarly, choosing weights from a normal distribution prevents any one weight from overwhelming other weights
- you'll use the `tf.truncated_normal()` function to generate random numbers from a normal distribution

### tf.truncated_normal()

```python
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))
```
- the `tf.truncated_normal()` function returns a tensor with random values from a normal distribution whose magnitude is no more than 2 standard deviations from the mean
- since the weights are already helping prevent the model from getting stuck, you don't need to randomize the bias
- let's use the simplest solution, setting the bias to $0$

### tf.zeros()

```python
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))
```
- the `tf.zeros()` function returns a tensor with all zeros

## Quiz: Linear Function

- you'll be classifying the handwritten numbers 0, 1, and 2 from the MNIST dataset using TensorFlow
- since `xW in xW + b` is matrix multiplication, you have to use the `tf.matmul()` function instead of `tf.multiply()`
  - don't forget that order matters in matrix multiplication, so `tf.matmul(a,b)` is not the same as `tf.matmul(b,a)`

```python
import tensorflow as tf

def get_weights(n_features, n_labels):
    """
    Return TensorFlow weights
    :param n_features: Number of features
    :param n_labels: Number of labels
    :return: TensorFlow weights
    """
    # TODO: Return weights
    return tf.Variable(tf.truncated_normal((n_features, n_labels)))


def get_biases(n_labels):
    """
    Return TensorFlow bias
    :param n_labels: Number of labels
    :return: TensorFlow bias
    """
    # TODO: Return biases
    return tf.Variable(tf.zeros(n_labels))


def linear(input, w, b):
    """
    Return linear function in TensorFlow
    :param input: TensorFlow input
    :param w: TensorFlow weights
    :param b: TensorFlow biases
    :return: TensorFlow linear function
    """
    # TODO: Linear Function (xW + b)
    return tf.add(tf.matmul(input, w), b)
```

```python
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from test import *

def mnist_features_labels(n_labels):
    """
    Gets the first <n> labels from the MNIST dataset
    :param n_labels: Number of labels to use
    :return: Tuple of feature list and label list
    """
    mnist_features = []
    mnist_labels = []

    mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

    # In order to make quizzes run faster, we're only looking at 10000 images
    for mnist_feature, mnist_label in zip(*mnist.train.next_batch(10000)):

        # Add features and labels if it's for the first <n>th labels
        if mnist_label[:n_labels].any():
            mnist_features.append(mnist_feature)
            mnist_labels.append(mnist_label[:n_labels])

    return mnist_features, mnist_labels


# Number of features (28*28 image is 784 features)
n_features = 784
# Number of labels
n_labels = 3

# Features and Labels
features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)

# Weights and Biases
w = get_weights(n_features, n_labels)
b = get_biases(n_labels)

# Linear Function xW + b
logits = linear(features, w, b)

# Training data
train_features, train_labels = mnist_features_labels(n_labels)

with tf.Session() as session:
    # TODO: Initialize session variables
    session.run(tf.global_variables_initializer())
    
    # Softmax
    prediction = tf.nn.softmax(logits)

    # Cross entropy
    # This quantifies how far off the predictions were.
    # You'll learn more about this in future lessons.
    cross_entropy = -tf.reduce_sum(labels * tf.log(prediction), reduction_indices=1)

    # Training loss
    # You'll learn more about this in future lessons.
    loss = tf.reduce_mean(cross_entropy)

    # Rate at which the weights are changed
    # You'll learn more about this in future lessons.
    learning_rate = 0.08

    # Gradient Descent
    # This is the method used to train the model
    # You'll learn more about this in future lessons.
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

    # Run optimizer and get loss
    _, l = session.run(
        [optimizer, loss],
        feed_dict={features: train_features, labels: train_labels})

# Print loss
print('Loss: {}'.format(l))
```

# Linear Update

- you can’t train a neural network on a single sample
- let’s apply `n` samples of `x` to the function `y = Wx + b`, which becomes `Y = WX + B`

<img src="resources/linear_update.jpg" style="width: 50%;"/>

- for every sample of `X` (`X1`, `X2`, `X3`), we get logits for label 1 (`Y1`) and label 2 (`Y2`)
- in order to add the bias to the product of `WX`, we had to turn `b` into a matrix of the same shape
  - this is a bit unnecessary, since the bias is only two numbers
  - it should really be a vector


- we can take advantage of an operation called broadcasting used in TensorFlow and Numpy
- this operation allows arrays of different dimension to be added or multiplied with each other
- for example:
```python
import numpy as np
t = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
u = np.array([1, 2, 3])
print(t + u)
```

- the code above will print...

```python
[[ 2  4  6]
 [ 5  7  9]
 [ 8 10 12]
 [11 13 15]]
```

- this is because `u` is the same dimension as the last dimension in `t`

## Quiz: Softmax

- the next step is to assign a probability to each label, which you can then use to classify the data
- use the softmax function to turn your logits into probabilities
- we can do this by using the formula above, which uses the input of $y$ values and the mathematical constant $e$ which is approximately equal to $2.718$
- by taking $e$ to the power of any real value we always get back a positive value, this then helps us scale when having negative $y$ values
- the summation symbol on the bottom of the divisor indicates that we add together all the $e^\text{(input y value)}$ elements in order to get our calculated probability outputs

- for the next quiz, you'll implement a `softmax(x)` function that takes in `x`, a one or two dimensional array of logits
  - in the one dimensional case, the array is just a single set of logits
  - in the two dimensional case, each column in the array is a set of logits
- the `softmax(x)` function should return a NumPy array of the same shape as `x`


- for example, given a one-dimensional array:
```python
# logits is a one-dimensional array with 3 elements
logits = [1.0, 2.0, 3.0]
# softmax will return a one-dimensional array with 3 elements
print softmax(logits)
```

```python
[ 0.09003057  0.24472847  0.66524096]
```


- given a two-dimensional array where each column represents a set of logits:
```python
# logits is a two-dimensional array
logits = np.array([
    [1, 2, 3, 6],
    [2, 4, 5, 6],
    [3, 8, 7, 6]])
# softmax will return a two-dimensional array with the same shape
print softmax(logits)
```

```python
[
  [ 0.09003057  0.00242826  0.01587624  0.33333333]
  [ 0.24472847  0.01794253  0.11731043  0.33333333]
  [ 0.66524096  0.97962921  0.86681333  0.33333333]
]
```

In [3]:
# Implement the softmax function, which is specified by the formula above in the lectures.
# The probabilities for each column must sum to 1

import numpy as np


def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    # TODO: Compute and return softmax(x)
    return np.exp(x) / np.sum(np.exp(x), axis=0)


logits = [3.0, 1.0, 0.2]
print(softmax(logits))

[0.8360188  0.11314284 0.05083836]


<IPython.core.display.Javascript object>

## Quiz: TensorFlow Softmax Workspaces

- now that you've built a softmax function from scratch, let's see how softmax is done in TensorFlow

```python
x = tf.nn.softmax([2.0, 1.0, 0.2])
```

- easy as that! `tf.nn.softmax()` implements the softmax function for you
  - it takes in logits and returns softmax activations


```python
import tensorflow as tf

def run():
    output = None
    logit_data = [2.0, 1.0, 0.1]
    logits = tf.placeholder(tf.float32)
    
    # TODO: Calculate the softmax of the logits
    softmax = tf.nn.softmax(logits)
    
    with tf.Session() as sess:
        pass
        # TODO: Feed in the logit data
        output = sess.run(softmax, feed_dict={logits: logit_data})

    return output
```

**Q:** What happens to the softmax probabilities when you multiply the logits by 10?
<br/>
**A:** Probabilities get closer to 0.0 or 1.0


**Q:** What happens to the softmax probabilities when you divide the logits by 10?
<br/>
**A:** Probabilities get close to the uniform distribution. Since all the scores decrease in magnitude, the resulting softmax probabilities will be closer to each other.

# One-Hot Encoding

- we need a way to represent our labels mathematically
- we just said, let's have the probabilities for the correct class be close to $1$ and the probability for all the others be close to $0$
- we can write down exactly that
- each label will be represented by a vector, that is as long as there are classes and it has the value $1.0$ for the correct class and $0$ everywhere else
- this is often called one-hot encoding

<img src="resources/one_hot_encoding.png" style="width: 50%;"/>

# Cross Entropy

- one-hot encoding works very well from most problems until you get into situations where you have tens of thousands, or even millions of classes
- in that case, your vector becomes really, really large and has mostly zeros everywhere and that becomes very inefficient
- you'll see later how we can deal with these problems using embeddings
- what's nice about this approach is that we can now measure how well we're doing by simply comparing two vectors
  - one that comes out of your classifiers and contains the probabilities of your classes and the one-hot encoded vector that corresponds to your labels
- the natural way to measure the distance between those two probability vectors is called the **Cross Entropy**

<img src="resources/cross_entropy.png" style="width: 50%;"/>

- I'll denote it by $D$ here for distance
- math, it looks like this: $D(S,L) = -\sum_{i} L_i log(S_i)$
- be careful, the cross entropy is not symmetric; $D(S,L) \neq D(L,S)$ and you have a nasty log in there so you have to make sure that your labels and your distributions are in the right place
- your labels, because they're one-hot encoded, will have a lot of zeroes in them and you don't want to take the log of zeroes
- for your distribution, the softmax will always guarantee that you have a little bit of probability going everywhere, so you never really take a log of zero

- let's recap, because we have a lot of pieces already
- we have an input, it's going to be turned into logits using a linear model, which is basically your matrix multiply and a bias
- we're then going to feed the logits, which are scores, into a softmax to turn them into probabilities
- then we're going to compare those probabilities to the one-hot encoded labels using the cross entropy function

<img src="resources/multinomial_logistic_classification.png" style="width: 50%;"/>

- this entire setting is often called **multinomial logistic classification**
  - $D(S(WX+b),L)$

# Minimizing Cross Entropy

- now we have all the pieces of our puzzle
- the question is how we're going to find those weights $w$ and those biases $b$ that will get our classifier to do what we want it to do
- that is, have a low distance for the correct class but have a high distance for the incorrect class
- one thing you can do is measure that distance averaged over the entire training sets for all the inputs and all the labels that you have available
  - that's called the training loss
  - $£ = \dfrac{1}{N} \sum_{i} D(S(wx_i + b), L_i)$
    - this loss, which is the **average cross-entropy** over your entire training set, is one humongous function
    - every example in your training set gets multiplied by this one big matrix $w$, and then they get all added up in one big sum
    - we want all the distances to be small, which would mean we're doing a good job at classifying every example in the training data
    - so we want the loss to be small
    - the loss is a function of the weights and the biases
    - so we are simply going to try and minimize that function


- imagine that the loss is a function of two weights, weight one and weight two, just for the sake of argument
- it's going to be a function which will be large in some areas, and small in others
- we're going to try the weights which cause this loss to be the smallest
- we've just turned the machine learning problem into one of numerical optimization
- there's lots of ways to solve a numerical optimization problem
- the simplest way is one you've probably encountered before, **gradient descent**
  - take the derivative of your loss, with respect to your parameters, and follow that derivative by taking a step backwards and repeat until you get to the bottom
  - gradient descent is relatively simple, especially when you have powerful numerical tools that compute the derivatives for you
  - remember, I'm showing you the derivative for a function of just two parameters here, but for a typical problem it could be a function of thousands, millions or even billions of parameters

<img src="resources/gradient_descent_circle_graph.png" style="width: 50%;"/>

## Practical Aspects of Learning

- in the coming lectures, I'll talk about these tools that compute the derivatives for you, and a lot about what's good and bad about grading descent
- for the moment, though, we'll assume that I give you the optimizer as a black box that you can simply use
- there are two last practical things that stand in your way of training your first model
  - first is how do you fill image pixels to this classifier and then where do you initialize the optimization? let's look into this


- this is where we have to talk a bit about numerical stability
- when you do numerical computations, you always have to worry a bit about calculating values that are too large or too small
- in particular, adding very small values to a very large value can introduce a lot of errors
  - try this in Python: take the value 1,000,000,000 and then add to it the value 10 to the minus 6 1,000,000 times, then subtract 1,000,000,000 again

In [4]:
a = 1000000000
for i in range(1000000):
    a = a + 1e-6
print(a - 1000000000)

0.95367431640625


<IPython.core.display.Javascript object>

# Normalized Inputs and Initial Weights

- in the example in the quiz, the math says the result should be $1.0$, but, the code says $0.95$; that's a big difference
  - go ahead, replace the one billion with just one, and you'll see that the error becomes very tiny


- we're going to want the values involved in the calculation of this big lost function that we care about, to never get too big or too small
- one good guiding principle is that we always want our variables to have:
  - $0$ mean; $\mu(X_i) = 0$
  - equal variance $\sigma (X_i) = \sigma (X_j)$ whenever possible


- on top of the numerical issues, there are also a really good mathematical reasons to keep values you compute roughly around a mean of zero and equal variance when you're doing optimization
- badly conditioned problem means that the optimizer has to do a lot of searching to go and find a good solution
- well conditioned problem makes it a lot easier for the optimizer to do its job

<img src="resources/well_vs_bad_conditioned_problem.png" style="width: 80%;"/>

- if you're dealing with images it's simple
- you can take the pixel values of your image, they are usually between $0$ and $255$, and simply subtract $128$ and divide by $128$
  - it doesn't change the content of your image, but it makes it much easier for the optimization to proceed numerically


- you also want your weights and biases to be initialized at a good enough starting point for  the gradient descent to proceed
- there are lots of fancy schemes to find good initialization values, but we're going to focus on a simple general method
  - draw the weights randomly from a Gaussian distribution with mean $0$ and standard deviation $\sigma$
  - the sigma value determines the order of magnitude of your outputs at the initial point of your optimization
  - because of the softmax on top of it, the order of magnitude also determines the peakiness of your initial probability distribution
    - a large sigma will mean that your distribution will have large peaks, it's going to be very opinionated
    - a small sigma means that your distribution is very uncertain about things
    - it's usually better to begin with an uncertain distribution, and let the optimization become more confident as the training progress
      - so, use a small sigma to begin with


- now we actually have everything we need to actually train this classifier
- we've got our training data ($x_i$ and $b$), which is normalized to have zero mean and unit variance
- we multiply it by a large matrix ($W$), which is initialized with random weights
- we apply the softmax ($S$) then the cross-entropy loss ($D$) and we calculate the average of this loss over the entire training data ($\dfrac{1}{N})$
- then our magical optimization package:
  - computes the derivative of this loss with respect to:
    - the weights: $w \leftarrow w - \alpha \Delta £$
    - the biases: $b \leftarrow b - \alpha \Delta_b £$
  - takes a step back in the direction opposite to that derivative
  - then, we start all over again; we repeat the process (loop) until we reach minimum of the loss function

# Measuring Performance

- now that you have trained your first model, there is something very important I want to discuss
- you might have seen in the assignment that we had a training set, as well as a validation set, and a test set
- it has to do with measuring how well you're doing without accidentally shooting yourself in the foot, and it is a lot more subtle then you might initially think
- it's also very important because as we will discover later, once you know how to measure your performance on a problem, you've already solved half of it


- let me explain why measuring performance is subtle
- let's go back to our classification task
- you've got a whole lot of images with labels
- you could say, okay, I'm going to run my classifier on those images, and see how many I got right
  - that's my error measure
- and then you go out and use your classifier on new images; images that you've never seen in the past, and you measure how many you get right, and your performance gets worse
  - the classifier doesn't do as well


- so what happened?
- well, imagine I construct a classifier that simply compares the new image to any of the other images that I've already seen in my training set, and just returns the label
- by the measure we defined earlier, it's a great classifier; it would get 100% accuracy on the training set but as soon as it sees a new image, it's lost; it has no idea what to do
  - it's not a great classifier
- the problem is that your classifier has memorized the training set, and it fails to generalize to new examples
- it's not just a theoretical problem
- every classifier that you will build will tend to try and memorize the training set
- and it will usually do that very, very well
- your job though, is to help it generalize to new data instead


- so, how do we measure generalization instead of measuring how well the classifier memorized the data?
- the simplest way is to take a small subset of the training set, not use it in training, and measure the error on that test data
- problem solved, now your classifier cannot cheat because it never sees the test data, so it can't memorize it
- but there is still a problem, because training a classifier is usually a process of trial and error
  - you try a classifier, you measure its performance and then you try another one and you measure again
  - and another, and another, you tweak the model, you explore the parameters, you measure, and finally, you have what you think is the perfect classifier
  - and then after all this care you've taken to separate your test data from your training data and only measuring your performance on the test data, now you deploy your system in a real production environment
  - and you get more data and you score your performance on that new data and it doesn't do nearly as well


- what happened is that your classifier has seen your test data, indirectly, through your own eyes
- every time you made a decision about which classifier to use, which parameter to tune, you actually give information to your classifier about the test set
- just a tiny bit, but it adds up
- so over time, as you run many, and many experiments, your test data bleeds into your training data


- there are many ways to deal with this; I'll give you the simplest one
- take another chunk of you training set and hide it under a rock
- never look at it until you have made your final decision
- you can use your validation set to measure your actual error, and maybe the validation set will bleed into the training sets
- but that's okay because you'll always have this test set that you can rely on to actually measure your real performance

# Transition: Overfitting -> Dataset Size

- I'm not going to talk about cross-validation here, but if you've never encountered it in your curriculum, I'd strongly recommend that you learn about it
- I am spending time on this because it's essential to deep learning in particular
- deep learning has many knobs that you can tweak, and you will be tempted to tweak them over and over
- you have to be very careful about overfitting on your test set
- use the validation set
  - how big does your validation and test sets need to be? it depends
  - the bigger your validation set the more precise your numbers will be

# Validation and Test Set Size

- imagine that your validation set has just six examples with an accuracy of $66%$
- now you tweak your model and your performance goes from $66%$ to $83%$
  - is this something you can trust? no, of course
  - this is only a change of label for a single example
  - it could just be noise
- the bigger your test set, the less noisy the accuracy measurement will be
- here is a useful rule of thumb, and if you're a statistician, feel free to cover your ears right now
  - a change that affects 30 examples in your validation sets one way or another, is usually statistically significant and typically can be trusted

## Quiz: Validation Set Size

- let's do some back of the envelope calculations
- imagine you have 3000 examples in your validation set, and assume you trust my hand-wavy rule of 30
- which level of accuracy improvement can you trust to not be in the noise?
  - a difference from 80% to 81%? YES
  - a difference from 80% to 80.5%? NO
  - a difference of 80% to 80.1%? NO


- if your accuracy changes from 80% to 80.1%, that's only at most 3 examples changing their labels
  - it's 0.1 times 3000 divided by 100
  - that's very few; it could just be noise and it definitely doesn't meet my rule of thumb of 30 examples minimum
- same thing going from 80% to 80.5%
  - at worst, only 15 examples are changing then
- when you get an improvement of 1% going from 80% to 81%, that's now a more robust 30 examples that are going from incorrect to correct
  - that's a stronger signal that whatever you're doing is indeed improving your accuracy

- this is why for most classification tasks people tend to hold back more than 30,000 examples for validation
- this makes accuracy figures significant to the first decimal place and gives you enough resolution to see small improvements
- $> 30,000 \text{ examples} \rightarrow \text{ changes } > 0.1\% \text{ in accuracy}$
- if your classes are not well balanced; for example, if some important classes are very rare, this heuristic is no longer good
- bad news, you're only going to need much more data
- now, holding back even 30,000 examples can be a lot of data if you have a small training set
- **cross-validation**, which I've mentioned before, is one possible way to mitigate the issue
  - but cross-validation can be a slow process, so getting more data is often the right solution

# Optimizing a Logistic Classifier

- training logistic regression using gradient descent is great
- for one thing, you're directly optimizing the error measure that you care about
  - that's always a great idea
  - that's why in practice, a lot of machine learning research is about designing the right loss function to optimize
- but as you might experienced if you've run the model in the assignments, it's got problems
- the biggest one is that it's very difficult to scale

# Stochastic Gradient Descent

- the problem with scaling gradient descent is simple; you need to compute these gradients $- \alpha \Delta £(w_1,w_2)$
- here's another rule of thumb
  - if computing your loss $£ = \sum_{i} D_i$ takes $n$ floating point operations, computing its gradient takes about three times that compute
  - as we saw earlier, this loss function is huge
    - it depends on every single element in your training set
      - that can be a lot of compute if your data set is big, and we want to be able to train on lots of data because in practice, on real problems, you'll always get more gains, the more data you use
    - because gradient descent is iterative, you have to do that for many steps
    - that means going through your data tens or hundreds of times; that's not good
    
<img src="resources/stochastic_gradient_descent_1.png" style="width: 50%;"/>


- so instead, we're going to cheat
- instead of computing the loss, we're going to compute an estimate of it, a very bad estimate; a terrible estimate, in fact
  - that estimate is going to be so bad, you might wonder why it works at all
  - you would be right because we're going to also have to spend some time making it less terrible
- the estimate we're going to use is simply computing the average loss for a very small random fraction of the training data
  - think between $1$ and $1000$ training samples each time
  - I say random because it's very important
    - if the way you pick your samples isn't random enough, it no longer works at all


- so we're going to take a very small sliver of the training data, compute the loss for that sample, compute the derivative for that sample, and pretend that that derivative is the right direction to use to do gradient descent
  - it is not at all the right direction; in fact, at times, it might increase the real loss, not reduce it
  - but we're going to compensate by doing this many, many times, taking very, very small steps each time
  - so each step is a lot cheaper to compute, but we pay a price
  - we have to take many more smaller steps instead of one large step
  - on balance though, we win by a lot
  - in fact, as you'll see in the assignments, doing this is vastly more efficient than doing gradient decent

<img src="resources/stochastic_gradient_descent_2.png" style="width: 50%;"/>

- this technique is called **stochastic gradient descent** and is at the core of deep learning
  - that's because stochastic gradient descent scales well with both data and model size, and we want both big data and big models
  - stochastic gradient descent, *SGD* for short, is nice and scalable
  - but because it's fundamentally a pretty bad optimizer that happens to be the only one that's fast enough, it comes with a lot of issues in

# Momentum and Learning Rate Decay

- you've already seen some of these tricks
  - I asked you to make your inputs zero mean and equal variance earlier
    - it's very important for SGD
  - I also told you to initialize with random weights that have relatively small variance, same thing
- I'm going to talk about a few more of those important tricks and that should cover all you really need to worry about to implement SGD

<img src="resources/momentum.png" style="width: 50%;"/>

- the first one is momentum
- remember that at each step, we're taking a very small step in a random direction
- but on aggregate, those steps take us towards the minimum of the loss
- we can take advantage of the knowledge that we've accumulated from previous steps about where we should be headed
  - a cheap way to do that is to keep a running average of the gradients $M \leftarrow 0.9M + \Delta£$ and to use that running average $M(w_1, w_2)$ instead of the direction of the current batch of the data
- this momentum technique works very well and often leads to better convergence


- the second one is learning rate decay
- remember, when replacing gradient descent with SGD, I said that we were going to take smaller, noisier steps towards our objective
  - how small should that step be? that's a whole area of research, as well
  - one the thing that's always the case, however is that it's beneficial to make that step smaller and smaller as you train
  - some like to apply an exponential decay to their learning rate
  - some like to make it smaller every time the loss reaches a plateau
  - there are lots of ways to go about it, but lowering it over time is the key thing to remember

<img src="resources/learning_rate_decay.png" style="width: 50%;"/>

# Parameter Hyperspace

- learning rate training can be very strange
- for example, you might think that using a higher learning rate means that you learn more or that you learn faster, that's just not true
- in fact, you can often take a model, lower the learning rate and get to a better model faster


- it gets even worse
- you might be tempted to look at the curve that shows the loss over time to see how quickly you learn
- here the higher learning rate starts faster but then it plateaus when the lower learning rate keeps on going and gets better

<img src="resources/learning_rate_tuning.png" style="width: 50%;"/>

- it is a very familiar picture for anyone who's trained neural networks
- never trust how quickly you learn, it has often little to do with how well you train
- this is where *SGD* gets its reputation for being *black magic*
  - you have many, many hyperparameters that you could play with
    - initial learning rate, learning rate decay, momentum, batch size, weight initialization
      - you have to get them right
    - in practice, it's not that bad but if you have to remember just one thing is that **when things don't work, always try to lower your learning rate first**
- there are lots of good solutions for small models
- but sadly, none that's completely satisfactory, so far, for the very large models that we really care about
- I mentioned one approach called **ADAGRAD** that makes things a little bit easier.
  - *ADAGRAD* is a modification of *SGD* which implicitly does momentum and learning rate decay for you
  - using *ADAGRAD*, often makes learning less sensitive to hyperparameters

# Mini-batch

- mini-batching is a technique for training on subsets of the dataset instead of all the data at one time
- this provides the ability to train a model, even if a computer lacks the memory to store the entire dataset
- mini-batching is computationally inefficient, since you can't calculate the loss simultaneously across all samples
  - however, this is a small price to pay in order to be able to run the model at all


- it's also quite useful combined with SGD
- he idea is to randomly shuffle the data at the start of each epoch, then create the mini-batches
- for each mini-batch, you train the network weights with gradient descent
- since these batches are random, you're performing SGD with each batch

- let's look at the MNIST dataset with weights and a bias to see if your machine can handle it

```python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))
```


**Question 1:** Calculate the memory size of `train_features`, `train_labels`, `weights`, and `bias` in bytes. Ignore memory for overhead, just calculate the memory required for the stored data.

You may have to look up how much memory a float32 requires. - It's 4 bytes

`train_features` Shape: (55000, 784) Type: float32 - It needs 172480000 bytes
<br/>
`train_labels` Shape: (55000, 10) Type: float32 - It needs 2200000 bytes
<br/>
`weights` Shape: (784, 10) Type: float32 - It needs 31360 bytes
<br/>
`bias` Shape: (10,) Type: float32 - It needs 40 bytes


- the total memory space required for the inputs, weights and bias is around 174 megabytes, which isn't that much memory
  - you could train this whole dataset on most CPUs and GPUs
  - but larger datasets that you'll use in the future measured in gigabytes or more
- it's possible to purchase more memory, but it's expensive
  - a Titan X GPU with 12 GB of memory costs over \\$1,000


- instead, in order to run large models on your machine, you'll learn how to use mini-batching

## TensorFlow Mini-batching

- in order to use mini-batching, you must first divide your data into batches
- unfortunately, it's sometimes impossible to divide the data into batches of exactly equal size
  - for example, imagine you'd like to create batches of 128 samples each from a dataset of 1000 samples
  - since 128 does not evenly divide into 1000, you'd wind up with 7 batches of 128 samples, and 1 batch of 104 samples ($7\cdot128 + 1\cdot104 = 1000$)
  - in that case, the size of the batches would vary, so you need to take advantage of TensorFlow's `tf.placeholder()` function to receive the varying batch sizes


- continuing the example, if each sample had `n_input = 784` features and `n_classes = 10` possible labels, the dimensions for `features` would be `[None, n_input]` and `labels` would be `[None, n_classes]`
- what does `None` do here?
  - the `None` dimension is a placeholder for the batch size
  - at runtime, TensorFlow will accept any batch size greater than $0$


- going back to our earlier example, this setup allows you to feed features and labels into the model as either the batches of 128 samples or the single batch of 104 samples


**Question 2:** Use the parameters below, how many batches are there, and what is the last batch size? Batch_size is 128.

`features` is (50000, 400)
<br/>
`labels` is (50000, 10)


There's 391 batches (50000 / 128 = 390,625).
<br/>
The last batch size is 80.

# Quiz: Mini-batch 1

Implement the `batches` function to batch `features` and `labels`. The function should return each batch with a maximum size of `batch_size`. To help you with the quiz, look at the following example output of a working `batches` function.

```python
# 4 Samples of features
example_features = [
    ['F11','F12','F13','F14'],
    ['F21','F22','F23','F24'],
    ['F31','F32','F33','F34'],
    ['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
    ['L11','L12'],
    ['L21','L22'],
    ['L31','L32'],
    ['L41','L42']]

example_batches = batches(3, example_features, example_labels)
```

The example_batches variable would be the following:

```python
[
    # 2 batches:
    #   First is a batch of size 3.
    #   Second is a batch of size 1
    [
        # First Batch is size 3
        [
            # 3 samples of features.
            # There are 4 features per sample.
            ['F11', 'F12', 'F13', 'F14'],
            ['F21', 'F22', 'F23', 'F24'],
            ['F31', 'F32', 'F33', 'F34']
        ], [
            # 3 samples of labels.
            # There are 2 labels per sample.
            ['L11', 'L12'],
            ['L21', 'L22'],
            ['L31', 'L32']
        ]
    ], [
        # Second Batch is size 1.
        # Since batch size is 3, there is only one sample left from the 4 samples.
        [
            # 1 sample of features.
            ['F41', 'F42', 'F43', 'F44']
        ], [
            # 1 sample of labels.
            ['L41', 'L42']
        ]
    ]
]
```

In [5]:
import math


def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    # TODO: Implement batching
    output_batches = []

    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        output_batches.append(batch)

    return output_batches

<IPython.core.display.Javascript object>

In [6]:
from pprint import pprint

# 4 Samples of features
example_features = [
    ["F11", "F12", "F13", "F14"],
    ["F21", "F22", "F23", "F24"],
    ["F31", "F32", "F33", "F34"],
    ["F41", "F42", "F43", "F44"],
]
# 4 Samples of labels
example_labels = [["L11", "L12"], ["L21", "L22"], ["L31", "L32"], ["L41", "L42"]]

# PPrint prints data structures like 2d arrays, so they are easier to read
pprint(batches(3, example_features, example_labels))

[[[['F11', 'F12', 'F13', 'F14'],
   ['F21', 'F22', 'F23', 'F24'],
   ['F31', 'F32', 'F33', 'F34']],
  [['L11', 'L12'], ['L21', 'L22'], ['L31', 'L32']]],
 [[['F41', 'F42', 'F43', 'F44']], [['L41', 'L42']]]]


<IPython.core.display.Javascript object>

## Quiz: Mini-batch 2

Let's use mini-batching to feed batches of MNIST features and labels into a linear model.

Set the batch size and run the optimizer over all the batches with the batches function. The recommended batch size is 128. If you have memory restrictions, feel free to make it smaller.

```python
import math
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    outout_batches = []
    
    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        outout_batches.append(batch)
        
    return outout_batches
```

```python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches

learning_rate = 0.001
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
```

```python
# TODO: Set batch size
batch_size = 128
assert batch_size is not None, 'You must set the batch size'

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    # TODO: Train optimizer on all batches
    # for batch_features, batch_labels in ______
    for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
        sess.run(optimizer, feed_dict={features: batch_features, labels: batch_labels})

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))
```

The accuracy is low, but you probably know that you could train on the dataset more than once. You can train a model using the dataset multiple times. You'll go over this subject in the next section where we talk about "epochs".

# Epochs

- an epoch is a single forward and backward pass of the whole dataset
- this is used to increase the accuracy of the model without requiring more data
- this section will cover epochs in TensorFlow and how to choose the right number of epochs

- the following TensorFlow code trains a model using 10 epochs

```python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches  # Helper function created in Mini-batching section


def print_epoch_stats(epoch_i, sess, last_features, last_labels):
    """
    Print cost and validation accuracy of an epoch
    """
    current_cost = sess.run(
        cost,
        feed_dict={features: last_features, labels: last_labels})
    valid_accuracy = sess.run(
        accuracy,
        feed_dict={features: valid_features, labels: valid_labels})
    print('Epoch: {:<4} - Cost: {:<8.3} Valid Accuracy: {:<5.3}'.format(
        epoch_i,
        current_cost,
        valid_accuracy))

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
valid_features = mnist.validation.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
valid_labels = mnist.validation.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
learning_rate = tf.placeholder(tf.float32)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init = tf.global_variables_initializer()

batch_size = 128
epochs = 10
learn_rate = 0.001

train_batches = batches(batch_size, train_features, train_labels)

with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch_i in range(epochs):

        # Loop over all batches
        for batch_features, batch_labels in train_batches:
            train_feed_dict = {
                features: batch_features,
                labels: batch_labels,
                learning_rate: learn_rate}
            sess.run(optimizer, feed_dict=train_feed_dict)

        # Print cost and validation accuracy of an epoch
        print_epoch_stats(epoch_i, sess, batch_features, batch_labels)

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))
```

- running the code will output the following:

```python
Epoch: 0    - Cost: 11.0     Valid Accuracy: 0.204
Epoch: 1    - Cost: 9.95     Valid Accuracy: 0.229
Epoch: 2    - Cost: 9.18     Valid Accuracy: 0.246
Epoch: 3    - Cost: 8.59     Valid Accuracy: 0.264
Epoch: 4    - Cost: 8.13     Valid Accuracy: 0.283
Epoch: 5    - Cost: 7.77     Valid Accuracy: 0.301
Epoch: 6    - Cost: 7.47     Valid Accuracy: 0.316
Epoch: 7    - Cost: 7.2      Valid Accuracy: 0.328
Epoch: 8    - Cost: 6.96     Valid Accuracy: 0.342
Epoch: 9    - Cost: 6.73     Valid Accuracy: 0.36 
Test Accuracy: 0.3801000118255615
```

- each epoch attempts to move to a lower cost, leading to better accuracy
- this model continues to improve accuracy up to Epoch 9. Let's increase the number of epochs to 100

```python
...
Epoch: 79   - Cost: 0.111    Valid Accuracy: 0.86
Epoch: 80   - Cost: 0.11     Valid Accuracy: 0.869
Epoch: 81   - Cost: 0.109    Valid Accuracy: 0.869
....
Epoch: 85   - Cost: 0.107    Valid Accuracy: 0.869
Epoch: 86   - Cost: 0.107    Valid Accuracy: 0.869
Epoch: 87   - Cost: 0.106    Valid Accuracy: 0.869
Epoch: 88   - Cost: 0.106    Valid Accuracy: 0.869
Epoch: 89   - Cost: 0.105    Valid Accuracy: 0.869
Epoch: 90   - Cost: 0.105    Valid Accuracy: 0.869
Epoch: 91   - Cost: 0.104    Valid Accuracy: 0.869
Epoch: 92   - Cost: 0.103    Valid Accuracy: 0.869
Epoch: 93   - Cost: 0.103    Valid Accuracy: 0.869
Epoch: 94   - Cost: 0.102    Valid Accuracy: 0.869
Epoch: 95   - Cost: 0.102    Valid Accuracy: 0.869
Epoch: 96   - Cost: 0.101    Valid Accuracy: 0.869
Epoch: 97   - Cost: 0.101    Valid Accuracy: 0.869
Epoch: 98   - Cost: 0.1      Valid Accuracy: 0.869
Epoch: 99   - Cost: 0.1      Valid Accuracy: 0.869
Test Accuracy: 0.8696000006198883
```

- from looking at the output above, you can see the model doesn't increase the validation accuracy after epoch 80
- let's see what happens when we increase the learning rate

`learn_rate = 0.1`

```python
Epoch: 76   - Cost: 0.214    Valid Accuracy: 0.752
Epoch: 77   - Cost: 0.21     Valid Accuracy: 0.756
Epoch: 78   - Cost: 0.21     Valid Accuracy: 0.756
...
Epoch: 85   - Cost: 0.207    Valid Accuracy: 0.756
Epoch: 86   - Cost: 0.209    Valid Accuracy: 0.756
Epoch: 87   - Cost: 0.205    Valid Accuracy: 0.756
Epoch: 88   - Cost: 0.208    Valid Accuracy: 0.756
Epoch: 89   - Cost: 0.205    Valid Accuracy: 0.756
Epoch: 90   - Cost: 0.202    Valid Accuracy: 0.756
Epoch: 91   - Cost: 0.207    Valid Accuracy: 0.756
Epoch: 92   - Cost: 0.204    Valid Accuracy: 0.756
Epoch: 93   - Cost: 0.206    Valid Accuracy: 0.756
Epoch: 94   - Cost: 0.202    Valid Accuracy: 0.756
Epoch: 95   - Cost: 0.2974   Valid Accuracy: 0.756
Epoch: 96   - Cost: 0.202    Valid Accuracy: 0.756
Epoch: 97   - Cost: 0.2996   Valid Accuracy: 0.756
Epoch: 98   - Cost: 0.203    Valid Accuracy: 0.756
Epoch: 99   - Cost: 0.2987   Valid Accuracy: 0.756
Test Accuracy: 0.7556000053882599
```

- looks like the learning rate was increased too much
- the final accuracy was lower, and it stopped improving earlier
- let's stick with the previous learning rate, but change the number of epochs to 80

```python
Epoch: 65   - Cost: 0.122    Valid Accuracy: 0.868
Epoch: 66   - Cost: 0.121    Valid Accuracy: 0.868
Epoch: 67   - Cost: 0.12     Valid Accuracy: 0.868
Epoch: 68   - Cost: 0.119    Valid Accuracy: 0.868
Epoch: 69   - Cost: 0.118    Valid Accuracy: 0.868
Epoch: 70   - Cost: 0.118    Valid Accuracy: 0.868
Epoch: 71   - Cost: 0.117    Valid Accuracy: 0.868
Epoch: 72   - Cost: 0.116    Valid Accuracy: 0.868
Epoch: 73   - Cost: 0.115    Valid Accuracy: 0.868
Epoch: 74   - Cost: 0.115    Valid Accuracy: 0.868
Epoch: 75   - Cost: 0.114    Valid Accuracy: 0.868
Epoch: 76   - Cost: 0.113    Valid Accuracy: 0.868
Epoch: 77   - Cost: 0.113    Valid Accuracy: 0.868
Epoch: 78   - Cost: 0.112    Valid Accuracy: 0.868
Epoch: 79   - Cost: 0.111    Valid Accuracy: 0.868
Epoch: 80   - Cost: 0.111    Valid Accuracy: 0.869
Test Accuracy: 0.86909999418258667
```

- the accuracy only reached 0.86, but that could be because the learning rate was too high
- lowering the learning rate would require more epochs, but could ultimately achieve better accuracy