# Lesson11-2 Tensorflow - Classification

## 1. Forward Propagation
### 1. Tensorflow Linear Function
Let’s derive the function `y = Wx + b`. We want to translate our input, `x`, to labels, `y`.

For example, imagine we want to classify images as digits.

`x` would be our list of pixel values, and `y` would be the logits, one for each digit. Let's take a look at `y = Wx`, where the weights, `W`, determine the influence of `x` at predicting each `y`.

![](https://video.udacity-data.com/topher/2018/May/5af21f64_wx-1/wx-1.jpg)

`y = Wx` allows us to segment the data into their respective labels using a line.

However, this line has to pass through the origin, because whenever `x` equals 0, then `y` is also going to equal 0.

We want the ability to shift the line away from the origin to fit more complex data. The simplest solution is to add a number to the function, which we call **"Bias”**.

![](https://video.udacity-data.com/topher/2018/May/5af21f8c_wx-b/wx-b.jpg)

Our new function becomes `Wx + b`, allowing us to create predictions on linearly separable data. Let’s use a concrete example and calculate the logits.

#### Matrix Multiplication
Calculate the logits `a` and `b` for the following formula.
![](https://video.udacity-data.com/topher/2018/May/5af21fd9_codecogseqn-13/codecogseqn-13.gif)
The answer is `a = 0.16`, `b = 0.06`.


#### Transposition
We've been using the `y = Wx + b` function for our linear function.

But there's another function that does the same thing, `y = xW + b`. These functions do the same thing and are interchangeable, except for the dimensions of the matrices involved.

To shift from one function to the other, you simply have to swap the row and column dimensions of each matrix. This is called **Transposition**.

For rest of this lesson, we actually use xW + b, because this is what TensorFlow uses.

![img](https://video.udacity-data.com/topher/2018/May/5af220e2_codecogseqn-18/codecogseqn-18.gif)

The above example is identical to the quiz you just completed, except that the matrices are **transposed**.

`x` now has the dimensions 1x3, `W` now has the dimensions 3x2, and `b` now has the dimensions `1x2`. Calculating this will produce a matrix with the dimension of `1x2`.

You'll notice that the elements in this 1x2 matrix are the same as the elements in the 2x1 matrix from the quiz. Again, these matrices are simply transposed.
![](https://video.udacity-data.com/topher/2018/May/5af2210f_codecogseqn-20/codecogseqn-20.gif)

We now have our logits! The columns represent the logits for our two labels.   
Now you can learn how to train this function in TensorFlow.

#### Weights and Bias in TensorFlow
The goal of training a neural network is to modify weights and biases to best predict the labels. In order to use weights and bias, you'll need a Tensor that can be modified. This leaves out `tf.placeholder()` and `tf.constant()`, since those Tensors can't be modified. This is where `tf.Variable` class comes in.

- **tf.Variable()**
```python
x = tf.Variable(5)
```
The `tf.Variable` class creates a tensor with an initial value that can be modified, much like a normal Python variable. This tensor stores its state in the session, so you must initialize the state of the tensor manually. You'll use the `tf.global_variables_initializer()` function to initialize the state of all the Variable tensors.

    - Initialization
    ```python
    init = tf.global_variables_initializer()
    with tf.Session() as sess:
        sess.run(init)
    ```
    The `tf.global_variables_initializer()` call returns an operation that will initialize all TensorFlow variables from the graph. You call the operation using a session to initialize all the variables as shown above. Using the tf.Variable class allows us to change the weights and bias, but an initial value needs to be chosen.

    Initializing the weights with random numbers from a **normal distribution** is good practice. Randomizing the weights helps the model from becoming stuck in the same place every time you train it. You'll learn more about this in the next lesson, when you study gradient descent.

    Similarly, choosing weights from a **normal distribution** prevents any one weight from overwhelming other weights. You'll use the `tf.truncated_normal()` function to generate random numbers from a normal distribution.

- **tf.truncated_normal()**

```python
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))
```
The `tf.truncated_normal()` function returns a tensor with random values from a normal distribution whose magnitude is no more than 2 standard deviations from the mean.

Since the weights are already helping prevent the model from getting stuck, you don't need to randomize the bias. Let's use the simplest solution, setting the bias to 0.

- **tf.zeros()**
``` python
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))
```
The `tf.zeros()` function returns a tensor with all zeros.

### Linear Classification Quiz
![](https://video.udacity-data.com/topher/2018/May/5af2214a_mnist-012/mnist-012.png)
You'll be classifying the handwritten numbers `0`, `1`, and `2` from the **MNIST dataset** using TensorFlow. The above is a small sample of the data you'll be training on. Notice how some of the 1s are written with a [serif](https://en.wikipedia.org/wiki/Serif) at the top and at different angles. The similarities and differences will play a part in shaping the weights of the model.

![](https://video.udacity-data.com/topher/2018/May/5af22171_weights-0-1-2/weights-0-1-2.png)
Left: Weights for labeling 0.   
Middle: Weights for labeling 1.   
Right: Weights for labeling 2.   

The images above are trained weights for each label (`0`, `1`, and `2`). The weights display the unique properties of each digit they have found. Complete this quiz to train your own weights using the MNIST dataset.

- **Instructions**
    1. Open quiz.py.
        1. Implement get_weights to return a tf.Variable of weights
        2. Implement get_biases to return a tf.Variable of biases
        3. Implement xW + b in the linear function
    2. Open sandbox.py
        1. Initialize all weights

Since `xW` in `xW + b` is matrix multiplication, you have to use the `tf.matmul()` function instead of `tf.multiply()`. Don't forget that order matters in matrix multiplication, so `tf.matmul(a,b)` is not the same as `tf.matmul(b,a)`.

In [1]:
## Udacity Workspace의 tensorflow version == 1.3.0 이다.
## 하지만 현재 받을 수 있는 버전은 tensorflow == 1.15.0이다.
## 따라서 Classroom의 코드(현재 아래의 코드)를 그대로 쓰면 오류가 발생하므로 고쳐주자...

# Solution is available in the other "quiz_solution.ipynb" 
import tensorflow as tf

def get_weights(n_features, n_labels):
    """
    Return TensorFlow weights
    :param n_features: Number of features
    :param n_labels: Number of labels
    :return: TensorFlow weights
    """
    # TODO: Return weights
    return tf.Variable(tf.truncated_normal((n_features, n_labels)))


def get_biases(n_labels):
    """
    Return TensorFlow bias
    :param n_labels: Number of labels
    :return: TensorFlow bias
    """
    # TODO: Return biases
    return tf.Variable(tf.zeros(n_labels))


def linear(input, W, b):
    """
    Return linear function in TensorFlow
    :param input: TensorFlow input
    :param w: TensorFlow weights
    :param b: TensorFlow biases
    :return: TensorFlow linear function
    """
    # TODO: Linear Function (xW + b)
    return tf.add(tf.matmul(input, W) , b)

In [2]:
# Solution is available in the other "quiz_solution.ipynb" tab

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from test import *

def mnist_features_labels(n_labels):
    """
    Gets the first <n> labels from the MNIST dataset
    :param n_labels: Number of labels to use
    :return: Tuple of feature list and label list
    """
    mnist_features = []
    mnist_labels = []

    mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

    # In order to make quizzes run faster, we're only looking at 10000 images
    for mnist_feature, mnist_label in zip(*mnist.train.next_batch(10000)):

        # Add features and labels if it's for the first <n>th labels
        if mnist_label[:n_labels].any():
            mnist_features.append(mnist_feature)
            mnist_labels.append(mnist_label[:n_labels])

    return mnist_features, mnist_labels


# Number of features (28*28 image is 784 features)
n_features = 784
# Number of labels
n_labels = 3

# Features and Labels
features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)

# Weights and Biases
w = get_weights(n_features, n_labels)
b = get_biases(n_labels)

# Linear Function xW + b
logits = linear(features, w, b)

# Training data
train_features, train_labels = mnist_features_labels(n_labels)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /datasets/ud730/mnist\train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /datasets/ud730/mnist\train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting /datasets/ud730/mnist\t10k-images-idx3-ubyte.gz
Extracting /datasets/ud730/mnist\t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


In [3]:
with tf.Session() as session:
    # TODO: Initialize session variables
    session.run(tf.global_variables_initializer())
    
    # Softmax
    prediction = tf.nn.softmax(logits)

    # Cross entropy
    # This quantifies how far off the predictions were.
    # You'll learn more about this in future lessons.
    cross_entropy = -tf.reduce_sum(labels * tf.log(prediction), reduction_indices=1)

    # Training loss
    # You'll learn more about this in future lessons.
    loss = tf.reduce_mean(cross_entropy)

    # Rate at which the weights are changed
    # You'll learn more about this in future lessons.
    learning_rate = 0.08

    # Gradient Descent
    # This is the method used to train the model
    # You'll learn more about this in future lessons.
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

    # Run optimizer and get loss
    _, l = session.run(
        [optimizer, loss],
        feed_dict={features: train_features, labels: train_labels})

# Print loss
print('Loss: {}'.format(l))

Loss: 5.583266258239746


#### Running the Grader

To run the grader below, you'll want to run the above training from scratch (if you have otherwise already ran it multiple times). You can reset your kernel and then run all cells for the grader code to appropriately check that you weights and biases achieved the desired end result.

In [4]:
### DON'T MODIFY ANYTHING BELOW ###
### Be sure to run all cells above before running this cell ###
import grader_quiz_LinearClassification as grader

try:
    grader.run_grader(get_weights, get_biases, linear)
except Exception as err:
    print(str(err))


Function weights isn't correct.




### 2. Softmax
Congratulations on successfully implementing a linear function that outputs logits. You're one step closer to a working classifier.

The next step is to assign a probability to each label, which you can then use to classify the data. Use the softmax function to turn your logits into probabilities.

We can do this by using the formula above, which uses the input of y values and the mathematical constant "e" which is approximately equal to 2.718. By taking "e" to the power of any real value we always get back a positive value, this then helps us scale when having negative y values. The summation symbol on the bottom of the divisor indicates that we add together all the e^(input y value) elements in order to get our calculated probability outputs.

![Softmax function](https://video.udacity-data.com/topher/2017/August/59a3b336_softmax/softmax.png)

#### Quiz   
For the next quiz, you'll implement a `softmax(x)` function that takes in `x`, a one or two dimensional array of logits.

In the one dimensional case, the array is just a single set of logits. In the two dimensional case, each column in the array is a set of logits. The `softmax(x)` function should return a NumPy array of the same shape as `x`.

For example, given a one-dimensional array:
```python
# logits is a one-dimensional array with 3 elements
logits = [1.0, 2.0, 3.0]
# softmax will return a one-dimensional array with 3 elements
print softmax(logits)
```
```
$ [ 0.09003057  0.24472847  0.66524096]
```
   
Given a two-dimensional array where each column represents a set of logits:
```python
# logits is a two-dimensional array
logits = np.array([
    [1, 2, 3, 6],
    [2, 4, 5, 6],
    [3, 8, 7, 6]])
# softmax will return a two-dimensional array with the same shape
print softmax(logits)
```
```
$ [
    [ 0.09003057  0.00242826  0.01587624  0.33333333]
    [ 0.24472847  0.01794253  0.11731043  0.33333333]
    [ 0.66524096  0.97962921  0.86681333  0.33333333]
  ]
```
Implement the softmax function, which is specified by the formula at the top of the page.

The probabilities for each column must sum to 1. Feel free to test your function with the inputs above.

In [5]:
# Solution is available in the other "solution.py" tab
import numpy as np


def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    # TODO: Compute and return softmax(x)
    
    arr = np.array(x)
    arr_exp = np.exp(arr)
    
    if arr.ndim == 1 or arr.ndim == 2:
        arr_result = arr_exp / np.sum(arr_exp, axis = 0)
        return arr_result
    
    else:
        raise NotImplementedError
        
# logits = [3.0, 1.0, 0.2]
logits = np.arange(9).reshape(3, 3)
print(softmax(logits))


[[0.00235563 0.00235563 0.00235563]
 [0.04731416 0.04731416 0.04731416]
 [0.95033021 0.95033021 0.95033021]]


#### Tensorflow Softmax
Now that you've built a softmax function from scratch, let's see how softmax is done in TensorFlow.
```python
x = tf.nn.softmax([2.0, 1.0, 0.2])
```
Easy as that! `tf.nn.softmax()` implements the softmax function for you. It takes in logits and returns softmax activations.

#### Quiz
Use the softmax function in the quiz below to return the softmax of the logits.
![Softmax Quiz](https://video.udacity-data.com/topher/2017/February/58950908_softmax-input-output/softmax-input-output.png)

In [6]:
# Solution is available in the other "solution.ipynb" 
import tensorflow as tf


def run():
    output = None
    logit_data = [2.0, 1.0, 0.1]
    logits = tf.placeholder(tf.float32)
    
    # TODO: Calculate the softmax of the logits
    softmax = tf.nn.softmax(logits)
    
    with tf.Session() as sess:
        # TODO: Feed in the logit data
        output = sess.run(softmax, feed_dict={logits: logit_data})

    return output

In [7]:
### DON'T MODIFY ANYTHING BELOW ###
### Be sure to run all cells above before running this cell ###
import grader_quiz_softmax as grader

try:
    grader.run_grader(run)
except Exception as err:
    print(str(err))

That's the correct softmax!




### 3. One-Hot Encoding
Classification을 진행할 때, 각각 label들의 확률 중에 제일 높은 것을 1로 하고 나머지를 0으로 만드는 것.
$$
\begin{bmatrix} 0.7 \\ 0.1 \\ 0.2 \end{bmatrix} → \begin{bmatrix} 1.0\\0.0\\0.0 \end{bmatrix}
$$

### 4. Cross Entropy
$S(y)$를 Forward Propagation의 결과라고 하였을 때, $L$은 $S(y)$를 One-Hot Encoding한 것이라고 하자.

$$
S(y) = \begin{bmatrix} 0.7 \\ 0.1 \\ 0.2 \end{bmatrix},\quad L = \begin{bmatrix} 1.0\\0.0\\0.0 \end{bmatrix}
$$
   
여기서의 에러를 Cross Entropy로 정의하고 이를 $D(S, L)$이라 할 때,

$$
D(S, L) = -\sum_{i} L_{i}log(S_i)
$$

### 5. 정리 : Forward Propagation

![Forward Propagation](Images/ForwardPropagationProcess.png)

## 2. Model Trainning

### 1. Input Data Normalization
![Data Normalization](Images/DataNormalization.png)

모델 학습을 할 때에는 기본적으로 Gradient Descent를 이용할 것인데, 해당 데이터마다 값의 범위가 0\~100일 수도 있고, 0\~1,000,000 일 수도 있고 범위가 다양하다.   

만약 해당 데이터들을 그대로 이용하여 Gradient를 구한다면은 데이터 값의 범위마다 큰 쪽으로 편향 될 것이고, 해당 방향에 Learning Rate를 곱한 값으로 학습을 시킨다면 학습효과가 떨어질 것이다.   

따라서 데이터 값의 범위를 일정하게 만들어주는 Data Normalization(**Data Standardization**이 더 정확한 표현)을 통해서 학습의 효과를 더욱 높일 수 있다.   

강의에서는 Normalization(정규화)라는 표현을 썼지만, 평균과 표준편차를 이용하는 Standardization(표준화)의 수식으로 설명되어 있다. 
   
엄연히 따지자면 Normalization(정규화)와 Standardization(표준화)는 서로 다르다.
- 표준화
$$
    x_{new} = \frac{x - \mu}{\sigma} \qquad \mu = average(평균),\  \sigma = standard \ deviation(표준편차)
$$   
   
   
- 정규화 : 데이터의 범위를 0과 1사이로 변환하는 것
$$
    x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}
$$

### 2. Weight Initialization
모델의 Weight를 초기화 할 때에도 정규분포를 사용하여 초기화하는 것이 학습에 좋다고 한다.(이유는 논문을 찾아봐야할 것 같다.)    
   
정규분포를 이용하여 Weight을 초기화할 때에 평균과 표준편차를 정해야하는데, 다음과 같은 가이드라인을 주었다.
- 평균 = 0
- 표준편차 : 표준편차 큰 것이 좋음   
    표준편차가 클 수록 평균근처의 값에서 먼 값으로 초기화 될 확률이 높아진다. 따라서 표준편차가 큰 값으로 weight을 초기화하여 Model이 편향되게 초기화되는 것을 막을 수 있으므로, 표준편차가 큰 정규분포를 이용하여 weight을 초기화하는 것이 좋다
    ![표준편차별 정규분포](https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/1200px-Normal_Distribution_PDF.svg.png)

### 3. Validation Set, Test Set
모델을 학습할 때 데이터 셋을 크게 세가지, Tranning Set / Validation Set / Test Set, 로 나눌 수 있다.   

- Tranning Set   : 모델을 학습할 때 사용하는 Dataset
- Validation Set : 모델을 학습을 검증할 때 사용하는 Dataset
- Test Set       : 모델의 학습을 평가할 때 사용하는 Dataset

   
**`Tranning Set`과 `Test Set`이외에 `Validation Set`은 왜 필요할까?**   
훈련의 예를 들어보자. Tranning Set으로 학습을 하고, 여기서 나온 오차값으로 Back Propagation(역전파)를 통하여 모델을 학습시킬 것이다.   
   
이를 많이 반복할 수록 모델은 점점 `Tranning Set`을 맞출 확률이 올라갈 것이다. 하지만 이를 Test Set을 이용하여 평가해보면 Tranning Set에서 훈련한 정확도보다 떨어진 결과를 얻을 수 있다. 이를 `Overfitting`되었다고 표현한다.   
   
`Overfitting`이란 쉽게 말해서 Tranning Set에만 모델이 최적화 되고, 다른 Dataset에서는 제대로 찾지 못하는 것을 말한다. 따라서 이러한 Overfitting을 피하기 위하여 Test Set을 분리하여 Test Set의 정확도가 높게 나오는지 평가하는 과정으로 찾는 것이다.   

하지만 Test Set의 정확도가 높게 나온다고 하여도 Overfitting을 피했다고 말할 수 있을까? 그렇지 않다. Test Set의 평가를 확인하고 모델의 weight을 다시 학습시키더라도, Test Set에 간접적으로 Overfitting이, 다시 말해 Test Set에 최적화 되도록 간접적으로 학습된다. 따라서 이러한 모델의 학습결과를 검증하는 `Validation Set`을 둬서 학습의 결과를 확인하고, 최종적으로 `Test Set`에서 모델의 학습을 평가하는 과정을 거쳐서 모델의 학습이 Overfitting이 일어나지 않고 잘 되었는지 확인한다.

### 4. 여러가지 Gradient Descent 방법과 Learning Rate Decay

- SGD(Stochastic Gradient Descent)   
    모든 데이터 셋을 한번에 학습하기 힘들어서 만들어진 방법으로, 데이터셋에서 랜덤으로 몇개를 뽑아서 학습시키는 방법이다.    
    
    모든 데이터셋을 전부 계산하고, 오류값을 계산하여 평균내는 것에는 많은 시간이 걸린다. 따라서 전체 Tranning Set에서 일부를 뽑아서 오류를 계산한 후, 오류를 평균내어 Gradient Descent를 통하여 학습시켜, 마치 전체 데이터셋을 학습시킨 것처럼 진행한다.   
    
    물론 전체 데이터셋을 하는 것보다 부정확하지만, 작은 batch를 수천번, 수만번 학습시키면 전체데이터셋을 여러번 학습시킨것과 결국 같은 효과를 내게 된다.
![SGD](Images/SGD.png)


- Momentum   
    SGD에서 개선된 방법으로, 이전의 Gradient를 평균내어 이전의 Gradient방향이 관성을 가지는 방법이다.
![Momentum](Images/Momentum.png)


- Learning Rate Decay   
    학습을 여러번 반복할수록 Gradient Descent를 진행할 때, Error 감소율이 점점 떨어진다. 만약 같은 learning rate로 계속 학습을 진행한다면, 일정 Error이하로 떨어지지 않는 것을 볼 수 있다. 이는 일정한 골짜기 높이에서 Gradient가 떨어지지 않고 주위를 뱅뱅 도는 것과 비슷하다. 따라서 좀더 정밀한 정확도를 가지기 위해서는 Learning Rate를 학습이 진행될수록 점점 작게 하는 것이 좋다.
![LearningRateDecay](Images/LearningRateDecay.png)   
   
   
* Adagrad : Momentum과 Learning Rate Decay가 적용된 학습방법이다. (자세한건 논문 찾아볼 것...)

### 5. Mini-batching
In this section, you'll go over what mini-batching is and how to apply it in TensorFlow.

Mini-batching is a technique for training on subsets of the dataset instead of all the data at one time. This provides the ability to train a model, even if a computer **lacks the memory to store the entire dataset**.

Mini-batching is computationally inefficient, since you can't calculate the loss simultaneously across all samples. However, this is a small price to pay in order to be able to run the model at all.

It's also quite useful combined with SGD. The idea is to randomly shuffle the data at the start of each epoch, then create the mini-batches. For each mini-batch, you train the network weights with gradient descent. Since these batches are random, you're performing SGD with each batch.

Let's look at the MNIST dataset with weights and a bias to see if your machine can handle it.

```python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))
```

##### Question 1
Calculate the memory size of train_features, train_labels, weights, and bias in bytes. Ignore memory for overhead, just calculate the memory required for the stored data.

You may have to look up how much memory a float32 requires, using this [link](https://en.wikipedia.org/wiki/Single-precision_floating-point_format).

train_features Shape: (55000, 784) Type: float32 → 172.48Mb

train_labels Shape: (55000, 10) Type: float32    → 2.2Mb

weights Shape: (784, 10) Type: float32           → 31.36Kb

bias Shape: (10,) Type: float32                  → 40bytes

The total memory space required for the inputs, weights and bias is around 174 megabytes, which isn't that much memory. You could train this whole dataset on most CPUs and GPUs.

But larger datasets that you'll use in the future measured in gigabytes or more. It's possible to purchase more memory, but it's expensive. A Titan X GPU with 12 GB of memory costs over $1,000.

Instead, in order to run large models on your machine, you'll learn how to use mini-batching.

Let's look at how you implement mini-batching in TensorFlow.

#### TensorFlow Mini-batching
In order to use mini-batching, you must first divide your data into batches.

Unfortunately, it's sometimes impossible to divide the data into batches of exactly equal size. For example, imagine you'd like to create batches of 128 samples each from a dataset of 1000 samples. Since 128 does not evenly divide into 1000, you'd wind up with 7 batches of 128 samples, and 1 batch of 104 samples. (7*128 + 1*104 = 1000)

In that case, the size of the batches would vary, so you need to take advantage of TensorFlow's `tf.placeholder()` function to receive the varying batch sizes.

Continuing the example, if each sample had `n_input = 784` features and `n_classes = 10` possible labels, the dimensions for `features` would be `[None, n_input]` and labels would be `[None, n_classes].

```python
# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])
```
What does `None` do here?

The `None` dimension is a placeholder for the batch size. At runtime, TensorFlow will accept any batch size greater than 0.

Going back to our earlier example, this setup allows you to feed `features` and `labels `into the model as either the batches of 128 samples or the single batch of 104 samples.

##### Question 2
Use the parameters below, how many batches are there, and what is the last batch size?

features is (50000, 400)

labels is (50000, 10)

batch_size is 128

`Answer : batchs = 391, last batch size = 80`

##### Question 3
Implement the `batches` function to batch `features` and `labels`. The function should return each batch with a maximum size of `batch_size`. To help you with the quiz, look at the following example output of a working `batches` function.
```python
# 4 Samples of features
example_features = [
    ['F11','F12','F13','F14'],
    ['F21','F22','F23','F24'],
    ['F31','F32','F33','F34'],
    ['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
    ['L11','L12'],
    ['L21','L22'],
    ['L31','L32'],
    ['L41','L42']]

example_batches = batches(3, example_features, example_labels)
```
The `example_batches` variable would be the following:
```python
[
    # 2 batches:
    #   First is a batch of size 3.
    #   Second is a batch of size 1
    [
        # First Batch is size 3
        [
            # 3 samples of features.
            # There are 4 features per sample.
            ['F11', 'F12', 'F13', 'F14'],
            ['F21', 'F22', 'F23', 'F24'],
            ['F31', 'F32', 'F33', 'F34']
        ], [
            # 3 samples of labels.
            # There are 2 labels per sample.
            ['L11', 'L12'],
            ['L21', 'L22'],
            ['L31', 'L32']
        ]
    ], [
        # Second Batch is size 1.
        # Since batch size is 3, there is only one sample left from the 4 samples.
        [
            # 1 sample of features.
            ['F41', 'F42', 'F43', 'F44']
        ], [
            # 1 sample of labels.
            ['L41', 'L42']
        ]
    ]
]
```

In [8]:
# quiz.py
import math
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    
    # TODO: Implement batching
    result = []
    
    data_size = len(features)
    for idx_start in range(0, data_size, batch_size):
        idx_end = idx_start + batch_size
        result.append([ features[idx_start:idx_end], labels[idx_start:idx_end] ])
    return result

In [9]:
# sandbox.py
# from quiz import batches
from pprint import pprint

# 4 Samples of features
example_features = [
    ['F11','F12','F13','F14'],
    ['F21','F22','F23','F24'],
    ['F31','F32','F33','F34'],
    ['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
    ['L11','L12'],
    ['L21','L22'],
    ['L31','L32'],
    ['L41','L42']]

# PPrint prints data structures like 2d arrays, so they are easier to read
pprint(batches(3, example_features, example_labels))

[[[['F11', 'F12', 'F13', 'F14'],
   ['F21', 'F22', 'F23', 'F24'],
   ['F31', 'F32', 'F33', 'F34']],
  [['L11', 'L12'], ['L21', 'L22'], ['L31', 'L32']]],
 [[['F41', 'F42', 'F43', 'F44']], [['L41', 'L42']]]]


#### Quiz : Mini-batch
Let's use mini-batching to feed batches of MNIST features and labels into a linear model.

Set the batch size and run the optimizer over all the batches with the `batches` function. The recommended batch size is 128. If you have memory restrictions, feel free to make it smaller.

This quiz is not graded, see the solution notebook for one way to solve this quiz.

In [15]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches

learning_rate = 0.001
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Extracting /datasets/ud730/mnist\train-images-idx3-ubyte.gz
Extracting /datasets/ud730/mnist\train-labels-idx1-ubyte.gz
Extracting /datasets/ud730/mnist\t10k-images-idx3-ubyte.gz
Extracting /datasets/ud730/mnist\t10k-labels-idx1-ubyte.gz


In [19]:
# TODO: Set batch size
batch_size = 128
assert batch_size is not None, 'You must set the batch size'

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    # TODO: Train optimizer on all batches
    for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
        sess.run(optimizer, feed_dict={features: batch_features, labels: batch_labels})

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(accuracy, feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))

Test Accuracy: 0.13760000467300415


### 6. Epochs
An epoch is a single forward and backward pass of the whole dataset. This is used to increase the accuracy of the model without requiring more data. This section will cover epochs in TensorFlow and how to choose the right number of epochs.

The following TensorFlow code trains a model using 10 epochs.
```python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches  # Helper function created in Mini-batching section


def print_epoch_stats(epoch_i, sess, last_features, last_labels):
    """
    Print cost and validation accuracy of an epoch
    """
    current_cost = sess.run(
        cost,
        feed_dict={features: last_features, labels: last_labels})
    valid_accuracy = sess.run(
        accuracy,
        feed_dict={features: valid_features, labels: valid_labels})
    print('Epoch: {:<4} - Cost: {:<8.3} Valid Accuracy: {:<5.3}'.format(
        epoch_i,
        current_cost,
        valid_accuracy))

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
valid_features = mnist.validation.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
valid_labels = mnist.validation.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
learning_rate = tf.placeholder(tf.float32)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init = tf.global_variables_initializer()

batch_size = 128
epochs = 10
learn_rate = 0.001

train_batches = batches(batch_size, train_features, train_labels)

with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch_i in range(epochs):

        # Loop over all batches
        for batch_features, batch_labels in train_batches:
            train_feed_dict = {
                features: batch_features,
                labels: batch_labels,
                learning_rate: learn_rate}
            sess.run(optimizer, feed_dict=train_feed_dict)

        # Print cost and validation accuracy of an epoch
        print_epoch_stats(epoch_i, sess, batch_features, batch_labels)

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))
```

Running the code will output the following:

```
Epoch: 0    - Cost: 11.0     Valid Accuracy: 0.204
Epoch: 1    - Cost: 9.95     Valid Accuracy: 0.229
Epoch: 2    - Cost: 9.18     Valid Accuracy: 0.246
Epoch: 3    - Cost: 8.59     Valid Accuracy: 0.264
Epoch: 4    - Cost: 8.13     Valid Accuracy: 0.283
Epoch: 5    - Cost: 7.77     Valid Accuracy: 0.301
Epoch: 6    - Cost: 7.47     Valid Accuracy: 0.316
Epoch: 7    - Cost: 7.2      Valid Accuracy: 0.328
Epoch: 8    - Cost: 6.96     Valid Accuracy: 0.342
Epoch: 9    - Cost: 6.73     Valid Accuracy: 0.36 
Test Accuracy: 0.3801000118255615
```

Each epoch attempts to move to a lower cost, leading to better accuracy.

This model continues to improve accuracy up to Epoch 9. Let's increase the number of epochs to 100.

```
...
Epoch: 79   - Cost: 0.111    Valid Accuracy: 0.86
Epoch: 80   - Cost: 0.11     Valid Accuracy: 0.869
Epoch: 81   - Cost: 0.109    Valid Accuracy: 0.869
....
Epoch: 85   - Cost: 0.107    Valid Accuracy: 0.869
Epoch: 86   - Cost: 0.107    Valid Accuracy: 0.869
Epoch: 87   - Cost: 0.106    Valid Accuracy: 0.869
Epoch: 88   - Cost: 0.106    Valid Accuracy: 0.869
Epoch: 89   - Cost: 0.105    Valid Accuracy: 0.869
Epoch: 90   - Cost: 0.105    Valid Accuracy: 0.869
Epoch: 91   - Cost: 0.104    Valid Accuracy: 0.869
Epoch: 92   - Cost: 0.103    Valid Accuracy: 0.869
Epoch: 93   - Cost: 0.103    Valid Accuracy: 0.869
Epoch: 94   - Cost: 0.102    Valid Accuracy: 0.869
Epoch: 95   - Cost: 0.102    Valid Accuracy: 0.869
Epoch: 96   - Cost: 0.101    Valid Accuracy: 0.869
Epoch: 97   - Cost: 0.101    Valid Accuracy: 0.869
Epoch: 98   - Cost: 0.1      Valid Accuracy: 0.869
Epoch: 99   - Cost: 0.1      Valid Accuracy: 0.869
Test Accuracy: 0.8696000006198883
```

From looking at the output above, you can see the model doesn't increase the validation accuracy after epoch 80. Let's see what happens when we increase the learning rate.

learn_rate = 0.1

```
Epoch: 76   - Cost: 0.214    Valid Accuracy: 0.752
Epoch: 77   - Cost: 0.21     Valid Accuracy: 0.756
Epoch: 78   - Cost: 0.21     Valid Accuracy: 0.756
...
Epoch: 85   - Cost: 0.207    Valid Accuracy: 0.756
Epoch: 86   - Cost: 0.209    Valid Accuracy: 0.756
Epoch: 87   - Cost: 0.205    Valid Accuracy: 0.756
Epoch: 88   - Cost: 0.208    Valid Accuracy: 0.756
Epoch: 89   - Cost: 0.205    Valid Accuracy: 0.756
Epoch: 90   - Cost: 0.202    Valid Accuracy: 0.756
Epoch: 91   - Cost: 0.207    Valid Accuracy: 0.756
Epoch: 92   - Cost: 0.204    Valid Accuracy: 0.756
Epoch: 93   - Cost: 0.206    Valid Accuracy: 0.756
Epoch: 94   - Cost: 0.202    Valid Accuracy: 0.756
Epoch: 95   - Cost: 0.2974   Valid Accuracy: 0.756
Epoch: 96   - Cost: 0.202    Valid Accuracy: 0.756
Epoch: 97   - Cost: 0.2996   Valid Accuracy: 0.756
Epoch: 98   - Cost: 0.203    Valid Accuracy: 0.756
Epoch: 99   - Cost: 0.2987   Valid Accuracy: 0.756
Test Accuracy: 0.7556000053882599
```

Looks like the learning rate was increased too much. The final accuracy was lower, and it stopped improving earlier. Let's stick with the previous learning rate, but change the number of epochs to 80.

```
Epoch: 65   - Cost: 0.122    Valid Accuracy: 0.868
Epoch: 66   - Cost: 0.121    Valid Accuracy: 0.868
Epoch: 67   - Cost: 0.12     Valid Accuracy: 0.868
Epoch: 68   - Cost: 0.119    Valid Accuracy: 0.868
Epoch: 69   - Cost: 0.118    Valid Accuracy: 0.868
Epoch: 70   - Cost: 0.118    Valid Accuracy: 0.868
Epoch: 71   - Cost: 0.117    Valid Accuracy: 0.868
Epoch: 72   - Cost: 0.116    Valid Accuracy: 0.868
Epoch: 73   - Cost: 0.115    Valid Accuracy: 0.868
Epoch: 74   - Cost: 0.115    Valid Accuracy: 0.868
Epoch: 75   - Cost: 0.114    Valid Accuracy: 0.868
Epoch: 76   - Cost: 0.113    Valid Accuracy: 0.868
Epoch: 77   - Cost: 0.113    Valid Accuracy: 0.868
Epoch: 78   - Cost: 0.112    Valid Accuracy: 0.868
Epoch: 79   - Cost: 0.111    Valid Accuracy: 0.868
Epoch: 80   - Cost: 0.111    Valid Accuracy: 0.869
Test Accuracy: 0.86909999418258667
```

The accuracy only reached 0.86, but that could be because the learning rate was too high. Lowering the learning rate would require more epochs, but could ultimately achieve better accuracy.

In the upcoming TensorFLow Lab, you'll get the opportunity to choose your own learning rate, epoch count, and batch size to improve the model's accuracy.