# TensorFlow 2.0 Practical

This notebook is based on the Tensorflow without a PhD codelab by Martin Gorner and was modified to run on TensorFlow 2.0.

### Get Started
<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/Tensorflow_2_0_practical.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/Tensorflow_2_0_practical.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>


## 1. Introduction
In this practical, you will learn how to build and train a neural network that recognises handwritten digits. Along the way, as you enhance your neural network to achieve 99% accuracy, you will also discover the tools of the trade that deep learning professionals use to train their models efficiently.

This codelab uses the MNIST dataset, a collection of 60,000 labeled digits that has kept generations of PhDs busy for almost two decades. You will solve the problem with less than 100 lines of Python / TensorFlow code.

What you'll learn:
*  What is a neural network and how to train it
*   How to build a basic 1-layer neural network using TensorFlow
*   How to add more layers
*  Training tips and tricks: overfitting, dropout, learning rate decay...
*  How to troubleshoot deep neural networks
*  How to build convolutional networks

What you'll need:
*  Python 2 or 3 (Python 3 recommended)
*  TensorFlow

### Running on GPU
For this practical, you will need to use a GPU to speed up training. To do this, go to the "Runtime" menu in Colab, select "Change runtime type" and then in the popup menu, choose "GPU" in the "Hardware accelelator" box. This is all you need to do, Colab and Tensorflow will take care of the rest!

### Requirements installation
First, we need o install TensorFlow 2.0 and download ngrock (ngrock is only used in order to run TensorBoard on Google Colab).

In [0]:
#@title Dependencies and Imports (RUN ME!) { display-mode: "form" }
!pip install -q tensorflow-gpu==2.0.0-alpha0

! wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
! unzip ngrok-stable-linux-amd64.zip

from __future__ import absolute_import, division, print_function

import os
import datetime
import numpy as np

import tensorflow as tf

### TensorBoard
TensorBoard is a tool used to monitor the training and investigate the model and the results.

Run the next cell in order to launch TensorBoard on background. Then, click on the link to access TensorBoard.

In [0]:
#@title Start TensorBoard (RUN ME!) { display-mode: "form" }
LOG_DIR = "/tmp/log"
get_ipython().system_raw(
    'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'
    .format(LOG_DIR)
)

## 2. The Data
In this practical, we use the MNIST dataset consisting of 70,000 greyscale images and their labels.
The idea is to train a classifier to identify the class value (what handwritten digit it is) given the image. We train and tune a model on the 60,000 training images and then evaluate how well it classifies the 10,000 test images that the model did not see during training. This task is an example of a supervised learning problem, where we are given both input and labels (targets) to learn from. This is in contrast to unsupervised learning where we only have inputs from which to learn patterns or reinforcement learning where an agent learns how to maximise a reward signal through interaction with its environment.

### Aside: Train/Validation/Test Split
When we build machine learning models, the goal is to build a model that will perform well on future data that we have not seen yet. We say that we want our models to be able to generalise well from whatever training data we can collect and do have available, to whatever data we will be applying them to in future. To do this, we split whatever data we have available into a training set, a validation set and a test set. The idea is that we train our model and use the performance on the validation set to make any adjustments to the model and its hyperparameters, but then we report the final accuracy on the test set. The test set (which we never train on), therefore acts as a proxy for our future data.

### Load the dataset
Execute the next cell in order to download the mnist dataset and prepare the training and test sets.

In [0]:
# Load the mnist dataset
mnist = tf.keras.datasets.mnist

# Split the dataset in train/test sets
(x_train, y_train),(x_test, y_test) = mnist.load_data()

# Normalize the inputs
x_train, x_test = x_train / 255.0, x_test / 255.0

## 3. Theory: a 1-layer neural network

![image_1](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/1.png?raw=true)

Handwritten digits in the MNIST dataset are 28x28 pixel greyscale images. The simplest approach for classifying them is to use the 28x28=784 pixels as inputs for a 1-layer neural network.

![](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/2.png?raw=true)

Each "neuron" in a neural network does a weighted sum of all of its inputs, adds a constant called the "bias" and then feeds the result through some non-linear activation function.

Here we design a 1-layer neural network with 10 output neurons since we want to classify digits into 10 classes (0 to 9).

For a classification problem, an activation function that works well is softmax. Applying softmax on a vector is done by taking the exponential of each element and then normalising the vector (using any norm, for example the ordinary euclidean length of the vector).

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/3.png?raw=true)

***Why is "softmax" called softmax ? The exponential is a steeply increasing function. It will increase differences between the elements of the vector. It also quickly produces large values. Then, as you normalise the vector, the largest element, which dominates the norm, will be normalised to a value close to 1 while all the other elements will end up divided by a large value and normalised to something close to 0. The resulting vector clearly shows which was its largest element, the "max", but retains the original relative order of its values, hence the "soft".***

We will now summarise the behaviour of this single layer of neurons into a simple formula using a matrix multiply. Let us do so directly for a "mini-batch" of 100 images as the input, producing 100 predictions (10-element vectors) as the output.

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/4.png?raw=true)

Using the first column of weights in the weights matrix W, we compute the weighted sum of all the pixels of the first image. This sum corresponds to the first neuron. Using the second column of weights, we do the same for the second neuron and so on until the 10th neuron. We can then repeat the operation for the remaining 99 images. If we call X the matrix containing our 100 images, all the weighted sums for our 10 neurons, computed on 100 images are simply X.W (matrix multiply).

Each neuron must now add its bias (a constant). Since we have 10 neurons, we have 10 bias constants. We will call this vector of 10 values b. It must be added to each line of the previously computed matrix. Using a bit of magic called "broadcasting" we will write this with a simple plus sign.

***"Broadcasting" is a standard trick used in Python and numpy, its scientific computation library. It extends how normal operations work on matrices with incompatible dimensions. "Broadcasting add" means "if you are adding two matrices but you cannot because their dimensions are not compatible, try to replicate the small one as much as needed to make it work."***

We finally apply the softmax activation function and obtain the formula describing a 1-layer neural network, applied to 100 images:

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/5.png?raw=true)

***By the way, what is a "tensor"?
A "tensor" is like a matrix but with an arbitrary number of dimensions. A 1-dimensional tensor is a vector. A 2-dimensions tensor is a matrix. And then you can have tensors with 3, 4, 5 or more dimensions.***

## 4. Theory: gradient descent

Now that our neural network produces predictions from input images, we need to measure how good they are, i.e. the distance between what the network tells us and what we know to be the truth. Remember that we have true labels for all the images in this dataset.

Any distance would work, the ordinary euclidian distance is fine but for classification problems one distance, called the "cross-entropy" is more efficient.

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/6.png?raw=true)

***"One-hot" encoding means that you represent the label "6" by using a vector of 10 values, all zeros but the 6th value which is 1. It is handy here because the format is very similar to how our neural network outputs ts predictions, also as a vector of 10 values.***

"Training" the neural network actually means using training images and labels to adjust weights and biases so as to minimise the cross-entropy loss function. Here is how it works.

The cross-entropy is a function of weights, biases, pixels of the training image and its known label.

If we compute the partial derivatives of the cross-entropy relatively to all the weights and all the biases we obtain a "gradient", computed for a given image, label and present value of weights and biases. Remember that we have 7850 weights and biases so computing the gradient sounds like a lot of work. Fortunately, TensorFlow will do it for us.

The mathematical property of a gradient is that it points "up". Since we want to go where the cross-entropy is low, we go in the opposite direction. We update weights and biases by a fraction of the gradient and do the same thing again using the next batch of training images. Hopefully, this gets us to the bottom of the pit where the cross-entropy is minimal.

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/7.png?raw=true)

In this picture, cross-entropy is represented as a function of 2 weights. In reality, there are many more. The gradient descent algorithm follows the path of steepest descent into a local minimum. The training images are changed at each iteration too so that we converge towards a local minimum that works for all images.

***"Learning rate": you cannot update your weights and biases by the whole length of the gradient at each iteration. It would be like trying to get to the bottom of a valley while wearing seven-league boots. You would be jumping from one side of the valley to the other. To get to the bottom, you need to do smaller steps, i.e. use only a fraction of the gradient, typically in the 1/1000th region. We call this fraction the "learning rate".***

To sum it up, here is how the training loop looks like:
```
Training digits and labels => loss function => gradient (partial derivatives) => steepest descent => update weights and biases => repeat with next mini-batch of training images and labels
```
***Why work with "mini-batches" of 100 images and labels ?***

***You can definitely compute your gradient on just one example image and update the weights and biases immediately (it's called "stochastic gradient descent" in scientific literature). Doing so on 100 examples gives a gradient that better represents the constraints imposed by different example images and is therefore likely to converge towards the solution faster. The size of the mini-batch is an adjustable parameter though. There is another, more technical reason: working with batches also means working with larger matrices and these are usually easier to optimise on GPUs.***

## 5. Lab: let's jump into the code

### Define the model
In this section we'll build a classifier. A **classifier** is a function that takes an object's characteristics (or "features") as inputs and outputs a prediction of the class (or group) that the object belongs to. It may make a single prediction for each input or it may output some score (for example a probability) for each of the possible classes. Specifically, we will build a classifier that takes in (a batch of) 28 x 28 Fashion MNIST images as we've seen above, and outputs predictions about which class the image belongs to. 

For each (batch of) input images, we will use a **feed-forward neural network** to compute un-normalised scores (also known as **logits**) for each of the 10 possible classes that the image could belong to. We can then **classify** the image as belonging to the class which receives the highest score, or we can quantify the model's "confidence" about the classifications by converting the scores into a probability distribution. 

A feed-forward neural network consisting of $N$ layers, applied to an input vector $\mathbf{x}$ can be defined as:

\begin{equation}
\mathbf{f_0} = \mathbf{x} \\
\mathbf{f_i} = \sigma_i(\mathbf{W_if_{i-1}} + \mathbf{b_i}) \ \ \ i \in [1, N]
\end{equation}

Each layer has a particular number, $m_i$, of neurons. The parameters of a layer consist of a matrix $\mathbf{W_i} \in \mathbb{R}^{m_i \times m_{i-1}}$ and bias vector $\mathbf{b_i} \in \mathbb{R}^{m_i}$. Each layer also has a non-linear activation function $\sigma_i$. 

**QUESTION**: Why do you think the activation functions need to be *non-linear*? What would happen if they were *linear*? **HINT**: If you're stuck, consider the very simplest case of an identity activation (which essentially does nothing) and ignore the biases. 

### Aside: Activation functions

Activation functions are a core ingredient in deep neural networks. In fact they are what allows us to make use of multiple layers in a neural network. There are a number of different activation functions, each of which are more or less useful under different circumstances. The four activation functions that you are most likely to encounter are, arguably, [ReLU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/ReLU), [Tanh](https://www.tensorflow.org/api_docs/python/tf/keras/activations/tanh), [Sigmoid](https://www.tensorflow.org/api_docs/python/tf/keras/activations/sigmoid), and [Softmax](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Softmax). 

ReLU, has in recent years, become the default choice for use in MLPs and Convolutional Neural Networks (CNNs). ReLU has two advantages over Tanh and Sigmoid: it is computationally much more efficient, and, it allows us to use deeper networks because it does not suffer from [vanishing gradients](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). As a result of their success, a number of ReLU variants, such as [LeakyRelu](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LeakyReLU) and [PReLU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/PReLU), have been developed.

Sigmoid and Softmax activations are often found after the last layer in binary and multi-class classification networks, respectively, as they transform the outputs of the network in such a way that we can interpret them as class probabilities.

Both Tanh and Sigmoid are found in LSTM and GRU recurrent neural networks, which we will find out more about in the coming days. They are also useful in MLPs and CNNs where we want the output to be bounded between -1 and 1 (Tanh) or 0 and 1 (Sigmoid).

Read more about activation functions [here](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6). 

### Create a 1-layer neural network
We configure the feed-forward neural network part of our classifier using the [Keras Layers API](https://www.tensorflow.org/api_docs/python/tf/keras/layers). This API consists of various reusable building-blocks that allow us to define many different neural network architectures (similar to how we defined a data pipeline earlier!). 

Here we use the [Sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential) component which allows us to wrap together a sequence of layers. An important point to note here is that we are **configuring** our neural network architecture as a pipeline. We can think of the resulting ```model`` variable as a *function* that takes a batch of images as inputs and outputs a batch of logits. 

In [0]:
# Define the model
model = tf.keras.models.Sequential([
# flatten the images into a single line of pixels
  tf.keras.layers.Flatten(input_shape=(28, 28), name='flatten_input'),
# apply softmax as an activation function
  tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='logits')
])

The following summary shows how many parameters each layer is made up of (the number of entries in the weight matrics and bias vectors). Note that a value of ```None``` in a particular dimension of a shape means that the shape will dynamically adapt based on the shape of the inputs. In particular, the output shape of the ```flatten_input``` layer will be $[N, 784]$ when the batch of inputs passed to the model has shape $[N, 28, 28]$

In [0]:
model.summary()

### Define the loss
As we did in the previous practical, we need to specify a loss function for our classifier. This tells us how good our model's predictions are compared to the actual labels (the targets), with a lower loss meaning better predictions. The standard loss function to use with a **multi-class classifier** is the **cross-entropy loss** also known as the "negative log likelihood". Suppose we have a classification problem with $C$ classes. A classifier is trained to predict a probability distribution $p(y | X_i)$ for each input $X_i$ from a batch of $N$ examples. The vector $p(y|X_i)$ is $C$ dimensional, sums to one, and we use $p(y|X_i)_c$ to denote the $c$th component of  $p(y|X_i)$. The true class for example $i$ in the batch is $y_i$ and we define the indicator function $\mathbb{1}[y_i=c]$ to be 1 whenever $y_i = c$ and $0$ otherwise. This classifier has a cross-entropy loss of

$- \frac{1}{N}\sum_{i=1}^N \sum_{c=1}^C log( p(y|X_i)_c) \mathbb{1}[y_i=c]$

**NOTE**: The indicator is one for the true class label, and zero everywhere else. So in that sum, the indicator just "lifts out" the $log(p(y|X_i))$ values for all true classes. So the above expression is minimised (note the negative at the front) when the model places all its probability mass on the true labels for all the examples. Remember that  log(1)=0 , thus the closer all probabilities of $y_i = c$ are to one, the lower the loss will be and the better the model will be performing.

**QUESTION**: 
* Why do you think this is a *good* loss function?
* Can you think of any potential issues with this loss function?

Fortunately we don't need to write this function ourselves as Tensorflow provides a version called 

```tf.nn.sparse_softmax_cross_entropy_with_logits```. 

**NOTE**: This function actually computes the cross entropy loss directly from the un-normalised logits, rather than from the probability distribution for numerical stability.

By the way, for training data in which the labels are themselves distributions rather than exact values, this definition of cross-entropy still works, where the indicator function is replaced with the corresponding probability of each class for that example. This might be important when we are not sure whether the training data has been labelled correctly, or when the data was labelled by a human who gave their answer along with a degree of confidence that the answer was correct.

### Train the model
Now that we have our data, data processing pipeline, our neural network architecture and the corresponding loss that we want to minimise, we need to **train** the model using batched stochastic gradient descent. We do this in multiple **epochs**, which is a single iteration through the entire training dataset. Briefly, in each epoch we loop over all the batches of images and labels, and for each batch we perform the following steps:
* Get the **predictions** of the model on the current batch of images
* Compute the **average loss** values across the batch, telling us how good these predictions are / how close they are to the true targets.
* Compute the **gradient of the average loss** (or the average gradient of the losses in the batch) with respect to each of the model's parameters: This tells us the direction to move in "parameter space" to decrease the loss value
* **Adjust the parameters** by taking a small step in the direction of each component of the gradient (where the learning rate controls the size of the step)

During training we also track some metrics, such as the loss and accuracy to see how well the classifier is doing. Note that the cell below may take a few minutes to run!

In [0]:
def train_model():
  model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.005),
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  log_dir = os.path.join("/tmp/log", "1-layer-fully-connected-"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir, histogram_freq=1)

  model.fit(x=x_train, 
            y=y_train,
            batch_size=100,
            epochs=300, 
            validation_data=(x_test, y_test), 
            callbacks=[tensorboard_callback])

train_model()

In [0]:
#@title Visualize in TensorBoard (RUN ME!) { display-mode: "form" }

get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

## 6. Add more Layers

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/8.png?raw=true)

To improve the recognition accuracy we will add more layers to the neural network. The neurons in the second layer, instead of computing weighted sums of pixels will compute weighted sums of neuron outputs from the previous layer. Here is for example a 5-layer fully connected neural network:

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/9.png?raw=true)

We keep softmax as the activation function on the last layer because that is what works best for classification. On intermediate layers however we will use the the most classical activation function: the sigmoid:

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/10.png?raw=true)

**Your task in this section is to add four intermediate layers to your model to increase its performance.**

In [0]:
# Define the model
model = tf.keras.models.Sequential([
# flatten the images into a single line of pixels
  tf.keras.layers.Flatten(input_shape=(28, 28), name='flatten_input'),

    # YOUR CODE GOES HERE
    
# add output layer and apply softmax as an activation function
  tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='logits')
])

model.summary()

In [0]:
#@title Solution

# Define the model
model = tf.keras.models.Sequential([
# flatten the images into a single line of pixels
  tf.keras.layers.Flatten(input_shape=(28, 28), name='flatten_input'),
# add the first hidden layer
  tf.keras.layers.Dense(200, activation=tf.nn.sigmoid, name='hidden_layer_1'),
# add the second hidden layer
  tf.keras.layers.Dense(100, activation=tf.nn.sigmoid, name='hidden_layer_2'),
# add the third hidden layer
  tf.keras.layers.Dense(60, activation=tf.nn.sigmoid, name='hidden_layer_3'),
# add the fourth hidden layer
  tf.keras.layers.Dense(30, activation=tf.nn.sigmoid, name='hidden_layer_4'),
# add output layer and apply softmax as an activation function
  tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='logits')
])

model.summary()

### Train the model

In [0]:
def train_model():
  model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.005),
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  log_dir = os.path.join("/tmp/log", "five-layers-" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir, histogram_freq=1)

  model.fit(x=x_train, 
            y=y_train,
            batch_size=100,
            epochs=300, 
            validation_data=(x_test, y_test), 
            callbacks=[tensorboard_callback])

train_model()

In [0]:
#@title Visualize in TensorBoard (RUN ME!) { display-mode: "form" }

get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

## 7. Lab: special care for deep networks

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/11.png?raw=true)

As layers were added, neural networks tended to converge with more difficulties. But we know today how to make them behave. Here are a couple of 1-line updates that will help if you see an accuracy curve like this:

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/12.png?raw=true)

**Relu activation function**
The sigmoid activation function is actually quite problematic in deep networks. It squashes all values between 0 and 1 and when you do so repeatedly, neuron outputs and their gradients can vanish entirely. It was mentioned for historical reasons but modern networks use the RELU (Rectified Linear Unit) which looks like this:

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/13.png?raw=true)

**TO DO:**

**Now, replace all your sigmoids with RELUs now and you will get faster initial convergence and avoid problems later as we add layers. Simply swap tf.nn.sigmoid with tf.nn.relu in your code.**

In [0]:
# Define the model
model = tf.keras.models.Sequential([
# flatten the images into a single line of pixels
  tf.keras.layers.Flatten(input_shape=(28, 28), name='flatten_input'),
# add the first hidden layer
  tf.keras.layers.Dense(200, activation=tf.nn.sigmoid, name='hidden_layer_1'),
# add the second hidden layer
  tf.keras.layers.Dense(100, activation=tf.nn.sigmoid, name='hidden_layer_2'),
# add the third hidden layer
  tf.keras.layers.Dense(60, activation=tf.nn.sigmoid, name='hidden_layer_3'),
# add the fourth hidden layer
  tf.keras.layers.Dense(30, activation=tf.nn.sigmoid, name='hidden_layer_4'),
# add output layer and apply softmax as an activation function
  tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='logits')
])

model.summary()

In [0]:
#@title Solution

# Define the model
model = tf.keras.models.Sequential([
# flatten the images into a single line of pixels
  tf.keras.layers.Flatten(input_shape=(28, 28), name='flatten_input'),
# add the first hidden layer
  tf.keras.layers.Dense(200, activation=tf.nn.relu, name='hidden_layer_1'),
# add the second hidden layer
  tf.keras.layers.Dense(100, activation=tf.nn.relu, name='hidden_layer_2'),
# add the third hidden layer
  tf.keras.layers.Dense(60, activation=tf.nn.relu, name='hidden_layer_3'),
# add the fourth hidden layer
  tf.keras.layers.Dense(30, activation=tf.nn.relu, name='hidden_layer_4'),
# add output layer and apply softmax as an activation function
  tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='logits')
])

model.summary()

**A better optimizer**

In very high dimensional spaces like here - we have in the order of 10K weights and biases - "saddle points" are frequent. These are points that are not local minima but where the gradient is nevertheless zero and the gradient descent optimizer stays stuck there. TensorFlow has a full array of available optimizers, including some that work with an amount of inertia and will safely sail past saddle points.

**TO DO:**

**Replace your tf.train.GradientDescentOptimiser with a tf.train.AdamOptimizer now.**

In [0]:
def train_model():
  model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.005), ## change the optimizer here
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  log_dir = os.path.join("/tmp/log", "relu-and-adam-" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir, histogram_freq=1)

  model.fit(x=x_train, 
            y=y_train,
            batch_size=100,
            epochs=300, 
            validation_data=(x_test, y_test), 
            callbacks=[tensorboard_callback])

train_model()

In [0]:
#@title Solution

def train_model():
  model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.003),
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  log_dir = os.path.join("/tmp/log", "relu-and-adam-" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir, histogram_freq=1)

  model.fit(x=x_train, 
            y=y_train,
            batch_size=100,
            epochs=300, 
            validation_data=(x_test, y_test), 
            callbacks=[tensorboard_callback])

train_model()

In [0]:
#@title Visualize in TensorBoard (RUN ME!) { display-mode: "form" }

get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

## 8. Lab: dropout, overfitting

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/14.png?raw=true)

You will have noticed that cross-entropy curves for test and training data start disconnecting after a couple thousand iterations. The learning algorithm works on training data only and optimises the training cross-entropy accordingly. It never sees test data so it is not surprising that after a while its work no longer has an effect on the test cross-entropy which stops dropping and sometimes even bounces back up.

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/15.png?raw=true)

This does not immediately affect the real-world recognition capabilities of your model but it will prevent you from running many iterations and is generally a sign that the training is no longer having a positive effect. This disconnect is usually labeled "overfitting" and when you see it, you can try to apply a regularisation technique called "dropout".

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/16.png?raw=true)

In dropout, at each training iteration, you drop random neurons from the network. You choose a probability pkeep for a neuron to be kept, usually between 50% and 75%, and then at each iteration of the training loop, you randomly remove neurons with all their weights and biases. Different neurons will be dropped at each iteration (and you also need to boost the output of the remaining neurons in proportion to make sure activations on the next layer do not shift).

**TO DO:**

**Now, add dropout to each intermediate layer of your network.**

In [0]:
# Define the model
model = tf.keras.models.Sequential([
# flatten the images into a single line of pixels
  tf.keras.layers.Flatten(input_shape=(28, 28), name='flatten_input'),
# add the first hidden layer
  tf.keras.layers.Dense(200, activation=tf.nn.relu, name='hidden_layer_1'),
# add the second hidden layer
  tf.keras.layers.Dense(100, activation=tf.nn.relu, name='hidden_layer_2'),
# add the third hidden layer
  tf.keras.layers.Dense(60, activation=tf.nn.relu, name='hidden_layer_3'),
# add the fourth hidden layer
  tf.keras.layers.Dense(30, activation=tf.nn.relu, name='hidden_layer_4'),
# add output layer and apply softmax as an activation function
  tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='logits')
])

model.summary()

In [0]:
#@title Solution

# Define the model
model = tf.keras.models.Sequential([
# flatten the images into a single line of pixels
  tf.keras.layers.Flatten(input_shape=(28, 28), name='flatten_input'),
# add dropout
  tf.keras.layers.Dropout(0.25),
# add the first hidden layer
  tf.keras.layers.Dense(200, activation=tf.nn.relu, name='hidden_layer_1'),
# add dropout
  tf.keras.layers.Dropout(0.25),
# add the second hidden layer
  tf.keras.layers.Dense(100, activation=tf.nn.relu, name='hidden_layer_2'),
# add dropout
  tf.keras.layers.Dropout(0.25),
# add the third hidden layer
  tf.keras.layers.Dense(60, activation=tf.nn.relu, name='hidden_layer_3'),
# add dropout
  tf.keras.layers.Dropout(0.25),
# add the fourth hidden layer
  tf.keras.layers.Dense(30, activation=tf.nn.relu, name='hidden_layer_4'),
# add dropout
  tf.keras.layers.Dropout(0.25),
# add output layer and apply softmax as an activation function
  tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='logits')
])

model.summary()

### Train the model with dropout

In [0]:
def train_model():
  model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.005),
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  log_dir = os.path.join("/tmp/log", "dropout-" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir, histogram_freq=1)

  model.fit(x=x_train, 
            y=y_train,
            batch_size=100,
            epochs=300, 
            validation_data=(x_test, y_test), 
            callbacks=[tensorboard_callback])

train_model()

In [0]:
#@title Visualize in TensorBoard (RUN ME!) { display-mode: "form" }

get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/17.png?raw=true)

You should see that the test loss is largely brought back under control, noise reappears (unsurprisingly given how dropout works) but in this case at least, the test accuracy remains unchanged which is a little disappointing. There must be another reason for the "overfitting".

Before we continue, a recap of all the tools we have tried so far:

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/18.png?raw=true)

Whatever we do, we do not seem to be able to break the 98% barrier in a significant way and our loss curves still exhibit the "overfitting" disconnect. What is really "overfitting" ? Overfitting happens when a neural network learns "badly", in a way that works for the training examples but not so well on real-world data. There are regularisation techniques like dropout that can force it to learn in a better way but overfitting also has deeper roots.

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/19.png?raw=true)

Basic overfitting happens when a neural network has too many degrees of freedom for the problem at hand. Imagine we have so many neurons that the network can store all of our training images in them and then recognise them by pattern matching. It would fail on real-world data completely. A neural network must be somewhat constrained so that it is forced to generalise what it learns during training.

If you have very little training data, even a small network can learn it by heart. Generally speaking, you always need lots of data to train neural networks.

Finally, if you have done everything well, experimented with different sizes of network to make sure its degrees of freedom are constrained, applied dropout, and trained on lots of data you might still be stuck at a performance level that nothing seems to be able to improve. This means that your neural network, in its present shape, is not capable of extracting more information from your data, as in our case here.

Remember how we are using our images, all pixels flattened into a single vector ? That was a really bad idea. Handwritten digits are made of shapes and we discarded the shape information when we flattened the pixels. However, there is a type of neural network that can take advantage of shape information: convolutional networks. Let us try them.

## 9. Theory: convolutional networks

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/20.png?raw=true)

In a layer of a convolutional network, one "neuron" does a weighted sum of the pixels just above it, across a small region of the image only. It then acts normally by adding a bias and feeding the result through its activation function. The big difference is that each neuron reuses the same weights whereas in the fully-connected networks seen previously, each neuron had its own set of weights.

In the animation above, you can see that by sliding the patch of weights across the image in both directions (a convolution) you obtain as many output values as there were pixels in the image (some padding is necessary at the edges though).

To generate one plane of output values using a patch size of 4x4 and a color image as the input, as in the animation, we need 4x4x3=48 weights. That is not enough. To add more degrees of freedom, we repeat the same thing with a different set of weights.

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/21.png?raw=true)

The two (or more) sets of weights can be rewritten as one by adding a dimension to the tensor and this gives us the generic shape of the weights tensor for a convolutional layer. Since the number of input and output channels are parameters, we can start stacking and chaining convolutional layers.

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/22.png?raw=true)

One last issue remains. We still need to boil the information down. In the last layer, we still want only 10 neurons for our 10 classes of digits. Traditionally, this was done by a "max-pooling" layer. Even if there are simpler ways today, "max-pooling" helps understand intuitively how convolutional networks operate: if you assume that during training, our little patches of weights evolve into filters that recognise basic shapes (horizontal and vertical lines, curves, ...) then one way of boiling useful information down is to keep through the layers the outputs where a shape was recognised with the maximum intensity. In practice, in a max-pool layer neuron outputs are processed in groups of 2x2 and only the one max one retained.

There is a simpler way though: if you slide the patches across the image with a stride of 2 pixels instead of 1, you also obtain fewer output values. This approach has proven just as effective and today's convolutional networks use convolutional layers only.

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/23.png?raw=true)

Let us build a convolutional network for handwritten digit recognition. We will use three convolutional layers at the top, our traditional softmax readout layer at the bottom and connect them with one fully-connected layer:

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/24.png?raw=true)

Notice that the second and third convolutional layers have a stride of two which explains why they bring the number of output values down from 28x28 to 14x14 and then 7x7. The sizing of the layers is done so that the number of neurons goes down roughly by a factor of two at each layer: 28x28x4≈3000 → 14x14x8≈1500 → 7x7x12≈500 → 200.

## 10. Lab: the 99% challenge
A good approach to sizing your neural networks is to implement a network that is a little too constrained, then give it a bit more degrees of freedom and add dropout to make sure it is not overfitting. This ends up with a fairly optimal network for your problem.

Here for example, we used only 4 patches in the first convolutional layer. If you accept that those patches of weights evolve during training into shape recognisers, you can intuitively see that this might not be enough for our problem. Handwritten digits are mode from more than 4 elemental shapes.

So let us bump up the patch sizes a little, increase the number of patches in our convolutional layers from 4, 8, 12 to 6, 12, 24 and then add dropout on the fully-connected layer. Why not on the convolutional layers? Their neurons reuse the same weights, so dropout, which effectively works by freezing some weights during one training iteration, would not work on them.

![alt text](https://github.com/YoussefBenDhieb/tensorflow-without-a-phd/blob/master/colabs/assets/25.png?raw=true)

**Go for it and break the 99% limit. Increase the patch sizes and channel numbers as on the picture above and add dropout on the convolutional layer.**

In [0]:
# Define the model
model = tf.keras.models.Sequential([
# First convolutional layer
  tf.keras.layers.Conv2D(filters=6, kernel_size=6, padding='same', strides=1, activation='relu', input_shape=(28,28,1), name='conv_layer_1'),

    #YOUR CODE GOES HERE! ADD TWO CONVOLUTIONAL LAYERS
    
#flatten layer
  tf.keras.layers.Flatten(name='flatten_layer'),
# add a fully connected layer
  tf.keras.layers.Dense(200, activation=tf.nn.relu, name='fc_layer_4'),

    #YOR CODE GOES HERE! ADD DROPOUT
    
# add output layer and apply softmax as an activation function
  tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='logits')
])

model.summary()

In [0]:
#@title Solution

# Define the model
model = tf.keras.models.Sequential([
# First convolutional layer
  tf.keras.layers.Conv2D(filters=6, kernel_size=6, padding='same', strides=1, activation='relu', input_shape=(28,28,1), name='conv_layer_1'),
# Second convolutional layer
  tf.keras.layers.Conv2D(filters=12, kernel_size=5, padding='same', strides=2, activation='relu', name='conv_layer_2'),
# Third convolutional layer
  tf.keras.layers.Conv2D(filters=24, kernel_size=4, padding='same', strides=2, activation='relu', name='conv_layer_3'),
#flatten layer
  tf.keras.layers.Flatten(name='flatten_layer'),
# add a fully connected layer
  tf.keras.layers.Dense(200, activation=tf.nn.relu, name='fc_layer_4'),
# add dropout
  tf.keras.layers.Dropout(0.35, name='dropout_layer'),
# add output layer and apply softmax as an activation function
  tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='logits')
])

model.summary()

### Train the model

In [0]:
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
x_train, x_test = np.expand_dims(x_train,-1), np.expand_dims(x_test,-1)

def train_model():
  model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.003, decay=0.001),
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  log_dir = os.path.join("/tmp/log", "convolutional-and-lr-decay-" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir, histogram_freq=1)

  model.fit(x=x_train, 
            y=y_train,
            batch_size=32,
            epochs=300, 
            validation_data=(x_test, y_test), 
            callbacks=[tensorboard_callback])

train_model()

In [0]:
#@title Visualize in TensorBoard (RUN ME!) { display-mode: "form" }

get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"