# 1. Hands on in the Tensorflow playground

This lesson is based in the [tensorflow playground](http://playground.tensorflow.org/). Some exercises are developed here and there is no code, but the images and explanation of what I did in this lesson are presented.

The tensorflow playground allows you to literally play around with some deep learning stuff. It has 4 different datasets and an interface to choose a handful of the neural network parameters. The image presents the site homepage when you first acess it. Here we can see the dataset to be used, the interactive neural net, where you can choose the amount of layers and neurons in each layer, the output result, and others neural network parameters.

Once you setup your neural network, you can press the play button and it will start training. As the epochs go on, the output graphic is updated to fit the background colour accordingly to the samples classes (orange or blue). At the end of the page, the site has an explanation about neural networks, disclaimer about license and the credits.

<img src="course_imgs/tensorflowplayground.png" alt="Tensorflow playground" width="500"/>

 ## First dataset experimentation
 
 ##### First try
 
The first experiment was realized with the setup provided by the page. As the start button was pressed, the output converged to the result shown in the image. As it is possible to see, the result was really satisfactory after only 50 epochs.
 
 <img src="course_imgs/playground_1.png" alt="First playground attempt" width="500"/>
 
 ##### Second try
 
Since the first try was clearly a overkill, the neural network size was reduced. By removing one neuron of the second layer, the result was still reached very fast. Because of that, I removed the whole second layer, and let 1 layer with 4 neurons. The result was still very fast. So, the neurons # was reduced to 3. The image shows the result. After 150 epochs, a satisfactory result was achieved. 
 
<img src="course_imgs/playground_2.png" alt="Second playground attempt" width="500"/>

However, if the second layer is reduced to 2 neurons, the neural network won't have a good performance, as shown in the image. This is the limit for the dataset.

<img src="course_imgs/playground_3.png" alt="Third playground attempt" width="500"/>

## Second dataset experimentation

The lesson continues to the second dataset, and I reset the neural network to the standard version with the spiral dataset, as in the image.

<img src="course_imgs/playground_4.png" alt="Second dataset" width="500"/>

By running the neural network, after 1000 epochs it was not possible to obtain a satisfactory result, as seen in the image, meaning that it is necessary to increases the network size.

<img src="course_imgs/playground_5.png" alt="Second dataset" width="500"/>

After several attempts, the Neural Network was defined as in the image. Because of the network size, the epochs became very slow, but eventually, for a hard dataset to classify as the used, a satisfying result was obtained. 

<img src="course_imgs/playground_6.png" alt="Second dataset" width="500"/>

#  2. Deep Learning Details

### Constructing, training, and tuning multi-layer perceptrons

## Backpropagation

Backpropagation is the gradient descent using reverse-mode autodiff. It is used to train de MLP weights: **learn**. In each epoch:
1. Compute the output error;
2. Compute how much each neuron in the previous hidden layer contributed;
3. Back-propagate that error in a reverse pass;
4. Tweak weights to reduce the error using gradient descent.

## Activation functions (aka rectifier)

Step functions don't have gradient, so they don't work with gradient descent. However, there are other functions to use as activation:
- Logistic;
- Hyperbolic tangent function;
- Exponential linear unit;
- Rectified linear unit.

ReLU is common, fast to compute and works. However, ELU can lead to faster learning sometimes.

## Optimization functions

There are faster optimizers than gradient descent:
- Momentum optimization:
    - Introduces a momentum term to the descent -> speed vary conformingly slope: bigger slope, faster speed, things flatten, slower speed.
- Nesterov Accelerated Gradient:
    - Small tweak on momentum optimization - computes momentum based on next epoch gradient
- RMSProp:
    - Adaptive learning rate to help reach the minimum
- Adam:
    - Adaptive moment estimation - momentum + RMSProp combined. Popular choice today, easy to use.
    
Choose one that results the best cost-benefit between speed vs computational cost.
    
## Avoiding Overfitting

With thousands of weights to tune, overfitting is a problem. It is possible to stop early the training as the performance reduces. It is also possible to regularize terms added to cost function during training. Anothre technique that is reliable is the **Dropout**:

- Ignore an amount of all neurons randomly at each traning step -> this works well because it forces the model to spread the learning, like if you used only half of your brain to learn something. This way, neurons that probably aren't contributing much, can start taking part in the learning.

## Tuning your topology

It is possible to tune by trial and error:
- Try as a Newtons method: evaluate smaller network with lesser neurons and hidden layers. Then, try to evaluate a larger network. As it reaches an acceptable result, go reducing and increasing the network size.

More layers can yield faster learning. It is also possible to use more layers and neurons than needed, and use early stopping. There are also "model zoos", which are available validated topologies for solving especific problems.