# Several Tips for Improving Neural Network
> In this post, it will be mentioned about how we can improve the performace of neural network. Especially, we are talking about ReLU activation function, Weight Initialization, Dropout, and Batch Normalization

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Deep_Learning]
- image: 

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.rcParams['figure.figsize'] = (16, 10)
plt.rcParams['text.usetex'] = True
plt.rc('font', size=15)

## ReLU Activation Function
### Problem of Sigmoid
Previously, we talked about the process happened int neural network. When the input pass througth the network, and generate the output, we called **forward propagation**. From this, we can measure the error between the predicted output and actual output. Of course, we want to train the neural network for minimizing this error. So we differentiate the the error and update the weight based on this. It is called **backpropation**.

![sigmoid](image/sigmoid.png)

$$g(z) = \frac{1}{1 + e^{-z}} $$

This is the **sigmoid** function. We used this for measuring the probability of binary classification. And its range is from 0 to 1. When we apply sigmoid function in the output, sigmoid function will be affected in backpropgation. The problem is that, when we differentiate the middle point of sigmoid function. It doesn't care while we differentiate the sigmoid function in middle point. The problem is when the error goes $\infty$ or $-\infty$. As you can see, when the error is high, the gradient of sigmoid goes to 0, and when the error is negatively high, the gradient of sigmoid goes to 0 too. When we cover the chain rule in previous post, the gradient in post step is used to calculate the overall gradient. So what if error is too high in some nodes, the overall gradient go towards to 0, because of chain rule. This kind of problem is called **Vanishing Gradient**. Of course, we cannot calculate the gradient, and it is hard to update the weight.

### ReLU
Here, we introduce the new activation function, **Rectified Linear Unit** (ReLU for short). Originally, simple linear unit is like this,

$$ f(x) = x $$

But we just consider the range of over 0, and ignore the value less than 0. We can express the form like this,

$$ f(x) = \max(0, x) $$

This form can be explained that, when the input is less than 0, then output will be 0. and input is larger than 0, input will be output itself.

![relu](image/relu.png)

So in this case, how can we analyze its gradient? If the x is larger than 0, its gradient will be 1. Unlike sigmoid, whatever the number of layers is increased, if the error is larger than 0, its gradient maintains and transfers to next step of chain rule. But there is a small problem when the error is less than 0. In this range, its gradient is 0. That is, gradient will be omitted when the error is less than 0. May be this is a same situation in Sigmoid case. But At least, we can main the gradient terms when the error is larger than 0. 

There are another variation for handling vanishing gradient problem, such as Exponential Linear Unit (ELU), Scaled Exponential Linear Unit (SELU), Leaky ReLU and so on.

### Comparing the performance of each activation function

In this example, we will use MNIST dataset for comparing the preformance of each activation function.

In [5]:
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist

# Load dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print(X_train.shape, X_test.shape)

# Expand the dimension from 2D to 3D
X_train = tf.expand_dims(X_train, axis=-1)
X_test = tf.expand_dims(X_test, axis=-1)
print(X_train.shape, X_test.shape)

(60000, 28, 28) (10000, 28, 28)
(60000, 28, 28, 1) (10000, 28, 28, 1)


Maybe someone will be confused in expanding the dimension. That's because tensorflow enforce image inputs  shapes like `[batch_size, height, width, channel]`. But MNIST dataset included in keras, doesn't have information of channel. So we expand the dimension in the end of dataset for expressing its channel(you know that the channel in MNIST is grayscale, so it is 0)

And its image is grayscale, so the range of data is from 0 to 255. And it is helpful for training while its dataset is normalized. So we apply the normalization. 

In [7]:
X_train = tf.cast(X_train, tf.float32) / 255.0
X_test = tf.cast(X_test, tf.float32) / 255.0

And the range of label is from 0 to 9. And its type is categorical. So we need to convert the label with one-hot encoding. Keras offers `to_categorical` APIs to do this. (There are so many approaches for one-hot encoding, we can try it by your mind).

In [8]:
y_train = to_categorical(y_train, num_classes=10)
y_test = to_categorical(y_test, num_classes=10)

WIP