<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/v2/xx_misc/activation_functions/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Activation Functions

Activation functions are core components of neural networks. These functions are used in every node of network to reduce a vector of inputs into an output value.

Learning when to apply specific activation functions is a critical skill for any buiding deep learning models.

## What is an activation function?

Picture yourself as a node in a neural network. On one side of you there are multiple input streams passing data from the prior layer. On the other side there are multiple output streams that we use to pass data to every node in the next layer.

We expect the data from our input layer to contain many different values since we are getting data from different nodes. On the output-side we'll give everh node in the next layer the same value. Distilling the multiple diverse inputs into a single value that we can hand to the next layer is the job of an activation function.

In mathmatical terms it looks something like this:

> $y = activation(\sum_{i=0}^{n}{x_i} + bias)$

We sum our inputs from prior nodes, $x$, and our bias. We then pass that summation through an activation function in order to get our output value, $y$, that we then pass to every node in the next layer of the network.

## Pass-through Activation

The most basic activation function is the [linear](https://www.tensorflow.org/api_docs/python/tf/keras/activations/linear) activation function. This function take the sum of inputs and bias, doest nothing to it, and hands the result to the next layer of the network.

Let's plot the linear activation function in the code block below.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def linear(x):
  return x

inputs = np.linspace(-10, 10, 10)
outputs = [linear(x) for x in inputs]
_ = plt.plot(inputs, outputs)

That's a pretty simple activation function to understand. But what value does it provide?

This function can be useful, especially in your output layer, if you want your model to product large or negative values. Many of the activation functions that we'll see greatly restrict the range of values that they output. The linear activation function does restrict it's output range at all. Any real number can be produced by a node with this activation function.

## Rectified Linear Units (ReLU)

There is another linear activation function that turns out to be quite useful, the [Rectified Lienar Unit (ReLU)](https://www.tensorflow.org/api_docs/python/tf/keras/activations/relu).

ReLU simply returns the input value unless that value is less than zero. In that case it returns zero.

$$a = \begin{cases}
x \ , &x \geq 0 \\
0 \ , &x < 0 \\
\end{cases}$$

Let's take a look at ReLU:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def relu(x):
  if x < 0:
    return 0
  return x

inputs = np.linspace(-10, 10, 100, .1)
outputs = [relu(x) for x in inputs]
_ = plt.plot(inputs, outputs)

This is also a quite simple activation, but it turns out to be quite useful in practice. Many powerful neural networks utilize ReLU activation, at least in part. It has the advantage of making training very fast; however, nodes using ReLU do run the risk of "dying" during the training process. The nodes die when they get to a state were they always produce a zero output.

Let's also think about the use of a ReLU node in a network. If the output layer consists of ReLU values, then the output of the network will be from `0` to infinity.

This works fine for models that are predicting positive values, but what if your model is predicting celsius temperatures in Antartica or some other potentially negative value?

In this case you would need to adjust the target training data to all be positive, say by adding `100` to it, and then do the reverse to the output of the model, subtract `100` from each value.

You'll find that you'll need to do this type of adjustment quite often when building models. Understanding your activation functions, espeically in your output layer, is critically important. When you know the range of values that your model can produce you can adjust your training data to fall within that range.

<img src="https://i.imgur.com/0CIHbg7.png" width="250">

Given some input $x$ either from previous layers, or some input data, an activation takes the input and returns a decision to activate.  In the diagram above:

- inputs $x_1,x_2,x_3$ are passed to a hidden layer
- activation function $a_1$ outputs to the final layer
- activation function $a_2$ returns a value $\hat{y}$, a prediction

There are many types of activation functions, each with their own strengths in both effectiveness and computational cost. In practice, a sigmoid activation might be used in a for a binary classifier, while a [softmax](https://en.wikipedia.org/wiki/Softmax_function) would be used for a multiclass problem, and ReLu is commonly used in image classification with convolutional neural net (CNN) architecture.

Many activations are derivatives of the following three functions:

- Sigmoid (logistic function)
$$a=\frac{1}{1+e^{-x}}$$

- Tanh (hyperbolic tangent)
$$a=tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$$

- ReLU (Rectified Linear Unit)
$$a = \begin{cases}
x \ , &x \geq 0 \\
0 \ , &x < 0 \\
\end{cases}$$



For each activation function listed above (sigmoid, tanh, ReLu), let's implement the function on some data and plot the results to visualize the function.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Sigmoid

def sigmoid(x):
  return [1 / (1 + np.exp(-item)) for item in x]
   
# Create sample data
x = np.arange(-10., 10., 0.2)
sig = sigmoid(x)
fig, ax = plt.subplots(1,1)
plt.plot(x,sig, marker = "o")

plt.axhline(0, c='k')
plt.axvline(0, c='k')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xlabel("X") 
plt.ylabel("Y")
plt.title("Sigmoid Function")
plt.text(3, 0.8, r'$a=\frac{1}{1+e^{-x}}$', fontsize=16)
plt.show()

In [None]:
# Tanh

in_array = np.linspace(-np.pi, np.pi, 12) 
out_array = np.tanh(in_array) 

plt.plot(in_array, out_array, color = 'red', marker = "o") 
plt.title('Tanh Function') 
plt.xlabel("X") 
plt.ylabel("Y")
plt.axhline(0, c='k')
plt.axvline(0, c='k')
plt.text(.2, -.5, r'$a=tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$', fontsize=16)
plt.show() 

In [None]:
# ReLU

def ReLU(x):
  return np.maximum(0.0, x)

X = np.linspace(-5, 5, 100)
plt.plot(X, ReLU(X),'b', marker = "o")
plt.xlabel('X')
plt.ylabel('Y')
plt.title('ReLU Function')
plt.axvline(0, c='k')
plt.show()

The "leaky ReLU" function is a derivative of the ReLU function, and it commonly used in neural nets.

$$a = \begin{cases}
x \ , &x \geq 0 \\
0.01x \ , &x < 0 \\
\end{cases}$$

In [None]:
# Leaky ReLU

def LeakyReLU(x):
  return x if x >= 0 else 0.01*x

X = np.linspace(-5, 5, 100)
plt.plot(X, ReLU(X),'b', marker = "o")
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Leaky ReLU Function')
plt.axvline(0, c='k')
plt.show()

## Activations Functions in Convolutional Neural Nets

In [None]:
from keras.datasets import mnist

In [None]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [None]:
# Reshape data to fit model.
X_train = X_train.reshape(60000,28,28,1)
X_test = X_test.reshape(10000,28,28,1)

In [None]:
from keras.utils import to_categorical
# one-hot encode target column
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

### Create the model architecture 

In [None]:
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Dense, Conv2D, Flatten
# Create model.
model = Sequential()
# Add model layers.
model.add(Conv2D(64, kernel_size=3, activation='relu', input_shape=(28, 28, 1)))
model.add(Conv2D(32, kernel_size=3, activation='relu'))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))

As we can see, this model has 2 convolutional layers. Each of the Conv2D layers have ReLu activations to identify interesting parts of the digits image, and a final dense layer with a softmax to classify the image as one of the 9 digits. 

### Visualize the model architecture

In [None]:
from tensorflow.python.keras.utils.vis_utils import plot_model

plot_model(model, to_file='model_plot.png', show_shapes=True,
           show_layer_names=True)

In [None]:
from IPython.display import Image
Image(filename='model_plot.png') 

### Compile the model

Compiling allows you to tune the model hyperparameters, like `optimizer`, `loss` and performance `metrics`.

In [None]:
model.compile(optimizer='adam', 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

### Train Model

In [None]:
model.fit(X_train, y_train,
                    batch_size=128,
                    epochs=10,
                    verbose=1,
                    validation_data=(X_test, y_test))

score = model.evaluate(X_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Using CNN architecture with ReLu activation functions, we were able to train a model with ~98% accuracy!

### Get Confusion Matrix from predictions

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))

In [None]:
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
plot_confusion_matrix(cm, [x for x in range(10)],
                  normalize=False,
                  title='Confusion matrix',
                  cmap=plt.cm.Greens)

### Show activation functions

In [None]:
from tensorflow.python.keras.models import Model

def display_activation(activations, col_size, row_size, act_index): 
    activation = activations[act_index]
    activation_index=0
    fig, ax = plt.subplots(row_size, col_size, figsize=(row_size*2.5,col_size*1.5))
    for row in range(0,row_size):
        for col in range(0,col_size):
            ax[row][col].imshow(activation[0, :, :, activation_index], cmap='bone')
            activation_index += 1

In [None]:
n = np.random.randint(len(X_test))
plt.imshow(X_test[n][:,:,0]);

In [None]:
layer_outputs = [layer.output for layer in model.layers]
activation_model = Model(inputs=model.input, outputs=layer_outputs)
activations = activation_model.predict(X_test[n].reshape(1,28,28,1))

### First layer activation maps

In [None]:
display_activation(activations, 8, 8, 0)

### 2nd layer activation maps

In [None]:
display_activation(activations, 5, 6, 1)

# Resources

* [Comparison of activation functions](https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions)
* [deeplearning.ai](https://www.coursera.org/lecture/neural-networks-deep-learning/activation-functions-4dDC1)

* [Discriminative Localization](http://cnnlocalization.csail.mit.edu/)

* [CNN with Keras](https://www.kaggle.com/amarjeet007/visualize-cnn-with-keras)

* [Disadvantages of ReLu](https://www.quora.com/What-are-the-disadvantages-of-using-the-ReLu-when-using-Neural-Networks)

# Exercises

Implement ReLu on another image dataset that you have worked on before, or a new one, and plot the activations.

### Student Solution

In [None]:
# Your Code

### Answer Key

**Solution**

In [None]:
# Put the recommended solution here; if there is more than one "good" solution
# that you think students should know put those solutions in subsequent code
# boxes with "# Solution" in the first line.