<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/v2/xx_misc/activation_functions/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Activation Functions

Activation functions are core components of neural networks. These functions are used in every node of network to reduce a vector of inputs into an output value.

Learning when to apply specific activation functions is a critical skill for any buiding deep learning models.

## What is an activation function?

Picture yourself as a node in a neural network. On one side of you there are multiple input streams passing data from the prior layer. On the other side there are multiple output streams that we use to pass data to every node in the next layer.

We expect the data from our input layer to contain many different values since we are getting data from different nodes. On the output-side we'll give everh node in the next layer the same value. Distilling the multiple diverse inputs into a single value that we can hand to the next layer is the job of an activation function.

In mathmatical terms it looks something like this:

$$a = activation(\sum_{i=0}^{n}{x_i} + bias)$$

We sum our inputs from prior nodes, $x$, and our bias. We then pass that summation through an activation function in order to get our output value, $y$, that we then pass to every node in the next layer of the network.

Though activation functions are used in every layer of a network, it is particularly important to understand how they behave at the output layer of a model.

## Pass-through Activation

The most basic activation function is the [linear](https://www.tensorflow.org/api_docs/python/tf/keras/activations/linear) activation function. This function take the sum of inputs and bias, doest nothing to it, and hands the result to the next layer of the network.

Let's plot the linear activation function in the code block below.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def linear(x):
  return x

inputs = np.linspace(-10, 10, 10)
outputs = [linear(x) for x in inputs]
_ = plt.plot(inputs, outputs)

That's a pretty simple activation function to understand. But what value does it provide?

This function can be useful, especially in your output layer, if you want your model to product large or negative values. Many of the activation functions that we'll see greatly restrict the range of values that they output. The linear activation function does restrict it's output range at all. Any real number can be produced by a node with this activation function.

## Rectified Linear Units (ReLU)

There is another linear activation function that turns out to be quite useful, the [Rectified Lienar Unit (ReLU)](https://www.tensorflow.org/api_docs/python/tf/keras/activations/relu).

ReLU simply returns the input value unless that value is less than zero. In that case it returns zero.

$$a = \begin{cases}
x \ , &x \geq 0 \\
0 \ , &x < 0 \\
\end{cases}$$

Let's take a look at ReLU:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def relu(x):
  if x < 0:
    return 0
  return x

inputs = np.linspace(-10, 10, 100, .1)
outputs = [relu(x) for x in inputs]
_ = plt.plot(inputs, outputs)

This is also a quite simple activation, but it turns out to be quite useful in practice. Many powerful neural networks utilize ReLU activation, at least in part. It has the advantage of making training very fast; however, nodes using ReLU do run the risk of "dying" during the training process. The nodes die when they get to a state were they always produce a zero output.

Let's also think about the use of a ReLU node in a network. If the output layer consists of ReLU values, then the output of the network will be from `0` to infinity.

This works fine for models that are predicting positive values, but what if your model is predicting celsius temperatures in Antartica or some other potentially negative value?

In this case you would need to adjust the target training data to all be positive, say by adding `100` to it, and then do the reverse to the output of the model, subtract `100` from each value.

You'll find that you'll need to do this type of adjustment quite often when building models. Understanding your activation functions, espeically in your output layer, is critically important. When you know the range of values that your model can produce you can adjust your training data to fall within that range.

## Leaky ReLU

We talked about dead nodes when discussing the ReLU activation function. One strategy that helps mitigate the dead node issue is a "leaky" ReLU.  Leaky ReLUs are ReLU functions that pass through any value zero or greater. For values less than zero they apply an alpha value to them and return the result.

$$a = \begin{cases}
x \ , &x \geq 0 \\
x * \alpha \ , &x < 0 \\
\end{cases}$$

TensorFlow Keras doen't make a distinction between ReLU and Leaky ReLU, it simply provides an alpha parameter to [relu](https://www.tensorflow.org/api_docs/python/tf/keras/activations/relu).

### Exercise 1: Leaky ReLU

Write a `leaky_relu` function that passes through any value zero or greater and applys an alpha of `0.1` to values less than zero.

**Student Solution**

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def leaky_relu(x):
  pass # Your code goes here

inputs = np.linspace(-10, 10, 100, .1)
outputs = [leaky_relu(x) for x in inputs]
_ = plt.plot(inputs, outputs)

---

#### Answer Key

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def leaky_relu(x):
  if x < 0:
    return x * 0.1
  return x

inputs = np.linspace(-10, 10, 100, .1)
outputs = [leaky_relu(x) for x in inputs]
_ = plt.plot(inputs, outputs)

---

## Binary Step

The binary step activation function serves as an on/off switch for a node. This function returns zero if it's input is on one side of a threshold and one if it is on the other.

$$a = \begin{cases}
1 \ , &x \geq 0 \\
0 \ , &x < 0 \\
\end{cases}$$


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def binary_step(x):
  if x < 0:
    return 0
  return 1

inputs = np.linspace(-10, 10, 100, .1)
outputs = [binary_step(x) for x in inputs]
_ = plt.plot(inputs, outputs)

At the output layer this function can be useful when you need to make a yes/no decision and don't care about the confidene of the model in that decision. 

## Sigmoid

Activation functions can also be non-linear. The [sigmoid](https://www.tensorflow.org/api_docs/python/tf/keras/activations/sigmoid) function works using a logistic curve.

$$a=\frac{1}{1+e^{-x}}$$


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def sigmoid(x):
  return 1 / (1 + np.exp(-x))

inputs = np.linspace(-10, 10, 100, .1)
outputs = [sigmoid(x) for x in inputs]
_ = plt.plot(inputs, outputs)

You'll notice that the sigmoid function restricts it's output range to $(0.0, 1.0)$. This is typically not a concern in hidden layers, but needs to be considered in the output layer. You'll likely need to scale your training targets down to this range and expand your predictions back to your actual data range.

Sigmoids in the output layer can be very useful for predicting continuous values. They can also be useful we making binary classification decisions. You can build a model that outputs values from $(0.0, 1.0)$ and treat the output as a confidence in a decision where values closer to `0.0` show no confidence and  values closer to `1.0` show extreme confidence. You then experiment and set a threshold where you make your binary decision.

For example, if you were making a classifier to determine if an image contained a cat you might find that any time the model returned a value over `0.85` there was typically a cat in the image. Before making this decision you'd need to experiment, find the precision and recall for different thresholds, and choose the one that fit your use case the best.

## Hyperbolic Tangent (tanh)

Similar to sigmoid, the hyperbolic tanget, [tanh](https://www.tensorflow.org/api_docs/python/tf/keras/activations/tanh) is a non-linear activation function that can be used in your models. The biggest difference between sigmod and tanh is that tanh has an output range of $(-1.0, 1.0)$

$$a=\frac{e^x-e^{-x}}{e^x+e^{-x}}$$

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def tanh(x):
  return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))

inputs = np.linspace(-10, 10, 100, .1)
outputs = [tanh(x) for x in inputs]
_ = plt.plot(inputs, outputs)

The tanh function is generally useful in hidden layers and can be especially useful in output layers where you need to produce negative numbers.

## Softmax

So far all of the activation functions that we have seen operate without knowing anything about other nodes in their layer. Each node accepts input from the layer before it and passes output to the next layer in the model. The node is unaware of any other node in it's own layer and activation functions on the nodes work independently.

[Softmax](https://www.tensorflow.org/api_docs/python/tf/keras/activations/softmax) is a different type of activation function. Softmax is aware of nodes in the same layer and adjusts their outputs in relation to each other.

Softmax outputs values in the range of $[0.0, 1.0]$. If you were to sum the outputs of every node in a layer, the sum would always equal `1.0`, or something very very close to `1.0`.

Let's say that we had a model that tried to determine if an image contained an apple, orange, or grapefruit. If given a picture of a bright red apple, it might output `[1.0, 0.0, 0.0]` to show that it was highly confident that the image contained an apply. If given a picture of a yellow apply it might be a little less confident and output `[0.8, 0.15, 0.05]`, indicating a little less confidence. If given a picture of a large orage it might output `[0.05, 0.55, 0.4]`, showing that it was having a tough time making a decision.

It is worth noting that softmax is typically not used in hidden layers of a model. Most of the time you will see it used on the output layer.

## Exercise 2: Which Activation Function?

In this exercise we will describe a model that we are building and you will answer with the best activation function to use in the output layer and why. Be sure to talk about what your output data represents and how it will be interpreted.

1. We are building a model that predicts the stock price for a stock. Which activation function should we use in the output layer and why?

> *Your answer goes here*

2. We are building a model that classifies a lung scan image as having pneumonia or not. Which activation function should we use in the output layer and why?

> *Your answer goes here*

3. We are building a model that determines if an image of a number is 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. Which activation function should we use in the output layer and why?

> *Your answer goes here*

4. We are building a model that predicts the daily change in temperature at a location. Which activation function should we use in the output layer and why?

> *Your answer goes here*

5. We are building a model that attempts to predict which single Unicode character is depected in an image. Which activation function should we use in the output layer and why?

> *Your answer goes here*

---

### Answer Key

1. In this case a pass-through, ReLU, or Leaky ReLU would all be candidates. The ReLU and Leaky ReLU would be better choices due to their emperical performance in other models.
2. The Binary Step or Sigmoid would both be reasonable candidates. The Sigmoid is likely more preferable so that we get more visiblity into the confidence of the predictions. Though softmax could be used, it is typically only necessary when there are three or more possible cases.
3. Softmax is likely the best candidate in this case since we are trying to classify between ten distinct cases. It would be possible to use another function, say sigmoid, and multiply the output by 10 and truncate to the nearest integer, but that is more creative of a model than typically seen in practice.
4. The best activation function in this case is the Hyperbolic Tangent since we need to predict positive and negative values. It could be argued that pass-through or even leaky-relu might be okay, but Hyperbolic Tanget probably is the better choice. Since pass-through is just a linear regression, it isn't the best candidate.
5. This is a bit of a tricky question because it really matters on how we want to interpret the output. Softmax is definitely not a good choice because there are too many Unicode characters and the layer would be too wide. If we set up the output to be the Unicode code point ReLU or Leaky ReLU might work with the right amount of output post-processing since codepoints are positive numbers. If we set up the output so that each node is a bit in some specific encoding, say UTF-32, then we could use Binary Step or Sigmoid and classify each bit as on or off.

---