# What is an activation function?  

The activation function was inspired by the "action potential", an electrical phenomenon between two biological neurons.

Let's start with a quick reminder of the biology class: A neuron has a cell body, an axon that allows it to send messages to other neurons and dendrites that allow it to receive signals from other neurons.

![Image de neuronne](./img/cn.jpg)

The neuron receives signals from other neurons through the dendrites. The weight associated with a dendrite, called synaptic weight, is multiplied by the incoming signal. Dendrite signals are accumulated in the cell body and if the resulting signal strength exceeds a certain threshold, the neuron transmits the message to the axon. Otherwise, the signal is killed by the neuron and does not spread any further. The action potential is therefore the variation in signal strength indicating whether or not communication should take place.

The activation function decides whether or not to transmit the signal. In this case, it is a simple function with only one parameter: the threshold. Now, when we learn something new, the threshold and connection probability (called synaptic weight) of some neurons change. This creates new connections between neurons, allowing the brain to learn new things.

Now let's see how all this works with an artificial neural network: The incoming values in a neuron (x1, x2, x3, ..., xn) are multiplied with their associated weights (reference to synaptic weight) (w1, w2, w3, ..., wn). These multipications are then summed and the bias (reference to the threshold) is added. The image below shows the calculation formula.

The purpose of activation is to transform the signal so as to obtain an output value from complex transformations between the inputs. To do this, the activation function must be non-linear. It is this non-linearity that makes it possible to create such transformations.

The most commonly used examples of activation functions are sigmoid, softmax, ReLU, tanh, etc.

![factivation functions](./img/reLU.png)

**Sigmoid :**  
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

**Tanh or hyperbolic tangent :**  
Tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped).The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.

**ReLU (Rectified Linear Unit) :**  
The ReLU is the most used activation function in the world right now. Since, it is used in almost all the convolutional neural networks or deep learning.

**Softmax :**  
The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression. 

There are others, this list is not exhaustive, but they are the ones that are most often used.  

## How to choose the right activation function? 

It depends on what you want to achieve. There is no magic formula, but there are some conventions. 

You will normally never use softmax and tanh for hidden layers. The result obtained is not a good format to be processed by the following layers. They correspond much better to the output neurons. 

### Let us look at this in detail. 
The first layer here are the inputs. The inputs communicate in a linear way to the first hidden layer. Depending on the weight of the synapses, the next neuron will be activated. For example, if the pixel is black, its weight will have a value of 1, if the pixel is white it will have a value of 0. 
![](./img/act1.png)  

If we use the example of handwritten number recognition, we would like only dark pixels (and therefore the neuron) to be activated.  
![](./img/ex.jpg)

### reLU function
The reLU function is only activated when the weight it receives is greater than 0. So if the pixels are white, the next synapse will not activate and nothing will happen. On the other hand, the more black the pixel, the greater the weight, the more the function will send a maximum of linear signal to the next neuron. This is usually the behaviour we want for hidden layers. On the other hand, if we had used the tanh function, it will have activated the next synapse, whereas this is not what we want.

Remember that reLU is mainly used for hidden layers and rarely for output layers.

![](./img/relu1.png)


### Softmax function

The "softmax" function is very suitable for the output layer that do Multi-class classification. Unlike the tanh function, it gives the sum of all the predictions to arrive at 1. 

For example, if our model must recognize cats, dogs and hot-dog, the softmax function would give a result like this. 

- Cat : 0.80 
- Dog: 0.05
- Hot dog: 0.15 

The sum of the three values is well equal to 1 or 100%. . In the case of multi-class classification, this function will always be preferred. 

### Tanh function 

On the other hand, if we had chosen the tanh function, we could expect a result like this :

- Cat : 0.90
- Dog: 0.55
- Hot dog: -0.43

Why? Because the tanh function gives results in a range of -1 to 1 without worrying about a proportion with the other neurons. 

![](./img/tanh.png)

The tanh function is more suitable for other problems. For example, imagine that our model has to tell an autonomous car whether to turn right, left or go straight ahead. 
One could imagine the following. 

-1 = the steering wheel is completely to the left.  
0 = The steering wheel is in the middle.   
1 = The steering wheel is on the far right.    


### Sigmoid function 
Sigmoid gives results from 0 to 1. Sigmoid is particularly effective when it comes to making probabilities (example: The price of a house). Sigmoid is also effective for binary classification. 

![](./img/sig.png)


## To summarize 

 1. Hidden layers will almost always have an activation function of type reLU.
 2. The output layers will very often have one of these functions: "softmax", "sigmoid", "tanh". 


![](./img/partial_derivative_notations.png)