# Activation functions

# 1.Sigmoid function

This is a very commonly used function that gives only values between 0 and 1. It is
usually indicated by σ(z).

![Sigmoid.jpeg](attachment:Sigmoid.jpeg)

* It is used for binary classification problem.
* It gives clear prediction when the output is between 0 to 1
* it is monotonic which means a function is either entirely non-increasing or non-decreasing.
* It is not zero centered which means it has a mean of 0.5 rather than 0.
* If z is very small or very large, the gradient or slope will become very small. This slows down the gradient descent and leads to gradient vanishing problem. 
* It only works well, if the input is near to 0, as you see in the above graph.
* Computation time is longer, since it involves exponent
* For this reason, Andrew NG (Co-founder of Coursera) suggests not to use sigmoid function in hidden layers except output layer, if it is a binary classification.

# 2.Tanh

Hyperbolic Tangent function is tanh(x) = (e^x – e^-x) / (e^x + e^-x)

![Tanh.jpeg](attachment:Tanh.jpeg)

* Tanh function ranges from -1 to 1
* the mean for the hidden layer comes out to be 0 or very close to it, hence helps in centering the data by bringing mean close to 0. This makes learning for the next layer much easier.
* It is a monotonic function used for binary classification
* It is also prone to vanishing gradient problem

# 3.ReLU Function

* ReLU means Rectified Linear Unit which gives maximum value
* In this function, if the input is negative, it gives zero and gives positive value, if the input is positive.
* It is easier to compute, as there is no complex math. Hence easier to optimize
* The convergence is much faster than sigmoid and tanh function in Schostic Gradient descent
* No vanishing gradient problem for large positive values
* It is usually used in hidden layers.
* If the input is negative, then it becomes zero which leads the neuron completely inactive or dead
* If the learning rate is too high the weights may change to a value that causes the neuron not to get updated at any data point again. It is called exploding gradient.
* ReLU is not zero centric function like sigmoid.


![Relu.jpeg](attachment:Relu.jpeg)

# 4.Leaky ReLU, Parametric ReLU and Randomised ReLU

* To fix the dead nuerons, a small slope is introduced in Leaky ReLU function.
* As we see in the below graph, the value of 'a' is 0.01.
* It ranges from -infinity to infinity
* If the 'a' value is not 0.01, it is called Randomised ReLU
* If the 'a' value can be learnt from back propagation, it is Parametric ReLU 

![Leaky%20Relu.jpeg](attachment:Leaky%20Relu.jpeg)

# 5. ELU (Exponential Linear Units) function

* The exponential Linear Unit leads to higher classification results than traditional ReLU. 
* It follows the same rule for x>= 0 as ReLU, and increases exponentially for x < 0.
* ELU tries to make the mean activations closer to zero which speeds up training.

![ELU.jpeg](attachment:ELU.jpeg)

* Since it is using exponent function for negative value, dead neurons can be avoided.
* It produces negative outputs helping the network to update the weights and biases into the right direction.
* It takes longer computational time, because it involves exponential operation
* It does not avoid the exploding gradient problem
* Alpha value is not learnt by neural network

# 6.SELU (Scaled Exponential Linear Unit):

* Lets look at the formula for this activation function first.

![SELU.jpeg](attachment:SELU.jpeg)

* The lamda and alpha value have been calculated by authors.
    alpha≈1.6732632423543772848170429916717
    λ≈1.0507009873554804934193349852946
* As you see in the above formula, if the input x is greater than zero, then the output value becomes x multiplied by λ.
* when x  is less than zero, we multiply alpha with the exponential of the x-value minus the alpha value, and then we multiply by the lambda value.
* No vanishing or exploding gradient problem.
  

# 7.Maxout activation

Maxout activation function is defined by the following formula

![Maxout.jpeg](attachment:Maxout.jpeg)

* The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a learnable activation function.
* It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with the dropout regularization technique
* Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore, enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU)
* However, it doubles the total number of parameters for each neuron and hence, a higher total number of parameters need to be trained

# 8.Softmax 

* The Softmax normalizes an input value into a vector of values that follows a probability distribution whose total sums up to one.
* The output values are between the range [0,1] which is nice because we are able to avoid binary classification and accommodate as many classes or dimensions in our neural network model. 
* This is why softmax is sometimes referred to as a multinomial logistic regression

![Softmax.jpeg](attachment:Softmax.jpeg)

# 9. Swish (A Self-Gated) Function

* Swish is the mulplication of x with sigmoid function.
* Swish is a smooth function. That means that it does not abruptly change direction like ReLU does near x = 0. Rather, it smoothly bends from 0 towards values < 0 and then upwards again.
* This function is non-monotonic. It thus does not remain stable or move in one direction, such as ReLU, tanh and sigmoid activation functions
* The output will not saturate for the maximum value. So the gradient descent will be good.


![Swish.jpeg](attachment:Swish.jpeg)

# 10.Softplus

* This function ranges from 0 to +infinity.
* It is similar to ReLU but relatively smooth and cntinuous function
* Softplus function: f(x) = ln(1+exp^x)

![Softplus.jpeg](attachment:Softplus.jpeg)

# Loss Functions

## 1.Mean Squared Error Loss(MSE)

* The MSE is calculated as the average of the squared differences between the predicted and actual values.
* It gives the positive result regardless of the sign of predicted and actual values
* The main disadvantage of MSE is that if any outlier is present, then the loss will be huge

![MSE.jpeg](attachment:MSE.jpeg)

## 2.Mean Absolute Error(MAE)

* The MAE is defined as follows.

![MAE.jpeg](attachment:MAE.jpeg)

* It is a result of measuring the differences between predicted and actual values.
* It is used in regression problems.
* So it measures the average magnitude of errors in a set of predictions, without considering their directions

## 3.Mean Absolute Percentage Error(MAPE)

* The MAPE is defined by

![MAPE.jpeg](attachment:MAPE.jpeg)

* It is commonly used as a loss function for regression problems.
* The difference between y and predicted y is divided by the actual y again.
* he absolute value in this calculation is summed for every predicted y and divided by the number of fitted points n. 
* Multiplying by 100 makes it a percentage error.
* It cannot be used, if there are zero values because there would be a division by zero.
* For predicted y which are too low the percentage error cannot exceed 100%, but for predicted y which are too high there is no upper limit to the percentage error.
* MAPE puts a heavier penalty on negative errors, y < predicted y than on positive errors. 
* As a consequence, when MAPE is used to compare the accuracy of prediction methods it is biased in that it will systematically select a method whose forecasts are too low. 


## 4.Mean Squared Logarithmic Error(MSLE)

* The MSLE is a varience of the Mean Squared Error.
* It is used in regression problem
* The logarithm in MSLE finds the relative difference between the true and predicted values in other words, precentual difference between them
* MSLE will treat small differences between small true and predicted values approximately the same as big differences between large true and predicted values.
* The problem with MSLE is that it penalizes underestimates more than overestimates, introducing an asymmetry in the error curve.


![MSLE%20formula.jpeg](attachment:MSLE%20formula.jpeg)

![MSLE%20table.jpeg](attachment:MSLE%20table.jpeg)

## 5.Binary Crossentropy

* This loss function is used in binary classifications which means yes/no decisions.
* The loss tells you how wrong your model’s predictions are. 
* For instance, in multi-label problems, where an example can belong to multiple classes at the same time, the model tries to decide for each class whether the example belongs to that class or not
* Binary crossentropy measures how far away from the true value (which is either 0 or 1) the prediction is for each of the classes and then averages these class-wise errors to obtain the final loss.

![Binary%20cross%20entropy.jpeg](attachment:Binary%20cross%20entropy.jpeg)

## 6.Categorical Crossentropy

* Categorical crossentropy is a loss function that is used for single label categorization. 
* This is when only one category is applicable for each data point. In other words, an example can belong to one class only.
* Categorical crossentropy will compare the distribution of the predictions (the activations in the output layer, one for each class) with the true distribution, where the probability of the true class is set to 1 and 0 for the other classes. 
* To put it in a different way, the true class is represented as a one-hot encoded vector, and the closer the model’s outputs are to that vector, the lower the loss.

![Categorical%20cross%20entropy.jpeg](attachment:Categorical%20cross%20entropy.jpeg)

## 7.Poisson loss function

* The Poisson loss function is used for regression when modelling count data.
* Minimizing the Poisson loss is equivalent of maximizing the likelihood of the data under the assumption that the target comes from a Poisson distribution, conditioned on the input.


![Posson%20loss.jpeg](attachment:Posson%20loss.jpeg)

## 8.Sqaure hinge loss function

* The squared hinge loss is a loss function used for “maximum margin” binary classification problems. 
* Mathematically it is defined as:

![Squared%20hinge.jpeg](attachment:Squared%20hinge.jpeg)

* The hinge loss guarantees that, during training, the classifier will find the classification boundary which is the furthest apart from each of the different classes of data points as possible. 
* In other words, it finds the classification boundary that guarantees the maximum margin between the data points of the different classes.

## 9.Huber loss 

* Huber loss is less sensitive to outliers in data than the squared error loss. 
* It’s also differentiable at 0. It’s basically absolute error, which becomes quadratic when error is small. 
* How small that error has to be to make it quadratic depends on a hyperparameter, 𝛿 (delta), which can be tuned. 
* Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.)

![Huber%20loss.jpeg](attachment:Huber%20loss.jpeg)

## 10.Log-Cosh loss function

* Log-cosh loss function is used for regression problem.
* It works better than L2 loss function.
* log(cosh(x)) is approximately equal to (x ** 2) / 2 for small x and to abs(x) - log(2) for large x. 
* This means that 'logcosh' works mostly like the mean squared error, but will not be so strongly affected by the occasional incorrect prediction.

![Log-cosh.jpeg](attachment:Log-cosh.jpeg)