Index

Loss Functions
Loss-Cost/Error
Generalization Error (or out-of-sample error)

◼️ Regression Loss

L1- Loss function or Least Absolute Deviations(LAD) and Mean Absolute Error (MAE)
L2- Loss Function or Least square errors(LS) and Mean Square Error (MSE)
Huber loss
Hinge Loss

◼️ Classification Loss

Categorical Crossentropy
Binary crossentropy

◼️ References

⬛ Loss Functions

Loss function is a method of evaluating how well your algorithm models your data set. Better the prediction is lesser the loss you will get.

The loss error is computed for a single training example. If we have ‘m’ number of examples then the average of the loss function of the entire training set is called Cost function.

Cost function (J) = (Sum of Loss error for ‘m’ examples) / m

Depending on the problem Cost Function can be formed in many different ways.

The purpose of Cost Function is to be either:

Minimized — then returned value is usually called cost, loss or error. The goal is to find the values of model parameters for which Cost Function return as small number as possible.
Maximized — then the value it yields is named a reward. The goal is to find values of model parameters for which returned number is as large as possible.

In other words, the terms cost and loss functions almost refer to the same meaning. But, the loss function mainly applies for a single training set as compared to the cost function which deals with a penalty for a number of training sets or the complete batch. It is also sometimes called an error function. In short, we can say that the loss function is a part of the cost function.

The cost function is calculated as an average of loss functions. The loss function is a value that is calculated at every instance. So, for a single training cycle loss is calculated numerous times, but the cost function is only calculated once.

Example 1:

One of the loss function used in Linear Regression, the Square loss
One of the cost function used in Linear Regression, the Mean Squared Error

Example 2:

One of the loss function used in SVM, the hinge loss
SVM cost function

⬛ Loss-Cost/Error:

When performing supervised training, a neural network’s actual output must be compared against the ideal output specified in the training data. The difference between actual and ideal output is the error of the neural network. Error calculation occurs at two levels.

🔲 Local Error

First , there is the local error. This is the difference between the actual output of one individual neuron and the ideal output that was expected. The local error is calculated using an error function.

🔲 Global Error

The local errors are aggregated together to form a global error. The global error is the measurement of how well a neural network performs to the entire training set. There are several different means by which a global error can be calculated. The global error calculation methods discussed in this chapter are listed below.

🔲 Generalization Error or out-of-sample Error

The only way to know how well a model will generalize to new cases is to actually try it out on new cases by split your data into two sets:

Training set
Test set

As these names imply, you train your model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error (or out-of-sample error), and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.

If the training error is low (i.e., your model makes few mistakes on the training set) but the generalization error is high, it means that your model is overfitting the training data.

⬛ Different Regression and Classification Errors

Different loss

⬛ Regression Errors:

🔲 L1 and L2 loss or Mean-Absolute-Error and Mean-Squared-Error

L1 and L2 are two common loss functions in Machine Learning which are mainly used to Minimize the Error. Are following-

$\large{\color{Purple}\textrm{L1-Loss function and Mean-Absolute-Error(MAE)}}$
$\large{\color{Purple}\textrm{L2-loss function or Mean-Squared-Error(MSE)}}$

🔲 1. L1-Loss function or Least-Absolute-Deviations(LAD) and Mean-Absolute-Error(MAE)

It is used to minimize the error which is the sum of all the absolute differences in between the true value and the predicted value.

♠️ Math equation for L1-Loss:

$$\Large \mathrm{\ L1-loss}\mathbf{\ ={\color{Purple} \sum_{i=1}^{n}|y_{true}-y_{pridicted}|} }$$

♠️ Math equation for Mean-Absolute-Error(MAE):

$$\Large \mathrm{\ MAE}\mathbf{\ ={\color{Purple}\frac{1}{n} \sum_{i=1}^{n}|y_{i}-\hat{y_{i}}|}}$$

🔲 2. L2-Loss or Least Squared Errors(LS) and Mean Squared Error(MSE)

It is also used to minimize the error which is the sum of all the squared differences in between the true value and the pedicted value.

♠️ Math equation for L2-Loss:

$$\Large \mathrm{L2-loss}\mathbf{\ ={\color{Purple} \sum_{i=1}^{n}(y_{true}-y_{pridicted})^2}}$$

♠️ Math equation for Mean-Squared-Error(MSE):

$$\Large \mathrm{MSE}\mathbf{\ ={\color{Purple}\frac{1}{n} \sum_{i=1}^{n}(y_{i}-\hat{y_i})^2}}$$

♠️ Disadvantages:

The disadvantage of the L2 norm is that when there are outliers, these points will account for the main component of the loss. For example, the true value is 1, the prediction is 10 times, the prediction value is 1000 once, and the prediction value of the other times is about 1, obviously the loss value is mainly dominated by 1000.

Huber loss

Huber Loss is often used in regression problems. Compared with L2 loss, Huber Loss is less sensitive to outliers(because if the residual is too large, it is a piecewise function, loss is a linear function of the residual).

$$\Large \mathbf{L\delta (y,f(x))=} \begin{cases} \mathbf{{\color{Purple}\frac{1}{2}(y-f(x))^2}}& \mathrm{for |y-f(x)|\leq \delta , }\\ \mathbf{{\color{Purple}\delta |y-f(x)|-\frac{1}{2}\delta ^2}}& \mathrm{otherwise.} \end{cases}$$

Among them, δ is a set parameter, y represents the real value, and f(x) represents the predicted value.

The advantage of this is that when the residual is small, the loss function is L2 norm, and when the residual is large, it is a linear function of L1 norm

Hinge Loss

Hinge loss is often used for binary classification problems, such as ground true: t = 1 or -1, predicted value y = wx + b

In the svm classifier, the definition of hinge loss is

In other words, the closer the y is to t, the smaller the loss will be.

Cross-entropy loss

Categorical Crossentropy

Categorical crossentropy is a loss function that is used in multi-class classification tasks. These are tasks where an example can only belong to one out of many possible categories, and the model must decide which one.

Formally, it is designed to quantify the difference between two probability distributions.

Math equation:

$https://latex.codecogs.com/svg.image?\large Loss\ \mathbf{={\color{Purple} - \sum_{i=1}^{Output\ size}y_i. \log \hat{y_i}}}$

How to use Categorical Crossentropy

The categorical crossentropy is well suited to classification tasks, since one example can be considered to belong to a specific category with probability 1, and to other categories with probability 0.

Example: The MNIST number recognition tutorial, where you have images of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. The model uses the categorical crossentropy to learn to give a high probability to the correct digit and a low probability to the other digits.

Activation functions

Softmax is the only activation function recommended to use with the categorical crossentropy loss function.

Strictly speaking, the output of the model only needs to be positive so that the logarithm of every output value $https://latex.codecogs.com/svg.image?\mathrm{{\color{Purple}\hat{y_i}}}$ exists. However, the main appeal of this loss function is for comparing two probability distributions. The softmax activation rescales the model output so that it has the right properties.

Binary crossentropy

Binary crossentropy is a loss function that is used in binary classification tasks. These are tasks that answer a question with only two choices (yes or no, A or B, 0 or 1, left or right). Several independent such questions can be answered at the same time, as in multi-label classification or in binary image segmentation.

Formally, this loss is equal to the average of the categorical crossentropy loss on many two-category tasks.

Math equation:

$https://latex.codecogs.com/svg.image?\large Loss\ \mathrm{={\color{Purple} - \frac{1}{output\ size}}}\mathbf{{\color{Purple} \sum_{i=1}^{output\ size}y_i. \log \hat{y_i} + (1-y_i).\log(1-\hat{y_i})}}$

How to use binary crossentropy

The binary crossentropy is very convenient to train a model to solve many classification problems at the same time, if each classification can be reduced to a binary choice (i.e. yes or no, A or B, 0 or 1).

Example: The build your own music critic tutorial contains music data and 46 labels like Happy, Hopeful, Laid back, Relaxing etc. The model uses the binary crossentropy to learn to tag songs with every applicable label.

Activation functions

Sigmoid is the only activation function compatible with the binary crossentropy loss function. You must use it on the last block before the target block.

The binary crossentropy needs to compute the logarithms of $https://latex.codecogs.com/svg.image?\mathrm{{\color{Purple}\hat{y_i}}}$ and $https://latex.codecogs.com/svg.image?\mathrm{{\color{Purple}(1-\hat{y_i})}}$ , which only exist if $https://latex.codecogs.com/svg.image?\mathrm{{\color{Purple}\hat{y_i}}}$ is between 0 and 1.

The sigmoid activation function is the only one to guarantee that independent outputs lie within this range.

References:

Peltarion.com

Files

README.md

Latest commit

History

README.md

File metadata and controls

Index

◼️ Regression Loss

◼️ Classification Loss

◼️ References

⬛ Loss Functions

Cost function (J) = (Sum of Loss error for ‘m’ examples) / m

Example 1:

Example 2:

⬛ Loss-Cost/Error:

🔲 Local Error

🔲 Global Error

🔲 Generalization Error or out-of-sample Error

If the training error is low (i.e., your model makes few mistakes on the training set) but the generalization error is high, it means that your model is overfitting the training data.

⬛ Different Regression and Classification Errors

⬛ Regression Errors:

🔲 L1 and L2 loss or Mean-Absolute-Error and Mean-Squared-Error

🔲 1. L1-Loss function or Least-Absolute-Deviations(LAD) and Mean-Absolute-Error(MAE)

♠️ Math equation for L1-Loss:

♠️ Math equation for Mean-Absolute-Error(MAE):

🔲 2. L2-Loss or Least Squared Errors(LS) and Mean Squared Error(MSE)

♠️ Math equation for L2-Loss:

♠️ Math equation for Mean-Squared-Error(MSE):

♠️ Disadvantages:

Huber loss

Hinge Loss

Cross-entropy loss

Categorical Crossentropy

Math equation:

How to use Categorical Crossentropy

Activation functions

Softmax is the only activation function recommended to use with the categorical crossentropy loss function.

Binary crossentropy

Math equation:

How to use binary crossentropy

Activation functions

Sigmoid is the only activation function compatible with the binary crossentropy loss function. You must use it on the last block before the target block.

References: