In [16]:
# setup the matplotlib graphics library and configure it to show 
# figures inline in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Loss Functions
In this article, we will focus on the selection of loss functions for regression and classification problems. First, we justify minimizing the mean square error (MSE) as a loss function for linear regression. We will proceed from the statement of the regression as a task with an infinite number of outcomes, thereby naturally assuming that the loss function, in this case, will be continuous. In contrast, the problem of classification has a discrete number of outcomes, and its loss function does not have the same nature as for regression. We will introduce a metric on the distribution space approximating the classification values and show that although the introduced value may not have all the properties of the metric, it can nevertheless serve to determine the “distance” between distributions, that is, it can be successfully used as a loss function. The considered quantity called cross-entropy, and it is widely used in commercial libraries (TensorFlow, PyTorch) when constructing classification models.

## Maximum likelihood estimation (MLE) and KL Divergence {#KLD}
There are many ways to introduce a metric on a distribution space. Some metrics were borrowed from functional analysis, while others, due to their special properties, were introduced for special cases. Such cases include pre-metrics that satisfy only part of the axiomatics of metrics, however, they are often used to specify the topology of the distribution space, and to some extent play the role of the distance on it. Such is the pre-metric that is known from information theory: the Kullback-Leibler divergence. For discrete distributions, it is defined as


$$D_{KL}(P||Q)  =\sum_{x \in X} P(x) \log(\frac{P(x)}{Q(x)})$$

And for continuous distributions:

$$D_{KL}(P||Q)=\int_{\infty}^{-\infty} p(x) \log(\frac{p(x)}{q(x)}) dx$$

This divergence is not symmetrical and does not satisfy the triangle inequality:
$$D_{KL}(P||Q) \neq D_{KL} (Q||P)$$

The only fact that the Kullback-Leibler divergence is related to the metric is that it is not negative and is equal to zero only for $P=Q$ almost everywhere.

In order to explain the meaning of the introduced quantity, let us step back and try to formalize the intuitive idea that the amount of information that an event carries is the greater the less this event, i.e. the less likely the event, the more informative it is.

This is expressed well by the function on the graph below:
![](https://raw.githubusercontent.com/olegkleiman/ml_theory/master/images/MinusLog.png)

The probability of the event is plotted on the  axis, its “amount of information" on the axis. You can notice that this function on the segment $[0\le x \le 1]$ fits perfectly to the given intuitive expression. 

1. It takes 0 on a value of 1 - the maximum allowable probability value, i.e. the information contained in the event that is certain to happen (with probability 1) is zero.
2. The lower the probability of an event, the greater its information: $\lim_{x \to +0}=\infty$
3. $ \forall x \in [0 \ge x \ge 1]: I(x) \ge 0 $

The considered value was introduced by C. Shannon in the epoch-making work [^4], and received the name of the event’s *self-information*:
$$I(x)=-log p(x)$$
where the negative sign ensures that information is positive or zero.
The choice of basis for the logarithm is arbitrary, but by the convention in information theory it is used to the base of 2.

It is readily extended from a single event to the entire (discrete) distribution:
$$H(X)=-\sum_{i=1}^m p(x) \times \log p(x)$$

In this case, it is called the *entropy (information entropy)* of a random variable and it is effectively the expectation of the self-information with respect to the distribution .

Considering entropy as a measure of chaos or distribution uncertainty, we now note its features for known distributions.
1. In general, a non-uniform distribution has less entropy than a uniform
2. The uniform distribution has the largest entropy of all possible:
$$H(P) = - \sum_{i=1}^n \frac{1}{n} \log \frac{1}{n} = - \frac{n}{n} ( - \log n) = \log n$$
3. Distributions $p(x_i) $ that are sharply peaked around a few values will have a relatively low entropy, whereas those that are spread more evenly across many values will have higher entropy.
4. The entropy of Gaussian:
$$H(P)=\ln(\sigma \sqrt{2 \pi e })$$
independent of the mean. (This is [calculated](http://cito-web.yspu.org/link1/metod/theory/node30.html) using the discrete Abel transform or integration by parts for the continuous case) and we can see again that the entropy increases as the distribution becomes broader, i.e., as $\sigma$ increases.

5. The Laplace distribution (double exponential), which is often used as the limit distribution in schemes of summing a random number of random variables, has entropy:
$H(X) = - \int_{-\infty}^{+\infty} \frac{2}{\lambda} e^{- \lambda |x-a| } \log \frac{2}{\lambda} e^{- \lambda |x-a|} dx= \log \frac{2}{\lambda}$ that also independent of the mean. ([Calculated](http://cito-web.yspu.org/link1/metod/theory/node30.html) by the same way)
6. Finally, the entropy of the binomial distribution:
$$H(X)= \sum_{m=0}^{n} C_n^m p^m q^{n-m} \log ( C_n^m p^m q^{n-m})^{-1}= - \sum \limits_{m = 0}^n {C_n^m p^mq^{n - m}\left[ {\log C_n^m + m\log p +(n - m)\log q} \right]} =$$
$$=- \sum\limits_{m = 1}^n {C_n^m p^mq^{n - m}\log C_n^m } - n(p\log p + q\log q).$$

Generally speaking, the informational entropy is deeply related to physical entropy. Nature seems to us not to be orderly, i.e. any manifestations of the organized structure of physical space can be considered as manifestations of a temporary anomaly. The uniform distribution of properties with its maximum entropy is, in fact, the essence of the second law of thermodynamics.

When comparing the two distributions, it makes sense to consider the cross-entropy, which is defined as:
$$H(P,Q)=-\sum_{x \in X} p(x_i) \log q(x_i)$$

There are no problems with generalizing the introduced values to continuous distributions. In this case, the quantity under consideration is called differential entropy and is derived as the first term of the asymptotic expansion of entropy [^5].

Here, we are primarily interested in the discrete case, so let us return to the Kullback-Leibler divergence and write out its discrete form in more detail:
$$D_{KL}(P||Q)  =\sum_{x \in X} P(x) \log(\frac{P(x)}{Q(x)}) =-\sum_{x \in X} p(x) \log q(x) + \sum_{x \in X} p(x) \log p(x) = H(P,Q) - H(P)$$
where $H(P, Q)$ - cross entropy between $P$ and $Q$, $H(P)$ - entropy of $P$.

For now, as an important milestone, we have:
$$D_{KL}(P||Q) = H(P,Q) - H(P)$$


## Literature
- [^1] Kullback S. (1959). Information theory and statistics. Dover Publications.
- [^2] Bishop C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- [^3] Goodfellow I,. Bengio Y., Courville A. (2016) Deep Learning. MIT Press.
- [^4] Shannon C. E. A mathematical Theory of Communication. 
- [^5] Колмогоров А. Н. Теория информации и теория алгоритмов.


















