In [1]:
#math and linear algebra stuff
import numpy as np

#Math and linear algebra stuff
import scipy.stats as scs

#plots
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (15.0, 15.0)
#mpl.rc('text', usetex = True)
import matplotlib.pyplot as plt
%matplotlib inline

# Lagrange Multipliers for Maximum Entropy distribution estimation

This notebook has been inspired by a really nice lecture on this topic: [MIT course](http://www-mtl.mit.edu/Courses/6.050/2003/notes/chapter10.pdf)


## A problem of probability

Many problems arising in the field of engineering, imaging, or data science imply the estimation of a mixture of, let's say $N$ components in the same bucket.<br>
The mixture can be described with a point
$ p \begin{pmatrix}
             p_0\\
             p_1\\
             \vdots\\
             p_{N-1}
    \end{pmatrix}
\in [0,1]^N$, such that $\sum_{i=0}^{N-1} p_i = 1$
This kind of mathematical object is also known as the probability simplex, see [this book](https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf), and is also written as
$$
    \{ p \in \mathbb{R}^N, p \succeq 0, \mathbb{1}^\intercal p = 1 \}
$$

In the more general framework of linear programming, this definition actually imply N+1 constraints:
$$
\begin{align*}
    &-p_i \leq 0, \text{ for } i = 0,1,\dots, N-1\\
    &\mathbb{1}^\intercal p = 1
\end{align*}
$$

## Additional constraints of the problem
Although we know that all probability should sum to 1, we generally sick for a specific mixture that should meet some additional linear constraints.

For instance, imagine that we have four advertisement-related url-links on a webpage, and everytime a person click on one of the links, it generates a given amount of money in an online wallet, let say
- $a_0$ = 0.06\$ for link 0
- $a_1$ = 0.28\$ for link 1
- $a_2$ = 0.15\$ for link 2
- $a_3$ = 0.005\$ for link 3

Unfortunately, the advertisement company that rewards you for the click doesn't want to or cannot tell you which link generated the money you get, it only tells you the average click reward, ie:
$$
\begin{align*}
    M &= \sum_{i=0}^{N-1} a_i p_i \\
    &= p^\intercal a
\end{align*}
$$

In many cases, we have $N$ unknowns $p_0,\dots,p_{N-1}$, $K_1$ linear equality constraints and $K_2$ linear inequality defining a convex polytope, and often $K_1<N$.
In this kind of settings, the problem has often infinitely many feasible solution, this is why we would like to set up an optimization problem, with a prior knowledge,for instance, related to a metric.

Some people may think about the $l1$ or $l2$ regularization term that they may have seen on robust regression methods, respectively LASSO or ridge regression.
This may apply, ie, we may have reason to believe that either a few links generates most of the rewards (sparsity), or we may on the contrary limit the total energy of the solution so that no link will end up with all the reward in the regression, which can occur if multiple links have a very similar reward.
Those may be valid apriori depending on the topic, but there is another one which is much more elegant, and it is called the maximum entopy principle.

## Maximum entropy principle

We gave a friendly beginner introduction to information theory and statistics in the notebook called "InformationTheoryOptimization", so let's just recall the definition of the entropy of a discrete distribution $p$:
$$
    H(x) = - \sum_{i=0}^{N-1} p_i log(p_i)
$$

If the $log$ is in base 2, the entropy of a distribution give the mean number of bits we should use to encode one outcome of the random process, with a perfect code.
The higher is the entropy, the lesser we have clue on the outcome of the random process.

We can use this concept in order to "regularize" somehow our problem, by restricting, among all feasible solution of the initial linear problem, the one that carry the minimum quantity of information in the information theory sense.
This allows to reduce the bias we may get by imposing a less neutral apriori, in the sense of information theory.

## Setting up our problem

Let's now write down our maximum entropy problem:
$$
\begin{align*}
    \text{maximize } & -\sum_{i=0}^{N-1} p_i log(p_i)\\
    \text{subject to } & -p_i \leq 0, \text{ for } i = 0,1,\dots, N-1\\
    & \mathbb{1}^\intercal p = 1 \\
    & a^\intercal p = M
\end{align*}
$$

Fortunately, the entropy function is convex in $p$, and, linear constraints defines a convex set, our problem is then convex, see Boyd&Vandenberghe book, p362.

## Using Lagrange multipliers for solving the convex problem

Lagrange multipliers stands for a generic method for solving constrained optimization problems, where optimization objective, or equality/inequality constraints aren't necessarily linear.

In short, it aims at relaxing constraints by integrating them to the objective while weighting them with a new set of variables.