In [1]:
#math and linear algebra stuff
import numpy as np

#Math and linear algebra stuff
import scipy.stats as scs

#plots
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (15.0, 15.0)
#mpl.rc('text', usetex = True)
import matplotlib.pyplot as plt
%matplotlib inline

# Lagrange Multipliers for Maximum Entropy distribution estimation

This notebook has been inspired by a really nice lecture on this topic: [MIT course](http://www-mtl.mit.edu/Courses/6.050/2003/notes/chapter10.pdf)


## A problem of probability

Many problems arising in the field of engineering, imaging, or data science imply the estimation of a mixture of, let's say $N$ components in the same bucket.<br>
The mixture can be described with a point
$ p \begin{pmatrix}
             p_0\\
             p_1\\
             \vdots\\
             p_{N-1}
    \end{pmatrix}
\in [0,1]^N$, such that $\sum_{i=0}^{N-1} p_i = 1$
This kind of mathematical object is also known as the probability simplex, see [this book](https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf), and is also written as
$$
    \{ p \in \mathbb{R}^N, p \succeq 0, \mathbb{1}^\intercal p = 1 \}
$$

In the more general framework of linear programming, this definition actually imply N+1 constraints:
$$
\begin{align*}
    &-p_i \leq 0, \text{ for } i = 0,1,\dots, N-1\\
    &\mathbb{1}^\intercal p = 1
\end{align*}
$$

## Additional constraints of the problem
Although we know that all probability should sum to 1, we generally sick for a specific mixture that should meet some additional linear constraints.

For instance, imagine that we have four advertisement-related url-links on a webpage, and everytime a person click on one of the links, it generates a given amount of money in an online wallet, let say
- $a_0$ = 0.06\$ for link 0
- $a_1$ = 0.28\$ for link 1
- $a_2$ = 0.15\$ for link 2
- $a_3$ = 0.005\$ for link 3

Unfortunately, the advertisement company that rewards you for the click doesn't want to or cannot tell you which link generated the money you get, it only tells you the average click reward, ie:
$$
\begin{align*}
    M &= \sum_{i=0}^{N-1} a_i p_i \\
    &= p^\intercal a
\end{align*}
$$

In many cases, we have $N$ unknowns $p_0,\dots,p_{N-1}$, $K_1$ linear equality constraints and $K_2$ linear inequality defining a convex polytope, and often $K_1<N$.
In this kind of settings, the problem has often infinitely many feasible solution, this is why we would like to set up an optimization problem, with a prior knowledge,for instance, related to a metric.

Some people may think about the $l1$ or $l2$ regularization term that they may have seen on robust regression methods, respectively LASSO or ridge regression.
This may apply, ie, we may have reason to believe that either a few links generates most of the rewards (sparsity), or we may on the contrary limit the total energy of the solution so that no link will end up with all the reward in the regression, which can occur if multiple links have a very similar reward.
Those may be valid apriori depending on the topic, but there is another one which is much more elegant, and it is called the maximum entopy principle.

## Maximum entropy principle

We gave a friendly beginner introduction to information theory and statistics in the notebook called "InformationTheoryOptimization", so let's just recall the definition of the entropy of a discrete distribution $x$:
$$
    H(x) = - \sum_{i=0}^{N-1} x_i log(x_i)
$$

If the $log$ is in base 2, the entropy of a distribution give the mean number of bits we should use to encode one outcome of the random process, with a perfect code.
The higher is the entropy, the lesser we have clue on the outcome of the random process.

We can use this concept in order to "regularize" somehow our problem, by restricting, among all feasible solution of the initial linear problem, the one that carry the minimum quantity of information in the information theory sense.
This allows to reduce the bias we may get by imposing a less neutral apriori, in the sense of information theory.

## Setting up our problem

Let's now write down our maximum entropy problem:

\begin{align*}
    \text{minimize } & f_0(x) = \sum_{i=0}^{N-1} x_i log(x_i)\\
    \text{subject to } & A x \preceq b \\
    & \mathbb{1}^\intercal x = 1 \\
\end{align*}

Where $\text{dom } f_0 = \mathbb{R}^{n}_{++}$ <br>
Fortunately, the entropy function is convex in $x$, and, linear constraints defines a convex set, our problem is then convex, see Boyd&Vandenberghe book, p362.

### Convex conjugate of entropy

As seen in the notebook "ForwardBackwardDual" we have seen the definition of the convex conjugate: given a convex, proper, and lower semi-continuous function $f$ defined on $\mathbb{R}^n_{+}$, its
conjugate $f^*$ is the convex function defined as
$$ \forall y \in \mathbb{R}^n, \quad f^*(y) = \underset{x \in \mathbb{R}^n_{+}}{sup} \  x^{\intercal}y - f(x) $$

Which give the following derivation in a simple one dimensional case where $f(x) = x log(x)$
\begin{align*}
    f^*(y) &= \underset{x \in \mathbb{R}_{+}}{sup} \  xy - f(x)\\
           &= \underset{x \in \mathbb{R}_{+}}{sup} \  g(x,y)\\
           & \text{Where }\\
    g(x,y) &= xy - x log(x)\\
\end{align*}
Assuming $f(0)=0$ it is easy to see that $g(x,y)$ has an upper bound on $x \in \mathbb{R}_{+}$ for all given $y \in \mathbb{R}$, let's take a look at its derivative:
\begin{align*}
    \frac{\partial g(x,y)}{\partial x} &= y - (log(x) + x\frac{1}{x})\\
        &= y - log(x) - 1
\end{align*}
We see that this derivative is strictly monotonic over $\mathbb{R}_{+}$ because of the monotonicity of $log(x)$ and that it shoud be 0-valued only once over $\mathbb{R}_{+}$ also because of the behaviour of $log(x)$. Equating this derivative to zero in order to find the supremum gives us:
\begin{align*}
    \frac{\partial g(x,y)}{\partial x} &= 0\\
        y - log(x) - 1 &= 0\\
        log(x) &= y - 1\\
        x &= e^{y - 1}\\
\end{align*}
So we can finally write
\begin{align*}
    f^*(y) &= \underset{x \in \mathbb{R}_{+}}{sup} \  xy - f(x)\\
           &= e^{y - 1}y - e^{y - 1} log(e^{y - 1})\\
           &= ye^{y - 1} - ye^{y - 1} + e^{y - 1}\\
           &= e^{y - 1}
\end{align*}

### Convex conjugate and Lagrange duality

#### General case: the link between Lagrange dual problem, and the conjugate
It is interesting to make the link between the convex conjugate function and the Lagrange dual function. Let's consider a general optimization problem of the form:

\begin{align*}
    \text{minimize } & f_0(x)\\
    \text{subject to } & A x \preceq b \\
    & C x = d \\
\end{align*}

Using the conjugate of $f_0$, we can write the dual fonction for the previous problem as:

\begin{align*}
    g(\lambda, \nu) &= \underset{x}{inf}\left( f_0(x) + \lambda^{\intercal} (Ax-b) + \nu^{\intercal}(Cx-d) \right)\\
    &= -b^{\intercal}\lambda - d^{\intercal}\nu + \underset{x}{inf}\left( \underbrace{ f_0(x) + (A^{\intercal}\lambda+C^{\intercal}\nu)^{\intercal}x}_{\textrm{recognize } -( y^{\intercal}x - f(x))} \right) \\
    &= -b^{\intercal}\lambda - d^{\intercal}\nu - f_0^*(-A^{\intercal}\lambda-C^{\intercal}\nu))
\end{align*}

Where the domain of $g$ follows from the domain of $f_0^*$:
$$
    \textrm{dom} \; g = \{(\lambda,\nu) | -A^{\intercal}\lambda-C^{\intercal}\nu \in \textrm{dom}\; f_0^* \}
$$

#### The specific case of entropy maximization (of negative entropy minimization)

In the specific case of negative entropy minimization, we already have derived the conjugate of the one dimensional entropy : $f(x) = xlog(x) \iff f^*(y) = e^{y-1}$ <br>

The derivation of the multidimensional (vectorial) case, which is only a sum of scalar functions is straightforward:
$$
    f(x) = \sum_{i=0}^{n-1} x_i log(x_i) \iff f^*(y) = \sum_{i=0}^{n-1} e^{y_i-1}
$$
With $\text{dom } f_0^* = \mathbb{R}^{n}$ <br>

We can now derive the dual problem of our entropy maximization from the general expression we found earlier, that depends on the conjugate, where $C = \mathbb{1}^{\intercal}$ and $ d = 1$ so that $\nu$ becomes a scalar:
\begin{align*}
    g(\lambda, \nu) &= -b^{\intercal}\lambda - \nu - f_0^*(-A^{\intercal}\lambda-\mathbb{1}\nu) \\
    &= -b^{\intercal}\lambda - \nu - \sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda-\nu-1} \\
    &= -b^{\intercal}\lambda - \nu - e^{-\nu-1} \sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda} 
\end{align*}

Where $a_i$ is the $i^th$ column of $A$. <br>
Here we must specify that $\text{dom} \; g = \{(\lambda,\nu) | -A^{\intercal}\lambda - \mathbb{1}\nu \in \mathbb{R}^n \}$ but this amounts to $\text{dom} \; g = \mathbb{R}^n \times \mathbb{R}$ which is not particularly restrictive.

### Solving the dual problem

Thanks to the previous part, we established the dual optimization problem of the entropy maximization which is:
\begin{align*}
    &\text{maximize} \quad & g(\lambda, \nu) \\
    &\text{subject to} & \lambda \succeq 0
\end{align*}

where
$$
    g(\lambda, \nu) = -b^{\intercal}\lambda - \nu - e^{-\nu-1} \sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda} 
$$

We can use the fact that $\nu$ is a scalar to see if we can obtain the optimal coordinate of the solution in this direction, let's take a look at the partial derivative:
\begin{align*}
    \frac{\partial g(\lambda, \nu)}{\partial \nu} &= \frac{\partial \left( -b^{\intercal}\lambda - \nu - e^{-\nu-1} \sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda}\right) }{\partial \nu} \\
    &= -0-1 - \frac{\partial \left(e^{-\nu-1} \sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda}\right) }{\partial \nu} \\
    &= e^{-\nu-1} \left( \sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda} \right) -1
\end{align*}

Equating this partial derivative to zero give us:
\begin{align*}
    \frac{\partial g(\lambda, \nu)}{\partial \nu} &= 0 \\
    e^{-\nu-1} \left( \sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda} \right)-1 &= 0\\
    e^{-\nu-1} &= \frac{1}{\sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda}} \\
    -\nu-1 &= log \left( \frac{1}{\sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda}} \right)\\
    \nu &= -log \left( \frac{1}{\sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda}} \right) -1 \\
    \nu &= log \left( \sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda} \right) -1
\end{align*}

This optimal value of $\nu$ can be injected into the dual problem, that immediadtly simplifies and is now only functions of $\lambda$:
\begin{align*}
    &\text{maximize} \quad & -b^{\intercal}\lambda - log \left( \sum_{i=0}^{n-1} e^{-a_i^{\intercal}\lambda} \right) \\
    &\text{subject to} & \lambda \succeq 0
\end{align*}


### Solving the primal, using the dual problem solution

If strong duality holds for this problem, primal solution $p^*$ and dual solution $d*$ are equal. As the primal problem is convex, Slater's conditions told us that strong duality holds if the problem is strictly feasible, ie:
$$
    \exists x \in \mathbb{R}_{+}^n \; \text{ such that } \; \mathbb{1}^{\intercal}x = 1, Ax \prec b
$$

For now, let's assume that there is a feasible point, so that strong duality holds, and that we know the dual solution $(\lambda^*,\nu^*)$, we can write the lagrangian at this point:

$$
    L(x,\lambda^*,\nu^*) = \sum_{i=0}^{n-1} x_i log(x_i) + {\lambda^*}^{\intercal} (Ax-b) + \nu^* (\mathbb{1}^{\intercal } x-1)
$$