# "[InformationTheory] CH01. Entropy"
> Information theory summary note.

- toc: false
- badges: false
- comments: false
- categories: [information-theory]
- hide_{github,colab,binder,deepnote}_badge: true

#### 1.0. Overview
Information theory was developed to study the problem of sending messages by sending discrete alphabets over noised channel like wireless communication. In machine learning, its context lies in the application of information theory to continuous variables. _TM Cover, 2006__, __MacKay, 2003__<br><br>

The following are the core intuitions of information theory. 

> Learning an event that is less likely to occur is more informative than learning an event that is more likely to occur.

For example, "there was a solar eclipse this morning" than "the sun rose this morning" means that you have more information. That is, to quantify the amount of information, the following properties must be satisfied.

- Events with a high probability of occurrence should have less information.
- There is no information about an event that must occur.
- Events with a low probability of occurrence should have more information.
- The amount of information for individual events should be additive.
    - Ex) In coin toss, $I([H, H]) = 2I([H])$
  
##### Definition.1.1. Self-information of Event x=$x$
$$
I_p(x) = -\log P(x).
$$

Since $\log$ is a natural logarithm, the unit of $I(x)$ above is nat. nat is the amount of information obtained by observing an event with probability $\frac{1}{e}$. If $\log_2$ is used instead of the natural logarithm, the unit is called bit or shannon, which means the amount of information obtained by observing an event with probability $\frac{1}{2}$.

Consider the binomial distribution of the probability that a coin is tossed 3 times and heads are x. The expression is:

$$
p(x) = {3 \choose x}p^x (1 - p)^{3 - x} \quad \text{for} \,\ x=0,1,2,3.
$$

In the above equation, for each event $x=0, x=1, x=2, x=3$, the probability value is as follows.

$$
\begin{matrix}
p(x=0) = 0.125 \\
p(x=1) = 0.375 \\
p(x=2) = 0.375 \\
p(x=3) = 0.125 \\
\end{matrix}
$$

The self-information for it is as follows.

$$
\begin{matrix}
I_p(x=0) = 3 \\
I_p(x=1) = 1.415037\cdots \\
I_p(x=2) = 1.415037\cdots \\
I_p(x=3) = 3 \\
\end{matrix}
$$

The information amount value is a structure that receives the probability value as $\log$, so it is not negative and has a relatively large value at a low probability value.

#### 1.1. Shannon Entropy
The above case of self-information deals with only one event. The uncertainty of the entire probability distribution can be quantified with the Shannon entropy.

##### Definition.1.2. Shannon Entropy
$$
H[p] = \mathbb{E}_{x \sim  p}[I(x)] = - \mathbb{E}_{x \sim  p}[\log p(x)].
$$

This is the average amount of information for the events in the distribution. This value tells the lower bound of the average number of bits required to encode information drawn from the distribution $p$. In the example above, the Shannon entropy value is calculated as follows.

$$
H[p] = \frac{1}{4} (0.125 \cdot 3 + 0.375 \cdot 1.415037 + 0.375 \cdot 1.415037 + 0.125 \cdot 3) = 0.4528194.
$$

The value of Shannon's entropy is low in the deterministic case, and the closer it is to an even distribution(i.e., like uniform distribution), the higher the entropy. In particular, when $x$ is a continuous variable, the Shannon entropy is called differential entropy.

---
##### Application.1.1. Maximization of Shannon Entropy
Let probability density function $p$ be for $x\in [a, b]$.<br>
Then,

$$
\begin{matrix}
\int_{a}^{b} p(x)dx = 1 \\
\end{matrix}
$$

Consider following problem

$$
\max_{p(x)} H[p(x)] = \max_{p(x)} - \mathbb{E}[\log p(x)] = \max_{p(x)} - \int_a^b p(x) \log p(x) dx
$$

Above problem is equality constraind optimization problem. Therefore we have to get lagrangian function $\mathcal{L}$.

$$
\mathcal{L} = - \int_a^b p(x) \log p(x) dx + \lambda_1 \left( \int_a^b p(x) dx - 1 \right)
$$

Above $\mathcal{L}$ is functional, then

$$
\frac{\delta \mathcal{L}}{\delta p(x)} = - \log p(x) - 1 + \lambda_1 = 0
$$

$$
\therefore \,\ p(x) = \exp(-1 + \lambda_1) = c \,\ \text{for some constant} \,\ c.
$$

By equality constraint, 
$$
\int_a^b c dx = c(b - a) = 1.
$$
Then

$$
\therefore \,\ c = \frac{1}{b - a} = p(x) \sim U(a, b)
$$

In above result, we can know that the uniform distribution is maximized shannon entropy distribution.

---
##### Application.1.2. Maximization of Shannon Entropy with Fixed Variation
Let the expectation of probability density function $p$ be $\mu$ and variation be $\sigma^2$.<br>
Then,

$$
\begin{matrix}
\int_{-\infty}^{\infty} p(x)dx = 1 \\
\int_{-\infty}^{\infty} xp(x)dx = \mu \\
\int_{-\infty}^{\infty} (x - \mu)^2p(x)dx = \sigma^2 \\
\end{matrix}
$$

For maximization of $H[p]$, we have to use lagrange multiplier.<br>
Then,

$$
\max \mathcal{L}[p](\mathbf{\lambda}) = \max - \int_{-\infty}^{\infty} p(x) \log p(x) dx + \mathbf{\lambda}^T 
\begin{bmatrix}
\int_{-\infty}^{\infty} p(x)dx - 1 \\
\int_{-\infty}^{\infty} xp(x)dx - \mu \\
\int_{-\infty}^{\infty} (x-\mu)^2 p(x)dx - \sigma^2 \\
\end{bmatrix}
$$

Then,

$$
\frac{\delta \mathcal{L}}{\delta p(x)} = - \log p(x) - 1 + \lambda_1  + \lambda_2x + \lambda_3(x-\mu)^2  = 0
$$

$$
\therefore \,\ p(x) = \exp(-1 + \lambda_1 + \lambda_2x + \lambda_3(x-\mu)^2)
$$

By first equality constraint, 

$$
\begin{matrix}
\int_{-\infty}^{\infty} p(x) dx = \int_{-\infty}^{\infty} \exp(-1 + \lambda_1 + \lambda_2x + \lambda_3(x - \mu)^2) dx \\
= \int_{-\infty}^{\infty} \exp( \lambda_3x^2 + (\lambda_2 - 2\mu)x + \mu^2 \lambda_3 + \lambda_1 - 1) dx \quad (\lambda_3 < 0)\\ 
\end{matrix}
$$

$$
\begin{matrix}
- |\lambda_3|(x^2 + \frac{\lambda_2 - 2\mu}{|\lambda_3|}x + \frac{(\lambda_2 - 2\mu)^2}{4\lambda_3^2} ) - \frac{(\lambda_2 - 2\mu)^2}{4|\lambda_3|} + \mu^2 \lambda_3 + \lambda_1 - 1 \\
= - |\lambda_3|( x + \frac{\lambda_2 - 2\mu}{2\lambda_3} )^2 - \frac{(\lambda_2 - 2\mu)^2}{4|\lambda_3|} + \mu^2 \lambda_3 + \lambda_1 - 1 \\
\end{matrix}
$$

$$
\begin{matrix}
\int_{-\infty}^{\infty} \exp( \lambda_3x^2 + (\lambda_2 - 2\mu)x + \mu^2 \lambda_3 + \lambda_1 - 1) dx \\
= \int_{-\infty}^{\infty} \exp(- |\lambda_3|( x + \frac{\lambda_2 - 2\mu}{2\lambda_3} )^2 - \frac{(\lambda_2 - 2\mu)^2}{4|\lambda_3|} + \mu^2 \lambda_3 + \lambda_1 - 1) dx \\
\end{matrix}
$$

Let $\sqrt{|\lambda_3|}(x + \frac{\lambda_2}{2\lambda_3}) = t$.<br>
Then

$$
\begin{matrix}
\int_{-\infty}^{\infty} \exp(- |\lambda_3|( x + \frac{\lambda_2 - 2\mu}{2\lambda_3} )^2 - \frac{(\lambda_2 - 2\mu)^2}{4|\lambda_3|} + \mu^2 \lambda_3 + \lambda_1 - 1) dx \\
= \frac{1}{\sqrt{|\lambda_3|}} \exp\left\{ - \frac{(\lambda_2 - 2\mu)^2}{4|\lambda_3|} + \mu^2\lambda_3 + \lambda_1 - 1 \right\} \int_{-\infty}^{\infty} \exp(-t^2) dt
\end{matrix}
$$

$$
\therefore \,\ \frac{\sqrt{\pi}}{\sqrt{|\lambda_3|}} \exp\left\{ - \frac{(\lambda_2 - 2\mu)^2}{4|\lambda_3|} + \mu^2\lambda_3 + \lambda_1 - 1 \right\} = 1 \qquad \cdots \,\ (1) 
$$

In similar way by second equality constraint, 

$$
- \frac{(\lambda_2 - 2\mu)\sqrt{\pi}}{2\lambda_3 \sqrt{|\lambda_3|}} \exp\left\{ - \frac{(\lambda_2 - 2\mu)^2}{4|\lambda_3|} + \mu^2\lambda_3 + \lambda_1 - 1 \right\} = \mu \qquad \cdots \,\ (2)
$$

In similar way by third equality constraint, 

$$
\frac{\sqrt{\pi}}{\sqrt{|\lambda_3|}} \left( \frac{1}{2|\lambda_3|} + \frac{(\lambda_2 - 2\mu + 2\lambda_3 \mu)^2}{4\lambda_3^2} \right) = \sigma^2 \qquad \cdots \,\ (3)
$$

By $(1), \,\ (2), \,\ \text{and} \,\ (3)$,

$$
\therefore \,\ p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\}
$$

---