# MSDM5058 Tutorial 4 - The Meaning of Entropy

## Contents
1. Physics' origin of entropy
2. Principle of maximum uncertainty
3. Entropy of continuous distributions

---

# 1. Physics' origin of entropy

The tern "entropy" was founded by the German physicist Rudolph Clausius in the 19-th century. By the time physicists had been constructing models to heat engine, and found that the change to the quantity $S$, defined by 

$$ \Delta S = \frac{\Delta Q}{T} $$ 

is always positive, i.e. $S$ never decreases no matter what processes the engine is running. Here $\Delta Q$ is the heat input to the engine, and $T$ is the temperature of the engine as function of $Q$. Physicists named this $S$ as **entropy of heat engines** - however they didn't yet have any idea why the quantity never decreases with time - until the interpretation by Boltzmann. 

## 1.1. Microstates, macrostates and multiplicity

These terms have a deep origin from physics theories. To begin with, let's look at the two biggest physics pictures in the 19-th century.

### 1.1.1. Classical mechanics

In the viewpoint of classical mechanics, the state of every object can be described and predicted exactly. If we know the initial position $x(t=0)$ and velocity $v(t=0)$ of an object at the moment, then we can predict its later on motion $(x(t), v(t))$ by solving differential equations (i.e. Newton's 2nd law). 

The procedure is the same for a system of many objects. Say there are $10000$ objects, theoretically a prediction can be made if we can measure the initial positions and velocities of the $10000$ objects, and then solve a system of $10000$ differential equations. The way we describe the system's initial state by $10000$ initial positions and $10000$ initial velocities is an example of **microstate description** - by looking into the configurations of all objects. 

However taking measurement to all objects' configuration is basically an impossible task in reality. Today we know that a few ten grams of matter already consist of $\sim 10^{23}$ atoms. So classical mechanics is not suitable to model large amount of objects.

### 1.1.2. Thermodynamics

For a system whose objects are almost identical, it would be much easier to describe the objects' collective behavior instead of looking into each object one-by-one. This is the **macrostate description** to a system of objects. For example, for a box of gas, instead of looking into the motion of each gas particle, we always prefer to look at its statistical quantities, like:

- Pressure = _Average_ collision force hitting on a surface.

- Volume of liquid/gas = _Average_ volume of a gas occupied, because gas particles are always moving. Although most of the time we use the container's volume for convenience. 

- Temperature = _Average_ kinetic energy of a system of particles. 


Description by macrostate is much easier than by microstate because the number of parameters required to measure are greatly reduced (from $\sim 10^{23}$ to around $10$). However this also implies a lost of fine details about the system.



### 1.1.3. Multiplicity

Microstates and marcostates are in a many-to-one correspondance:

- It we can measure all the microstate parameters, we can calculate the statistical quantities and determine which marcrostate the system is in.

- It is impossible to determine the microstate by measuring macrostate parameters, because macrostate description doesn't give the fine details. We can have many different configurations (microstates) that yield the same statistical properties (macrostate). 

We can define the **multiplicity $W$ of a macrostate** as the number of microstates that correspond to it. 

#### Example: Two box model

Suppose we have 4 balls that are allowed to move between 2 boxes $L$ and $R$. What are the 

1. Microstate description?
2. Macrostate description?
3. Multiplicity of each macrostate?

**Solution.** 

1. For microstate, we have to look at the configuration of the individual balls - for each ball whether it is in the $L$ box or the $R$ box. There are $2^4=16$ different microstates in total.

2. For macrostate, we look at collective behavior of the balls - how many balls are in the $L$ and $R$ box respectively. There are 5 different kinds of macrostates in total. 

3. The multiplicity of each of the macrostates are $C^4_L = \{1,4,6,4,1\}$ respectively. 

<figure style="text-align: center">
  
    <figcaption> <b>Fig. 1</b> All the microstates and macrostaes of a two boxes model.</figcaption>
</figure>

## 1.2. Boltzmann entropy

### 1.2.1. From multiplicity to entropy

Imagine we allow the 4 balls to move freely between the two boxes and take snapshots from time to time. What will be the average number of balls in each box? In the simplest scenario, we can _assume_ the two boxes to be identical, so that each ball has equal probability to be found in either $L$ or $R$ box. Then 

- Each microstate has an equal probability $\frac{1}{16}$ to be found
- The probability of being in a macrostate is proportional to how many microstates is classified to it. E.g. the macrostate 2L-2R has a probability $\frac{6}{16}$ because its multiplicity is $6$. 

Now consider the a more extreme case, say we begin with $1000$ balls in the $L$ box. After allowing the balls to move freely between the boxes, what will we observe? 

- The multiplicity of macrostate 1000L-0R is $C^{1000}_0 = 1$, so its probability of occurence is $\frac{1}{2^{1000}} \sim 10^{-301}$.

- The multiplicity of macrostate 500L-500R is $C^{1000}_{500}$, so its probability of occurence is $\frac{C^{1000}_{500}}{2^{1000}} \sim 10^{-2}$.

Our universe is only 13 billion years old $\sim 10^{14}$ seconds. So it is equivenlently impossible to find the system being in the 1000L-0R state. Not to mention that in reality, a few ten grams of matter already contain $\sim 10^{23}$ atoms. 


To keep a long story short, we have now learnt that: 

1. We can start from any macrostate, but at the end the system will be found in the macrostate with the highest multiplicity most of the time. i.e. $W$ of a system basically never decreases.

2. The quantity $S$ - the entropy of a heat engine never decreases (which is derivable from thermodynamics equations).

Could $S$ and $W$ be something similar?

### 1.2.2. Boltzmann's hypothesis

Physicist Ludwig Boltzmann noticed this similiarity between $S$ and $W$ in their irreversible properties, and proposed his most famous hypothesis:

$$S \propto \log W $$

The Boltzmann hypothesis greatly extended the meaning of entropy - **_from originally only a concept in heat engines, to a concept which is applicable to any physical systems that can be described by probabilities_**. The proportionality constant, which was later known as the **Boltzmann constant** $k_B \approx 1.38\times 10^{-23}$, is obtained by comparing the _experimental values of $S$_ and _theoretical values of $W$_ in an ideal gas heat engine. 

**Note:** When we are talking about entropy in physics, we always use $k_B$ as the proportionality constant. However in system other than physics, we may use any convenient values as the proportionality constant. 

> The presense of the $\log$ is purely a mathematical argument - total $S$ of two heat engines is computed by addition, i.e. $S_\text{tot}=S_X+S_Y$, but total multiplicity of two systems is computed by multiplication, i.e. $W_\text{tot}=W_X\times W_Y$. The $\log$ is just the right operation so that 
>
> $$ S_\text{tot} = S_X+S_Y = k\log W_X + k\log W_Y = k\log (W_X\times W_Y) = k\log W_\text{tot} $$

P.S. The equation $S = k \log W$ is carved on Boltzmann's [grave](https://commons.wikimedia.org/wiki/File:Zentralfriedhof_Vienna_-_Boltzmann.JPG#/media/File:Zentralfriedhof_Vienna_-_Boltzmann.JPG) to memorize his contribution.

### 1.2.3. Entropy as disorderness

In traditional physics textbook, **_we usually regards entropy as disorderness due to Boltzmann hypothesis_**. The idea is intuitive - there are more ways to make things look messy than to make things look tidy, so a messy state implies higher multiplicity and thus higher entropy. Just like clothes should be in wardrobes and books should be on bookshelves if you want your room looking clean, but you can put them anywhere else if you want your room looking messy.


<figure style="text-align: center">
  <img src="https://i0.wp.com/comic.hmp.is.it/wp-content/uploads/2014/02/0018-Entropy.png?fit=1041%2C1200&ssl=1" alt="entropy comic" style="width:30%">
    <figcaption> <b>Fig. 2</b> Entropy as disorderness. Retrieved from <a href="https://comic.hmp.is.it/comic/entropy/">HMP Comics.</a></figcaption>
</figure>

---

# 2. Principle of maximum uncertainty

## 2.1. Gibbs entropy

### 2.1.1. Distribution of microstates

Boltzmann hypothesis relates entropy to multiplicity - the number of microstates corresponds to the macrostate, but it does not hint anything about how probable are each of the microstates. Consider the two boxes model again. We know there are 6 microstates for the 2L-2R macrostate, but we cannot be sure that their probability of occurence are equal. For example, what if the red balls is conscious and like to stay in the $L$ box more? Then the LLRR, LRLR and LRRL state will occur more frequently than the other three. 

**_Telling the distribution of microstates require prior knowledge about the system_**. You may read the first paragraph of section 1.2.1. again - before telling the probability of each microstate of the two boxes model is equiprobable = $\frac{1}{16}$, we **_assumed_** that each ball has equal probability to be found in either $L$ or $R$ box. But if we have more prior information (like if the red ball is conscious), then probability of each microstates is not $\frac{1}{16}$ anymore.

In classical physics, we rarely care about the distribution of microstates because 
1. Measurement is physically impossible, since normally number of particles scales with $\sim 10^{23}$.
2. All molecules/atoms/particles are indistinguishable, so it is safe to assume that all microstates are of equal probability. 

However these two points are not valid in information theory. The equiprobable assumption of microstates is seldom valid. 

### 2.1.2. Formulation

Suppose the system is now observed to be in a macrostate with multiplicity $W$, and we know that each of the $W$ microstates have the probabilities $\{p_1, p_2,...,p_W\}$ to occur. As probabilities, they must satisfy 

$$p_1+p_2+...+p_W=1 \ .$$

1. Let's say we try to measure which microstate the system is in for $N$ times. The expected occurence of each of the $W$ microstates would be $\{n_1, n_2, ..., n_W\}$ where

    $$n_j = p_jN \quad \text{with} \quad n_1 + n_2 + ... + n_W = N \ .$$

    $N$ can be picken to be a very large number so that each $n_j$ is an interger. 
 
2. Then we can argue that the multiplicity $W$ is equivalent to the number of possible ways to observe the distribution $\{n_1, n_2, ..., n_W\}$ through total $N$ observation. Just like we have $W$ boxes and $N$ balls, which we have to arrange $n_1$ balls in the first box, $n_2$ balls in the second box, etc. By combinatorics, the number of possible arrangements is 

    $$ W = \frac{N!}{n_1!n_2!...n_W!} $$


3. The remaining is substituting this into the Boltzmann entropy and make simplification by Stirling's approximation $\log n! \approx n\log n - n$. The Boltzmann entropy becomes 

    $$\begin{align*}
    S_\text{Boltz} &\propto \log \left(\frac{N!}{n_1!n_2!...n_W!}\right) \\[0.5em]
    &= \log N! - \log n_1! - \log n_2! -... - \log n_W! \\[0.5em]
    &= (N\log N - N) - (n_1\log n_1 - n_1) - (n_2\log n_1 - n_2) - ... - (n_W\log n_W - n_W) \\[0.5em]
    &= -[n_1\log n_1+ n_2\log n_2 + ... + n_W\log n_W] + N\log N \\[0.5em]
    &= -[Np_1\log (Np_1) + Np_2\log (Np_2) + ... Np_W\log (Np_W)] + N\log N \\[0.5em]
    &= - N[p_1\log p_1 + p_2\log p_2 + ... p_W\log p_W] - N[p_1 \log N + p_2 \log N + ... + p_W \log N] + N\log N \\[0.5em]
    &= - N\sum_{j=1}^W p_j\log p_j + (N-1)\log N 
    \end{align*}
    $$

4. We define the Gibbs entropy of a macrostate as 

    $$S_\text{Gibbs} \propto -\sum_{j=1}^W p_j \log p_j $$

    which concerns only the internal distribution of microstates's probabilities of the same macrostate. Again, if we are talking about a physics system, we will append $k_B$ as the proportionality constant. 




> To be clear, the $p_j$ above are the conditional probabilities $P_{(\text{micro}=j|\text{macro}=i)} = p_{(j|i)}$, the probability of being in the $j$ microstate given that it is in a specific marcostate $i$. 
> 
> Suppose the system has access to the macrostates $i=\{1,2,...,n\}$ which the multiplicities of each macrostate is $\{W_1, W_2,...,W_n\}$ respectively. The probability of being in the $i$-th macrostate is:
> 
> $$P_{(\text{macro}=i)} = \frac{W_i}{\sum_{i=1}^n W_i} $$
> 
> Given the system is in the $i$-th macrostate and if its $W_i$ microstates have conditional probabilities of occurence by $\{p_{(1|i)}, p_{(2|i)},...,p_{(W_i|i)}\}$, then the overall probability of being in the $j$-th microstate of macrostate $i$ is:
>
> $$
\begin{align*}
P_{(\text{micro}=j)} &= P_{(\text{micro}=j|\text{macro}=i)}\times P_{(\text{macro}=i)} \\
&= p_{(j|i)}\times\frac{W_i}{\sum_{i=1}^n W_i} 
\end{align*}
$$

### 2.1.3. Entropy as uncertainty

Interpreting entropy as "uncertainty" is a relative newer idea compared to "disorderness". By uncertainty, it means **how unsure we are about which microstate the system is in, given that we know its current macrostate**. Therefore, the distribution of microstates matter:

- If one of the microstates happens 99% of the time, we are quite certain which microstate we will observe.
- If all microstates are equilprobable, we have no idea which microstate we will observe. 

It is Claude Shannon who first suggested the idea of using the function $-\sum p\log p$ as a measure of uncertainty of a distribution, by borrowing the concept of entropy from physics. In later history, more kinds of entropies are suggested as alternative measures to uncertainty in various systems.



## 2.2. The principle of maximum entropy

Usually, an observation to a macrostate provides very little or no prior information about the distribution of the microstate. An objective scientist should be honest to his observations, so he should not assume anything that he does not know. In other words, after using up the prior information, he should make himself as uncertain as possible about the microstate distribution, i.e. **_assume the microstate distribution to be the one with maximized entropy_**. This philosophy has been promoted by Edwin Thompson Jaynes and essentially becomes a foundational technique in many disciplines. 


### 2.1 Review: Lagrangian multiplier

The maximization of entropy is a typical problem of constrained optimization, which Lagrangian multiplier is devoted to deal with. 

In general, we optimize an $n$-variable function $f(\mathbf{x})$ for $\mathbf{x}\equiv(x_1, x_2, \dots, x_n)$ subject to $m$ equalities $g_j(\mathbf{x}) = 0$. We define the Lagrangian function

$$
L(\mathbf{x}, \mathbf{r}) = f(\mathbf{x})-\mathbf{\lambda}\cdot\mathbf{g}(\mathbf{x})
$$

with $\mathbf{\lambda}\equiv(\lambda_1, \lambda_2, \dots, \lambda_m)$ and $\mathbf{g}(\mathbf{x})\equiv [g_1(\mathbf{x}), g_2(\mathbf{x}), \dots, g_n(\mathbf{x})]$. Then we solve the system of $n+m$ equations:

$$
\nabla L = \left(\frac{\partial L}{\partial x_1},
\frac{\partial L}{\partial x_2}, \dots,
\frac{\partial L}{\partial x_n},
\frac{\partial L}{\partial \lambda_1},
\frac{\partial L}{\partial \lambda_2}, \dots,
\frac{\partial L}{\partial \lambda_m}\right) = 0
$$

for its optima $(\mathbf{x}^*,\mathbf{\lambda}^*)$, at which the constrained optima $f(\mathbf{x}^*)$ are yielded.



> The simplest case is the optimization to a 2 variables function $f(x, y)$ subject to only 1 constraint $g(x,y) = 0$. Define the Lagrangian function
>
>$$
L(x, y, \lambda) = f(x, y) -  \lambda g(x, y) 
$$
> The we solve the system of 2+1 equations 
>
>$$ \begin{cases} \dfrac{\partial L}{\partial x}=0 \\[0.5em] \dfrac{\partial L}{\partial y}=0 \\[0.5em] \dfrac{\partial L}{\partial \lambda}=0 \end{cases}$$
>
> for its optima $(x^*, y^*, \lambda^*)$. The constrained optima of $f(x, y)$ are then given by $f(x^*, y^*)$.
>
> You can find some practical examples with visualization on the [Wiki page](https://en.wikipedia.org/wiki/Lagrange_multiplier#Examples)
  

**Note:** Normally you have to check whether the optima are maxima, minima, or saddle points after lagrangian multiplier optimization. However the function $f(p) = -p\log p$ for $0<p<1$ is a concave function with one and only one local maximum (remember the bell-shape curve?), so we will definitely obtain a maximum. 

## 2.2. Example: A dice

Let's demonstrate the use of Lagrange multiplier by a 6-face dice. We can regards the microstates as the occurence of each number. What can we say about the dice's probability distribution of its microstate? Let the probability of getting the $i$-th face is $p_i$.

### 2.2.1. Case 1: Know nothing

We know nothing about the dice except that its outcome is discrete. As probabilities, the $p_i$ always satisfy the normalization constraint $\sum_{i=1}^6 p_i =1$, so the only constraint is 

$$g_\text{norm}(p_1,...p_6) = \sum_{i=1}^6 p_i - 1 = 0$$ 

To maximize the dice's entropy $H=-\sum_{i=1}^6 p_i \ln p_i$. Define the Lagrangian function as

$$
L(\mathbf{p}, \lambda) =- \sum_{i=1}^6 p_i \ln p_i - \lambda\left(\sum_{i=1}^6 p_i -1\right) \,.
$$

To solve $\nabla L = 0$, we consider the $i$-th face:

$$
\frac{\partial L}{\partial p_i} = -\ln p_i-1-\lambda = 0 \ .
$$

This yields $p_i=e^{-1-\lambda}$, which is constant for all faces. Upon normalizing, we get

$$p_i= \frac{e^{-1-\lambda}}{\sum_{i=1}^6 e^{-1-\lambda}} = \frac{1}{6}$$

### 2.2.2 Case 2: Know the mean of outcome

Let the outcome of the $i$-th face be $x_i = \{1,2,3,4,5,6\}$. Suppose this time we have learnt the mean of its outcome is $\mu=3.75$. Other than the normalization constraint, we get another constraint (call it the mean constraint)

$$g_\text{mean}(p_1,...p_6) = \langle x_i\rangle -\mu = \sum_{i=1}^6 x_i p_i - \mu = 0$$
 
so we have the following Lagrangian function:

$$
L(\mathbf{p}, \lambda_1, \lambda_2) = -\sum_{i=1}^6 p_i \ln p_i - \lambda_1\left(\sum_{i=1}^6 p_i -1\right) - \lambda_2 \left(\sum_{i=1}^6 x_ip_i -\mu\right) \ .
$$

Upon differentiation, it becomes

$$
\begin{align*}
\frac{\partial L}{\partial p_i} &= -\ln p_i-1-\lambda_1-\lambda_2 x_i = 0 \\
\Rightarrow p_i &= e^{-1-\lambda_1-\lambda_2 x_i} \ .
\end{align*}
$$

Substituting the result back into the normalization constraint, we get 

$$
\begin{align*}
\sum_{i=1}^6 e^{-1-\lambda_1-\lambda_2 x_i} &= e^{-1-\lambda_1}\left(\sum_{i=1}^6 e^{-\lambda_2x_i}\right) = 1 \\
e^{-1-\lambda_1} &= \frac{1}{\sum_{i=1}^6 e^{-\lambda_2x_i}}
\end{align*}
$$ 

and thus $p_i=\frac{e^{-\lambda_2x_i}}{\sum_{i=1}^6 e^{-\lambda_2x_i}}$. Then, we substitute the refined probability into the unused constraint of mean to get

$$
\frac{\sum_{i=1}^6 x_i e^{-\lambda_2x_i}}{\sum_{i=1}^6 e^{-\lambda_2x_i}} = \mu \ .
$$

Finally, we solve this equation numerically for $\lambda_2$ and then get $p_i$. For $\mu=3.75$, we get $\lambda_2\approx-0.0861$, and $p_i \approx (0.133, 0.145, 0.158, 0.172, 0.188, 0.204)$.

### 2.2.3. Case 3: Know the mean and variance of outcome

The procedure of entropy maximization stays the same when we are given more information, but we can formulate our Langrangian function skillfully to makes our lives easier.

Consider the dice again. What is the probability distribution if we further know that the variance of its outcome is $\sigma^2=0.5$? The constraint of variance is

$$g_\text{var}(p_1,...p_6) = \langle (x_i-\mu)^2\rangle - \sigma^2 = \sum_{i=1}^6 (x_i-\mu)^2p_i-\sigma^2 = 0$$.

You may be tempted to define the Lagrangian function as

$$
\begin{align}
\tilde{L}(\mathbf{p}, \lambda_1, \lambda_2, \lambda_3) 
=&-\sum_{i=1}^6 p_i \ln p_i - \lambda_1\left(\sum_{i=1}^6 p_i -1\right) - \lambda_2 \left(\sum_{i=1}^6 x_ip_i -\mu\right) - \lambda_3 \left[\sum_{i=1}^6 \left(x_i-\mu\right)^2p_i -\sigma^2\right]\,,
\end{align}
$$

with which you will obtain $\tilde{p}_i=\frac{e^{-\lambda_2x_i-\lambda_3(x_i-\mu)^2}}{\sum_{i=1}^6 e^{-\lambda_2x_i-\lambda_3(x_i-\mu)^2}}$. Although this is not wrong - you can solve two unknowns $\lambda_2$ and $\lambda_3$ with the two constraints of mean and variance - this is unnecessarily complicate. In fact, the constraint of variance already entails the constraint of mean: once you apply $\sum_{i=1}^6 (i-\mu)^2p_i=\sigma^2$, you have already required $\mu$ to be the mean. Therefore, we can drop the redundant constraint of mean and simplify the Lagrangian function as

$$
L(\mathbf{p}, \lambda_1, \lambda_3)
=-\displaystyle \sum_{i=1}^6 p_i \ln p_i - \lambda_1\left(\sum_{i=1}^6 p_i -1\right) - \lambda_3 \left[\sum_{i=1}^6 \left(x_i-\mu\right)^2p_i -\sigma^2\right] \ ,
$$

which yields $p_i=\frac{e^{-\lambda_3(x_i-\mu)^2}}{\sum_{i=1}^6 e^{-\lambda_3(x_i-\mu)^2}}$, having one unknown fewer. For $\mu=3.75$ and  $\sigma^2=0.5$, we get $\lambda_3\approx1$ and $\mathbf{p} \approx (0.000, 0.026, 0.321, 0.530, 0.118, 0.004)$.



---
# 3. Entropy of continuous distributions

## 3.1. Differential entropy

For a continuous variable $X$ distributed with $f_X(x)$, Shannon wrote the information entropy formula by naively replacing the summation with integration: 

$$
H_\text{discrete}(X) = -\sum_i P(x_i)\log P(x_i) \quad\Rightarrow\quad H_\text{continuous}(X) = - \int f_X(x) \log f_X(x) \mathrm{d}x 
$$

which is also called differential entropy. This definition is less useful than its discrete version for a number of reasons:

- Its value can be negative.
- It is not invariant under change of variables, i.e. replacing $X \rightarrow g(X)$.
- It is not dimensionally correct - $\log$ require a dimensionless argument but $f(x)$ require to be of unit $\sim\frac{1}{\mathrm{d}x}$ in order to have the integration being dimensionless. 

Jaynes proposed the adapataion that permits smooth transition from discrete to continuous limit as

$$
H(X)= -\int f_X(x) \log \left(\frac{f_X(x)}{m(x)}\right) \mathrm{d}x
$$

which $m(x)$ is the _limit of increasingly dense discrete distributions_. It is formally defined through

\begin{align*}
\int_a^b m(x) \mathrm{d}x &= \lim_{N\to \infty} \frac{\text{no. of point in between }(a,b)}{\text{total N points on number line}}
\end{align*}

> **Informal explanation:** 
> 
> To visualize the above definition, you may try to scatter a lot of points on a line, with separations between points representing the density of the points (smaller separation = denser). If the points in between interval $(a,b)$ are uniform distributed, the number of points in between should be proportional to the length of interval $(b-a) = \int_a^b 1 \mathrm{d}x$. Thus $m(x)$ is some distribution describing the density of points per length: 
> 
> $$
\int_a^b m(x) \mathrm{d}x \sim m(x)\times (\text{length of interval }(a,b)) \sim \frac{\text{no. of point in between }(a,b)}{\text{total no. of points on number line}}
$$
> 
> The limit of $N\to \infty$ is sort of a mathematical argument to ensure that the points are "dense enough" to look like a continuous distribution.



## 3.2. Kullbackâ€“Leibler (KL) divergence

A few years before Jaynes proposing the new entropy formula, the concept of KL divergence, a.k.a. relative entropy, has already been proposed in information theory by Solomon Kullback and Richard Leibler. The symbol $D_{KL}(P||Q)$, being read as "relative entropy of $P$ with respect to $Q$", is defined as 

$$
D_{KL}(P||Q) = 
\begin{cases}
\sum_i P(x_i)\log \dfrac{P(x_i)}{Q(x_i)} \quad\quad (\text{discrete }X)\\[1.2em]
\displaystyle\int_{-\infty}^\infty p_X(x)\log \dfrac{p_X(x)}{q_X(x)} \mathrm{d}x  \quad\quad (\text{continuous }X)
\end{cases}
$$
    
The quantity is originally measuring the expected number of extra bits required to code an information of distribution $p(x)$ when using a code based on $q(x)$, rather than using a code based on $p(x)$. 

\begin{align*}
D_{KL}(P||Q) &= \sum_i P(x_i)\log \frac{P(x_i)}{Q(x_i)} \\
&= \sum_i P(x_i) \log P(x_i) - \sum_i P(x_i) \log Q(x_i) \\
&= \begin{pmatrix}\text{original uncertainty} \\ \text{when coded using }P(x)\end{pmatrix} - \begin{pmatrix}\text{expected uncertainty} \\ \text{when coded using }Q(x)\end{pmatrix}
\end{align*}

But nowadays KL divergence is more frequently used as a measure to the discrepancy between two distribution $P(x)$ and $Q(x)$, with $P(x)$ being the theoretical/original distribution and $Q(x)$ being the distribution to be tested, because KL diveregence is always non-negative: 

- $D_{KL}(P||Q) = 0$ if $P(x)$ and $Q(x)$ are the same distribution.
- $D_{KL}(P||Q) \rightarrow \infty$ if $P(x)\neq 0$ but $Q(x)=0$, i.e. $Q(x)$ is absolutely impossible to be like $P(x)$ anywhere. 

Note that Jaynes's re-definition to Shannon entropy and KL divergence share the same formula, with only a difference in the minus sign. When $m(x)=1$, it returns Shannon's differential entropy. Therefore we can interprete Shannon differential entropy of a distribution as its discrepancy to the most humble testing distribution $m(x)=1$, i.e. a uniform distribution with everywhere $=1$.

> **Note:** Although people often refer KL divergence as some kind of "distance" (metric) between distributions, this is not a recognized by mathematicians since it
> 
> - Is not symmetric between $P$ and $Q$.
> - Do not satisfy the triangle inequality. 

## 3.3. Principle of maximum entropy - calculus of variations

(This part is purely for fun. If you don't understand a word, just skip it.)

Although it is possible to apply the principle of maximum entropy to continuous distribution, like what we did in the discrete cases, the math becomes way more complicate when the things we want to optimize are not numbers but continuous functions. Let's say the optimization is formulated as 

$$
\max_{f} H(f) \quad\quad \text{subject to }\quad g(f)=0
$$

But note that 
- $f=f(x)$ is a function that we have to vary.
- $H(f) = -\displaystyle \int_a^b f(x)\log f(x) \mathrm{d}x$ is the entropy "function of function" we want to optimize. 
- $g(f) = 0$ is another "function of function" that constraint $f(x)$. 

Mathematicians have actually invented the term "_functional_" to call these kinds of "function of function" which map a function to a number. And the field of optimizing a functional is called "_Calculus of variation_". 

The method of Lagrangian multiplier starts by writing a bigger functional that looks like the Lagrangian function in discrete case: 

$$
J(f,\lambda) = -\int_a^b f(x)\log f(x) \mathrm{d}x - \lambda \cdot g(f(x))
$$

If $J$ can be turned into something looks like $J(f,\lambda) = \int L[f(x)]\mathrm{d}x$, the "derivative" of $J$ with respect to $f$ can be formulated as 

\begin{align*}
\frac{\delta J}{\delta f} &\equiv \frac{\partial{L}}{\partial f} 
- \frac{\mathrm{d}}{\mathrm{d}x}\frac{\partial L}{\partial f'}
+ \frac{\mathrm{d}^2}{\mathrm{d}x^2}\frac{\partial L}{\partial f''}
- \frac{\mathrm{d}^3}{\mathrm{d}x^3}\frac{\partial L}{\partial f'''} + \cdots\\ 
&= \sum_{n=0}^\infty (-1)^n
\frac{\mathrm{d}^n}{\mathrm{d}x^n}
\frac{\partial L}{\partial f^{(n)}}
\end{align*}

for which $f$ and its $n$-th derivatives $f^{(n)}$ are treated like independent variables when carrying out differentiation. The equation is called the Euler-Lagrange equation, and the function $L$ is, somewhat unfortunately thanks to Lagrange's wisdom, also called a Lagrangian function. Finally, the optimal $f$ can be found by solving $\dfrac{\delta J}{\delta f} = 0$.


#### Example: A uniform prior

Suppose we have a continuous random variable $X$. If the only thing we know is that $X$ has non-zero value only within $[a,b]$, what can we say about its distribution $f(x)$?

**Solution.** The distribution is still subject to the normalization condition $\int_a^b f(x) \mathrm{d}x=1$. So the functional $J$ can be written as

$$
J(f,\lambda) = - \int_a^b f(x) \log f(x)\mathrm{d}x - \lambda\left[\int_a^b f(x)\mathrm{d}x-1\right]
$$

Thus the Lagrangian function is 

$$
L[f] = -f \log f -\lambda f + \frac{\lambda}{b-a}
$$

It contains no derivatives of $f(x)$, so only the first term remains in the differentiation:

$$
\frac{\delta J}{\delta f}=\frac{\mathrm{\partial}L}{\mathrm{\partial}f}=-\log f-1-\lambda
$$

By solving $\dfrac{\delta J}{\delta f}=0$, we can see that $f(x)=e^{-1-\lambda}$, which is independent of $x$. Substituting into the normalization condition: 

\begin{align*}
\int_a^b e^{-1-\lambda} \mathrm{d}x= (b-a)e^{-1-\lambda} &= 1 \\
e^{-1-\lambda} &= \frac{1}{b-a} = f(x)
\end{align*}

which is a pretty intuitive solution - If we know nothing about a continuous variable except its range $[a,b]$, it is the most objective to believe it is distributed in the range uniformly, i.e. $f(x)=\dfrac{1}{b-a}$.  