### KL Divergence

KL divergence, short for Kullback-Leibler divergence, is a measure of how different two probability distributions are from each other. It is commonly used in statistics, information theory, and machine learning to compare two probability distributions.

#### Definition
In general, KL divergence is defined as:

$$
D_{KL}(P\|Q) = \int_{-\infty}^{\infty} p(x) \log\frac{p(x)}{q(x)} dx
$$

where $p(x)$ and $q(x)$ are probability density functions of two distributions $P$ and $Q$, respectively. When $P$ and $Q$ are discrete probability distributions, the integral is replaced by a sum, and the formula becomes:

$$
D_{KL}(P\|Q) = \sum_{i=1}^{n} p(i) \log\frac{p(i)}{q(i)}
$$

where $p(i)$ and $q(i)$ are the probabilities of the $i$-th outcome in distributions $P$ and $Q$, respectively.

#### Jensen's inequality
Jensen's inequality can be used to prove several important properties of KL divergence. Jensen's inequality is a fundamental theorem in mathematics that states that for any convex function $f(x)$ and any random variable $X$, the expectation of the function $f$ of $X$ is greater than or equal to the function of the expectation of $X$. Mathematically, it can be written as:

$$
f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]
$$

or equivalently,

$$
\mathbb{E}[f(X)] - f(\mathbb{E}[X]) \geq 0
$$

Intuitively, Jensen's inequality tells us that the expectation of a function of a random variable is at least as large as the function of the expectation of the random variable. The equality holds only when $f(x)$ is a linear function of $x$.

#### Properties of KL Divergence
1. **KL Divergence is always non-negative.**    
   $$
   \begin{aligned}
   D_{\text{KL}}(P||Q) &= -\int p(x) \text{log} \frac{q(x)}{p(x)} dx \\
   &= E \left [ -\text{log} \frac{q(x)}{p(x)} \right ] \\
   &\geq -\text{log} E \left [ \frac{q(x)}{p(x)} \right ] \quad \because -\text{log}(x) \text{ is a convex function.}\\
   &= -\text{log} \int p(x) \frac{q(x)}{p(x)} dx \\
   &= -\text{log} \int q(x) dx \\
   &= -\text{log}(1) \\
   &= 0 \\  
   \\
   \therefore D_{\text{KL}}(P||Q) & \geq 0
   \end{aligned}
   $$

2. **The cross-entropy is always greater than or equal to entropy.**
   $$
   \begin{aligned} 
   D_{\text{KL}}(P||Q) &= \int p(x) \text{log} \frac{p(x)}{q(x)} dx \\
   &= \int p(x) \text{log } p(x) dx - \int p(x) \text{log } q(x) dx \\
   &\geq 0 \\
   \\
   \therefore -\int p(x) \text{log } q(x) dx &\geq -\int p(x) \text{log } p(x) dx
   \end{aligned}
   $$
   where $-\int p(x) \text{log } q(x) dx$ and $-\int p(x) \text{log } p(x) dx$ are the cross-entropy and the entropy, respectively.

3. **Two univariate normal distributions $P$ and $Q$ are simplified to**
   $$
   \begin{aligned}
   D_{\text{KL}}(P||Q) = \text{log }\frac{\sigma_{q}}{\sigma_{p}} + \frac{\sigma_{p}^{2} + (\mu_{p}-\mu_{q})^{2}}{2 \sigma_{q}^{2}} - \frac{1}{2}
   \end{aligned}
   $$

   $\bf{proof}$
   $$
   \begin{aligned}
   D_{\text{KL}}(P||Q) &= \int p(x) \text{log} \frac{p(x)}{q(x)} dx \\
   &= E_{p} \left [ \text{log} \frac{p(x)}{q(x)} \right ] \\
   &= E_{p} [ \text{log } p(x) - \text{log } q(x) ] \\
   &= E_{p} [ \text{log } p(x) ] - E_{p} [ \text{log } q(x) ] \\
   E_{p} [ \text{log } p(x) ] &= E_{p} \left \{ \text{log } \frac{1}{\sqrt{2\pi}\sigma_{p}} \text{exp} \left [ {-\frac{(x-\mu_{p})^{2}}{2\sigma_{p}^{2}}} \right ] \right \} \\
   &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{p}} - \frac{1}{2\sigma_{p}^{2}} E_{p} \left [ (x-\mu_{p})^{2} \right ] \\
   &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{p}} - \frac{1}{2\sigma_{p}^{2}} \cdot \sigma_{p}^{2} \\
   &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{p}} - \frac{1}{2} \\
   E_{p} [ \text{log } q(x) ] &= E_{p} \left \{ \text{log } \frac{1}{\sqrt{2\pi}\sigma_{q}} \text{exp} \left [ {-\frac{(x-\mu_{q})^{2}}{2\sigma_{q}^{2}}} \right ] \right \} \\
   &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{q}} - \frac{1}{2\sigma_{q}^{2}} E_{p} \left [ (x-\mu_{q})^{2} \right ] \\
   &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{q}} - \frac{1}{2\sigma_{q}^{2}} E_{p} \left [ x^{2} - 2x\mu_{q} + \mu_{q}^{2} \right ] \\
   &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{q}} - \frac{1}{2\sigma_{q}^{2}} E_{p} \left [ x^{2} - 2x\mu_{q} + \mu_{q}^{2} - 2x\mu_{p} + 2x\mu_{p} + \mu_{p}^{2} - \mu_{p}^{2} \right ] \\
   &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{q}} - \frac{1}{2\sigma_{q}^{2}} E_{p} \left [ (x^{2} - 2x\mu_{p} + \mu_{p}^{2}) - 2x\mu_{q} + 2x\mu_{p} + \mu_{q}^{2} - \mu_{p}^{2} \right ] \\
   &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{q}} - \frac{1}{2\sigma_{q}^{2}} \left \{ E_{p} [ (x-\mu_{p})^{2} ] - 2\mu_{p}\mu_{q} + 2\mu_{p}^{2} + \mu_{q}^{2} - \mu_{p}^{2} \right \} \\
   &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{q}} - \frac{\sigma_{p}^{2} + (\mu_{p}-\mu_{q})^{2}}{2\sigma_{q}^{2}}  \\
   D_{\text{KL}}(P||Q) &= \text{log }\frac{1}{\sqrt{2\pi}\sigma_{p}} - \frac{1}{2} - \text{log }\frac{1}{\sqrt{2\pi}\sigma_{q}} + \frac{\sigma_{p}^{2} + (\mu_{p}-\mu_{q})^{2}}{2\sigma_{q}^{2}} \\
   &= \text{log }\frac{\sigma_{q}}{\sigma_{p}} + \frac{\sigma_{p}^{2} + (\mu_{p}-\mu_{q})^{2}}{2\sigma_{q}^{2}} - \frac{1}{2}
   \end{aligned}
   $$