# Total Variation Distance

## Definition
Let $P$ and $Q$ be probability measures with a discrete sample space $E$ and probability mass functions $f$ and $g$. Then, the total variation distance between $P$ and $Q$:$$\text {TV}(\mathbf{P}, \mathbf{Q}) = {\max _{A \subset E}}| \mathbf{P}(A) - \mathbf{Q}(A) |$$
can be computed as$$\text {TV}(\mathbf{P}, \mathbf{Q}) = \frac{1}{2} \, \sum _{x \in E} |f(x) - g(x)|$$
Let $P$ and $Q$ be probability distributions on a continuous sample space $E$ with probability density functions $f$ and $g$. Then, the total variation distance between $P$ and $Q$ $$\text {TV}(\mathbf{P}, \mathbf{Q}) = {\max _{A \subset E}}| \mathbf{P}(A) - \mathbf{Q}(A) |$$
can be computed as$$\text {TV}(\mathbf{P}, \mathbf{Q}) = \frac{1}{2} \, \int  _{x \in E} |f(x) - g(x)|~ \text {d}x$$

## Properties
$TV(\mathbf{P}, \mathbf{Q}) = TV(\mathbf{Q}, \mathbf{P})$ (symmetric)  
$TV(\mathbf{P}, \mathbf{Q}) \geq 0$ (nonnegative)  
$TV(\mathbf{P}, \mathbf{Q}) = 0 \iff \mathbf{P}= \mathbf{Q}$(definite)  
$TV(\mathbf{P}, \mathbf{V}) \leq TV(\mathbf{P}, \mathbf{Q}) + TV(\mathbf{Q}, \mathbf{V})$(triangle inequality)  
These imply that the total variation is a distance between probability distributions.  
The smallest number $M$ such that $TV(P,Q)≤M$ for any probability measures $P,Q$ is 1

# Kullback-Leibler (KL) Divergence

## Definition
Let $P$ and $Q$ be discrete probability distributions with pmfs $p$ and $q$ respectively. Let's also assume $P$ and $Q$ have a common sample space $E$. Then the KL divergence (also known as relative entropy ) between $P$ and $Q$ is defined by$$\text {KL}(\mathbf{P}, \mathbf{Q}) = \sum _{x \in E} p(x) \ln \left( \frac{p(x)}{q(x)} \right)$$where the sum is only over the support of $P$.  
Analogously, if $P$ and $Q$ are continuous probability distributions with pdfs $p$ and $q$ on a common sample space $E$, then$$\text {KL}(\mathbf{P}, \mathbf{Q}) = {{\int }} _{x \in E} p(x) \ln \left( \frac{p(x)}{q(x)} \right) dx$$where the integral is again only over the support of $P$.

## Properties
$KL(\mathbf{P}, \mathbf{Q}) \neq KL(\mathbf{Q}, \mathbf{P})$ in general(Asymmetric)  
$KL(\mathbf{P}, \mathbf{Q}) \geq 0$ (nonnegative)  
$KL(\mathbf{P}, \mathbf{Q}) = 0$ only if $P$ and $Q$ are the same distribution (definite)  
$KL(\mathbf{P}, \mathbf{V}) \nleq KL(\mathbf{P}, \mathbf{Q}) + KL(\mathbf{Q}, \mathbf{V})$ in general  

Not a distance.  
This is called a divergence.  
Asymmetry is the key to our ability to estimate it.  
$θ^∗$ Is the unique minimizer of $θ \mapsto KL(P_{θ^∗},P_θ)$

## Estimating KL Divergence
$$
\begin{align} 
KL(P_{\theta ^*}, P_{{\theta }}) &= \mathbb {E}_{\theta ^*}[\ln (\frac{p_{\theta ^*}(X)}{p_{\theta}(X)}) ]=\sum _{x \in E} p_{\theta ^*} \ln p_{\theta ^*}(x) - \sum _{x \in E} p_{\theta ^*} \ln p_{\theta }(x)\\
&= \mathbb {E}_{\theta ^*}[\ln p_{\theta ^*}(X) ] - \mathbb {E}_{\theta ^*}[\ln p_\theta (X)]
\end{align}
$$
So the function $θ \mapsto KL(P_{θ^∗},P_θ)$ is of the form: (since the first term dose not depend on $\theta$.)
$$ 'constant' - \mathbb {E}_{\theta ^*}[\ln p_\theta (X)]$$
By the law of large numbers, $\displaystyle \frac{1}{n} \sum _{i = 1}^ n \ln (p_\theta (X_ i)) \to \mathbb {E}_{\theta ^*}[\ln p_\theta ]$ in probability
$$\hat{\text {KL}}(P_{\theta ^*}, P_\theta ) := \mathbb {E}_{\theta ^*}[\ln p_{\theta ^*}(X) ] - \displaystyle \frac{1}{n} \sum _{i = 1}^ n \ln (p_\theta (X_ i)).$$
Therefore, as shown above, while we cannot find $\theta$ that minimizes $KL(P_{\theta ^*}, P_{{\theta }})$, we can find $\theta$ that minimizes $\hat{\text {KL}}(P_{\theta ^*}, P_\theta )$.

## Maximum Likelihood principle
$$\hat{\text {KL}}(P_{\theta ^*}, P_\theta ) := \mathbb {E}_{\theta ^*}[\ln p_{\theta ^*}(X) ] - \displaystyle \frac{1}{n} \sum _{i = 1}^ n \ln (p_\theta (X_ i)).$$
$$
\begin{align} 
\min _{\theta \in \Theta}\hat{\text {KL}}(P_{\theta ^*}, P_\theta ) &\iff \min _{\theta \in \Theta} -\displaystyle \frac{1}{n} \sum _{i = 1}^ n \ln (p_\theta (X_ i))\\
&\iff \max _{\theta \in \Theta } \displaystyle \frac{1}{n} \sum _{i = 1}^ n \ln (p_\theta (X_ i))\\
&\iff \max _{\theta \in \Theta } \displaystyle \ln \Bigg [\prod \limits_{i=1}^n p_\theta (X_ i)\Bigg]\\
&\iff \max _{\theta \in \Theta } \displaystyle \prod \limits_{i=1}^n p_\theta (X_ i)\\
\end{align}
$$

# Likelihood of a Discrete Distribution

# Likelihood of a Continuous Distribution