# MLE: Cross Entropy and KL Divergence Minimzation

### Maximum Likelihood Estimation (MLE)

Given a dataset $\{x_1, x_2, \dots, x_n\}$ drawn from an unknown distribution, the goal of MLE is to estimate the parameters $\theta$ of a probability distribution $p(x|\theta)$ such that the likelihood of the observed data is maximized.

The likelihood function is the joint probability of the observed data:

$$
L(\theta) = p(x_1, x_2, \dots, x_n | \theta) = \prod_{i=1}^{n} p(x_i | \theta)
$$

The **log-likelihood** is then:

$$
\log L(\theta) = \sum_{i=1}^{n} \log p(x_i | \theta)
$$

MLE finds $\theta$ by maximizing the log-likelihood:

$$
\hat{\theta} = \arg\max_{\theta} \log L(\theta)
$$


## Minimzing CE Loss & KL Divergence against data density estimate

In this notebook, we will show that minimizing the Cross Entropy (CE) loss and the Kullback-Leibler (KL) Divergence between the data distribution and the model distribution are equivalent to maximizing the log-likelihood of the data.

Let's consider a dataset $\{x_1, x_2, \dots, x_n\}$ drawn from an unknown distribution $p_{\text{data}}(x)$.

We want to estimate the parameters $\theta$ of a model distribution $p_{\text{model}}(x|\theta)$ such that the model distribution is as close as possible to the data distribution.

The data distribution is the crude estimate of the true distribution of the data. The model distribution is the distribution we are trying to learn.

For the dataset, the data density estimate is:

$$
\hat{p}_{\text{data}}(x) = \frac{1}{n} \sum_{i=1}^{n} \delta(x - x_i)
$$

where $\delta(x)$ is the Dirac delta function.

The CE loss between the data distribution and the model distribution is:

$$
\text{CE}(\theta) = - \mathbb{E}_{x \sim p_{\text{data}}(x)} \left[ \log p_{\text{model}}(x|\theta) \right]
$$

For fixed $\hat{p}_{\text{data}}(x)$, the CE loss is minimized when the model distribution is as close as possible to the data distribution.

Proof:

$$
\begin{aligned}
\text{CE}(\theta) &= - \mathbb{E}_{x \sim p_{\text{data}}(x)} \left[ \log p_{\text{model}}(x|\theta) \right] \\
&= - \int p_{\text{data}}(x) \log p_{\text{model}}(x|\theta) dx \\
&= - \int p_{\text{data}}(x) \log \left( \frac{p_{\text{model}}(x|\theta)}{p_{\text{data}}(x)} p_{\text{data}}(x) \right) dx \\
&= - \int p_{\text{data}}(x) \log \left( \frac{p_{\text{model}}(x|\theta)}{p_{\text{data}}(x)} \right) dx - \int p_{\text{data}}(x) \log p_{\text{data}}(x) dx \\
&= - \int p_{\text{data}}(x) \log \left( \frac{p_{\text{model}}(x|\theta)}{p_{\text{data}}(x)} \right) dx + \text{H}(\hat{p}_{\text{data}}(x)) \\
&= \text{KL}(\hat{p}_{\text{data}}(x) || p_{\text{model}}(x|\theta)) +\text{H}(\hat{p}_{\text{data}}(x))
\end{aligned}
$$

where $\text{H}(\hat{p}_{\text{data}}(x))$ is the entropy of the data distribution.

This is the sum of the KL Divergence between the data distribution and the model distribution and the entropy of the data distribution.

If we substitute the data density estimate $\hat{p}_{\text{data}}(x)$ with the true data distribution $p_{\text{data}}(x)$, then:

$$
\text{CE}(\theta) = \text{KL}(p_{\text{data}}(x) || p_{\text{model}}(x|\theta)) - \text{H}(p_{\text{data}}(x))
$$

Ignoring the entropy term, the CE loss is minimized when the KL Divergence between the data distribution and the model distribution is minimized.

Substituting $\hat{p}_{\text{data}}(x)$ with the true data distribution $p_{\text{data}}(x)$, and using the dirac delta function, the CE loss becomes:

$$
\text{CE}(\theta) = - \frac{1}{n} \sum_{i=1}^{n} \log p_{\text{model}}(x_i|\theta)
$$

because $p_{\text{data}}(x) = \frac{1}{n} \sum_{i=1}^{n} \delta(x - x_i)$ which gives:

$$
\begin{aligned}
\text{CE}(\theta) &= - \int p_{\text{data}}(x) \log p_{\text{model}}(x|\theta) dx \\
&= - \int \frac{1}{n} \sum_{i=1}^{n} \delta(x - x_i) \log p_{\text{model}}(x|\theta) dx \\
&= - \frac{1}{n} \sum_{i=1}^{n} \log p_{\text{model}}(x_i|\theta)
\end{aligned}
$$


So when we minimize the CE loss, we are maximizing the log-likelihood of the data, $\log L(\theta)$.

At the minimum of the CE loss, $\hat{\theta}$, CE($\hat{\theta}$) ( a constant away from $l = log(L(\hat{\theta})$)) the model distribution is as close as possible to the data distribution.

