# Maximum Likelihood Estimation

## ***Vocabulary***

- **empirical data**
    - data gathered through empirical means (e.g., observations, measurements, experiments, simulations, etc.). this type of data allows ai models to learn from real-world examples and test against practical scenarios
- **realistic data**
    - data which mimics the complexity, variability, and noise found in the real-world. this type of data should capture the imperfections fo real world processes 
- **multimodality**
    - "multimodal" refers to receiving, processing, or outputting media of multiple forms (modalities) (e.g., audio, video)
- **stochastic**
    - systems or processes that are random or involve some degree of randomness. in a stochasic system, outcomes are not deteministic. outcomes can vary even, even under the same conditions. you cannot predict the exact outcome, but can describe their likelihood using probabilities
- **multivariate**
    - refers to something that uses multiple variables or features
- **argmax**
    - "argument of the maximum", the input value that produces the maximum output of a given function
- **second-derivate test**
    - Second derivative > 0: Local minimum.
    - Second derivative < 0: Local maximum.
    - Second derivative = 0: Test is inconclusive, and further analysis is needed.
- **parametric distribution**
    - distribution of the same mathematical form that has a finite number of certain parameters, that once known, describe the distribution completely. for example, the gaussian distribution uses $\sigma^2$ for variance and $\mu$ for the mean. exponential distribution uses $\gamma$ for the decay rate. different parameters outline different behaviours for the distributions.
- **closed-form solution**
    - an explicit mathematical expression which can be evaluated in a finite number of standard operations. it does not require iterative procedures or methods.
- **sum of squared differences**
- **asymptotically**
- **regularity**
- 

# Lecture Notes #

## ***2.1.0 Introduction***

### **Intuition/Background/Example 1**

#### **Introduction**

Maximum likelihood estimation is a way to estimate parameters for given probabalistic distributions from observation.

#### **Problem Statement**

<br>
<center>
    <img width="60%" src="images/2.1.1.png" alt="Professor Notes" />
</center>
<br>

In this example, we can very intuitively estimate $\theta = \frac{4}{5}$. However, is there a theoretical and mathematical way to estimate this parameter that we can apply to more complex problems? **Maximum likelihood estimation does exactly that**.

#### **Basis for MLE**

<br>
<center>
    <img width="60%" src="images/2.1.2.png" alt="Professor Notes" />
</center>
<br>

Simplifying the problem, let's compare the probability that $\theta$ equals 0.1 or 0.9. These values can actually be calculated using the probabilistic rule since we have some independent draws from the distribution. As we can see after calculating both, the likelihood that $theta = .9$ is much higher than the likelihood that $\theta = .1$. This is the basis for estimating parameters using maximum likelihood estimation.

In practice, we can do this same process for every value between 0 and 1. By doing this, we can create a function, a likelihood function, that evaluates how likely each parameter is. Then, we can maximize that likelihood function to find the optimal esimtation.

<br>
<center>
    <img width="60%" src="images/2.1.3.png" alt="Professor Notes" />
</center>
<br>

#### **Applying MLE**

In the case of the example problem, the likelihood function and maximum likelihood estimation would be as follows:

<br>
<center>
    <img width="60%" src="images/2.1.4.png" alt="Professor Notes" />
</center>
<br>

So, we end up with the optimization problem:

$$ \hat{\theta} = \underset{\theta}{argmax}\; \theta^4\;(1-\theta) $$

#### **Log-Likelihood**

And it turns out there is a nice trick to solve this problem, we can maximize the log of the function instead (because the log is monotonically increasing). Thus, we are solving the log-likelihood function:

$$ \hat{\theta}  = \underset{\theta}{argmax}\; log(\theta^4\;(1-\theta)) $$
$$ = 4\;log\;\theta+log\;(1-\theta) $$
$$ \ell(\theta) \triangleq log\;L(\theta) $$

Then, we can calculate the gradient of this function (by taking the derivative), and find the point where the gradient 0. This will yield the maximum of the fuction, as long as the second derivate test confirms it's a maximum. The gradient for this problem is:

$$\nabla \ell(\theta) = \frac{4}{\theta}+\frac{-1}{1-\theta} = 0 \implies \hat{\theta} $$

#### **Family of Problems MLE can Solve**

<br>
<center>
    <img width="60%" src="images/2.1.5.png" alt="Professor Notes" />
</center>
<br>

So we can define the likelhood function as:

$$ L(\theta) = P(x_1, \dots, x_n | \theta) $$

Which, since the observations are drawn independently, is a product of probabilities:

$$ \prod_{i=1}^n P(x_i|\theta) $$

And as we said, its easier to work with logs when working with many products, so we will often prefer to use the log-likelihood function:

$$ \ell = log\;L(\theta) = log(\prod_{i=1}^n P(x_i|\theta)) $$

**Thus, we can define the log-likelihood function as:**
$$ = \sum_{i=1}^n log P(x_i|\theta) $$

**We can then describe the maximum likelihood estimation as:** 

$$ \hat{\theta} = \underset{\theta}{argmax}\;\ell(\theta) = \underset{\theta}{argmax}\{\sum_{i=1}^n log P(x_i|\theta)\} $$

#### **Formally, the Steps of MLE**
1. Define the likelihood function
2. Calculate the log-likelihood function
3. Maximize the log-likelihood function

### **Example 2: Biased Coin Revisted; Bernoulli Distribution**

<br>
<center>
    <img width="60%" src="images/2.1.6.png" alt="Professor Notes" />
</center>
<br>

This is another binary problem. Notice that the sigmoid function $\frac{\exp(w)}{1+exp(w)}$ is being used. This transformation maps any real-valued number $w \in (-\infty, \infty)$ to a value $\theta \in (0,1)$.

**Step 1: Define Likelihood Function**

$$L(w) = P(x_1, \dots, x_n | w)$$
$$ = \prod_{i=1}^n P(x_i|w) $$

**Step 2: Calculate Log-Likelihood Function**

Now a problem that we have is that the probability is on a case by case basis. If $x_i = 0$, then we use one probability, if $x_i = 1$, we use the other. Fortunately, we can actually write these probabilites in a unified form:

$$ Pr(x) = \frac{exp(xw)}{1+exp(w)} $$

Proof:

<br>
<center>
    <img width="60%" src="images/2.1.7.png" alt="Professor Notes" />
</center>
<br>

Plugging in that formula to get the expression for each $x$:

$$ log\;Pr(x|w) = log(\frac{exp(xw)}{1+exp(w)}) $$
$$ = xw-log(1+exp(w)) $$

Recall the formula for likelihood of $w$, and plug in the log-likelihood formula derived:


$$ \ell(w) = log(L(w)) = log(\prod_{i=1}^n P(x_i|w))$$
$$ = \sum_{i=1}^n [x_iw-log(1+exp(w))] $$

Simplifying:

$$ = (\sum_{i=1}^n x_i)w-n\;log(1+exp(w)) $$
$$ = n(\bar{x}w - log(1+exp(w))) $$

**Step 3: Optimize Log-Likelihood Function**

$$ \hat{w} = \underset{w}{argmax}\;\ell(w) = \underset{w}{argmax}\{n(\bar{x}w - log(1+exp(w)))\} $$

We will start by taking the gradient:

$$ \nabla \ell(w) = \frac{d}{dx}\; n(\bar{x}w - log(1+exp(w))) $$ 
$$ = n(\bar{x}-\frac{exp(w)}{1+exp(w)}) $$

And we want to find where this gradient equals 0.

Here are some other equivalent ways to write that final optimization problem:

<br>
<center>
    <img width="60%" src="images/2.1.8.png" alt="Professor Notes" />
</center>
<br>

### **Example 3: Gaussian Distribution**

#### **Setup**
<br>
<center>
    <img width="60%" src="images/2.1.9.png" alt="Professor Notes" />
</center>
<br>

Here we have $x$ drawn from a Gaussian distribution, which is indexed on two parameters, $\mu$, and $\sigma$.

Since we already have the formula for $P(x|\theta)$, we will jump straight into step 2. The log-likelihood function can be defined as:

$$ \ell(\theta) = \sum_{i=1}^n (log\frac{1}{(2\pi)^{1/2}\sigma})-\frac{1}{2\sigma^2}(x_i-\mu)^2 $$

And in this case we have two parameters, so theta is sigma and mu here.

Simplifying:

$$ \ell(\theta) = \frac{-n}{2}log(2\pi)-n\;log\;\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2 $$

This is what we want to maximize.

#### **Optimizing mu**

If say we want to maximize $\mu$, we only need to look at the terms that depend on $\mu$, as any other term is a constant with respect to $\mu$. Thus, the term we will look at is:

$$ -\frac{1}{2\sigma^2} \sum_{i=1}^n(x_i-\mu)^2 $$

And since $\frac{1}{2\sigma^2}$ is a constant with respect to optimizing $\mu$ (for a fixed $\sigma$ while maximizing $\mu$), the optimization problem turns into the following minimization problem:

$$ \hat{\mu} = \underset{\mu}{argmin}\{\sum_{i=1}^n(x_i-\mu)^2\} $$

This is because $x_i-\mu$ corresponds to how far a data point is from the mean, and we want to minimize that value to choose the best $\mu$. Also because it is preceeded by a negative sign, so the smaller it is, the more we maximize the overall function.

Calculating the gradient, we get:

$$ \sum_{i=1}^n2(\mu-x_i)=0 \implies \hat{\mu} = \frac{1}{n}\sum_{i1}^nx_i $$

As shown here:

<br>
<center>
    <img width="60%" src="images/2.1.10.png" alt="Professor Notes" />
</center>
<br>

#### **Optimizing sigma**

Only the second and third term rely on $\sigma$, so our optimization problem turns into:

$$ \hat{\sigma} = \underset{\sigma}{argmax}\{-n\;log\;\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\hat{\mu}^2\} $$

Which we will define as $f(\sigma)$. The gradient can be calculated as:

$$ \nabla f(\sigma) = -\frac{n}{\sigma}-\frac{(-2)1}{2\sigma^3}  \sum_{i=1}^n(x_i-\hat{\mu})^2 $$

Solving for this gradient set to 0:

$$ \implies \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(x_i-\hat{\mu})^2 $$

#### **Conclusion**

Both optimizing $\mu$ and optimizing $\sigma$ turned out very nice, as $\mu$ ended up being the mean value of the $x_i$'s, and $\sigma$ ended up being the emprical variance.

### **Example 4: Uniform Distribution**

<br>
<center>
    <img width="60%" src="images/2.1.11.png" alt="Professor Notes" />
</center>
<br>

This problem involves the uniform distribution, for which the function is as shown below. It is a step function. Anything in the green shaded area will evaluate to $\frac{1}{\theta}$, any point anywhere else will evaluate to 0. 

It is shown that both orange shaded areas can be used to estimate $\theta$ given the observations shown in red, but we we want to keep $\theta$ as small as possible. We can use MLE to get the optimal $\theta$.

<br>
<center>
    <img width="60%" src="images/2.1.12.png" alt="Professor Notes" />
</center>
<br>

#### **Step 1: Defining**

We will start by rewriting the probability function in terms of $x$:

$$ P(x|\theta) = \frac{1}{\theta}1(x\in[0,\theta]) $$

#### **Step 2: Calculating**

Deriving the likelihood function:

$$ L(\theta) = \prod_{i=1}^n P(x_i|\theta) $$
$$ = \prod_{i=1}^n(\frac{1}{\theta})1(x_i\in[0,\theta]) $$
$$ = (\frac{1}{\theta})^n \prod_{i=1}^n 1(x_i\in[0,\theta]) $$ 

Which, due to the conditionality (we are looking at the product of a bunch of indicator functions), we can further simplify this expression to:

$$ L(\theta) = 
\begin{cases}  
(\frac{1}{\theta})^n & \text{if } x_i \in [0, \theta] \;\; \forall i \\ 
0 & \text{otherwise} 
\end{cases} $$

#### **Step 3: Maximizing**

That piecewise is what we want to maximize. And since this is a step function, we want to ensure that $\theta$ includes all observations. Thus, $\theta$ must enclose all the observations, and the largest $x_i$ will be the best choice for $\theta$.


Below is the likelihood function graphed, which proves that the largest $x_i$ is the best choice for $\theta$.

<br>
<center>
    <img width="60%" src="images/2.1.13.png" alt="Professor Notes" />
</center>
<br>

Formally:

$$ \hat{\theta} = max\{x_1, \dots, x_n\} $$

### **Example 4: Regression; Gaussian Distribution**

<br>
<center>
    <img width="60%" src="images/2.1.14.png" alt="Professor Notes" />
</center>
<br>

It turns out that MLE can be used for more complicated problems such as regression. Regression is when we have a pairs of variables and we want to understand the relationship between the feature (typically $x_i$) and the variable (typically $y_i$). Basically, we want to fit a curve to predict $y$ given $x$.

#### **Step 1: Defining**

**Breaking down the given probability:**
Its a pretty reasonable to assume that $y$ is generated from some distribution. Another typical assumption we can make is that we can assume, conditioned on each $x_i$, $y$ is following some gaussian distribution whose mean is some function class, and we also have some variance $\sigma^2$.

This graph outlines these principles visually, where $f(x,\theta)$ is the mean conditioned on $x$, and $\sigma^2$ is the variance, which determines where the point will fall.

<br>
<center>
    <img width="60%" src="images/2.1.15.png" alt="Professor Notes" />
</center>
<br>

Instead of maximizing a single likelihood, we maximize the conditional likelihood.

$$ \ell(\theta, \sigma) = \sum_{i=1}^n log\;P(y_i|x_i, \theta) $$

#### **Step 2: Calculating**

Using the same formula derived in Example 3, but replacing $\mu$ with the mean function:

$$ \ell(\theta, \sigma) = \sum_{i=1}^n [\frac{-n}{2}log(2\pi)-n\;log\;\sigma-\sum_{i=1}^n\frac{1}{2\sigma^2}(y_i-f(x_i\theta))^2 $$

#### **Step 3: Maximizing**

$$ \underset{\theta}{max}\;\ell(\theta,\sigma) \implies \underset{\theta}{min}\sum_{i=1}^n(y_i-f(x_i\theta))^2 $$

And for sigma, after finding $\hat{\theta}$:

$$ \hat{\sigma} = \frac{1}{n}\sum_{i=1}^n (y_i=f(x_i\hat{\theta}))^2 $$

### **Example 5: Logistic Regression**

<br>
<center>
    <img width="60%" src="images/2.1.16.png" alt="Professor Notes" />
</center>
<br>

While regression is when we have a pairs of variables and $y$ is a real number, in logistic regression $y$ is a binary variable, or label, conditioned on $x$. That is, logistic regression is a classification problem.

This problem is essentially an extension of the Bernoulli example we saw earlier, replacing $w$ with this function $f(x;\;\theta)$.

#### **Step 1: Defining**

Given the above probability, we can derive that:

$$ P(y=1|x,\theta) = \frac{exp(f(x,\theta))}{1+exp(f(x,\theta))} $$

$$ P(y=0|x,\theta) = \frac{1}{1+exp(f(x,\theta))} $$

#### **Step 2: Calculating**

In this case we don't need the log-likelihood function. That's because above we transformed the probabilites into a sigmoid function.

#### **Step 3: Maximizing**

$$ max \sum_{i=1}^n log\;P(y_i|x_i,\theta) $$
$$ = \sum_{i=1}^n(y_i\;f(x_i,\theta)-log(1+exp(f(x_i,\theta)))) $$
$$ \triangleq \ell(\theta) $$

In this case, its difficult or impossible to derive a closed form solution for the optimal theta using numerical methods. One of the most widely used algorithms for solving this problem is **gradient descent**.

### **Conclusion**

<br>
<center>
    <img width="60%" src="images/2.1.17.png" alt="Professor Notes" />
</center>
<br>

Emphasis on MLE estimator being a random variable estimated using bias, variance, and MSE.

## ***2.1.1 Theoretical Properties***

### **Introduction**

<br>
<center>
    <img width="60%" src="images/2.1.18.png" alt="Professor Notes" />
</center>
<br>

The first thing we want to highlight is that the MLE estimator is a **random variable**. That is because the MLE estimator is a function of $x$, which is drawn iid from a distribution. That is, $x$ is a random variable. Since $\hat{\theta}$ is a function of $x$, its also a random variable. Because it is a random variable, we must understand the statistical properties of $\hat{\theta}$.

This section will define the properties bias, variance, and mean square error properties of the MLE estimator and the relation between them. It will also define unbiased and consistent estimators.

### **Bias**

#### **Definition:** 
The bias is simply the difference of the expectation of the MLE with respect to the true parameter.

$$ Bias(\hat{\theta}) = \mathbf{E}_{\theta^*}[\hat{\theta}(x_1, \dots, x_n)] - \theta^* $$

Where $\theta^*$ is the true parameter. This can be more explicity with the definition of the expected values as:

$$ \int \hat{\theta}(x_1, \dots, x_n) \prod_{i=1}^nP(x_i|\theta^*)\;dx - \theta^* $$

Which is a fairly complicated formula adn wouldn't be able to have a closed-form solution for complicated cases.

### **Variance**

#### **Definition:** 
The variance represents the fluctuation around the mean. 

$$ Var(\hat{\theta}) = \mathbf{E}_{\theta^*}[(\hat{\theta}(x_1, \dots, x_n)-\mathbf{E}_{\theta^*}(\hat{\theta}(x_1, \dots, x_n)))^2] $$

### **Mean Squared Error**

#### **Definition:** 
The MSE measures the quality of the estimator produced by MLE.

$$ MSE(\hat{\theta}) = \mathbf{E}_{\theta^*}[(\hat{\theta}(x_1, \dots, x_n) - \theta^*)^2] $$

Note that the MSE subtracts the true parameter from the epected value of the estimated parameter. In this way **MSE is a very direct estimation of the quality of the MLE estimator**. We want to design the MLE to minimize this value as much as possible.

### **Relating Bias, Variance, and MSE; the Bias-Variance Decomposition**

#### **Bias-Variance Decomposition Equation**

$$ MSE(\hat{\theta}) = (Bias(\hat{\theta}))^2 + Var(\hat{\theta}) $$

#### **Proof**

Note: We are dropping the dependency of $\hat{\theta}$ on $x_1$ through $x_n$ in order to simplify this notation for the proof. Starting with the definition of MSE:

$$ MSE(\hat{\theta}) = \mathbf{E}[(\hat{\theta}-\theta^*)^2] $$

Expanding by adding and subtracting the same term inside the expression (completing the square) making a quadratic formula:

$$ = \mathbf{E}[(\hat{\theta}-\mathbf{E}(\hat{\theta}) + \mathbf{E}(\hat{\theta})-\theta^*)^2] $$

Expanding out the square:

$$ = \mathbf{E}[(\hat{\theta}-\mathbf{E}(\hat{\theta}))^2 + (\mathbf{E}(\hat{\theta})-\theta^*)^2 + 2(\hat{\theta}-\mathbf{E}(\hat{\theta}))(\mathbf{E}(\hat{\theta})-\theta^*) ] $$

Showing that this expression is equivalent to the initial equation:

<br>
<center>
    <img width="60%" src="images/2.1.19.png" alt="Professor Notes" />
</center>
<br>

Proving that the final term equals 0:

<br>
<center>
    <img width="60%" src="images/2.1.20.png" alt="Professor Notes" />
</center>
<br>

#### **Importance**

This equation is critical, because oftentimes there will be a trade-off between bias and variance when performing MLE. By trading off these terms, we can find a sweet point where the sum of them is minimized, so we can optimize the overall performance.

### **Unbiased and Consistent Estimators**

#### **Unbiased Estimator Definition**
If an estimator has a bias of 0 ($Bias(\hat{\theta}) = 0$) then we call $\hat{\theta}$ an unbiased estimator. We call this an unbiased estimator.

#### **Consistent Estimator Definition**
If the $MSE(\hat{\theta})$ moves toward 0 as the amount of data, $n$, moves to $\infty$, then $\hat{\theta}$ is consistent.

#### **Relation to One Another**

It is important to note that these are two very different properties, and the presence of one does not indicate the presence of the other. Specifically, an unbiased estimator may still have some varaince, meaning the MSE will not converge to 0 with infinite data. Conversely, a consistent estimator may still have some bias for a finite amount of data.

#### **Asymptotic Unbiasedness**
This is the case when the bias goes to 0 as $n$ goes to $\infty$. Having a consistent estimator **does** imply asymptotic unbiasedness, but it does not imply direct unbiasedness.

<br>

---

#### **Example**

<br>
<center>
    <img width="60%" src="images/2.1.19.png" alt="Professor Notes" />
</center>
<br>

We can see that this is a Gaussian distribution problem. (See Example 3 from 2.1.0)

Looking at $\hat{\mu}$ first, cheking for unbiasedness:

$$ Bias(\hat{\mu}) = \mathbf{E}[\hat{\mu}]-\mu $$ 
$$ = \mathbf{E}[\frac{1}{n}\sum_{i=1}^nx_i]-\mu $$ 
$$ = \frac{1}{n}\sum_{i=1}^n\mathbf{E}[x_i]-\mu $$ 
$$ = \frac{1}{n} \sum_{i=1}^n\mu - \mu $$
$$ = 0 $$

Now checking variance:

$$ Var(\hat{\mu}) = \mathbf{E}[(\hat{\mu}-\mu)^2] $$
$$ = \mathbf{E}[(\frac{1}{n}\sum_{i=1}^nx_i-\mu)^2] $$
$$ = \mathbf{E}[\frac{1}{n^2}(\sum_{i=1}^n(x_i-\mu)^2+\sum_{i\ne j}^n(x_i-\mu)(x_j-\mu)] $$
$$ = \frac{1}{n^2}\sum_{i=1}^n\mathbf{E}((x_i-\mu)^2) + \sum_{i\ne j}^n\mathbf{E}[( x_i-\mu)(x_j-\mu)]$$
$$ = \frac{1}{n}\sigma^2 $$

Now that we have both of those, we can check if the estimator is consistent:

$$ MSE(\hat{\mu}) = (Bias(\hat{\mu}))^2 + Var(\hat{\mu}) $$
$$ = \frac{\sigma^2}{n} $$

Thus, as $n \to \infty$, $MSE(\hat{\mu}) \to 0 \implies$ consistent.

We can say that this estimator, $\hat{\mu}$, is unbiased and consistent. It can be shown (but we will not prove it here) that $\hat{\sigma^2}$ is biased, but is asymptotically ubiased and consistent:

$$ Bias(\hat{\sigma^2}) \to 0 \;\; n \to \infty $$
$$ Var(\hat{\sigma^2}) \to 0 \;\; n \to \infty $$
$$ MSE(\hat{\sigma^2}) \to 0 \;\; n \to \infty $$

### **Kullback-Leibler Divergence**

#### **Why MLE?**

It turns out that in most cases an MLE estimator can be shown to be consistent (save for weird cases where regularity doesn't hold). 

***Question: Why is MLE always consistent?*** \
It turns out that MLE can be viewed as minimizing the notion of difference measured by the KL Divergence. The KL Divergence measures the difference between two distributions. So MLE is minimizng the difference between the data distribution and the model distribution.

#### **Understanding KL Divergence**

<br>
<center>
    <img width="60%" src="images/2.1.22.png" alt="Professor Notes" />
</center>
<br>

Above is the definition of the KL divergence. We can see that the formula explicitly written for a discrete $x$ is a summation, and for a continuous $x$ is an integration.

Another thing to note is that this equation is not symmetric:

$$ KL(q||p) \ne KL(p||q) $$

However, the KL divergence is **always** non-negative, and if KL divergence = 0, then $p = q$.

#### **Jensen's Inequality**

If $f(x)$ is convex, then we can show that $\mathbf{E}_q[f(x)] \ge  f(\mathbf{E}(x))$, the expected value of f(x) is always greater than the f(expected value of x).

<br>
<center>
    <img width="60%" src="images/2.1.23.png" alt="Professor Notes" />
</center>
<br>

#### **Proving Property 1 of RL Divergence**

<br>
<center>
    <img width="60%" src="images/2.1.24.png" alt="Professor Notes" />
</center>
<br>

#### **Proving Property 2 of RL Divergence**

<br>
<center>
    <img width="60%" src="images/2.1.25.png" alt="Professor Notes" />
</center>
<br>

#### **Connection**

As it turns out, minimizing the KL Divergence IS maximizing log-likelihood. Thus, it is MLE.

<br>
<center>
    <img width="60%" src="images/2.1.26.png" alt="Professor Notes" />
</center>
<br>

# Personal Notes #