# 极大似然估计
---

**Bernoulli example**

- Suppose that we know that the following numbers were simulated using a Bernoulli distribution: 0 0 0 1 1 1 0 1 1 1
    
- We can denote by $y_{1}, y_{2}, \dots, y_{10}$

- Recall that the pdf of a Bernoulli random variable is $f(y ; p)=p^{y}(1-p)^{1-y}$, where $y \in\{0,1\}$. The probability of 1 is $p$ while the probability of 0 is $(1 − p)$

- We want to figure out what is the $\hat{p}$ we used to generate those numbers

- The probability of the first number will be given by $p^{y_{1}}(1-p)^{1-y_{1}}$ the probability of the second by $p^{y_{2}}(1-p)^{1-y_{2}}$ and so on..

- If we assume that the numbers above are independent, the joint probability of seeing all 10 numbers will be given by the multiplication of the individual probabilities

- We use the product symbol $\prod$, For example, $\prod_{i=1}^{2} x_{i}=x_{1} * x_{2}$

- So we can write the joint probability or the likelihood (L) of seeing those 10 numbers as:

$$L(p)=\prod_{i=1}^{10} p^{y_{i}}(1-p)^{1-y_{i}}$$

- Remember that we are trying to find the $\hat{p}$ that was used to generate the 10 numbers. In other words, we want to find the $p$ that maximizes the likelihood function $L(p)$ (we use $\hat{p}$ because that’s the optimal one)

- Said another way we want to find the $\hat{p}$ that makes the joint likelihood of seeing those numbers as high as possible

- Sounds like calculus... We can take the derivative of $L(p)$ with respect to $p$ and set it to zero to find the optimal $\hat{p}$

- We take the log to simplify taking the derivative; the log function is a monotonic transformation, it won’t change the optimal $\hat{p}$ value

- We will use several properties of the log, in particular:

$$\log \left(x^{a} y^{b}\right)=\log \left(x^{a}\right)+\log \left(y^{b}\right)=a * \log (x)+b * \log (y)$$

- The advantage of taking the log is that the multiplication becomes a summation. So now we have:

$$\ln L(p)=\sum_{i=1}^{n} y_{i} \ln (p)+\sum_{i=1}^{n}\left(1-y_{i}\right) \ln (1-p)$$

$$\ln L(p)=n \overline{y} \ln (p)+(n-n \overline{y}) \ln (1-p)$$

- This looks a lot easier; all we have to do is take $\frac{d \ln (p)}{d p}$, set it to zero, and solve for $p$ (I made it more general, $n = 10$)

$$\frac{d \ln (p)}{d p}=\frac{n \overline{y}}{p}-\frac{(n-n \overline{y})}{(1-p)}=0$$

- After solving, we’ll find that $\hat{p}\left(y_{i}\right)=\overline{y}=\sum_{i=1}^{n} \frac{y_{i}}{n}$

- So that’s the MLE estimator. This is saying more or less the obvious: our best guess for the $p$ that generated the data is the proportion of 1s, in this case $p = 0.6$

**Normal example**

What about if we do the same but now we have numbers like

90.46561

105.1319

117.5445

102.7179

102.7788

107.6234

94.87266

95.48918

75.63886

87.40594

I tell you that they were simulated from a normal distribution with parameters $\mu$ and $\sigma^{2}$. The numbers are independent. Your job is to come up with the best guess of the two parameters.

- As before, we know that formula for the pdf of a normal and because the observations are independent we multiply the densities:

$$L\left(\mu, \sigma^{2}\right)=\prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(\frac{-\left(y_{i}-\mu\right)^{2}}{2 \sigma^{2}}\right)$$

- Remember the rules of exponents. We can write the likelihood as:

$$L\left(\mu, \sigma^{2}\right)=\left(\frac{1}{\sqrt{2 \pi \sigma^{2}}}\right)^{n} \exp \left(-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right)$$

- As before, we can simplify the problem by taking the ln to help us take the derivatives. After taking the ln, we have:

$$\ln L\left(\mu, \sigma^{2}\right)=-\frac{n}{2} \ln \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}$$

- All we have left is to take the derivative with respect to our two unknowns, $\mu$ and $\sigma^{2}$ and set them to zero. Let’s start with $\mu$:

$$\frac{\partial \ln \left(L\left(\mu, \sigma^{2}\right)\right)}{\partial \mu}=2 \frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu\right)=0$$

- The above expression reduces to

$$\sum_{i=1}^{n}\left(y_{i}-\hat{\mu}\right)=0$$

- Solving, we find that $\hat{\mu}=\frac{\sum_{i=1}^{n} y_{i}}{n}=\overline{y}$, In other words, our best guess is just the mean of the numbers

- We can also figure out the variance by taking the derivative with respect to $\sigma^{2}$. We will find that $\hat{\sigma}^{2}=\frac{\sum_{i=1}^{n}\left(y_{i}-\hat{\mu}\right)^{2}}{n}$

- We know that this formula is biased. We need to divide by (n − 1) instead

- We just figured out that the best guess is to calculate the sample mean and sample variance

**Linear regression**

- What about if I told you that the number I generated is a linear function of one variable, say, $x_{1}$? 

- Now we want to find the parameters $\beta_{0}$, $\beta_{1}$, $\sigma^{2}$ that maximize the likelihood function

- The likelihood function is now:

$$L\left(\beta_{0}, \beta_{1}, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{1 i}\right)^{2}\right)$$

- The ln likelihood is

$$\ln L\left(\mu, \sigma^{2}\right)=-\frac{n}{2} \ln \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{1 i}\right)^{2}$$

- If we take the derivatives with respect to $\beta_{0}$ and $\beta_{1}$ we will find exactly the same first order conditions as we did with OLS. For example, with respect to $\beta_{1}$:

$$\sum_{i=1}^{n} x_{1}\left(y_{i}-\beta_{0}-\beta_{1} x_{1}\right)=0$$

- The MLE estimate of $\sigma^{2}$ will be biased but we divide by $(n-p-1)$ instead 

- MLE is much more general than OLS. You will use MLE for logit, Probit, Poisson, mixture models...

- AIC and BIC to compare non-nested models are based on the log likelihood function