# Lecture 2.2: Method of Moments (MOM), Maximum Likelihood Estimation (MLE)  

## Outline

* Likelihood  
* Method of moments (MOM) estimation   
* Maximum likelihood estimation (MLE)  

---

## Likelihood  

**Quiz**: Assume the height of a black cherry tree follows a Normal distribution, $N \sim (\mu, \sigma^2)$. What is the likelihood $L(\mu, \sigma^2)$ based on the observed data below?

In [8]:
data()
trees

Unnamed: 0,Girth,Height,Volume
1,8.3,70.0,10.3
2,8.6,65.0,10.3
3,8.8,63.0,10.2
4,10.5,72.0,16.4
5,10.7,81.0,18.8
6,10.8,83.0,19.7
7,11.0,66.0,15.6
8,11.0,75.0,18.2
9,11.1,80.0,22.6
10,11.2,75.0,19.9


* If $X_1, \dots, X_n$ are i.i.d. (independently, identically distributed) sample from a population with PDF or PMF $f(x | \theta_1, \dots, \theta_k)$, the likelihood function is defined by

$$L(\theta | \textbf{x}) = L(\theta_1, \dots, \theta_k | x_1, \dots, x_n) = \prod_{i = 1}^n f(x_i | \theta_1, \dots, \theta_k) $$  


### Difference between the likelihood function and the probability mass function

* The likelihood function is a **FUNCTION of the unknown parameters**.  


* The likelihood gives the probability of a **FIXED** observation $x$, for every possible value of the parameter.  


* The probability function gives the probability of every different value of $x$, for a **FIXED** value of the parameter.

---

## Method of Moments (MOM)  

**Quiz**: I flipped a coin 10 times, and got 7 heads. What is the MOM estimate for the probablity of success (getting a head)?

### What are moments? 

The $k^{th}$ **raw moment** of a random variable $X$ is defined to be $E(X^k)$ assuming that $E(|X|^k) < \infty$.

e.g.

* The $1^{st}$ moment of $X$: $\mu_1 = E(X)$
* The $2^{nd}$ moment of $X$: $\mu_2 = E(X^2)$
* The $3^{rd}$ moment of $X$: $\mu_3 = E(X^3)$
* $\dots$

Note:

$$\mu = E(X)$$  


$$\sigma^2 = E(X^2) - (E(X))^2$$

The true/population raw moments can not be observed, but we can estimate them by the **sample moments**:  

$$ \hat{\mu}_1 = \frac{1}{n} \sum_{i = 1}^n X_i $$
$$ \hat{\mu}_2 = \frac{1}{n} \sum_{i = 1}^n X_i^2 $$
$$ \vdots $$
$$ \hat{\mu}_k = \frac{1}{n} \sum_{i = 1}^n X_i^k $$
$$ \vdots $$  

**Moment estimators**


* Suppose we have a sample, $X_1, \dots, X_n$ from a population with PDF or PMF $f(x | \theta_1, \dots, \theta_k)$, and we want to estimate the $k$ parameters of the population (the $\theta_i$'s).  


* Method of moments estimators are found by equating the first $k$ sample moments to the corresponding $k$ population moments, and solve the system of equations.

In the coin example, $X \sim Binomial(n = 10, p)$, the first population moments is:  

$$ \mu_1 = E(X) = np = 10p$$  

The corresponding sample moment can be obtained by:

$$\hat{\mu}_1 = \bar{X} = 7 $$

Equating the two, we have:

$$ 10\hat{p} = 7 $$  
$$ \hat{p} = 7/10 $$

**Advantages of MOM**

* Simple to calculate
* Consistent (if we keep increasing the sample size, we eventually obtain accurate estimates)

**Disadvantages of MOM**

* Often biased (more on biasness later today)
* Sometimes gives estimates outside the parameter space

---

# Maximum Likelihood Estimation (MLE)  

**Quiz**: How would you find the MLE for the probablity of success (getting a head) for the coin example?

### Maximum Likelihood Estimation (MLE)


* Recall that if $X_1, \dots, X_n$ are i.i.d. (independently, identically distributed) sample from a population with PDF or PMF $f(x | \theta_1, \dots, \theta_k)$, the likelihood function is defined by

$$L(\theta | \textbf{x}) = L(\theta_1, \dots, \theta_k | x_1, \dots, x_n) = \prod_{i = 1}^n f(x_i | \theta_1, \dots, \theta_k) $$  

* Definition: For each sample $\textbf{x}$, let $\hat{\theta}(\textbf{x})$ be a parameter value at which $L(\theta | \textbf{x})$ attains its maximum as a function of $\theta$, with $\textbf{x}$ fixed. A **maximum likelihood estimator (MLE)** of the parameter $\theta$ based on a sample $\textbf{X}$ is $\hat{\theta}(\textbf{X})$.  

* We sometimes write $L(\theta | \textbf{x})$ as $L(\theta)$, and $\hat{\theta}(\textbf{x})$ as $\hat{\theta}$.

The likelihood function for the coin example is:

$$ \begin{align*}
     L(p) &= P(X = 7) \text{ when } X \sim Binomial(10, p) \\
          &= \binom{10}{7} p^7 (1 - p)^{10 - 7} \\
          &= \binom{10}{7} p^7 (1 - p)^3
   \end{align*} $$

### The Log-likelihood Function

If we take the natural logarithm of the likelihood function, we get the **log-likelihood** function.  

$$l(p) = log(L(p))$$

For our coin example,

$$ \begin{align*}
     l(p) &= L(p) \\
          &= log\left( \binom{10}{7} p^7 (1 - p)^3 \right) \\
          &= log \binom{10}{7} + 7 log(p) + 3 log(1 - p)
   \end{align*} $$

Note: The $log \binom{10}{7}$ is usually dropped since it is just a constant and does not affect the value that maximizes the function (we only care about the terms that involve $p$).

### Finding the Maximum of the Log-likelihood Function  

To find the maximizing value of $p$, we first differentiate the log-likelihood with respect to $p$:

$$ \begin{align*}
     \frac{dl}{dp} &= 7 (\frac{1}{p}) + 3 (\frac{1}{1 - p}) \\
                   &= \frac{7}{p} + \frac{3}{1 - p}
   \end{align*} $$
   
The maximizing value of $p$ occurs when  

$$ \frac{dl}{dp} = 0 $$

This gives us  

$$ \frac{dl}{dp} = \frac{7}{p} - \frac{3}{1 - p} = 0 $$  

$$ \Rightarrow \hat{p} = 0.7 $$

---

### Exerciese  

1) Find the MOM estimate and the MLE for $(\mu, \sigma^2)$ in the cherry tree example.

In [10]:
trees

Unnamed: 0,Girth,Height,Volume
1,8.3,70.0,10.3
2,8.6,65.0,10.3
3,8.8,63.0,10.2
4,10.5,72.0,16.4
5,10.7,81.0,18.8
6,10.8,83.0,19.7
7,11.0,66.0,15.6
8,11.0,75.0,18.2
9,11.1,80.0,22.6
10,11.2,75.0,19.9


2) We will revisit the stock market data:

In [20]:
library(ISLR)
summary(Smarket)

      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max. 

Assume the following logistic regression model:  

$$ P(Direction = Up) = p = \frac{exp(\beta_0 + \beta_1 Volume)}{1 + exp(\beta_0 + \beta_1 Volume)} $$  

Find the MLE for $\beta_0$ and $\beta_1$ without using any built-in function/library. You may find this [paper](http://czep.net/stat/mlelr.pdf) helpful.