# Lecture 4.2: Point Estimation

## Outline

* Difference between estimand, estimator and estimate

* Methods of finding estimators
    * Method of moments (MOM)
    * Maximum likelihood estimation (MLE)
* Evaluating estimators
    * Mean squared error (MSE)
        * Bias of an estimator
        * Variance of an estimator

## Objectives

* Understand clearly the difference between an estimator and an estimate
* Know how to use MOM for estimation
* Know how to find the MLE for a parameter of interest
* Know how to evaluate an estimator using mean squared error (MSE)
* Have a general understanding of bias-variance trade-off

## Estimand, Estimator and Estimate

* **Estimand**: what we are trying to estimate (the quantity of interest)
    * e.g. the population mean, $\mu$  

* **Estimator**: what we use to estimate the estimand (the rule)
    * e.g. the sample mean $\bar{X} = \frac{1}{n} \sum_{i = 1}^n X_i$
    * the estimator contains random variables, not the actual observed data for them
    * the estimator itself is a random variable
    * the estimator has its mean and variance  

* **Estimate**: the actual estimate given by the estimator and a set of data (the result)
    * e.g. the sample mean we calculate from real data $\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i$
    * an estimate is realization of the estimator
    * estimator vs estimate is just like $X$ vs $x$ (the random variable vs one value of the random variable)
    * the estimate is a fixed quantity   

## Methods of Finding Estimators

<img src="images/coins.png" width="400">

### Method of Moments (MOM)

Coin experiment: 8 heads out of 10 tosses  

We wanted to estimate the true probability of getting a head (the estimand) 

You said: it's common sense, $8/10 = 0.8$ (an estimate based on the data observed) 

You were essentially using a moment estimator (method of moments)  

**What are moments?**

The $k^{th}$ **moment** of a random variable $X$ is defined to be $E(X^k)$ assuming that $E(|X|^k) < \infty$.

e.g.

* The $1^{st}$ moment of $X$: $E(X)$
* The $2^{nd}$ moment of $X$: $E(X^2)$
* The $3^{rd}$ moment of $X$: $E(X^3)$
* $\dots$

Note:

$$\mu = E(X)$$  


$$\sigma^2 = E(X^2) - (E(X))^2$$

**Moment estimators**

* The **method of moments** is probably the oldest method of finding point estimators, and the estimators found are called **moment estimators**.  


* Suppose we have a sample, $X_1, \dots, X_n$ from a population with PDF or PMF $f(x | \theta_1, \dots, \theta_k)$, and we want to estimate the $k$ parameters of the population (the $\theta_i$'s).  


* Method of moments estimators are found by equating the first $k$ sample moments to the corresponding $k$ population moments, and solve the system of equations.

$X_1, \dots, X_n \sim f(x | \theta_1, \dots, \theta_k)$  

* Sample moments:
$$ m_1 = \frac{1}{n} \sum_{i = 1}^n X_i $$
$$ m_2 = \frac{1}{n} \sum_{i = 1}^n X_i^2 $$
$$ \vdots $$
$$ m_k = \frac{1}{n} \sum_{i = 1}^n X_i^k $$
$$ \vdots $$  

* Population moments:  
$$ \mu_1^{'} = E(X) $$
$$ \mu_2^{'} = E(X^2) $$
$$ \vdots $$
$$ \mu_k^{'} = E(X^k) $$
$$ \vdots $$   

* Let the sample moments be the estimates of the population moments:  
$$ \frac{1}{n} \sum_{i = 1}^n X_i = \hat{E(X)} $$
$$ \frac{1}{n} \sum_{i = 1}^n X_i^2 = \hat{E(X^2)} $$
$$ \vdots $$
$$ \frac{1}{n} \sum_{i = 1}^n X_i^k = \hat{E(X^k)} $$
$$ \vdots $$  


* Solve the system of equations for $\theta_i$'s

**Example: Normal method of moments**

* Suppose $X_1, \dots, X_n$ are i.i.d. (independently, identically distributed) $N(\mu, \sigma^2)$. In the preceding notation, $\theta_1 = \mu$ and $\theta_2 = \sigma^2$.  

* We have 

$$m_1 = \bar{X} \text{, } m_2 = \frac{1}{n} \sum X_i^2$$  


$$E(X) = \mu$$  


$$E(X^2) = \mu^2 + \sigma^2 $$  
$$(Var(X) = \sigma^2 = E(X^2) - (E(X))^2 = E(X^2) - \mu^2) $$ 

* Equating the sample moments to the estimates of the population moments
$$ \bar{X} = \hat{\mu}$$
$$ \frac{1}{n} \sum X_i^2 = \hat{\mu}^2 + \hat{\sigma}^2 $$  

* Solve for $\mu$ and $\sigma^2$, we get:  

$$ \hat{\mu} = \bar{X} $$  


$$ \hat{\sigma}^2 = \frac{1}{n} \sum X_i^2 - \bar{X}^2 = \frac{1}{n} \sum (X_i - \bar{X})^2 $$

**Advantages of MOM**

* Simple to calculate
* Consistent (if we keep increasing the sample size, we eventually obtain accurate estimates)

**Disadvantages of MOM**

* Often biased (more on biasness later today)
* Sometimes gives estimates outside the parameter space

### Maximum Likelihood Estimation (MLE)

* The method of maximum likelihood we looked at yesterday is arguably the most popular technique for deriving estimators.

* Recall that if $X_1, \dots, X_n$ are i.i.d. (independently, identically distributed) sample from a population with PDF or PMF $f(x | \theta_1, \dots, \theta_k)$, the likelihood function is defined by

$$L(\theta | \textbf{x}) = L(\theta_1, \dots, \theta_k | x_1, \dots, x_n) = \prod_{i = 1}^n f(x_i | \theta_1, \dots, \theta_k) $$  

* Definition: For each sample $\textbf{x}$, let $\hat{\theta}(\textbf{x})$ be a parameter value at which $L(\theta | \textbf{x})$ attains its maximum as a function of $\theta$, with $\textbf{x}$ fixed. A **maximum likelihood estimator (MLE)** of the parameter $\theta$ based on a sample $\textbf{X}$ is $\hat{\theta}(\textbf{X})$.  

* We sometimes write $L(\theta | \textbf{x})$ as $L(\theta)$, and $\hat{\theta}(\textbf{x})$ as $\hat{\theta}$.

**Example: Bernoulli MLE**

Let $X_1, \dots, X_n$ be i.i.d. $Bernoulli(p)$. Then the likelihood function is

$$ L(p | \textbf{x}) = \prod_{i = 1}^n p^{x_i} (1 - p)^{1 - x_1} = p^y(1 - p)^{n - y} $$  

where $y = \sum x_i$.  

While this function is not hard to differentiate, it is much easier to differentiate the log-likelihood  

$$ l(p | \textbf{x}) = log(L(p | \textbf{x})) = y log(p) + (n - y) log(1 - p) $$

Differentiate $log(L(p | \textbf{x}))$ and set it to 0  

$$ \frac{y}{p} - \frac{n - y}{1 - p} = 0 $$

Solve for $p$, and we get,  

$$\hat{p} = \frac{y}{n}$$

## Evaluating Estimators

In many cases, different methods will lead to different estimators, and in today's class, we will introduce one of the criteria for evaluating estimators.

### Mean Squared Error (MSE)

* The **mean squared error (MSE)** of an estimator $\hat{\theta}$ of a parameter $\theta$ is the function of $\theta$ defined by $E(\hat{\theta} - \theta)^2$.

* So basically, the MSE of a particular estimator $\hat{\theta}$ is $E(\hat{\theta} - \theta)^2$.  

* If we do some math, we can express MSE as

$$ E(\hat{\theta} - \theta)^2 = Var(\hat{\theta}) + [E(\hat{\theta}) - \theta]^2 = Var(\hat{\theta}) + (Bias(\hat{\theta}))^2 $$ 

#### Bias

* The **bias** of a point estimator $\hat{\theta}$ is the difference between the expected value of $\hat{\theta}$ and $\theta$.  

* That is, $Bias(\hat{\theta}) = E(\hat{\theta}) - \theta$. 

* If an estimator satisfies $Bias(\hat{\theta}) = E(\hat{\theta}) - \theta = 0$ for all $\theta$, it is called an **unbiased estimator**.

#### The two components of MSE

* $Var(\hat{\theta})$, the variability of the estimator (precision)  


* $Bias(\hat{\theta})$, the bias of the estimator (accuracy)  


**To find an estimator with good MSE properties**, we need to find estimators that control both variance and bias.

#### Unbiased estimators

* Unbiased estimators do a good job of controlling bias  

* We have  

$$ MSE = E(\hat{\theta} - \theta)^2 = Var(\hat{\theta}) $$  

* If an estimator is unbiased, its MSE is equal to its variance.

**Example: Normal MSE**  

Let $X_1, \dots, X_n$ be i.i.d. $N(\mu, \sigma^2)$. The statistics $\bar{X}$ (sample mean) and $S^2$ (sample variance) are both unbiased estimators since

$$ E(\bar{X}) = \mu $$  


$$ E(S^2) = \sigma^2 $$  


for all $\mu$ and $\sigma^2$.

The **MSE**s of these estimators are given by  


$$ E(\bar{X} - \mu)^2 = Var(\bar{X}) = \frac{\sigma^2}{n} $$  


$$ E(S^2 - \sigma^2)^2 = Var(S^2) = \frac{2 \sigma^4}{n - 1} $$

#### Bias-variance trade-off

* Although many unbiased estimators are also reasonable from the standpoint of MSE, keep in mind that controlling bias alone does not guarantee that MSE is controlled.  

* In particular, it is sometimes the case that a trade-off occurs between variance and bias in such a way that a small increase in bias can be traded for a larger decrease in variance, resulting in an improved MSE.