# Estimation

We wanna find out the value of the population parameters, but we cannot poll the entire universe, so we will use a (random) sample of the population and infer these values from it.

## Point estimators and Point estimates

A point estimator is an statistic used to estimate a population parameter.

If $X_1, X_2, ..., X_n$ are random variables corresponding to the _n_ elements of the sample, then the function of these random vars $f(X_1, X_2, ..., X_n)$ used to estimate the population parameter $\theta$, is a point estimator.

### Examples of point estimators

1. Sample mean (estimator of the population mean $\mu$):
$$
\bar{X} = \frac{1}{n} \sum^n X_i
$$

2. Sample variance (estimator of the population variance $\sigma^2$):
$$
S^2 = \frac{1}{n-1} \sum^n (x-\mu)^2
$$

3. Sample proportion (estimator of the population proportion _p_)

The random vars corresponding to the sample elements (denoted uppercase $X_1, ..., X_n$), once observed become samples observations (denoted lowercase $x_1, ..., x_n$); and the values of the $f(X_1, ..., X_n)$ ("point estimators"), now $f(x_1, ..., x_n)$ are called "point estimates".

## Maximum likelihood

One way of estimating the population parameters is by maximizing the likelihood of the data we observe.

Having an i.i.d. random sample $X_1, X_2, ..., X_n$, each corresponding to a density function $f(X_1;\theta)$; with observations $x_1, x_2, ..., x_n$, then, the join density function (denoted $\mathbf{L}$) be given by:
$$
\mathbf{L(\theta)} = P(X_1=x_1, X_2=x_2, ..., X_n=x_n;\theta) = f(x_1;\theta) \times f(x_2;\theta) \times ... \times f(x_n;\theta) = \prod^n f(x_i;\theta)
$$

All samples come from the same distribution (i.e. i.i.d.), which is parameterized by the unknown $\theta$, thus, $\mathbf{L}$ is a function of $\theta$. We want to find the $\theta$ that max $\mathbf{L}$.

In reality $\mathbf{L}$ can depend, not only on one $\theta$, but on many: $\mathbf{L(\theta_1, \theta_2, ..., \theta_m)}$. We want to estimate the value of each unknown $\theta_i$.

The function $f_i(X_1, X_2, ..., X_n)$ that maximizes the likelihood function for the unknown parameter $\theta_i$ (its a population param, so its denoted lowercased), as before, is known as the __maximum likelihood estimator__ of $\theta_i$ (and because the estimator is a random variable, it is denoted uppercased: $\hat{\Theta_i}$). We will have as many of these functions as thetas, of course:
$$
\hat{\Theta_i} = f(X_1, X_2, ..., X_n)
$$
Once the sample elements are observed, these estimators (the "maximum likelihood estimators": $f(X_1, X_2, ..., X_n)$) become __maximum likelihood estimates__ ($\hat{\theta_i} = f(x_1, x_2, ..., x_n$).

## Unbiased estimators

The estimator is a random variable, and so, its value will vary with each sample. In order for it to be a good estimator, we would like that, on average, the estimator reflecs the true value of the unknown population parameter, and we called them __unbiased estimators__:
$$
\mathbf{E}[\hat{\Theta}] = \theta
$$

### Example: $S^2$ as an unbiased estimator of $\sigma^2$

For this proof, we will need the following identities ($\bar{X}$ is a random variable, and thus, has an expected value and a variance):

$$
\sigma_X^2 = \mathbf{VAR}[X] = \frac{1}{n} \sum (X_i - \mu)^2 = \frac{1}{n} [ \sum X_i^2 - 2 \times \mu \times \sum X_i + \sum \mu^2 ] = \frac{1}{n} [ \sum X_i^2 - 2 \times \mu \times n \times \mu + n \times \mu^2 ] = \frac{1}{n} [ \sum X_i^2 - n \times \mu^2 ] = \mathbf{E}[X^2] - \mu^2
$$

$$
\mu_\bar{X} = \mathbf{E}[\bar{X}] = \mathbf{E}[\frac{1}{n} \sum X_i ] = \frac{1}{n} \times \sum \mathbf{E}[X_i] = \frac{1}{n} \times n \times \mu = \mu
$$

$$
\sigma_\bar{X}^2 = \mathbf{VAR}[\bar{X}] = \mathbf{VAR}[ \frac{1}{n} \sum X_i ] = \frac{1}{n^2} \sum \mathbf{VAR}[X_i] = \frac{1}{n^2} \times n \times \sigma^2 = \frac{\sigma^2}{n}
$$

$$
\sigma_\bar{X}^2 = \frac{1}{n} \sum (\bar{X_i} - \mu_\bar{X})^2 = \frac{1}{n} [ \sum \bar{X_i}^2 - 2 \times \mu_\bar{X} \times \sum \bar{X_i} + \sum \mu_\bar{X}^2 ] = \frac{1}{n} [ \sum \bar{X_i}^2 - 2 \times \mu_\bar{X} \times n \times \mu_\bar{X} + n \times \mu_\bar{X}^2 ] = \frac{1}{n} [ \sum \bar{X_i}^2 - n \times \mu_\bar{X}^2 ] = \mathbf{E}[\bar{X}^2] - \mu_\bar{X}^2
$$

$$
\mathbf{E}[S^2] = \mathbf{E}[ \frac{1}{n-1} \sum (X_i - \bar{X})^2 ] = \frac{1}{n-1} \mathbf{E}[ \sum X_i^2 - 2 \times \bar{X} \times (\sum X_i) + \sum \bar{X}^2 ] = \frac{1}{n-1} \mathbf{E}[ \sum X_i^2 - 2 \times n \times \bar{X} \times \bar{X} + n \times \bar{X}^2 ] = \frac{1}{n-1} \mathbf{E}[ \sum X_i^2 - n \times \bar{X}^2 ] = \frac{1}{n-1} [ \mathbf{E}[ \sum X_i^2 ] - n \times \mathbf{E}[\bar{X}^2 ] ] = \frac{1}{n-1} [ n \times \sigma_X^2 + n \times \mu^2 - n \times [ \frac{\sigma^2}{n} + \mu_\bar{X}^2 ] ] = \frac{(n-1) \times \sigma^2}{n-1} = \sigma^2
$$

## Confidence intervals

As mentioned before, the point estimates are gonna vary per sample; in reality, we don't really care about the point estimators (the estimates of that particular sample), but rather would like to know the value of the population parameters. We will use the point estimates to construct a range of values where we can be confident the parameter will lie.

Usually, confidence intervals are computed for:
1. One mean
2. Mean difference
3. One variance
4. Variance difference
5. One proportion
6. Proportion difference

### Deriving a confidence interval for one mean (assuming know $\sigma$,  as an example)

Think about a Gaussian curve. Mark two equidistant points $-Z_{\frac{\alpha}{2}}, +Z_{\frac{\alpha}{2}}$ that enclose an $\frac{\alpha}{2}$ area, i.e.:
<img src="img/confidence_interval.png"/>

With that pic in mind, its easy to see that:
$$
P(-Z_{\frac{\alpha}{2}} < Z < +Z_{\frac{\alpha}{2}}) = 1 - \alpha
$$

Then, we can replace $Z$ by its formula, and derive the confidence interval:
$$
\begin{align}
\begin{split}
P(-Z_{\frac{\alpha}{2}} < \frac{\bar{X} - \mu}{S} < +Z_{\frac{\alpha}{2}}) &= 1 - \alpha \Rightarrow \\
P(-Z_{\frac{\alpha}{2}} \times S - \bar{X} < - \mu < +Z_{\frac{\alpha}{2}} \times S - \bar{X}) &= 1 - \alpha \Rightarrow \\
P(-Z_{\frac{\alpha}{2}} \times S + \bar{X} < \mu < +Z_{\frac{\alpha}{2}} \times S + \bar{X}) &= 1 - \alpha \Rightarrow \\
P(-Z_{\frac{\alpha}{2}} \times \frac{\sigma}{\sqrt{n}} + \bar{X} < \mu < +Z_{\frac{\alpha}{2}} \times \frac{\sigma}{\sqrt{n}} + \bar{X}) &= 1 - \alpha
\end{split}
\end{align}
$$

### Interpretation (important, and no so intuitive)

It is incorrect to say that _"we are 95% (for example) confident that the population mean lies in the confidence interval"_. The confidence interval is derived from the sample estimator, and so, we will have a different confidence interval per sample. The correct interpretation is that out of the total number of intervals we could construct, in 95% of the times, the population mean will lie within the range (that is, 95% of the intervals would be correct).

In reality, we only have one sample, and the derived confidence interval will either be correct or incorrect. As we do not know the true value of the parameter, we will never know if the interval is correct. We can just state that we are confident the interval is correct (because 95% of the intervals would be correct).

### A more realistic situation ($\sigma$ unknown)

It is unrealistic to believe that we would know the population parameter $\sigma$ (we would also know $\mu$). In reality, we would infer a confidence interval without the value of $\sigma$. What we do in this case, is to use our best estimate of $\sigma$, which is _S_; and when doing so, we no longer use a _Z_-statistic, but a _T_-statistic with $n-1$ degrees of freedom (for the rest, it is the same formula and procedure).