# Lecture 4.4: Confidence Intervals

## Outline

* Recap: Central Limit Theorem
* Confidence intervals for one sample mean
    * z-based
    * t-based
    * Difference between Normal and t-distribution

## Objectives

* Understand the frequentist's interpretation of confidence intervals
* Know how to calculate and interpret confidence intervals for one sample mean (both z- and t-based)

## Recap: The Really Cool Central Limit Theorem (CLT)

### Sampling Distribution

#### Review: Population Parameters vs Sample Statistics

<img src="images/population_vs_sample.png" width="300">

* The value of a **population parameter** is a **fixed number**, it is **NOT random**; its value is **not known**.  


* The value of a sample statistic is calculated from sample data.  


* The value of a **sample statistic** will **vary** from sample to sample (sampling distributions).

**Example**: Suppose (and this is never the case) that we know our entire population and have their values for age, then we would know the population parameters, e.g. $\mu$ and $\sigma^2$.  

* Each time we take a sample from this population and calculate the sample mean, we would potentially get a different sample mean value - this is called **sampling variation**.

* If I took many (say 1000) samples of size 20 from this population and made a histogram of the 1000 means, it looks like this:

<img src="images/sample_dist.png" width="400">

* In general we only take one sample. What the histogram shows are the range and likelihood of different $\bar{x}$ values.  

<img src="images/sample_dist2.png" width="500">

### Central Limit Theorem

* The CLT states that if random samples of size $n$ are repeatedly drawn from any population with mean $\mu$ and variance $\sigma^2$, then when $n$ is large, the distribution of the sample means will be approximately Normal:  


$$ \bar{X} \dot{\sim} N(\mu, \frac{\sigma^2}{n}) $$  

* This is a really cool and powerful result. It says that no matter what our initial data looks like, when we take averages we end up with the Normal distribution.  

* This result is so useful because we usually don’t know where our data comes from, i.e. what the shape of the underlying true distribution looks like. The CLT theorem says that as long as we work with averages, it doesn’t matter.


* [Sampling distribution simulation](http://onlinestatbook.com/stat_sim/sampling_dist/index.html)

**How Large is Large Enough?**  

* For **most** distributions, $n > 30$ will give a sampling distribution that is nearly Normal   


* For **fairly symmetric** distributions, $n > 15$  


* For **Normal** population distributions, the sampling distribution of the mean is always Normally distributed

**Example**: The service times for customers coming through a checkout counter in a retail store are independent random variables with a mean of 1.5 minutes and a variance of 1.0.  
What is the probability that 88 customers can be serviced in less than 2 hours of total service time by this one checkout counter?

We want to find  


$$P(\sum_{i = 1}^{88} X_i < 120)$$

If we divide both sides by 88, we get  


$$  P(\bar{X} < 1.36) $$

From CLT, we have  

$$ \begin{align*}
     \bar{X} &\sim N(\mu, \frac{\sigma^2}{n}) \\
             &\sim N(1.5, \frac{1}{88})
   \end{align*} $$          

Then  


$$ P(\bar{X} < 1.36) = 0.095 $$

In [16]:
from scipy.stats import norm
import numpy as np

norm(1.5, 1 / np.sqrt(88)).cdf(1.36)

0.094538175129462776

So there is about a 10% chance 88 customers can be serviced within 2 hours.

## Confidence Intervals

Say we have a population of interest and we want to determine what its mean $\mu$ is.  

We know the procedure is to generate a random sample $X_1, X_2, \dots, X_n$ and form the estimates  

$$ \bar{X} = \frac{1}{n} \sum_{i = 1}^n X_i $$

It would be grossly **misleading** to claim that $\mu$ is precisely equal to the observed $\bar{x}$  

To detail our uncertainty about our estimate for $\mu$, we can construct a confidence interval or interval estimate for $\mu$.


**Example**: Say if I want to estimate the mean weight of the population,  

* I can give a point estimate, $\bar{x} = 150$ pounds,  
* or I can give an interval estimate: I estimate the mean weight to be between 140 and 160 pounds with certain level of confidence (more on this later).  


Confidence intervals are a vital aspect to statistics since an estimate is useless without some concept of its precision, and that is exactly what a confidence interval tells us - how good is our estimate.

### Point and Interval Estimates

* A point estimate is a single number
* A confidence interval provides additional information about variability

<img src="images/point_vs_interval.png" width="500">


### Confidence Intervals for One Sample Mean

**Example**: We are interested in estimating the mean weight of a certain population, and we know the standard deviation of the population is 3 pounds. We took a random sample of 100 people from the population and calculated their average weight, and we got $\bar{X} = 150$ pounds. Is this a good estimate for the population mean weight? How do we know?

Let's construct an interval estimate.

We construct the interval estimate using the CLT, which says 

$$ \bar{X} \sim N(\mu, \frac{\sigma^2}{n}) $$  

Then by **standardizing** it (taking the Z score):  

$$ Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim N(0, 1) $$

#### Standardizing Rule

If $X \sim N(\text{mean, variance})$, then  

$$ Z = \frac{X - \text{mean}}{\sqrt{\text{variance}}} \sim N(0, 1) $$   

This is also called **normalization**.

For $Z \sim N(0, 1)$, we have  

$$ P(-1.96 < Z < 1.96) = 0.95 $$  

So far, we have  

$$ \bar{X} \sim N(\mu, \frac{\sigma^2}{n}) $$  


$$ Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim N(0, 1) $$


$$ P(-1.96 < Z < 1.96) = 0.95 $$  


Then

$$ P(-1.96 < \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} < 1.96) = 0.95 $$  


By doing some algebra, we can rearrange things so that

$$ P(\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} < \mu < \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}) = 0.95 $$

**Note**: this is a probability statement about $\bar{X}$ (random variable), not about $\mu$ (number).

We are 95% confidence that the true mean $\mu$ is in the interval  

$$ (\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} \text{, } \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}) $$

Sometimes, we write it as 

$$ \bar{X} \pm 1.96 \frac{\sigma}{\sqrt{n}} $$

This is the **z-based** confidence interval for one sample mean.

Going back to our weight example, we have  

$$ \sigma^2 = 3^2 $$
$$ \bar{x} = 150 $$
$$ n = 100 $$

Then the confidence interval for the population mean weight, $\mu$, is  

$$ (\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} \text{, } \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}) $$

$$ (150 - 1.96 \times \frac{3}{\sqrt{100}}, 150 + 1.96 \times \frac{3}{\sqrt{100}}) $$

$$ \rightarrow (149.4, 150.6) $$

We are 95% confident that the true mean weight is somewhere between 149.4 and 150.6 pounds.

**What does it mean that we are 95% confident?**

If we had 100 different samples with the same sample size and created 100 different intervals, we would expect (on average) 95 out of 100 of them to contain the true (but unknown) mean.

#### Margin of Error

**Margin of Error (e)**: the amount added and subtracted to the point estimate to form the confidence interval.

$$ \bar{X} \pm 1.96 \frac{\sigma}{\sqrt{n}} \Rightarrow e = 1.96 \frac{\sigma}{\sqrt{n}}$$

**Example**: 

In the weight example, what is the margin of error?

$$ e = 1.96 \frac{\sigma}{\sqrt{n}}= 1.96 \times \frac{3}{\sqrt{100}} = 0.6 $$

**Determining sample size**

In an effort to reduce the margin of error to 0.2, we want to collect another sample. How many people's weights do we need to sample to be within 0.2 pound of the true mean?

We want

$$ e = 1.96 \frac{\sigma}{\sqrt{n}} = 1.96 \times \frac{3}{\sqrt{n}} = 0.2 $$

Solving for $n$:

$$ n = (\frac{1.96 \sigma}{e})^2 = (\frac{1.96 \times 3}{0.2})^2 = 864.36 \approx 865 $$

We need to sample at least 865 people's weights to have a marge of error of 0.2.

**Note**: for sample size calculations, we always round up to the next whole number.

**Does it always have to be 95%?**

Nope.

We can find confidence interval for any level of confidence.  

$$ \bar{X} \pm z_{\alpha /2} \frac{\sigma}{\sqrt{n}} $$  

$z_{\alpha /2}$ is the point on the Normal curve as follows,

<img src="images/z_curve.png" width="400">

For example, when we calculate a 95% confidence interval, $\alpha = 5\%$, $z_{\alpha /2}$ = 1.96.  

Any size confidence interval is then given by  

$$ \bar{X} \pm z_{\alpha /2} \frac{\sigma}{\sqrt{n}} $$  


Here are the most common values  
<img src="images/z_values.png" width="400">


In [17]:
norm.ppf(0.975) # z value for 95% confidence interval

1.959963984540054

In [18]:
norm.ppf(0.995) # z value for 99% confidence interval

2.5758293035489004

In [19]:
norm.ppf(0.95) # z value for 90% confidence interval

1.6448536269514722

**What if we don't know the population standard deviation $\sigma$?**

We needed $\sigma$ in our previous calculation for confidence intervals,  

$$ (\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} \text{, } \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}) $$ 


But we don't always know the true population standard deviation $\sigma$.  

Do we panic?

We don't have the population standard deviation $\sigma$, but we can calculate the sample standard deviation $s$ from the sample we have.  

Can we simply replace $\sigma$ with $s$?  

Yes, and no.

We know by the CLT that  

$$ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim N(0, 1) $$

It turns out, if we replace $\sigma$ with $s$, we get a slightly different distribution - the t-distribution.  

$$ \frac{\bar{X} - \mu}{s / \sqrt{n}} \sim t_{n - 1} $$

#### Review: t-distribution

* The t-distribution looks similar to the standard Normal distribution $N(0, 1)$ except it has fatter tails.   


* It is centered at zero and defined by its degrees of freedom which equal $n-1$.


* As the sample size $n$ gets large, the t-distribution looks like the $N(0,1)$ distribution.  


$$ t_{n - 1} \rightarrow N(0, 1) \text{ as } n \rightarrow \infty $$

<img src="images/normal_t.png" width="600">

When population standard deviation is unknown, we can construct the confidence interval for the mean by  

$$ \bar{X} \pm t \left( \frac{s}{\sqrt{n}} \right) $$  

where $t$ comes from the t-distribution, and depends on the sample size through the degrees of freedom $n - 1$.  

This is the **t-based** confidence interval for one sample mean.

In [20]:
from scipy.stats import t

t(99).ppf(0.975)  # t value for 95% confidence interval with sample size 100 (df = 100 - 1 = 99)

1.9842169515086827

In [21]:
t(99).ppf(0.995)  # t value for 99% confidence interval with sample size 100 (df = 100 - 1 = 99)

2.6264054563851857

In [22]:
t(99).ppf(0.95)  # t value for 90% confidence interval with sample size 100 (df = 100 - 1 = 99)

1.6603911559963895

**Example**  


Say we don't actually have the population standard deviation in the population weight example, instead, we calculated the sample standard deviation from the sample, and we have $s = 3.5$. How do we construct a confidence interval for the population mean weight $\mu$?

Information given:  

$$ s = 3.5 $$
$$ \bar{x} = 150 $$
$$ n = 100 $$



Then the t-based confidence interval for the mean weight is  

$$ \bar{X} \pm t \left( \frac{s}{\sqrt{n}} \right) $$  


$$ 150 \pm 1.98 \times \frac{3.5}{\sqrt{100}}$$  

$$ \rightarrow (149.3, 150.7) $$  

We are 95% confident that the mean weight of the population is somewhere between 149.3 and 150.7 pounds.