<a href="https://colab.research.google.com/github/javidjamae/data-science/blob/master/statistics/Confidence_Interval_for_a_Mean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Confidence Intervals for a Mean

Often times we want to estimate a population mean using a sample.

We can use a confidence interval to give us a range around our sample mean that the true population mean is likely to fall within.

$$\bar x \pm t^* \cdot (\frac{ s_x }{\sqrt n})$$

Here:
* $n$ -> sample size
* $\bar x$ -> sample mean
* $s_x$ -> sample standard deviation
* $t^*$ -> critical t-value

The ***Margin of Error*** is the term after the $\pm$:

$$ME = t^* \cdot (\frac{ s_x }{\sqrt n})$$

The ***Standard Error*** is the last part of the Margin of Error, which divides the sample standard deviation by the square root of the sample size:

$$SE = \frac{ s_x }{\sqrt n}$$

In [36]:
n = 10
c = 0.90
x_hat = 450
s_x = 15

## Conditions for a t-interval for a mean

When we don't have the true population mean or distribution available to us, we must rely on the sample data. 

But, in order to know if the sample data is reliable, certain conditions must be met.

### Normal
A rule of thumb is that if the sample size $n$ is greater than 30, then the sample can be considered normal. This is because of the central limit theorem.

If the sample is smaller than 30, we have to graph the distrubition to see if it looks *roughly* symmetrical or normal, with no obvious outliers. If it is, then we can treat it as normal.

### Random
We have to random select from the population to ensure that there is no bias in the sample.

### Independent
For the sample to be considered independent, we either need to sample with replacement, or we need to ensure that the sample is ***less than*** 10% of the overall population. 

If the sample is relatively small, then it can be considered independent, even if we're not replacing.  

## Calculate the critical t

To calculate $t^*$ you'll need:
* $c$ -> confidence level (e.g. $.95$)
* $\alpha$ -> alpha is $(1 - c)$ (e.g. $.05$)
* degrees of freedom -> ($n-1$)

To calculate the critical t value, we have to adjust the confidence level to be two-tailed. We convert as follows:

$$c_{two-tail} = c + \frac{ \alpha }{ 2 }$$

So, for example:

$$c =.95$$

$$\frac{ \alpha }{ 2 } = 0.025$$

$$c_{two-tail} = 0.975$$

In [37]:
from scipy import stats
from scipy.stats import t

df = n - 1

c_two_tail = c + ( ( 1 - c ) / 2 )

t_two_tail = t.ppf( c_two_tail, df )
print('Two-tail critical t value: %6.3f' % (t_two_tail))

Two-tail critical t value:  1.833



## Check the critical t value

To test our calculation, we can plug the critical t value into the `scipy.stats.t.cdf` function to ensure that it gives us our expected confidence level.

Since we calculated the critical t value by using the confidence level to calculate the upper value, the `t.cdf` function will return that same value. 

For example, if our confidence level is $.95$ then the two-tail level is $.975$. We calculate the critical t value using $.975$, along with out degrees of freedom. So, $.975$ is the value that we'd expect to get back if we call using the critical t value and the same degrees of freedom. 

In [38]:
confidence_calc_two_tail = t.cdf( t_two_tail, df )
print( 'Two-tail confidence %0.3f' % ( confidence_calc_two_tail ) )

Two-tail confidence 0.950


## Calculate the confidence interval

The confidence interval is calculated by subtracting and adding the "Margin of Error" to the sample mean:
$$\bar x \pm t^* \cdot (\frac{ \sigma }{\sqrt n})$$

The Margin of Error is:
$$t^* \cdot (\frac{ \sigma }{\sqrt n})$$

In [45]:
import math

margin_error = t_two_tail * s_x / math.sqrt( n )
lower_confidence = x_hat - margin_error
upper_confidence = x_hat + margin_error

print( 'Sample Mean: %6.4f' % ( x_hat ) )
print( 'Two-tail critical t value: %6.3f' % (t_two_tail)) 
print( 'Standard Deviation: %6.4f' % ( s_x ) )
print( 'Margin of Error: %6.4f' % ( margin_error ) )
print( 'Lower Confidence Limit: %6.4f' % ( lower_confidence ) )
print( 'Upper Confidence Limit: %6.4f' % ( upper_confidence ) )
print()
print( '%6.4f +- %6.3f * ( %6.4f / sqrt( %2d ) )' % ( x_hat, t_two_tail, s_x, n ) )
print( '%6.4f +- %6.3f' % ( x_hat, margin_error ) )
print( '( %6.4f, %6.4f )' % ( lower_confidence, upper_confidence ) )

Sample Mean: 450.0000
Two-tail critical t value:  1.833
Standard Deviation: 15.0000
Margin of Error: 8.6952
Lower Confidence Limit: 441.3048
Upper Confidence Limit: 458.6952

450.0000 +-  1.833 * ( 15.0000 / sqrt( 10 ) )
450.0000 +-  8.695
( 441.3048, 458.6952 )


## Calculating a sample size

Let's say you are going to conduct a study or experiment and you have a desired margin of error that you're trying to stay within with a given confidence level. 

You'll need to know how many samples you need to meet that criteria.

Typically to *find* a confidence interval we'll calculate a margin of error using a t-statistic. But, calculating a t-statistic requires knowing the desired degrees of freedom, which requires knowing the sample size. But, the sample size is what we're trying to calculate, so we can't use a t-statistic.

As an alternative, ***if*** we have some insights into what the population standard deviation is, we could use a z-statistic instead.

$$\bar x \pm z^* \cdot (\frac{ \sigma }{\sqrt n})$$

Here:
* $n$ -> sample size
* $\bar x$ -> sample mean (x bar)
* $\sigma$ -> population standard deviation (sigma)
* $z^*$ -> critical z-value

So, to keep the margin of error less than a certain amount, we would define an inequality:

$$z^* \cdot (\frac{ \sigma }{\sqrt n}) \leq ME_{max}$$

Then, we can solve for $n$:

$$n \geq \Bigl(\frac{ z^* \cdot \sigma }{ME_{max}}\Bigr)^2$$ 

In [55]:
from scipy.stats import norm

c_sample = .90
me_max = 10
sigma = 15

c_two_tail_sample = c_sample + ( ( 1 - c_sample ) / 2 )

z_critical = norm.ppf( c_two_tail_sample )
sample_size = ( z_critical * sigma / me_max ) ** 2

print( "z-critical: %6.4f" % z_critical )
print( "The estimated sample size is: %6d" % math.ceil( sample_size ) )

z-critical: 1.6449
The estimated sample size is:      7


## References

- [How to Calculate Critical Values for Statistical Hypothesis Testing with Python](https://machinelearningmastery.com/critical-values-for-statistical-hypothesis-testing/)

- [Khan Academy: Statistics & Probability - Confidence intervals](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample)