# Confidence intervals

Import python modules

In [1]:
import matplotlib.pylab as plt
import numpy as np
from scipy import stats

### Table of Contents

1. [Background](#background)<br><br>
2. [Single population mean CI](#popmean)<br>
    2a. [if population sigma is known -> Normal distribution](#norm_ci)<br>
    2b. [sampling (n) required for desired CI](#norm_ci_n)<br>
    2c. [if population sigma is unknown -> t distribution](#t_ci)<br><br>
3. [Population proportion CI](#bi_ci)<br>
    3a. [using a binomial approximated as normal](#bi_ci)<br>
    3b. [+4 correction](#plus4)<br>
    3c. [sampling (n) required for desired confidence level](#norm_ci_n)<br>

### Background<a class="anchor" id="background"></a>

A **confidence interval (CI)** is a range estimate into which the population mean falls with some probability, the **confidence level (CL)**. More accurately, the CL is the percent of confidence intervals that would contain the true population parameter if repeated samples are taken.

### Population mean CIs<a class="anchor" id="popmean"></a>

#### For unknown population standard deviation <a class="anchor" id="t_ci"></a>

When the population standard deviation is unknown, a Students t test should be used to calculate a confidence interval for the mean.

The confidence interval is $\bar{x}$ ± EBM<br>
EBM = Error bound of the mean = t_df($\alpha$/2) \* (s / $\sqrt{n}$)

s = sample standard deviation *note: calculated using $\sqrt{n-1}$ in demoninator*<br>
t_df = A Students t distribution for degrees of freedom (n-1)<br>
t($\alpha$/2) = t-score on t_df for (1 - CL) / 2. *note: divide by two for two-tailed test*

In [4]:
# Illowsky - Example 8.8

data = [8.6, 9.4, 7.9, 6.8, 8.3, 7.3, 9.2, 9.6, 8.7, 11.4, 10.3, 5.4, 8.1, 5.5, 6.9]
cl = 0.95

mu = np.mean(data) # sample mean
s = np.std(data, ddof=1) # sample standard deviation
n = len(data)
print(n, mu, s)

alpha = 1 - cl
t_lower = (stats.t.ppf((alpha / 2), n - 1)) 
t_upper = (stats.t.ppf((1 - (alpha / 2)), n - 1)) 
ebm = t_upper * (s / np.sqrt(n))
print(mu - ebm, mu + ebm)

print(stats.t.interval(1 - alpha, n - 1, loc=mu, scale=(s / np.sqrt(n))))

15 8.226666666666667 1.6722383060978339
7.300611959652363 9.15272137368097
(7.300611959652363, 9.15272137368097)


In [5]:
# Illowsky - Example 8.9

data = [ 79, 145, 147, 160, 116, 100, 159, 151, 156, 126,
        137,  83, 156,  94, 121, 144, 123, 114, 139,  99]
cl = 0.90

n = len(data)
mu = np.mean(data) # sample mean
s = np.std(data, ddof=1) # sample standard deviation
print(n, mu, s)

alpha = 1 - cl
t_lower = stats.t.ppf((alpha / 2), n - 1)
t_upper = stats.t.ppf((1 - alpha / 2), n - 1)
ebm = t_upper * (s / np.sqrt(n))
print(mu - ebm, mu + ebm)

print(stats.t.interval(cl, n - 1, loc=mu, scale=(s / np.sqrt(n))))

20 127.45 25.964500055997508
117.41093378346815 137.48906621653185
(117.41093378346815, 137.48906621653185)


#### For known population standard deviation<a class="anchor" id="norm_ci"></a>

When the population standard deviation is known (this is rare), the normal distribution can be used to calculate a confidence interval for the mean.

The confidence interval is $\bar{x}$ ± EBM<br>
EBM = Error bound of the mean = z($\alpha$/2) \* ($\sigma$ / $\sqrt{n}$)<br>
z($\alpha$/2) = z-score on a normal distribution for (1 - CL) / 2<br>
note: dividing by two for two-tailed test

In [29]:
#   Illowsky - Example 8.3
x_bar = 1.024  # sample mean
sigma = 0.337  # known population standard deviation
cl = 0.98
n = 30

alpha = 1 - cl
z = (stats.norm.ppf(1 - (alpha / 2), 0, 1)) 
ebm = z * sigma / np.sqrt(n)
ci_low = x_bar - ebm
ci_high = x_bar + ebm

print('The {0} percent CI is {1:.2f} to {2:.2f}'.format(cl * 100, ci_low, ci_high))
print('From a sample mean of {0} from a population with known standard deviation {1}'.format(x_bar, sigma))
print()

# or go straight to getting bounds with .interval
ci_low, ci_high = stats.norm.interval(cl, x_bar, (sigma / np.sqrt(n)))
print('The {0} percent CI is {1:.2f} to {2:.2f} found with scipy .interval'.format(cl * 100, ci_low, ci_high))

The 98.0 percent CI is 0.88 to 1.17
From a sample mean of 1.024 from a population with known standard deviation 0.337

The 98.0 percent CI is 0.88 to 1.17 found with scipy .interval


#### Number of samples required <a class="anchor" id="norm_ci_n"></a>

The number of samples required for a desired confidence level and interval can be determined in advance.

n = $z^2 \sigma^2 / EBM^2$

In [33]:
# Illowsky - example 8.7
ebm = 2
conf = 0.95
sigma = 15

alpha = (1 - conf)
z = (stats.norm.ppf((alpha / 2)))
n = np.ceil((z ** 2 * sigma ** 2) / (ebm ** 2))
print('To achieve a range of +/-{0} with a confidence of {1} percent would require {2:.0f} samples'.\
      format(ebm, conf * 100, n))

To achieve a range of +/-2 with a confidence of 95.0 percent would require 217 samples


### Population Proportion Confidence Intervals<a class="anchor" id="bi_ci"></a>

In [6]:
# confidence interval of a proprotion
#  not using +4 method here (see below)

n = 500
p = 421 / n
cl = 0.95

alpha = 1 - cl
z_lower = stats.norm.ppf(alpha / 2)
z_upper = stats.norm.ppf(1 - (alpha / 2))

s = np.sqrt((p * (1 - p)) / n)

ebp = z_upper * s

print(p - ebp, p + ebp)

print(stats.norm.interval(cl, p, s))

0.8100296288520179 0.873970371147982
(0.8100296288520179, 0.873970371147982)


#### +4 correction<a class="anchor" id="plus4"></a>

In [7]:
n = 25     # number of samples
pn = 6     # number of successes
cl = 0.95  # confidence level

n += 4     # applying +4 method
p = (pn + 2) / n   # probability of success, applying +4 method

alpha = 1 - cl
z_lower = stats.norm.ppf(alpha / 2)
z_upper = stats.norm.ppf(1 - (alpha / 2))

s = np.sqrt(p * (1 - p) / n)

ebp = z_upper * s

print(p - ebp, p + ebp)

print(stats.norm.interval(cl, p, s))

0.11319271756780241 0.43853142036323206
(0.11319271756780241, 0.43853142036323206)


#### n required for desired confidence level and interval<a class="anchor" id="bi_ci_n"></a>

In [8]:
# n required for desired confidence and bounds of proportion
ebm = 0.03
conf = 0.90

alpha = (1 - conf)
z = (stats.norm.ppf((alpha / 2), 0, 1))
n = np.ceil((z ** 2 * 0.25) / (ebm ** 2))
print(n)

752.0
