## Statistical Inference with Confidence Intervals


### Why Confidence Intervals?

Confidence intervals are a calculated range or boundary around a parameter or a statistic that is supported mathematically with a certain level of confidence.  For example, in the lecture, we estimated, with 95% confidence, that the population proportion of parents with a toddler that use a car seat for all travel with their toddler was somewhere between 82.2% and 87.7%.

This is *__different__* than having a 95% probability that the true population proportion is within our confidence interval.

Essentially, if we were to repeat this process, 95% of our calculated confidence intervals would contain the true proportion.

### How are Confidence Intervals Calculated?

Equation for calculating confidence intervals is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** and the *Margin of Error* is the **t-multiplier**.

The t-multiplier is calculated based on the degrees of freedom and desired confidence level.  For samples with more than 30 observations and a confidence level of 95%, the t-multiplier is 1.96

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

Lastly, the Standard Error is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$


In [1]:
import numpy as np

In [2]:
tstar = 1.96 # t*-multiplier
p = .85  # population proportion
n = 659  # number of observtion

In [3]:
se = np.sqrt((p*(1-p))/n)  # standard error for population proportion

In [4]:
# population proportion +/- standard error for population proportion
lower_ci = p - (tstar*se)
upper_ci = p + (tstar*se)

In [6]:
(lower_ci, upper_ci)

(0.8227373256215749, 0.8772626743784251)

In [9]:
# Calculation directly using statsmodel package
import statsmodels.api as sm

sm.stats.proportion_confint(n*p, n)

(0.8227378265796143, 0.8772621734203857)

`Using Cartwheel dataset`

In [13]:
import pandas as pd
df = pd.read_csv('cartwheel_distance.csv')

In [15]:
df.head(3)

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7


In [16]:
mean = df["CWDistance"].mean()
sd = df["CWDistance"].std()
n = len(df)

In [24]:
print('Mean:', mean)
print('Std Deviation:', sd)
print('Number of observations:', n)

Mean: 82.48
Std Deviation: 15.058552387264855
Number of observations: 25


In [17]:
tstar = 2.064

In [19]:
se = sd/np.sqrt(n)
print(se)

3.0117104774529713


In [20]:
lower_ci = mean - (tstar*se)
upper_ci = mean + (tstar*se)
(lower_ci, upper_ci)

(76.26382957453707, 88.69617042546294)

In [23]:
sm.stats.DescrStatsW(df['CWDistance']).zconfint_mean()

(76.57715593233026, 88.38284406766975)