## Constructing Confidence Interval

Objective : Calculate confidence intervals of proportions and means in Python

### Interpretation

Confidence intervals are a **calculated range or boundary around a parameter or a statistic** that is supported mathematically with a certain level of confidence.  For example, we can estimate, with 95% confidence, that the population proportion of parents with a toddler that use a car seat for all travel with their toddler was somewhere between 82.2% and 87.7%.

This is *__different__* than having a 95% probability that the true population proportion is within our confidence interval.

Essentially, if we were to repeat this process (in the long run), 95% of our calculated confidence intervals would contain the true proportion.

import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm

### Calculation

Our equation for calculating confidence intervals is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** and the *Margin of Error* is the **t-multiplier**.

The t-multiplier is calculated based on the degrees of freedom and desired confidence level.  For samples with more than 30 observations and a confidence level of 95%, the t-multiplier is 1.96

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

Lastly, the Standard Error is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$



---
### Calculate confidence intervals of population **proportions**

In [3]:
# Given t-multiplier, popoluation proportion and number of observations
alpha = 0.5
tMultiplier = 1.96 # when alpha = 0.5
popProp = 0.43
numObs = 232

In [4]:
# standard error for population proportion
stdErrPop = np.sqrt( popProp*(1-popProp) / numObs )

# confidence inverval from lower and upper confidence bound
lcb = popProp - tMultiplier*stdErrPop
ucb = popProp + tMultiplier*stdErrPop

confInt = (lcb, ucb)
print(confInt)

(0.36629350165772345, 0.49370649834227653)


In [5]:
# using statsmodels.api
sm.stats.proportion_confint(count=popProp*numObs, nobs=numObs, alpha=0.05, method='normal')

(0.3662946722795803, 0.4937053277204198)

---
### Calculate a confidence interval for our **mean** cartwheel distance (CWDistance)

In [6]:
df = pd.read_csv('../data/cartwheeldata.csv')
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [7]:
print(df['CWDistance'].describe())
stats_CWDistance = df['CWDistance'].describe()
nObs = stats_CWDistance[0]
mean = stats_CWDistance[1]
std = stats_CWDistance[2]

count     25.000000
mean      82.480000
std       15.058552
min       63.000000
25%       70.000000
50%       81.000000
75%       92.000000
max      115.000000
Name: CWDistance, dtype: float64


In [8]:
# Given t-mulitplier = 2.064 
tMultiplier = 2.064
estStdErr = std/np.sqrt(nObs)

lcb = mean - tMultiplier*estStdErr
ucb = mean + tMultiplier*estStdErr
confInt = (lcb, ucb)
print(confInt)

(76.26382957453707, 88.69617042546294)


In [14]:
# using statsmodels.api
sm.stats.DescrStatsW(df['CWDistance']).tconfint_mean(alpha=0.05)

(76.26413507754478, 88.69586492245523)