# Confidence Intervals: Examples

This notebook collects practical exercises of the concepts introduced in the course videos.

Overview of contents:

1. Definitions
2. Personal Notes: Using Distributions with Scipy
3. CI of One Proportion
4. CI of One Mean
5. **Cleaning and Preparing Datasets (for Two Groups): `crosstab`, `groupby.agg`**
6. CI of Two Proportions: Smokers vs Non-Smokers in Males & Females
7. CI of Two Means: BMI mean in Males & Females

## 1. Definitions

We must distinguish between:
- Pupulation: the real total group of subjects we want to measure.
- Sample: the subset of the population we really measure, due to economical limits; measurements are assumed to be independent and identically distirbuted (iid.).

Even though we measure the sample, we can infer parameters of a population with confidence intervals (CI) defined around a best parameter estimate we have.

Note that we distinguish also
- Sample distribution: the distirbution of the data we have collected. We can compute parameters of it: mean, meadia, variance, proportions, etc.
- Sampling distribution: if define many iid. samples from the population and compute a parameter, the distribution of that parameter is the sampling distirbution. Accorsing to the Central Limit Theorem (CLT) it tends to be normal.

Having a confidenfe interval of 95% means that if we draw 100 independent samples and compute the parameter and its CI with the same method, 95 of the CIs will contain the real paramater of the population. Thus, the confidence is associated to the method we use.

In general, we use the following formula for the computation of the CI:

`Confidence Interval` = `Best Estimate` $\pm$ `Margin of Error`

The terms are obtained as follows:

- The `Best Estimate` is the parameter of our sample: sample mean, sample proportion.
- The `Margin of Error` is `K x Estimated Standard Error`; that is, `K` is how many standard errors we want to cover in the sampling distribution.
- `K` is defined as the value that covers X% in a symmetric Z or T distirbution; generally, `Z*(95%)` or `T*(95%,df=n-1)` are taken. Note that `Z*(95%) = 1.96`. The T distirbution tends to be Z with large sample sizes `n`.
- `Estimated Standard Error = sqrt(var(sample parameter) / n)`


## 2. Personal Notes: Using Distributions with Scipy

In the course videos, fixed values are used for `Z*(95%)` and `T*(95%,df)`. However, it is possible to obtain exact values with `scipy`.

For the CI computation, note that the `95%` coverage in the chosen distribution is two-sided; the significance level `alpha` related to that CI would be: `1 - alpha = 0.95 -> alpha = 0.05`. However, when we look in tables, that two-sided symmetry is not considered: we get `1 - alpha = P(x < v)`; instead, we would like: `1 - alpha = P(-v < x < v)`. In a symmetric distirbution, that can be intuitively achieved taking `1 - alpha/2 = P(x < v)`!

In [24]:
from scipy.stats import norm,t

In [25]:
# Confidence 95% -> significane level alpha = 0.05
# Since we have two sides, we need to consider: alpha/2 = 0.05/2
# Thus, the percentage we look is: 1 - alpha/2 = 0.975
T_star_95 = t(df=10).ppf(0.975)
print(T_star_95)

2.2281388519649385


In [26]:
Z_star_95 = norm.ppf(0.975)
print(Z_star_95)

1.959963984540054


### 2.1 Further Notes on How to Use Distributions

Load distributions:
```python
from scipy.stats import binom,norm,cauchy
```
Instantiate a distribution with its parameters:
```python    
dist = binom(n, b)
dist = norm(m, s)
dist = cauchy(z, g)
...
```

Distributions have usually at least the `loc` and `scale` parameters, which are often related to the `mean` and `stddev`.

Get data:
```python
dist.rvs(N) # N random variables of the distribution
dist.pmf(x) # Probability Mass Function at values x for discrete distributions
dist.pdf(x) # Probability Density Function at values x for continuous distributions
dist.cdf(x) # Cumulative Distribution Function at values x for any distribution
dist.ppf(q) # Percent point function (inverse of `cdf`) at q (% of accumulated area) of the given RV
```
Note:
- `dist.cdf(v)` = $P (x < v)$; $P(x < \infty) = 1$
- `dist.ppf(q)` = $v | P(x < v) = q$

Fitting data to a distribution:
```python
# Choose distrbution or iterate through a set of candidates
# Data: replace this with real dataset
data = dist.rvs(10)
# Fit
params = dist.fit(data)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Calculate fitted PDF and error with fit in distribution
pdf = dist.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
```

Get parameters:
```python
params = dist.stats() # Mean(‘m’), variance(‘v’), skew(‘s’), and/or kurtosis(‘k’)
m = dist.mean()
std = dist.std()
...
```

Documentation:

    help(scipy.stat)
    https://docs.scipy.org/doc/scipy/reference/index.html

## 3. CI of One Proportion

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

Example: a hospital polls toddler parents whether they use a car seat. The estimated parameter is the proportion of parents who use a car seat. Data:
- `n = 659` parents sampled.
- 540 responded 'yes'.

### Manual Computation

In [27]:
import numpy as np

In [42]:
# T*: See above, how to compute it with scipy
tstar = 1.96
# Sample size
n = 659.0
# Proportion
p = 540.0/n
# Standard Error
se = np.sqrt((p * (1 - p))/n)
se

0.014984499401390045

In [43]:
# Lower and Upper Bounds
lcb = p - tstar * se
ucb = p + tstar * se
(lcb, ucb)

(0.7900537499137914, 0.8487929875672404)

### CI Computation with Statsmodels

In [30]:
import statsmodels.api as sm

In [31]:
sm.stats.proportion_confint(n * p, n)

(0.8227378265796143, 0.8772621734203857)

## 4. CI of One Mean

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$

Example (Cartwheel dataset): What is the **average** cartwheel distance (in inches) for adults? (distance from the forward foot before performing the cartwheel to the final foot after performing it).

In [32]:
import pandas as pd

In [33]:
df = pd.read_csv("Cartwheeldata.csv")

In [34]:
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [36]:
mean = df["CWDistance"].mean()
sd = df["CWDistance"].std()
n = len(df)
print(n)

25


### Manual Computation

In [37]:
tstar = 2.064
se = sd/np.sqrt(n)
print(se)

3.0117104774529713


In [38]:
lcb = mean - tstar * se
ucb = mean + tstar * se
(lcb, ucb)

(76.26382957453707, 88.69617042546294)

### Computation with Statsmodels

In [39]:
sm.stats.DescrStatsW(df["CWDistance"]).zconfint_mean()

(76.57715593233026, 88.38284406766975)

## 5. Cleaning and Preparing Datasets (for Two Groups): `crosstab`, `groupby.agg`

In order to compute **proportions or means in for several categories**, some steps need to be taken with `pandas`:

- Data must be cleaned (`np.nan`) and often category names are changed (gender, yes/no, etc.).
- Cross-tables must be created with `pd.crosstab` and `pd.groupby().agg()`: these summarize proportions, means, sizes for two categories/groups from the same categorical variable.

Examples with the NHANES dataset analyzed in the next sections are prepared in this section; these examples are:
1. Compare proportions of smokers and non-smokers for males & females.
2. Compare means of BMI for males & females.

In [100]:
import pandas as pd
import numpy as np
#import statsmodels.api as sm

In [45]:
url = "nhanes_2015_2016.csv"
da = pd.read_csv(url)

In [46]:
# Recode SMQ020 from 1/2 to Yes/No into new variable SMQ020x
da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
da["SMQ020x"].head()

0    Yes
1    Yes
2    Yes
3     No
4     No
Name: SMQ020x, dtype: object

In [47]:
# Recode RIAGENDR from 1/2 to Male/Female into new variable RIAGENDRx
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
da["RIAGENDRx"].head()

0      Male
1      Male
2      Male
3    Female
4    Female
Name: RIAGENDRx, dtype: object

In [52]:
# Cross-Table: Very useful for counting/frequencies of different groups
dx = da[["SMQ020x", "RIAGENDRx"]].dropna()
pd.crosstab(dx.SMQ020x, dx.RIAGENDRx)

RIAGENDRx,Female,Male
SMQ020x,Unnamed: 1_level_1,Unnamed: 2_level_1
No,2066,1340
Yes,906,1413


In [54]:
# Recode (again) SMQ020x from Yes/No to 1/0 into existing variable SMQ020x
# We recode is again because with 1/0, the mean yields the proportion :)
dx["SMQ020x"] = dx.SMQ020x.replace({"Yes": 1, "No": 0})

In [62]:
# Group By + Aggregate: Very Useful for porportionas and means of different groups
# groupby().agg() creates a new table with aggregated summary values
# in groupby we say which category groups we want in the rows
# and with agg() we say the aggregate function to be applied in the columns
dy = dx.groupby("RIAGENDRx").agg({"SMQ020x": [np.mean, np.size]})
dy.columns = ["Proportion", "n"]
dy

Unnamed: 0_level_0,Proportion,n
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


In [63]:
da["BMXBMI"].head()

0    27.8
1    30.8
2    28.8
3    42.4
4    20.3
Name: BMXBMI, dtype: float64

In [64]:
dz = da.groupby("RIAGENDRx").agg({"BMXBMI": [np.mean, np.std, np.size]})
dz.columns = ["BMI_mean", "BMI_std", "BMI_n"]
dz

Unnamed: 0_level_0,BMI_mean,BMI_std,BMI_n
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,29.939946,7.753319,2976
Male,28.778072,6.252568,2759


## 6. CI of Two Proportions: Smokers vs Non-Smokers in Males & Females

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error\ for\ Difference\ of\ Two\ Population\ Proportions\ Or\ Means = \sqrt{(SE_{\ 1})^2 + (SE_{\ 2})^2}$$

In [69]:
dy

Unnamed: 0_level_0,Proportion,n
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


In [75]:
dy.columns[0]

'Proportion'

In [86]:
p_f = dy.loc['Female','Proportion']
n_f = dy.loc['Female','n']
se_female = np.sqrt(p_f * (1 - p_f)/n_f)
se_female

0.008444152146214435

In [89]:
p_m = dy.loc['Male','Proportion']
n_m = dy.loc['Male','n']
se_male = np.sqrt(p_m * (1 - p_m)/ n_m)
se_male

0.009526078653689868

In [90]:
se_diff = np.sqrt(se_female**2 + se_male**2)
se_diff

0.012729881381407434

In [91]:
d = p_f - p_m
lcb = d - 1.96 * se_diff
ucb = d + 1.96 * se_diff
(lcb, ucb)

(-0.2333636091471941, -0.18346247413207697)

## 7. CI of Two Means: BMI mean in Males & Females

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$

$$Standard\ Error\ for\ Difference\ of\ Two\ Population\ Proportions\ Or\ Means = \sqrt{(SE_{\ 1})^2 + (SE_{\ 2})^2}$$

In [92]:
dz

Unnamed: 0_level_0,BMI_mean,BMI_std,BMI_n
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,29.939946,7.753319,2976
Male,28.778072,6.252568,2759


In [94]:
se_mean_female = dz.loc['Female','BMI_std'] / np.sqrt(dz.loc['Female','BMI_n'])
se_mean_male = dz.loc['Male','BMI_std'] / np.sqrt(dz.loc['Male','BMI_n'])
(se_mean_female, se_mean_male)

(0.14212522940758335, 0.11903715722332033)

In [95]:
se_mean_diff = np.sqrt(se_mean_female**2 + se_mean_male**2)
se_mean_diff

0.18538992862064455

In [97]:
d = dz.loc['Female','BMI_mean'] - dz.loc['Male','BMI_mean']
d

1.1618735403269653

In [99]:
lcb = d - 1.96 * se_mean_diff
ucb = d + 1.96 * se_mean_diff
(lcb, ucb)

(0.798509280230502, 1.5252378004234286)