# Exploring Confidence Intervals

In this workbook, you will investigate the theoretical properties of confidence intervals and practice calculating confidence intervals on real data.

In [56]:
from symbulate import *

import pandas as pd
pd.set_option("display.max_rows", 15)

## Confidence Interval Coverage

In this exercise, you will simulate many realizations of 95% confidence intervals and see whether they really do cover the true mean 95% of the time.

In [57]:
box = [0, 1, 3, 4, 7]
model = BoxModel(box, size=10, replace=True)

<symbulate.probability_space.BoxModel at 0x7f89df8a8470>

In [31]:
def calculate_interval(sample):
    """Calculates 95% confidence interval for a given sample."""
    return ((mean(sample)-2*(sd(sample)/sqrt(len(sample)))),(mean(sample)+2*(sd(sample)/sqrt(len(sample)))))

CI = RV(model, calculate_interval)
CI.sim(10000)

Index,Result
0,"(1.6181093157725839, 4.5818906842274165)"
1,"(3.180140852056092, 6.019859147943907)"
2,"(1.6748913077721743, 5.125108692227825)"
3,"(1.9801408520560921, 4.819859147943908)"
4,"(1.885092596121005, 4.314907403878995)"
5,"(0.6875961595364041, 2.312403840463596)"
6,"(0.7882446312250164, 3.0117553687749834)"
7,"(1.7086918494154204, 4.69130815058458)"
8,"(0.6435340033749464, 3.3564659966250536)"
...,...


In [25]:
def contains_true_mean(interval):
    """Returns True if an interval (lower, upper) contains the true population mean."""
    return (interval[0] < mean(box) < interval[1])

CI.sim(10000).count(contains_true_mean)/10000

0.955

In [23]:
9584/10000

0.9584

**Bonus.** If you are done early, try different versions of the basic confidence interval. For example, what happens to the coverage if you replace $\sigma$ by $S$? Can you fix the coverage if you replace $2$ by the appropriate quantile of the $t$-distribution?

_Hint_: You can calculate the quantile of a list of simulation results using `sims.quantile(q)`, where `q` is the quantile you want to calculate.

## Calculating Confidence Intervals

In practice, we only ever get to observe one sample and calculate one interval. We cannot say for certain whether our interval contains the true mean or not, but if the confidence level is high (e.g., 95%), we typically assume that the true mean lies inside our interval, instead of outside.

In [27]:
data = pd.read_csv("/data/harris.csv")
data

Unnamed: 0,Bsal,Sal77,Sex,Senior,Age,Educ,Exper
0,5040,12420,Male,96,329,15,14.0
1,6300,12060,Male,82,357,15,72.0
2,6000,15120,Male,67,315,15,35.5
3,6000,16320,Male,97,354,12,24.0
4,6000,12300,Male,66,351,12,56.0
5,6840,10380,Male,92,374,15,41.5
6,8100,13980,Male,66,369,16,54.5
...,...,...,...,...,...,...,...
86,5100,10560,Female,84,458,12,36.0
87,4800,9240,Female,84,571,16,214.0


**Exercise.** Assuming that Harris Bank employees are a repesentative sample of all bank employees in Chicago, calculate 95% confidence intervals for the average months of experience for male bank employees and the average months of experience for female bank employees in Chicago.

In [39]:
sample = data.groupby("Sex")["Exper"].apply(calculate_interval)
sample

Sex
Female     (78.126392766, 121.504754775)
Male      (67.5186797879, 138.575070212)
Name: Exper, dtype: object

In [63]:
a = {1: 271, 0: 406}
model = BoxModel(a, size=677, replace=False)
CI = RV(model, calculate_interval)
CI.sim(1)

Index,Result
0,"(0.36263417752716026, 0.4379566644226182)"


In [45]:
x = mean(a)
stderr = sd(a)/sqrt(677)

TypeError: 'float' object is not callable