# Notebook 15: Confidence Intervals Solutions
***

We'll need Numpy, Matplotlib, Pandas, and scipy.stats for this notebook, so let's load them. 

In [1]:
import numpy as np 
from scipy import stats
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline

### Exercise 1 - Single Sample CI
*** 
Load `hubble.csv` into Python. A description of the variables can be obtained from page 73 of https://cran.r-project.org/web/packages/gamair/gamair.pdf.

In [2]:
# Path to the data - select the path that works for you 
file_path = 'hubble.csv'

# Load the data into a DataFrame 
df = pd.read_csv(file_path)

# Look at the data
df.head(10)


Unnamed: 0.1,Unnamed: 0,Galaxy,y,x
0,1,NGC0300,133,2.0
1,2,NGC0925,664,9.16
2,3,NGC1326A,1794,16.14
3,4,NGC1365,1594,17.95
4,5,NGC1425,1473,21.88
5,6,NGC2403,278,3.22
6,7,NGC2541,714,11.22
7,8,NGC2090,882,11.75
8,9,NGC3031,80,3.63
9,10,NGC3198,772,13.8


In [3]:
# Check the data types
df.dtypes

Unnamed: 0      int64
Galaxy         object
y               int64
x             float64
dtype: object

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  24 non-null     int64  
 1   Galaxy      24 non-null     object 
 2   y           24 non-null     int64  
 3   x           24 non-null     float64
dtypes: float64(1), int64(2), object(1)
memory usage: 896.0+ bytes


From the documentation (pg. 73) it appears the 'x' is a distance and 'y' is a velocity.

#### (a) Calculate the 85% confidence interval for the mean of a galaxy's distance from Earth in Mega parsecs in Python by $\color{red}{\text{doing the computation explicitly.}}$  Use the large sample approximation even though we only have an n of 24.


In [5]:
# Create an array of length equal to our dataframe. 
n=len(df) #num samples in df so n=24 

# 'xbar' is a variable holding the mean of column 'x'.
xbar=np.sum(df['x'])/n

# 'svar' is a variable holding the sample variance of column 'x'.
svar=np.sum((xi-xbar)**2/(n-1) for xi in df['x']) #variance= std^2/n

# 'sd' is a variable holding the sample std of column 'x'.
sd=np.sqrt(svar) #square root of variance

# 'critz' is a variable holding the critical z.
critz=stats.norm.ppf(.925)  

# Now, we create the interval by taking our sample mean and adding and subtracting our
#  critical z times the standard error.
print('The CI is: (', xbar-critz*sd/np.sqrt(n),' , ',xbar+critz*sd/np.sqrt(n),')')

The CI is: ( 10.345988767455863  ,  13.763177899210804 )


  svar=np.sum((xi-xbar)**2/(n-1) for xi in df['x']) #variance= std^2/n


#### (b) Can you find a built in stats function that does this computation automatically? 
The $\color{red}{\text{.interval}}$ function!

In [6]:
# Of course we need not do this by hand each time!
# Just provide the .interval function inputs like:
#   confidence level, xbar, and the standard error.
# Because this is a sample mean distribution our standard error is sd/np.sqrt(n)
stats.norm.interval(.85, loc=xbar, scale=sd/np.sqrt(n))


(10.345988767455863, 13.763177899210804)

#### (c) Interpret the confidence interval.

We are 100% confident that 85% of intervals created in this manner will contain the true mean of a galaxy's distance from Earth. And this is one such interval.

This claim means that, if we were to collect these measurements from sample after sample after sample, and calculate the confidence interval for each sample, then about 85% of the CI's would contain the true mean. And again, this is just one of those sample intervals.

### Exercise 2 - Two Sample CI
*** 
Load `clean_titanic_data` into Python.

In [7]:
# Path to the data - select the path that works for you 
file_path = 'clean_titanic_data.csv'

# Load the data into a DataFrame 
df = pd.read_csv(file_path)

# Look at the data.
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S
5,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
6,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,S
7,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S
8,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C
9,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,S


#### a) Calculate a 98% CI for the survival rate of all passengers.

In [8]:
# 'people_survived' is a variable holding the number of survivors.
# We just sum up the 1's in the column with the 'Survived' header.
people_survived = df['Survived'].sum()

# Print out how many people survived.
print("{} people survived the disaster".format(people_survived))

# 'people_total' is a variable holding the total number of people in the df.
people_total = len(df)

# We can now print out, relatively, how many people survived. 
# So we are looking at a proportion from our sample.
print("... out of {} people total".format(people_total))

##So the CI is: phat plus/minus z_crit * sqrt(phat(1-phat)/n)
# Because this is a proportion distribution, our standard error is sqrt(phat(1-phat)/n)
phat_t=people_survived/people_total
zcrit=stats.norm.ppf(.99)
var_t=phat_t*(1-phat_t)/people_total

print("For a 98% CI of: (", phat_t-zcrit*np.sqrt(var_t)," , ",   phat_t+zcrit*np.sqrt(var_t),')')

print("")
# Alternatively:
g = stats.norm.interval(.98, loc=phat_t, scale=np.sqrt(var_t))
print("The 98% CI is given by ",g)


290 people survived the disaster
... out of 714 people total
For a 98% CI of: ( 0.36340526395317535  ,  0.44891966601881345 )

The 98% CI is given by  (0.36340526395317535, 0.44891966601881345)


#### b) Calculate a 98% CI for the survival rate of men (all passenger classes).

In [9]:
# Find the proportion of male survivors.
male_survived = df.loc[df['Sex']=='male', 'Survived'].sum()
print("{} men survived the disaster".format(male_survived))

# Out of how many men were in the sample.
male_total = len(df.loc[(df["Sex"]=='male')])
print("... out of {} men total".format(male_total))


# So the CI is: phat plus/minus z_crit * sqrt(phat(1-phat)/n)
phat_m=male_survived/male_total
zcrit=stats.norm.ppf(.99)
var_m=phat_m*(1-phat_m)/male_total

print("For a 98% CI of: (", phat_m-zcrit*np.sqrt(var_m),' , ',   phat_m+zcrit*np.sqrt(var_m),')')


93 men survived the disaster
... out of 453 men total
For a 98% CI of: ( 0.1611490936776798  ,  0.24944693281238642 )



#### c) Calculate a 98% CI for the survival rate of women (all passenger classes).

In [10]:
# How many women survived
female_survived = df.loc[df['Sex']=='female', 'Survived'].sum() # your code goes here! 
print("{} women survived the disaster".format(female_survived))

# Out of how many women were in the sample.
female_total = len(df.loc[(df["Sex"]=='female')])
print("... out of {} women total".format(female_total))


# So the CI is: phat plus/minus z_crit * sqrt(phat(1-phat)/n)
phat_f=female_survived/female_total
zcrit=stats.norm.ppf(.99)
var_f=phat_f*(1-phat_f)/female_total

print("For a 98% CI of: (", phat_f-zcrit*np.sqrt(var_f),' , ',   phat_f+zcrit*np.sqrt(var_f),')')


197 women survived the disaster
... out of 261 women total
For a 98% CI of: ( 0.692839887327238  ,  0.8167386567340648 )



#### d) Calculate a 98% CI for the $\color{red}{\text{difference}}$ in survival rates between men and women.

In [11]:
# So, the CI is: phat1-phat2 plus/minus z_crit * sqrt(phat1(1-phat1)/n1 +phat2(1-phat2)/n2 )
var_pool=var_f+var_m
print("For a 98% CI of: (", phat_m-phat_f-zcrit*np.sqrt(var_f+var_m),' , ',phat_m-phat_f+zcrit*np.sqrt(var_m+var_f),')')



For a 98% CI of: ( -0.625562628985557  ,  -0.4734198885856794 )



#### e) What can you conclude?

- not including zero so it does seem to say that sex matters for survival rate

Probably the same conclusion as in HW 1: sex seemed to matter on the Titanic!  But now we can add the sentence: there was a statistically significant difference (i.e. not random luck) between the survival rates of men and women.

That said, we also know that it may have varied quite a bit depending on passenger class and age (our sample was not stratified), and we don't know yet how to compare across multiple variables at once.

But stay tuned for regression and ANOVA!