## Inferential statistics lab

It might be a good idea to first check the [source of the Boston housing data](https://archive.ics.uci.edu/ml/datasets/Housing). 

We've saved the data for you in a file named "housing.data". Load it in using any method you choose.

In [3]:
import csv
data = []
with open('housing.data') as f:
    reader = csv.reader(f, delimiter="\t")
    for row in reader:
        data.append(map(float, row[0].split()))
    data = list(data)

In [4]:
# housing_data = {
#                 'CRIM' : [row[0] for row in data],
#                 'ZN' : [row[1] for row in data],
#                 'INDUS' : [row[2] for row in data],
#                 'CHAS' : [row[3] for row in data],
#                 'NOX' : [row[4] for row in data],
#                 'RM' : [row[5] for row in data],
#                 'AGE' : [row[6] for row in data],
#                 'DIS' : [row[7] for row in data],
#                 'RAD' : [row[8] for row in data],
#                 'TAX' : [row[9] for row in data],
#                 'PTRATIO' : [row[10] for row in data],
#                 'B' : [row[11] for row in data],
#                 'LSTAT' : [row[12] for row in data],
#                 'MEDV' : [row[13] for row in data]
#                }
#### MORE EFFICIENT from brian:
# headers = 'CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV'
# headers = headers.split(',')
# data = {}
# for i,header in enumerate(headers):
#     data[header] = [row[i] for row in rows] 

# for i in housing_data:
#     housing_data[i] = [x/100 if i in ['ZN','INDUS','AGE','LSTAT'] else x for x in housing_data[i] ]
#     print i + ' ' + str(max(housing_data[i]))
# print housing_data

import pandas as pd
df = pd.DataFrame(data)
housing_data = df.apply(pd.to_numeric)

  

Exercise 1: Conduct a brief integrity check of your data. This integrity check should include, but is not limited to, checking for missing values and making sure all values make logical sense. (i.e. Is one variable a percentage, but there are observations above 100%?)
Summarize your findings in a few sentences, including what you checked and, if appropriate, any steps you took to rectify potential integrity issues.

When I loaded the data, I noticed that the values weren't comma separated, but tab delimited. I had to account for that when creating the array. 

Also, there were a few columns that were described as proportions but listed as integers instead of their decimal representations. All of those columns were converted to decimals to be 0-1 instead of 0-100

Exercise 2: For what two attributes does it make the least sense to calculate mean and median? Why?

CHAS      Charles River dummy variable (= 1 if tract bounds 
                  river; 0 otherwise)
RAD       index of accessibility to radial highways

CHAS is either 0 or 1, so it would be silly to calculate a mean and median. 
We do not know how the RAD index is assigned, so it wouldn't make sense to find the average index.



Exercise 3: Find the mean, standard deviation, and the standard error of the mean for variable 'AGE.'

In [16]:
import scipy
import numpy as np
from scipy import stats

age = housing_data[6]

mean_age = np.mean(age)
std_dev_age = np.std(age)
std_error_age = scipy.stats.sem(age)

print mean_age
print std_dev_age
print std_error_age


68.5749011858
28.1210325702
1.25136952526


Exercise 4: Generate a 90%, 95%, and 99% confidence interval for 'AGE'. Do at least one of these manually (i.e. by plugging in the appropriate parts) and at least one of these using a function from scipy.stats. Interpret the results from all three confidence intervals.

In [17]:
## manual -- 90% CI
multiplier = 1.645

conf_int_low = mean_age - (multiplier * std_error_age)
conf_int_high = mean_age + (multiplier * std_error_age)

print (conf_int_low, conf_int_high)
## (0.66516398316720848, 0.70633404054820659)
## We can be 90% certain that the true mean age is between (0.66516398316720848, 0.70633404054820659)

import math
N = len(age)
sigma = std_dev_age

conf_int_95 = stats.norm.interval(.95, 
                               loc=mean_age, 
                               scale=sigma / (math.sqrt(N)))

print conf_int_95

## We can be 95% sure that the true mean age is 
## between(0.66124686740030103, 0.71025115631511404)

conf_int_99 = stats.norm.interval(.99, 
                               loc=mean_age, 

                               scale=sigma / (math.sqrt(N)))
print conf_int_99

## We can be 99% sure that the true mean age is 
## between (0.6535477356143653, 0.71795028810104977)



(66.516398316720867, 70.633404054820701)
(66.124686740030128, 71.02511563151144)
(65.354773561436559, 71.795028810105009)


Exercise 5: For variable 'NOX', generate a 95% confidence interval and interpret it.

In [7]:
nox = housing_data[4]
N_nox = len(nox)
sigma_nox = np.std(nox)
mean_nox = np.mean(nox)

conf_int = stats.norm.interval(.95, 
                               loc=mean_nox, 
                               scale=sigma_nox / (math.sqrt(N_nox)))
print conf_int

## We can be 95% sure that the true average NOX is 
## between (0.54460850016434292, 0.56478161841273222)

(0.54460850016434248, 0.564781618412732)


Exercise 6: For the variable 'NOX', find the median.

In [19]:
median_nox = np.median(housing_data[4])

print median_nox

## 0.538

0.538


Exercise 7: For the variable 'NOX', test the hypothesis that the mean is equal to the median. You may use scipy functions to complete this, but complete all steps - define hypotheses, etc. Let alpha = 0.05. Interpret your results.

In [9]:
# H_0: mean_nox != median_nox
# H_1: mean_nox = median_nox
alpha = 0.05

ttest_result = stats.ttest_1samp(nox,median_nox)
print ttest_result


# Ttest_1sampResult(statistic=3.2408837167794102, pvalue=0.0012702109998191441)
pvalue = 0.0012702109998191441
tstat = 3.2408837167794102
print pvalue < alpha

## The pvalue is less than alpha = 0.05 so we can 
## reject the null hypothesis and accept our assertion that the median = mean



Ttest_1sampResult(statistic=3.2408837167794102, pvalue=0.0012702109998191441)
True


Exercise 8: What do you notice about the results from Exercise 5 through Exercise 7? If you were going to generalize this to the relationship between hypothesis tests and confidence intervals, what might you say? Be specific.

We can note that they have an inverse relationship--while for example, a confidence test could be trying to find a range which they are 90% confident the true mean exists within, a hypothesis test would try to prove that the mean exists outside of that range ie under 10%

Exercise 9: For the variable 'NOX', test the hypothesis that the mean is greater than or equal to the median. You may use scipy functions to complete this, but complete all steps - define hypotheses, etc. Let alpha = 0.05. Interpret your results.

In [13]:
## H_0: mean_nox < median_nox
## H_!: mean_nox >= median_nox
alpha = 0.05

ttest_result = stats.ttest_1samp(nox,median_nox)

## We can use the same ttest_result as above, 
## but since we are testing that the mean is greater than or equal to the median, 
## we can take pvalue/2 and check that pvalue/2 < alpha 
## and that the tstat > 0. 
## note for self:: less-than test - check that p/2 < alpha and t < 0.

pvalue = 0.0012702109998191441
tstat = 3.2408837167794102
print pvalue / 2 < alpha and tstat > 0 
# True!! We can reject the null hypothesis 
# and accept our hypothesis that the mean is greater than or equal to the median

True


The pvalue in #9 is half of the pvalue in #7 due to the two tail test in #7 vs the one tail test in #9