<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Review CLT, Confidence Intervals, and Hypothesis Testing


---

### Read in the housing data (code provided).

You can find the original data [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data).

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [4]:
from sklearn.datasets import load_boston

data_boston = load_boston()
data = pd.DataFrame(data_boston.data,columns=data_boston.feature_names)
NOX = data['NOX']
AGE = data['AGE']

### 1. Find the mean, standard deviation, and the standard error of the mean for variable `AGE`

In [16]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [18]:
data.AGE.describe()

count    506.000000
mean      68.574901
std       28.148861
min        2.900000
25%       45.025000
50%       77.500000
75%       94.075000
max      100.000000
Name: AGE, dtype: float64

In [22]:
AGE.sem()

1.251369525258305

In [44]:
sem(data.AGE)

1.2513695252583041

In [5]:
# scipy standard error function
from scipy.stats import sem

In [23]:
len(AGE)

506

In [42]:
print("mean:\t\t\t {:.4f}.".format(data["AGE"].mean()))

mean:			 68.5749.


In [43]:
print("standard deviation:\t {:.4f}".format(data["AGE"].std()))

standard deviation:	 28.1489


### 2. Generate a 90%, 95%, and 99% confidence interval for `AGE`

You can use the `scipy.stats.t.interval` function to calculate confidence interval range.

```python
# Endpoints of the range that contains alpha percent of the distribution
stats.t.interval(alpha, df, loc=0, scale=1)	
```

Arguments:
- `df` = the degrees of freedom, will be the length of the vector -1.
- `loc` = the mean of the t-distribution (your point estimate - mean of the variable)
- `scale` = the standard deviation of the t-distribution (the standard error of your sample mean)

**Interpret the results from all three confidence intervals.**

In [6]:
from scipy.stats import t

In [45]:
t.interval(0.9, 505, loc=AGE.mean(), scale=AGE.sem())

(66.51279866704189, 70.63700370449968)

In [29]:
t.interval(0.95, 505, loc=68.574901, scale=AGE.sem())

(66.11636953277245, 71.03343246722754)

In [30]:
t.interval(0.99, 505, loc=68.574901, scale=AGE.sem())

(65.33936023257061, 71.81044176742938)

In [None]:
# as you want more conficence, the range is becoming wider.

### 3. Did you rely on the Central Limit Theorem in question 2? Why or why not? Explain.

In [8]:
#yes. mean of samples should be the same as mean of population.
#because our sample is 500 sth, a lot more than 30, as required.
# so, we can rely on CLT.

### 4. For the variable `NOX`, generate a 95% confidence interval and interpret it.

In [32]:
NOX.describe()

count    506.000000
mean       0.554695
std        0.115878
min        0.385000
25%        0.449000
50%        0.538000
75%        0.624000
max        0.871000
Name: NOX, dtype: float64

In [33]:
NOX.sem()

0.005151391024028495

In [34]:
t.interval(0.95, 505, loc=0.554695, scale=0.005151391024028495)

(0.5445742030036426, 0.5648157969963575)

In [35]:
t.interval(0.99, 505, loc=0.554695, scale=0.005151391024028495)

(0.541375564508894, 0.5680144354911061)

In [None]:
# 

### 5. For the variable `NOX`, test the hypothesis that the mean is equal to the median. 

You may use scipy functions to complete this, but complete all steps listed below.

1. Define hypothesis
2. Set alpha (Let alpha = 0.05)
3. Calculate point estimate
4. Calculate test statistic
5. Find the p-value
6. Interpret results

Hint: Use the function `stats.ttest_1samp` to test for equality of the mean to a particular value $\mu$. In this case, the relevant t-statistic is calculated as

$$
t = \frac{\bar{x}-\mu}{s/\sqrt{n}}
$$

where the sample standard deviation is estimated from the single sample $x$.

In [37]:
# H0: mean is equal to median
import scipy.stats as stats


In [39]:
NOX_5 = stats.norm.rvs(loc=0.554695, scale=0.005151391024028495, size=(50,2))

In [None]:
# rvs = stats.norm.rvs(loc=5, scale=10, size=(50,2))

In [None]:
# stats.ttest_1samp(rvs,5.0)

In [40]:
stats.ttest_1samp(NOX_5, 5.0)

Ttest_1sampResult(statistic=array([-6260.6021943, -5450.7836018]), pvalue=array([2.69294796e-146, 2.38739923e-143]))

In [41]:
stats.ttest_1samp(NOX, NOX.median()) # 这里的第一个argument,就是对NOX取mean。

Ttest_1sampResult(statistic=3.2408837167794102, pvalue=0.001270210999819144)

In [None]:
# 通过这个p，我们可以reject 我们的H0. 因为0.00127< 0.05, H0不成立，H1成立

### 6. What do you notice about the results from Exercise 4 and Exercise 5? 

**If you were going to generalize this to the relationship between hypothesis tests and confidence intervals, what might you say? Be specific.**

### 7. For the variable `NOX`, test the hypothesis that the mean is smaller than or equal to the median. 

You may use scipy functions to complete this, but complete all steps listed below.

1. Define hypothesis
2. Set alpha (Let alpha = 0.05)
3. Calculate point estimate
4. Calculate test statistic
5. Find the p-value
6. Interpret results

In [12]:
# A:

### 8. Compare the p-values from Exercise 5 and Exercise 7. What do you notice?

In [13]:
# A:

### 9. Test if the data is ordered or not.

Split the dataset into the first and second half according to the index order. Perform a statistical test if the means of the two groups are the same. Assume equal variances.

In [14]:
# A: