## 簡介
### 虛無假設 null hypothesis
就是什麼事情都沒有發生(nothing happened): 平均數不變, 治療無效果, 答案或者是模型沒有改變.

### 對立假設 alternative hypothesis
something happened: 平均數上升, 改善病患狀況, 模型變更好了

### 步驟
1. 設定 null hypothesis 包含 分佈 與 p 臨界值.
2. 計算檢定統計量, 可能包含: 平均數等等，關鍵在於要知道樣本或母體的分配狀況
3. 根據統計量與分配狀況, 計算 p 
4. 如果p 小於臨界值, 這表示我們要 rejecting the null hypothesis
5. 如果p 大於臨界值, 這表示我們要 failing to reject the null hypothesis (這不表示 null hypothese 是對的)

### 案例
1. null hypothesis假設兩個變數獨立, 那麼 alternative hypothesis 就表示並非獨立
2. 所以設定臨界值(顯著水準) $\alpha = 0.05 $, 當 $ p < \alpha $ 那我們就 rejecting the null hypothesis 表示我們相信 兩個變數並不是獨立的
3. 顯著水準是可以自由調整的

### 信賴區間
可以用給定的信賴水準估計母體的參數值．

In [1]:
import numpy as np
import matplotlib as mp
import pandas as pd
%matplotlib inline
import sklearn
import os
import sys
import datetime
import random

## 8.1 匯總資料統計量摘要
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

In [12]:
a = pd.DataFrame({"A":np.arange(10),"B":np.repeat(["A","B"],5)})
a

Unnamed: 0,A,B
0,0,A
1,1,A
2,2,A
3,3,A
4,4,A
5,5,B
6,6,B
7,7,B
8,8,B
9,9,B


In [10]:
# 透過include 強迫全秀
a.describe(include='all')

Unnamed: 0,A,B
count,10.0,10
unique,,2
top,,B
freq,,5
mean,4.5,
std,3.02765,
min,0.0,
25%,2.25,
50%,4.5,
75%,6.75,


### 第二個方法
https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.describe.html

In [14]:
from scipy import stats
a = np.arange(10)
stats.describe(a)

DescribeResult(nobs=10, minmax=(0, 9), mean=4.5, variance=9.1666666666666661, skewness=0.0, kurtosis=-1.2242424242424244)

## 8.2 計算特定子集的 相對次數
例如想要找出大於兩個標準差的數字的個數

In [21]:
a = np.arange(10,dtype=np.float64)*2
a

array([  0.,   2.,   4.,   6.,   8.,  10.,  12.,  14.,  16.,  18.])

In [24]:
# 浮點數誤差
np.mean(np.abs(a-np.mean(a))> np.std(a),dtype=np.float64)

0.40000000000000002

## 8.3 表列因子建立列連表
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html

In [31]:
foo = 'foo'
bar = 'bar'
one = 1 
two = 2
dull = 'dull'
shiny = 'shiny'
a = np.array([foo, foo, foo, foo, bar, bar,bar, bar, foo, foo, foo], dtype=object)
b = np.array([one, one, one, two, one, one,one, two, two, two, one], dtype=object)
c = np.array([dull, dull, shiny, dull, dull, shiny,shiny, dull, shiny, shiny, shiny], dtype=object)

In [32]:
pd.crosstab(a,[b,c])

col_0,1,1,2,2
col_1,dull,shiny,dull,shiny
row_0,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
bar,1,2,1,0
foo,2,2,1,2


## 8.4  確認變數之間的獨立性
http://hamelg.blogspot.tw/2015/11/python-for-data-analysis-part-25-chi.html

In [35]:
# //TODO

## 8.5 計算資料集的分位數
https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html

In [45]:
a = np.linspace(0,1,100)
a

array([ 0.        ,  0.01010101,  0.02020202,  0.03030303,  0.04040404,
        0.05050505,  0.06060606,  0.07070707,  0.08080808,  0.09090909,
        0.1010101 ,  0.11111111,  0.12121212,  0.13131313,  0.14141414,
        0.15151515,  0.16161616,  0.17171717,  0.18181818,  0.19191919,
        0.2020202 ,  0.21212121,  0.22222222,  0.23232323,  0.24242424,
        0.25252525,  0.26262626,  0.27272727,  0.28282828,  0.29292929,
        0.3030303 ,  0.31313131,  0.32323232,  0.33333333,  0.34343434,
        0.35353535,  0.36363636,  0.37373737,  0.38383838,  0.39393939,
        0.4040404 ,  0.41414141,  0.42424242,  0.43434343,  0.44444444,
        0.45454545,  0.46464646,  0.47474747,  0.48484848,  0.49494949,
        0.50505051,  0.51515152,  0.52525253,  0.53535354,  0.54545455,
        0.55555556,  0.56565657,  0.57575758,  0.58585859,  0.5959596 ,
        0.60606061,  0.61616162,  0.62626263,  0.63636364,  0.64646465,
        0.65656566,  0.66666667,  0.67676768,  0.68686869,  0.69

In [46]:
np.percentile(a,50)

0.5

In [47]:
np.percentile(a,[20,50])

array([ 0.2,  0.5])

## 8.6 反轉分位數
找出有多少比例小於你所給予的數字

In [49]:
a = np.random.random(10)
print(a)

[ 0.00791305  0.13835104  0.36701642  0.64019307  0.69162787  0.9606918
  0.30717037  0.19778286  0.14175649  0.64856211]


In [50]:
np.mean(a<0.3)

0.40000000000000002

## 8.7 將資料轉換成Z score
公式 $ z = \frac{x - mean(x)}{std(x)} $

In [52]:
a = np.arange(20)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [54]:
z = (a - np.mean(a))/np.std(a)
z

array([-1.64750894, -1.47408695, -1.30066495, -1.12724296, -0.95382097,
       -0.78039897, -0.60697698, -0.43355498, -0.26013299, -0.086711  ,
        0.086711  ,  0.26013299,  0.43355498,  0.60697698,  0.78039897,
        0.95382097,  1.12724296,  1.30066495,  1.47408695,  1.64750894])

## 8.8 檢驗樣本平均數  (t-test)
https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.ttest_ind.html
https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.ttest_1samp.html#scipy.stats.ttest_1samp

In [56]:
from scipy import stats
np.random.seed(7654567)  # fix seed to get the same result
rvs = stats.norm.rvs(loc=5, scale=10, size=(50,2))
rvs

array([[ -4.46976756e+00,   2.23747986e+00],
       [  3.01265677e+00,   1.99351998e+01],
       [  6.96381504e-01,   8.02590763e+00],
       [  1.12881422e+01,  -8.19430475e+00],
       [ -4.06294214e+00,   5.40597577e-01],
       [  7.49758326e+00,  -1.57148818e+00],
       [ -1.44414717e+00,   1.01871511e+01],
       [  4.65777598e+00,  -4.79148636e+00],
       [ -1.49960303e+00,  -4.09743532e+00],
       [  3.32305696e+00,   4.03947153e+00],
       [  7.14723880e+00,   2.02604829e+00],
       [  1.27751142e+01,  -1.03867073e+01],
       [ -1.49519210e+01,   1.52315561e-02],
       [  2.05524559e+01,   3.56715484e+00],
       [ -4.80533143e+00,   6.66137409e+00],
       [  1.11086989e+00,  -1.12748227e+01],
       [ -1.60772481e+01,   8.80348065e+00],
       [ -4.29292484e+00,   1.64598738e+00],
       [  8.46555570e+00,  -1.44406425e+00],
       [  2.54389177e+01,  -4.71895192e+00],
       [  9.67603416e+00,   3.25855752e+00],
       [ -2.40258078e+00,   1.81492883e+01],
       [  

In [57]:
# static t 的數值, 轉換成p value
stats.ttest_1samp(rvs,5.0)

Ttest_1sampResult(statistic=array([-0.68014479, -0.04323899]), pvalue=array([ 0.49961383,  0.96568674]))