# 分層抽樣
主要用途確保每一種分類都可以一定數量的 Sample，且跟原始的分佈是一樣的
* Generate test sets such that all contain the same distribution of classes, or as close as possible.
* Be invariant to class label: relabelling y = ["Happy", "Sad"] to y = [1, 0] should not change the indices generated.
* Preserve order dependencies in the dataset ordering, when shuffle=False: all samples from class k in some test set were contiguous in y, or separated in y by samples from classes other than k.
* Generate test sets where the smallest and largest differ by at most one sample.

In [1]:
import numpy as np

In [2]:
X = np.random.rand(1000,2)

In [3]:
y = np.random.rand(1000)
y=y<0.01
y=np.float32(y)
print("the % of y==1 is ",y.sum()/1000)

the % of y==1 is  0.013


# 使用 Stratified　確保抽樣完之後 y's distriubtion 會維持一致不受變動

In [8]:
import numpy as np
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)

print(skf)

for train_index, test_index in skf.split(X, y):
#     print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]



StratifiedKFold(n_splits=2, random_state=None, shuffle=False)


In [9]:
print("the % of train_y==1 is " ,(y_train==1).sum()/y_train.shape[0])

the % of train_y==1 is  0.012


In [10]:
print("the % of train_y==1 is " ,(y_test==1).sum()/y_test.shape[0])

the % of train_y==1 is  0.014
