<a href="https://colab.research.google.com/github/oymin2001/DataScience/blob/main/CrossValidation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold, GroupKFold, StratifiedGroupKFold, ShuffleSplit, GroupShuffleSplit, StratifiedShuffleSplit, RepeatedKFold, LeaveOneOut, LeavePOut, LeaveOneGroupOut, LeavePGroupsOut, TimeSeriesSplit
import numpy as np

# KFold

데이터텟을 k개의 folds들로 나눠서 각 iter 마다 (k-1)개의 folds들을 학습한 후 나머지 fold를 test set으로 사용

In [None]:
X = np.array(range(10))
kf = KFold(n_splits=2)
for i, (train, test) in enumerate(kf.split(X)):
    print('Validation %s:'%i)
    print('Train idx: %s, Test idx: %s'%(train, test))

Validation 0:
Train idx: [5 6 7 8 9], Test idx: [0 1 2 3 4]
Validation 1:
Train idx: [0 1 2 3 4], Test idx: [5 6 7 8 9]


# StratifiedKFold

주로 분류문제에서 클래스 분포가 불균형할 경우 사용. 라벨 비율에 맞게 fold들을 분배

In [None]:
label = np.array([0,0,0,0,0,0,0,0,1,1])
skf = StratifiedKFold(n_splits=2)
for i, (train, test) in enumerate(skf.split(X, label)):
    print('Validation %s:'%i)
    print('Train idx: %s, Test idx: %s'%(train, test))
    print('Fold_train: ', label[train])
    print('Fold_test: ', label[test])
    print('==========================================')

Validation 0:
Train idx: [4 5 6 7 9], Test idx: [0 1 2 3 8]
Fold_train:  [0 0 0 0 1]
Fold_test:  [0 0 0 0 1]
Validation 1:
Train idx: [0 1 2 3 8], Test idx: [4 5 6 7 9]
Fold_train:  [0 0 0 0 1]
Fold_test:  [0 0 0 0 1]


# Repeated KFold

n_repeats 파라미터를 이용하여 KFold를 여러번 반복할 수 있음

In [None]:
X = np.array(range(6))
rkf = RepeatedKFold(n_splits=2, n_repeats=4, random_state=42)
cnt = 1
for train ,test in rkf.split(X):
    print('train: %s, test: %s'%(train, test))
    if cnt % 2 ==0:
        print('================repeated============')
    cnt+=1

train: [2 3 4], test: [0 1 5]
train: [0 1 5], test: [2 3 4]
train: [2 4 5], test: [0 1 3]
train: [0 1 3], test: [2 4 5]
train: [2 3 5], test: [0 1 4]
train: [0 1 4], test: [2 3 5]
train: [1 2 4], test: [0 3 5]
train: [0 3 5], test: [1 2 4]


# GruopedKFold

그룹화되있는 데이터가 있을 경우

iter마다 검증시에 같은 그룹이 train data이면서 동시에 test data가 되는걸 방지

In [None]:
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = GroupKFold(n_splits=3)
gkf.get_n_splits(X,y,groups)

3

- Group1 idx = 0:2
- Group2 idx = 3:5
- Gruop3 idx = 6:9

In [None]:
for idx, (train, test) in enumerate(gkf.split(X,y,groups)):
    print('Fold %s:'%idx)
    print("Train index %s, Test indes: %s: "%(train,test))

Fold 0:
Train index [0 1 2 3 4 5], Test indes: [6 7 8 9]: 
Fold 1:
Train index [0 1 2 6 7 8 9], Test indes: [3 4 5]: 
Fold 2:
Train index [3 4 5 6 7 8 9], Test indes: [0 1 2]: 


In [None]:
gskf = StratifiedGroupKFold(n_splits=3)
y = np.array([0,0,1,0,0,1,0,0,1,0]) # 0:1 = 7:3
print("Datasets prop:", (len(y[y==1]) / len(y)))
groups = np.array(groups)
for idx, (train, test) in enumerate(gskf.split(X,y,groups)):
    print('Fold %s:'%idx)
    y_train = y[train]
    prop = len(y_train[y_train == 1]) / len(y_train)
    print('Train Group: %s, Test Group: %s'%(groups[train], groups[test]))
    print("train %s prop: %s"%(idx, np.round(prop,2)))
    print('====================================================')

Datasets prop: 0.3
Fold 0:
Train Group: [1 1 1 2 2 2], Test Group: [3 3 3 3]
train 0 prop: 0.33
Fold 1:
Train Group: [1 1 1 3 3 3 3], Test Group: [2 2 2]
train 1 prop: 0.29
Fold 2:
Train Group: [2 2 2 3 3 3 3], Test Group: [1 1 1]
train 2 prop: 0.29


StratifiedGroupKFold의 경우 검증시마다 train, test에 동일한 group이 없으며, 동시에  각 iter마다 라벨비율도 보존하여 fold를 분배

# Shuffle & Split

KFold는 데이터를 K개의 fold로 나눈 후 train/ test set을 나눴지만, Shuffle & Split은 K개의 fold로 나누는 대신에, 각 iter마다 test set을 랜덤으로 추출한다.

In [None]:
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=42)
for train, test in ss.split(X):
    print("Train idx: %s, Test idx: %s"%(train, test))

Train idx: [0 7 2 9 4 3 6], Test idx: [8 1 5]
Train idx: [5 3 4 7 9 6 2], Test idx: [0 1 8]
Train idx: [6 8 5 3 7 1 4], Test idx: [9 2 0]
Train idx: [2 8 0 3 4 5 9], Test idx: [1 7 6]
Train idx: [8 0 7 6 3 2 9], Test idx: [1 5 4]


In [None]:
# GroupShuffleSplit: Shuffle&Split + LeavePGroupsOut
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gss = GroupShuffleSplit(n_splits=4, test_size=0.3, random_state=42)

for train, test in gss.split(X,y,groups=groups):
    print("Train idx: %s, Test idx: %s"%(train, test))

Train idx: [3 4 5 6 7 8 9], Test idx: [0 1 2]
Train idx: [0 1 2 6 7 8 9], Test idx: [3 4 5]
Train idx: [3 4 5 6 7 8 9], Test idx: [0 1 2]
Train idx: [0 1 2 6 7 8 9], Test idx: [3 4 5]


# LeaveOneOut

검증시마다 한개의 데이터만 test data로 사용. 데이터의 양이 적을경우 사용

In [None]:
X = np.array(range(6))
loo = LeaveOneOut()
for train, test in loo.split(X):
    print('train: %s, test: %s'%(train, test))

train: [1 2 3 4 5], test: [0]
train: [0 2 3 4 5], test: [1]
train: [0 1 3 4 5], test: [2]
train: [0 1 2 4 5], test: [3]
train: [0 1 2 3 5], test: [4]
train: [0 1 2 3 4], test: [5]


In [None]:
#LeavePOut: P개의 데이터를 test data로 사용
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
    print('train: %s, test: %s'%(train, test))

train: [2 3 4 5], test: [0 1]
train: [1 3 4 5], test: [0 2]
train: [1 2 4 5], test: [0 3]
train: [1 2 3 5], test: [0 4]
train: [1 2 3 4], test: [0 5]
train: [0 3 4 5], test: [1 2]
train: [0 2 4 5], test: [1 3]
train: [0 2 3 5], test: [1 4]
train: [0 2 3 4], test: [1 5]
train: [0 1 4 5], test: [2 3]
train: [0 1 3 5], test: [2 4]
train: [0 1 3 4], test: [2 5]
train: [0 1 2 5], test: [3 4]
train: [0 1 2 4], test: [3 5]
train: [0 1 2 3], test: [4 5]


In [None]:
#LeaveOneGruopOut: 1개의 그룹을 test data로 사용
groups = [1,1,2,2,3,3]
logo = LeaveOneGroupOut()
for train, test in logo.split(X,groups = groups):
    print('train: %s, test: %s'%(train, test))

train: [2 3 4 5], test: [0 1]
train: [0 1 4 5], test: [2 3]
train: [0 1 2 3], test: [4 5]


In [None]:
#LeavePGroupsOut: p개의 그룹을 test data로 사용
lpgo = LeavePGroupsOut(n_groups=2)
for train, test in lpgo.split(X,groups = groups):
    print('train: %s, test: %s'%(train, test))

train: [4 5], test: [0 1 2 3]
train: [2 3], test: [0 1 4 5]
train: [0 1], test: [2 3 4 5]


# **Time Series Split

시계열 데이터를 교차검증시에 사용. 

**cross-validation on a rolling basis**


iter = 1:

첫 t개의 데이터를 train으로, (t + n)개까지의 데이터를 test 데이터로 사용

iter = 2:

(t + n)개의 데이터를 train으로, (t + 2n)개까지의 데이터를 test 데이터로 사용


iter = k:

(t + (k-1)n)개의 데이터를 train으로, (t + kn)개까지의 데이터를 test 데이터로 사용

In [None]:
X = np.array(range(8))
tscv = TimeSeriesSplit(n_splits=3, test_size=2)
for train, test in tscv.split(X):
    print("%s %s" % (train, test))

[0 1] [2 3]
[0 1 2 3] [4 5]
[0 1 2 3 4 5] [6 7]
