### Scikit learn의 교차 검증 기능

    - 단순 데이터 분리
    - 복수의 test set 준비(cross validation generator)
    - 복수의 test set 사용해서 평가 과정 반복(cross validation calculator)


 


### 1. 단순 데이터 분리

`train_test_split(X, y, test_size, train_size, random_state)`

In [2]:
import numpy as np

In [3]:
X = np.arange(10).reshape((5,2))
X

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [5]:
y = np.arange(5)
y

array([0, 1, 2, 3, 4])

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [8]:
X_train

array([[4, 5],
       [0, 1],
       [6, 7]])

In [9]:
X_test

array([[2, 3],
       [8, 9]])

In [10]:
y_test

array([1, 4])

In [11]:
y_train

array([2, 0, 3])

- - -

### 2. Cross validation generator

    - 트레이닝/테스트용 데이터 인덱스를 내보내는 iterator를 출력하는 split 메서드를 제공한다.
    
    - KFold CV : 데이터 셋을 K개의 subset으로 분리, 1개 test 나머지 train, test가 되는 애를 계속 바꾸면서 데이터 셋 만듬
    - Leave One Out(LOO) : 하나의 sample 만을 test로 남긴다.
    - Shuffle Split : 중복된 데이터를 허용

#### KFold CV

In [12]:
N = 5
X = np.arange(8 * N).reshape(-1, 2) * 10
y = np.hstack([np.ones(N), np.ones(N) * 2, np.ones(N) * 3, np.ones(N) * 4])
print("X:\n", X, sep="")
print("y:\n", y, sep="")

X:
[[  0  10]
 [ 20  30]
 [ 40  50]
 [ 60  70]
 [ 80  90]
 [100 110]
 [120 130]
 [140 150]
 [160 170]
 [180 190]
 [200 210]
 [220 230]
 [240 250]
 [260 270]
 [280 290]
 [300 310]
 [320 330]
 [340 350]
 [360 370]
 [380 390]]
y:
[1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3. 4. 4. 4. 4. 4.]


In [21]:
from sklearn.model_selection import KFold

cv = KFold(n_splits=7, shuffle=True, random_state=0) # 여기가 핵심

for train_index, test_index in cv.split(X):
    print("test index : ", test_index)
    print("."*60)
    print("train index : ", train_index)
    print("="*60)

test index :  [ 1 18 19]
............................................................
train index :  [ 0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17]
test index :  [ 8 10 17]
............................................................
train index :  [ 0  1  2  3  4  5  6  7  9 11 12 13 14 15 16 18 19]
test index :  [ 4  6 13]
............................................................
train index :  [ 0  1  2  3  5  7  8  9 10 11 12 14 15 16 17 18 19]
test index :  [ 2  5 14]
............................................................
train index :  [ 0  1  3  4  6  7  8  9 10 11 12 13 15 16 17 18 19]
test index :  [ 7  9 16]
............................................................
train index :  [ 0  1  2  3  4  5  6  8 10 11 12 13 14 15 17 18 19]
test index :  [ 0  3 11]
............................................................
train index :  [ 1  2  4  5  6  7  8  9 10 12 13 14 15 16 17 18 19]
test index :  [12 15]
......................................................

### Leave One Out(LOO)

In [16]:
from sklearn.model_selection import LeaveOneOut

cv = LeaveOneOut() # 여기가 핵심

for train_index, test_index in cv.split(X):
    print("test index : ", test_index)
    print("."*60)
    print("train index : ", train_index)
    print("="*60)

test index :  [0]
............................................................
train index :  [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
test index :  [1]
............................................................
train index :  [ 0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
test index :  [2]
............................................................
train index :  [ 0  1  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
test index :  [3]
............................................................
train index :  [ 0  1  2  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
test index :  [4]
............................................................
train index :  [ 0  1  2  3  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
test index :  [5]
............................................................
train index :  [ 0  1  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
test index :  [6]
............................................................
tra

### Shuffle Split

In [23]:
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=5, test_size=.5, random_state=0)

for train_index, test_index in cv.split(X):
    print("test X:\n", X[test_index])
    print("=" *20)

test X:
 [[360 370]
 [ 20  30]
 [380 390]
 [160 170]
 [200 210]
 [340 350]
 [120 130]
 [260 270]
 [ 80  90]
 [ 40  50]]
test X:
 [[220 230]
 [ 20  30]
 [360 370]
 [340 350]
 [ 40  50]
 [240 250]
 [380 390]
 [320 330]
 [200 210]
 [  0  10]]
test X:
 [[300 310]
 [260 270]
 [240 250]
 [100 110]
 [220 230]
 [ 40  50]
 [160 170]
 [120 130]
 [ 60  70]
 [340 350]]
test X:
 [[360 370]
 [  0  10]
 [260 270]
 [ 40  50]
 [ 60  70]
 [340 350]
 [140 150]
 [240 250]
 [280 290]
 [320 330]]
test X:
 [[140 150]
 [ 20  30]
 [ 40  50]
 [380 390]
 [100 110]
 [360 370]
 [160 170]
 [340 350]
 [300 310]
 [320 330]]


- - -

### 3. Cross validation calculation 교차 평가 시행

- cv generator는 단순히 데이터 셋을 나누는 역할을 수행
- cv calculator로 나누어진 데이터 셋을 이용해서 $R^{2}$의 평균과 분산을 구하는 평가를 반복해야 함

`cross_val_score(estimator, X, y=None, scoring=None, cv=None)`

In [31]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X, y, coef = make_regression(n_samples=100, n_features=1, noise=20, coef=True, random_state=0)

model = LinearRegression()
cv = KFold(10)

scores = np.zeros(10)

for i, (train_index, test_index) in enumerate(cv.split(X)):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    scores[i] = r2_score(y_test, y_pred)

scores

array([0.48150779, 0.88388748, 0.1119309 , 0.57355877, 0.80187345,
       0.75415636, 0.90806019, 0.79230414, 0.71207105, 0.47630251])

In [33]:
# 위의 복잡한 코드를 아래 코드로 간편하게 할 수 있다.

from sklearn.model_selection import cross_val_score

cross_val_score(model, X, y, scoring="r2", cv=cv)

array([0.48150779, 0.88388748, 0.1119309 , 0.57355877, 0.80187345,
       0.75415636, 0.90806019, 0.79230414, 0.71207105, 0.47630251])