### sklearn.model_selection.KFold

* _class_ sklearn.model_selection.KFold(_n_splits=5_, _*_, _shuffle=False_, _random_state=None_)[[source]](https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/model_selection/_split.py#L365)[¶](https://scikit-learn.org/1.1/modules/generated/sklearn.model_selection.KFold.html?highlight=kfold#sklearn.model_selection.KFold "Permalink to this definition")

Parameters:

**n_splits**int, default=5

Number of folds. Must be at least 2.

Changed in version 0.22: `n_splits`  default value changed from 3 to 5.

**shuffle**bool, default=False

Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled.

**random_state**int, RandomState instance or None, default=None

When  `shuffle`  is True,  `random_state`  affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect. Pass an int for reproducible output across multiple function calls. See  [Glossary](https://scikit-learn.org/1.1/glossary.html#term-random_state).


### sklearn.model_selection.train_test_split
* sklearn.model_selection.train_test_split(_*arrays_,  _test_size=None_,  _train_size=None_,  _random_state=None_,  _shuffle=True_,  _stratify=None_)[[source]](https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/model_selection/_split.py#L2349)[](https://scikit-learn.org/1.1/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test#sklearn.model_selection.train_test_split "Permalink to this definition")

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split

iris = load_iris()
features = iris.data
label = iris.target
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.2, random_state = 121)
dt_clf = DecisionTreeClassifier(random_state = 0)

kfold = KFold(n_splits=5)
cv_accuracy = []
print("붓꽃 데이터 세트 크기 : ", features.shape[0])




붓꽃 데이터 세트 크기 :  150


In [3]:
kfold.split(X_train) 

<generator object _BaseKFold.split at 0x0000024429CECA60>

In [4]:
n_iter = 0

In [5]:
for i in kfold.split(X_train) :
    print(i)

(array([ 24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,
        37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,
        50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,
        63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,
        76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,
        89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101,
       102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
       115, 116, 117, 118, 119]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23]))
(array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  48,  49,
        50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,
        63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,
        76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86

In [67]:
for train_index, test_index in kfold.split(features):
    X_train, X_test = features[train_index],features[test_index]
    y_train, y_test = label[train_index], label[test_index]
    
    dt_clf.fit(X_train, y_train)
    pred = dt_clf.predict(X_test)
    n_iter +=1
    
    accuracy = np.round(accuracy_score(y_test,pred),4) 
    print(accuracy)
    cv_accuracy.append(accuracy)
    train_size= X_train.shape[0]
    test_size = X_test.shape[0]
    print(f"{n_iter} 교차검증 정확도 : {accuracy}, 학습 데이터의 크기 : {train_size}, \
                              검증데이터의 크기 : {test_size}")
    print(f"{n_iter} 검증세트 인덱스 : {test_index} ")

1.0
22 교차검증 정확도 : 1.0, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
22 검증세트 인덱스 : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29] 
0.9667
23 교차검증 정확도 : 0.9667, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
23 검증세트 인덱스 : [30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
 54 55 56 57 58 59] 
0.8333
24 교차검증 정확도 : 0.8333, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
24 검증세트 인덱스 : [60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
 84 85 86 87 88 89] 
0.9333
25 교차검증 정확도 : 0.9333, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
25 검증세트 인덱스 : [ 90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119] 
0.8
26 교차검증 정확도 : 0.8, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
26 검증세트 인덱스 : [120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
 13

In [69]:
cv_accuracy

[1.0,
 0.9667,
 0.8333,
 0.9333,
 0.8,
 1.0,
 1.0,
 0.9667,
 0.8333,
 0.9333,
 0.8,
 1.0,
 0.9667,
 0.8333,
 0.9333,
 0.8]

In [24]:
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [70]:
np.mean(cv_accuracy)

0.9124937500000001

### 2. pandas.DataFrame

* _class_ pandas.DataFrame(_data=None_,  _index=None_,  _columns=None_,  _dtype=None_,  _copy=None_)[[source]](https://github.com/pandas-dev/pandas/blob/v1.5.3/pandas/core/frame.py#L475-L11996)[](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame "Permalink to this definition")

In [18]:
import pandas as pd

In [32]:
iris_df = pd.DataFrame(iris.data, columns= iris.feature_names)

In [33]:
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [35]:
iris_df["label"] = iris.target

In [36]:
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [37]:
iris_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [38]:
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   label              150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB


In [94]:
X_train.shape

(120, 4)

### sklearn.model_selection.StratifiedKFold¶

* _class_ sklearn.model_selection.StratifiedKFold(_n_splits=5_, _*_, _shuffle=False_, _random_state=None_)[[source]](https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/model_selection/_split.py#L581)

Parameters:

**n_splits**int, default=5

Number of folds. Must be at least 2.

Changed in version 0.22: `n_splits`  default value changed from 3 to 5.

**shuffle**bool, default=False

Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.

**random_state**int, RandomState instance or None, default=None

When  `shuffle`  is True,  `random_state`  affects the ordering of the indices, which controls the randomness of each fold for each class. Otherwise, leave  `random_state`  as  `None`. Pass an int for reproducible output across multiple function calls. See  [Glossary](https://scikit-learn.org/1.1/glossary.html#term-random_state).

In [89]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3)
n_iter = 0

for train_index, test_index in skf.split(iris.data, iris.target):
    n_iter += 1
    label_train = iris.data[train_index]
    label_test = iris.target[test_index]
    print(f"교차검증 : {n_iter}, 학습레이블 데이터 분포 : {label_train}, 검증레이블 데이터 분포 : {label_test}")
    print(train_index)

교차검증 : 1, 학습레이블 데이터 분포 : [[5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [5.8 2.7 4.1 1. ]
 [6.2 2.2 4.5 1.5]
 [5.6 2.5 3.9 1.1]
 [5.9 3.2 4.8 1.8]
 [6.1 2.8 4.  1.3]
 [6.3 2.5 4.9 1.5]
 [6.1 2.8 4.7 1.2]
 [6.4 2.9 4.3 1.3]
 [6.6 3.  4.4 1.4]
 [6.8 2.8 4.8 1.4]
 [6.7 3.  5.  1.7]
 [6.  2.9 4.5 1.5]
 [5.7 2.6 3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [5.5 2.4 3.7 1. ]
 [5.8 2.7 3.9 1.2]
 [6.  2.7 5.1 1.6]
 [5.4 3.  4.5 1.5]
 [6.  

In [93]:
cv_accuracy= []
n_iter = 0
skf = StratifiedKFold(n_splits=5)
for train_index, test_index in skf.split(iris.data, iris.target):
    X_train, X_test = iris.data[train_index],iris.data[test_index]
    y_train, y_test = iris.target[train_index], iris.target[test_index]
    
    dt_clf.fit(X_train, y_train)
    pred = dt_clf.predict(X_test)
    n_iter +=1
    
    accuracy = np.round(accuracy_score(y_test,pred),4) 
    print(accuracy)
    cv_accuracy.append(accuracy)
    train_size= X_train.shape[0]
    test_size = X_test.shape[0]
    print(f"{n_iter} 교차검증 정확도 : {accuracy}, 학습 데이터의 크기 : {train_size}, \
                              검증데이터의 크기 : {test_size}")
    print(f"{n_iter} 검증세트 인덱스 : {test_index} ")

0.9667
1 교차검증 정확도 : 0.9667, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
1 검증세트 인덱스 : [  0   1   2   3   4   5   6   7   8   9  50  51  52  53  54  55  56  57
  58  59 100 101 102 103 104 105 106 107 108 109] 
0.9667
2 교차검증 정확도 : 0.9667, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
2 검증세트 인덱스 : [ 10  11  12  13  14  15  16  17  18  19  60  61  62  63  64  65  66  67
  68  69 110 111 112 113 114 115 116 117 118 119] 
0.9
3 교차검증 정확도 : 0.9, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
3 검증세트 인덱스 : [ 20  21  22  23  24  25  26  27  28  29  70  71  72  73  74  75  76  77
  78  79 120 121 122 123 124 125 126 127 128 129] 
0.9667
4 교차검증 정확도 : 0.9667, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
4 검증세트 인덱스 : [ 30  31  32  33  34  35  36  37  38  39  80  81  82  83  84  85  86  87
  88  89 130 131 132 133 134 135 136 137 138 139] 
1.0
5 교차검증 정확도 : 1.0, 학습 데이터의 크기 : 120,                               검증데이터의 크기 : 30
5 검증세트 인덱

In [95]:
np.mean(cv_accuracy)

0.9600200000000001