<a href="https://colab.research.google.com/github/ldsAS/Tibame-AI-Learning/blob/main/Tibame20250609_%E4%BA%A4%E5%8F%89%E9%A9%97%E8%AD%89%E6%B3%95_Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 交叉驗證法 (Cross Validation)

交叉驗證法（Cross Validation）是一種常用的模型評估技術，主要用來衡量機器學習模型在未知資料上的預測能力。它可以有效減少因資料集劃分方式不同而帶來的過擬合問題，從而提供更穩定且可靠的模型性能評估。

## 交叉驗證的基本流程：
1. **將資料集分成 K 個子集（K-fold）**：常見的是 K=5 或 K=10，這樣將資料集分成 K 等份。
2. **訓練與測試**：每次將其中一個子集用作測試資料，其餘 K-1 個子集用作訓練資料。這樣就能夠訓練出 K 個模型，每個模型都在不同的測試資料上進行測試。
3. **計算評估指標**：在每次測試後，計算相應的評估指標（如準確率、F1 分數等），最終將 K 次的評估結果進行平均，得出模型的總體表現。

## 交叉驗證的優點：
- **減少過擬合的風險**：由於模型會在多個不同的資料子集上進行訓練和測試，能夠更好地評估模型的泛化能力。
- **充分利用資料**：每個資料點都會在測試集和訓練集中出現，有助於提高模型的學習效果。
- **結果更穩定**：相比單一的訓練/測試集劃分，交叉驗證能提供更加穩定的評估結果。

## 常見的交叉驗證方法：
1. **K-fold 交叉驗證**：將資料集隨機劃分為 K 等份，每次用其中一份作為測試集，其他作為訓練集，重複 K 次。
2. **Leave-one-out Cross Validation (LOOCV)**：這是 K-fold 交叉驗證的一個極端情況，將資料集的每個樣本都當作一次測試集，其他樣本作為訓練集，適用於資料集較小的情況。
3. **Stratified K-fold 交叉驗證**：這是對 K-fold 交叉驗證的改進，確保每個子集的資料分佈與原資料集的分佈相似，特別在處理類別不平衡的問題時效果更好。

## 總結：
交叉驗證是機器學習中一個非常重要且實用的技術，能夠有效提高模型的可靠性和穩定性。


In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv("../data/Iris.csv")
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()

input_field = ['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']
output_field = 'Species'

# train_test_split

## 方法1 ： 先分群再選擇欄位

In [None]:
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(data, test_size=0.2, random_state=0)
print('Train data size', data_train.shape)
print('Test data size', data_test.shape)

Train data size (120, 6)
Test data size (30, 6)


In [None]:
train_x = data_train[input_field]
train_y = data_train[output_field]
test_x = data_test[input_field]
test_y = data_test[output_field]

In [None]:
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)

(120, 4)
(120,)
(30, 4)
(30,)


## 方法2 ： 先選擇欄位再分群

In [None]:
x = data[input_field]
y = data[output_field]

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=0)

In [None]:
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)

(120, 4)
(120,)
(30, 4)
(30,)


## 訓練與測試

In [None]:
from sklearn.metrics import accuracy_score

model.fit(train_x, train_y)

pred_train = model.predict(train_x)
pred_test = model.predict(test_x)

print('The train accuracy is:', accuracy_score(train_y,pred_train))
print('The test accuracy is:', accuracy_score(test_y,pred_test))

The train accuracy is: 0.95
The test accuracy is: 0.9666666666666667


# cross_validate

In [None]:
from sklearn.model_selection import cross_validate

x = data[input_field]
y = data[output_field]

score_type = ['accuracy','balanced_accuracy','roc_auc','f1','precision','recall','roc_auc_ovr']

cv_scores = cross_validate(model, x, y, cv=5, n_jobs=5, scoring=score_type)
cv_scores

{'fit_time': array([0.0019989 , 0.0019989 , 0.00299811, 0.00168991, 0.00199986]),
 'score_time': array([0.16244054, 0.15913296, 0.15490985, 0.09976792, 0.12406325]),
 'test_accuracy': array([0.96666667, 1.        , 0.93333333, 0.96666667, 1.        ]),
 'test_balanced_accuracy': array([0.96666667, 1.        , 0.93333333, 0.96666667, 1.        ]),
 'test_roc_auc': array([nan, nan, nan, nan, nan]),
 'test_f1': array([nan, nan, nan, nan, nan]),
 'test_precision': array([nan, nan, nan, nan, nan]),
 'test_recall': array([nan, nan, nan, nan, nan]),
 'test_roc_auc_ovr': array([0.97333333, 1.        , 0.99333333, 0.97      , 1.        ])}

### cross_validate的缺點是只能處理二元分類，且類別要預先轉換為 0 跟 1

In [None]:
from sklearn.model_selection import cross_validate

target_filed_map = {'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 1}
# target_filed_map = {'Iris-setosa': 'Iris-setosa', 'Iris-versicolor': 'Iris-setosa', 'Iris-virginica': 'Iris-virginica'}

x = data[input_field]
y = data[output_field].map(target_filed_map)


score_type = ['accuracy','balanced_accuracy','roc_auc','f1','precision','recall','roc_auc_ovr']

cv_scores = cross_validate(model, x, y, cv=5, n_jobs=5, scoring=score_type)
cv_scores

{'fit_time': array([0.00542116, 0.00551391, 0.00444508, 0.00444078, 0.00302482]),
 'score_time': array([0.02718616, 0.02820516, 0.02828693, 0.02416134, 0.02826953]),
 'test_accuracy': array([0.96666667, 0.96666667, 1.        , 0.86666667, 0.96666667]),
 'test_balanced_accuracy': array([0.95 , 0.95 , 1.   , 0.9  , 0.975]),
 'test_roc_auc': array([0.95  , 1.    , 1.    , 0.9925, 0.995 ]),
 'test_f1': array([0.94736842, 0.94736842, 1.        , 0.83333333, 0.95238095]),
 'test_precision': array([1.        , 1.        , 1.        , 0.71428571, 0.90909091]),
 'test_recall': array([0.9, 0.9, 1. , 1. , 1. ]),
 'test_roc_auc_ovr': array([0.95  , 1.    , 1.    , 0.9925, 0.995 ])}

# KFold


In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)
for train_index, test_index in kf.split(data):
    print('train_index = \n', train_index)
    print('test_index = \n', test_index, '\n')

train_index = 
 [ 30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47
  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65
  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83
  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101
 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]
test_index = 
 [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29] 

train_index = 
 [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  60  61  62  63  64  65
  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83
  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101
 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
 120 

### 上面可以發現index值是連續的，比較好的做法應該是加上參數shuffle=True

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, random_state=0, shuffle=True)
for train_index, test_index in kf.split(data):
    print('train_index = \n', train_index)
    print('test_index = \n', test_index, '\n')

train_index = 
 [  0   1   2   3   4   5   6   9  10  11  12  13  14  15  17  18  19  20
  21  23  25  27  28  29  30  31  32  34  35  36  38  39  41  42  43  46
  47  48  49  50  52  53  55  56  57  58  59  60  61  64  65  67  68  69
  70  72  74  75  77  79  80  81  82  83  84  85  87  88  89  91  92  94
  95  96  98  99 101 102 103 104 105 106 108 109 110 111 112 113 115 116
 117 118 119 120 122 123 124 125 127 128 129 130 131 132 133 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]
test_index = 
 [  7   8  16  22  24  26  33  37  40  44  45  51  54  62  63  66  71  73
  76  78  86  90  93  97 100 107 114 121 126 134] 

train_index = 
 [  0   1   3   4   5   6   7   8   9  11  12  13  14  15  16  17  19  20
  21  22  23  24  25  26  28  29  30  31  32  33  34  35  36  37  38  39
  40  41  42  44  45  46  47  48  49  51  52  53  54  55  57  58  62  63
  64  65  66  67  68  70  71  72  73  74  75  76  77  78  79  81  82  85
  86  87  88  89  90  91  93  94  95  96  97  98

## 完整的KFold示範

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, random_state=0, shuffle=True)

train_score = []
test_score = []

for train_index, test_index in kf.split(data):
    train_x = data[input_field].iloc[train_index]
    train_y = data[output_field].iloc[train_index]
    test_x = data[input_field].iloc[test_index]
    test_y = data[output_field].iloc[test_index]

    model.fit(train_x, train_y)

    pred_train = model.predict(train_x)
    pred_test = model.predict(test_x)

    train_score.append(accuracy_score(train_y,pred_train))
    test_score.append(accuracy_score(test_y,pred_test))

print('The train accuracy is:', train_score)
print('The test accuracy is:', test_score)

np.mean(train_score)


The train accuracy is: [0.95, 0.9916666666666667, 0.9666666666666667, 0.9666666666666667, 0.9833333333333333]
The test accuracy is: [0.9666666666666667, 0.9, 1.0, 1.0, 0.9333333333333333]


np.float64(0.9716666666666667)

# RepeatedKFold

In [None]:
x = np.zeros(10)

In [None]:
from sklearn.model_selection import RepeatedKFold

rkf = RepeatedKFold(n_splits=2, n_repeats=3, random_state=2652124)
for train_index, test_index in rkf.split(x):
    print('train_index',train_index)
    print('test_index',test_index)
    print()

train_index [0 4 5 6 9]
test_index [1 2 3 7 8]

train_index [1 2 3 7 8]
test_index [0 4 5 6 9]

train_index [0 1 4 8 9]
test_index [2 3 5 6 7]

train_index [2 3 5 6 7]
test_index [0 1 4 8 9]

train_index [0 3 4 5 8]
test_index [1 2 6 7 9]

train_index [1 2 6 7 9]
test_index [0 3 4 5 8]



# RepeatedStratifiedKFold
驗證用資料的各類別比率會跟原始資料一樣

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold

x = np.arange(0,14)
y = np.array([0,0,1,1,1,1,2,2,2,2,2,2,2,2])
rkf = RepeatedStratifiedKFold(n_splits=2, n_repeats=3, random_state=2652124)
for train_index, test_index in rkf.split(x,y):
    print('test_index',test_index)
    print('test',y[test_index])
    print()

test_index [ 0  2  5  7  9 10 11]
test [0 1 1 2 2 2 2]

test_index [ 1  3  4  6  8 12 13]
test [0 1 1 2 2 2 2]

test_index [ 1  4  5  6  9 10 11]
test [0 1 1 2 2 2 2]

test_index [ 0  2  3  7  8 12 13]
test [0 1 1 2 2 2 2]

test_index [ 0  3  5  6  7 10 13]
test [0 1 1 2 2 2 2]

test_index [ 1  2  4  8  9 11 12]
test [0 1 1 2 2 2 2]



# LeaveOneOut

In [None]:
x = np.arange(0,5)
from sklearn.model_selection import LeaveOneOut
rkf = LeaveOneOut()
for train_index, test_index in rkf.split(x):
    print('train_index',train_index)
    print('test_index',test_index)
    print()

train_index [1 2 3 4]
test_index [0]

train_index [0 2 3 4]
test_index [1]

train_index [0 1 3 4]
test_index [2]

train_index [0 1 2 4]
test_index [3]

train_index [0 1 2 3]
test_index [4]



# LeavePOut
Provides train/test indices to split data in train/test sets. This results in testing on all distinct samples of size p, while the remaining n - p samples form the training set in each iteration.

In [None]:
x = np.arange(0,5)
from sklearn.model_selection import LeavePOut
rkf = LeavePOut(2)
for train_index, test_index in rkf.split(x):
    print('train_index',train_index)
    print('test_index',test_index)
    print()

train_index [2 3 4]
test_index [0 1]

train_index [1 3 4]
test_index [0 2]

train_index [1 2 4]
test_index [0 3]

train_index [1 2 3]
test_index [0 4]

train_index [0 3 4]
test_index [1 2]

train_index [0 2 4]
test_index [1 3]

train_index [0 2 3]
test_index [1 4]

train_index [0 1 4]
test_index [2 3]

train_index [0 1 3]
test_index [2 4]

train_index [0 1 2]
test_index [3 4]



# ShuffleSplit
完全隨機的切出n次一定比率的資料出來，可以看成連續跑n次的train_test_split

In [None]:
x = np.arange(0,10)
from sklearn.model_selection import ShuffleSplit
rkf = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
for train_index, test_index in rkf.split(x):
    print('train_index',train_index)
    print('test_index',test_index)
    print()

train_index [9 1 6 7 3 0 5]
test_index [2 8 4]

train_index [2 9 8 0 6 7 4]
test_index [3 5 1]

train_index [4 5 1 0 6 9 7]
test_index [2 3 8]



# TimeSeriesSplit

In [None]:
x = np.arange(0,10)
from sklearn.model_selection import TimeSeriesSplit
rkf = TimeSeriesSplit(n_splits=3, test_size=2)
for train_index, test_index in rkf.split(x):
    print('train_index',train_index)
    print('test_index',test_index)
    print()

train_index [0 1 2 3]
test_index [4 5]

train_index [0 1 2 3 4 5]
test_index [6 7]

train_index [0 1 2 3 4 5 6 7]
test_index [8 9]



# GroupKFold

In [None]:
x = np.arange(0,10)
g = [0,0,1,1,2,2,3,3,4,4]
from sklearn.model_selection import GroupKFold
rkf = GroupKFold(n_splits=3)
for train_index, test_index in rkf.split(x,groups=g):
    print('train_index',train_index)
    print('test_index',test_index)
    print()

train_index [0 1 4 5 6 7]
test_index [2 3 8 9]

train_index [2 3 4 5 8 9]
test_index [0 1 6 7]

train_index [0 1 2 3 6 7 8 9]
test_index [4 5]

