# 总览
当模型拟合过多的训练数据时，就会出现overfitting的问题

常用的解决办法有：正则化、最佳超参数和K折交叉验证

In [24]:
# 导入所需要的数据集模块
from sklearn.datasets import load_breast_cancer

In [40]:
# 加载数据集
cancer = load_breast_cancer()
X_train, y_train = cancer.data, cancer.target

print(X_train.shape)
print(y_train.shape)


(569, 30)
(569,)
[1 1 1 0 0 0 0 0 0]


# 1.传统交叉验证
![传统交叉验证](传统交叉验证.png "传统交叉验证")
- 经典和老式的方法是将数据集分解为3个固定子集
- 常见的选择是使用 60% 进行训练，20% 用于验证，20% 用于测试
- 我们可以根据数据集的大小决定这些比例。对于一个小数据集，这个比例是可以的；当有更多数据时，可以考虑较大的训练集百分比和较小的验证集和测试集百分比


In [27]:
from sklearn.model_selection import train_test_split
X_train2, X_val, y_train2, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print("train dataset:",X_train2.shape)
print("test dataset:", X_val.shape)



train dataset: (455, 30)
test dataset: (114, 30)


In [28]:
# 用决策树验证传统交叉验证的效果
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


dt = DecisionTreeClassifier(max_depth=10,random_state=0,min_samples_split=2,max_features=11)
dt.fit(X_train2, y_train2)

y_pred_val = dt.predict(X_val)
print("传统交叉验证的正确率：",accuracy_score(y_val,y_pred_val))

传统交叉验证的正确率： 0.9035087719298246


# 2.留一法交叉验证
![留一法交叉验证](留一法交叉验证.png "留一法交叉验证")
n-1个观测值来拟合模型，剩余的观测值用于评估它。此操作将重复n次。

In [38]:
from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()

for train_index, val_index in loo.split(X_train):
     print("TEST: ",val_index[:10],"TRAIN: ",train_index[:10])
     Xtrain, Xval = X_train[train_index], X_train[val_index]
     ytrain, yval = y_train[train_index], y_train[val_index]
     print((ytrain==1).sum(),(ytrain==0).sum(),(yval==1).sum(),(yval==0).sum())


TEST:  [0] TRAIN:  [ 1  2  3  4  5  6  7  8  9 10]
357 211 0 1
TEST:  [1] TRAIN:  [ 0  2  3  4  5  6  7  8  9 10]
357 211 0 1
TEST:  [2] TRAIN:  [ 0  1  3  4  5  6  7  8  9 10]
357 211 0 1
TEST:  [3] TRAIN:  [ 0  1  2  4  5  6  7  8  9 10]
357 211 0 1
TEST:  [4] TRAIN:  [ 0  1  2  3  5  6  7  8  9 10]
357 211 0 1
TEST:  [5] TRAIN:  [ 0  1  2  3  4  6  7  8  9 10]
357 211 0 1
TEST:  [6] TRAIN:  [ 0  1  2  3  4  5  7  8  9 10]
357 211 0 1
TEST:  [7] TRAIN:  [ 0  1  2  3  4  5  6  8  9 10]
357 211 0 1
TEST:  [8] TRAIN:  [ 0  1  2  3  4  5  6  7  9 10]
357 211 0 1
TEST:  [9] TRAIN:  [ 0  1  2  3  4  5  6  7  8 10]
357 211 0 1
TEST:  [10] TRAIN:  [0 1 2 3 4 5 6 7 8 9]
357 211 0 1
TEST:  [11] TRAIN:  [0 1 2 3 4 5 6 7 8 9]
357 211 0 1
TEST:  [12] TRAIN:  [0 1 2 3 4 5 6 7 8 9]
357 211 0 1
TEST:  [13] TRAIN:  [0 1 2 3 4 5 6 7 8 9]
357 211 0 1
TEST:  [14] TRAIN:  [0 1 2 3 4 5 6 7 8 9]
357 211 0 1
TEST:  [15] TRAIN:  [0 1 2 3 4 5 6 7 8 9]
357 211 0 1
TEST:  [16] TRAIN:  [0 1 2 3 4 5 6 7 8 9]
357 

In [30]:
from sklearn.model_selection import cross_val_score
import numpy as np

dt = DecisionTreeClassifier(max_depth=10,random_state=0,min_samples_split=2,max_features=11)
scores = cross_val_score(dt, X_train, y_train, scoring='accuracy', cv=loo, n_jobs=1)
print("留一法的准确率：",np.mean(scores))

留一法的准确率： 0.9244288224956063


# 3. K折交叉验证
![K折交叉验证](K折交叉验证.png "K折交叉验证")
如前所述，在K折交叉验证中，我们将数据集分成k个折叠，k-1用于训练模型，剩余的一个用于评估模型，不断重复这个操作k次。

In [31]:
from sklearn.model_selection import KFold

# K=5
cv = KFold(n_splits=5, random_state=0,shuffle=True)

for train_index, val_index in cv.split(X_train):
     print("TEST: ",val_index[:10],"TRAIN: ",train_index[:10])
     Xtrain, Xval = X_train[train_index], X_train[val_index]
     ytrain, yval = y_train[train_index], y_train[val_index]
     print((ytrain==1).sum(),(ytrain==0).sum(),(yval==1).sum(),(yval==0).sum())


TEST:  [ 1 10 12 14 15 17 21 31 37 45] TRAIN:  [ 0  2  3  4  5  6  7  8  9 11]
290 165 67 47
TEST:  [ 6  7  8 20 30 34 38 49 54 55] TRAIN:  [ 0  1  2  3  4  5  9 10 11 12]
279 176 78 36
TEST:  [ 2  4  5 18 22 26 33 35 39 44] TRAIN:  [ 0  1  3  6  7  8  9 10 11 12]
283 172 74 40
TEST:  [ 3 11 13 16 19 24 25 27 29 32] TRAIN:  [ 0  1  2  4  5  6  7  8  9 10]
289 166 68 46
TEST:  [ 0  9 23 28 42 43 47 48 50 53] TRAIN:  [ 1  2  3  4  5  6  7  8 10 11]
287 169 70 43


In [32]:
dt = DecisionTreeClassifier(max_depth=10,random_state=0,min_samples_split=2,max_features=11)
scores = cross_val_score(dt, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=1)


#print(scores)
print("K折交叉验证的准确率：",np.mean(scores))


K折交叉验证的准确率： 0.9262226362366093


# 4. 分层K折交叉验证
![分层K折交叉验证](分层K折交叉验证.png "分层K折交叉验证")
分层K折交叉验证的工作方式与K折交叉验证相同，唯一的区别是它确保每个分类值的观察百分比相同。

In [33]:
from sklearn.model_selection import StratifiedKFold


scv = StratifiedKFold(n_splits=5, random_state=0,shuffle=True)


for train_index, val_index in scv.split(X_train,y_train):
     print("\n TEST: ",val_index[:10],"\n\n TRAIN: ",train_index[:10])
     Xtrain, Xval = X_train[train_index], X_train[val_index]
     ytrain, yval = y_train[train_index], y_train[val_index]
     print((ytrain==1).sum(),(ytrain==0).sum(),(yval==1).sum(),(yval==0).sum())




 TEST:  [ 1  8 17 28 30 33 40 46 49 53] 

 TRAIN:  [ 0  2  3  4  5  6  7  9 10 11]
286 169 71 43

 TEST:  [ 5  9 10 14 32 34 62 64 65 66] 

 TRAIN:  [ 0  1  2  3  4  6  7  8 11 12]
286 169 71 43

 TEST:  [ 2  6 13 16 19 20 31 36 42 44] 

 TRAIN:  [ 0  1  3  4  5  7  8  9 10 11]
285 170 72 42

 TEST:  [ 4  7 11 12 18 22 23 35 38 39] 

 TRAIN:  [ 0  1  2  3  5  6  8  9 10 13]
285 170 72 42

 TEST:  [ 0  3 15 21 24 25 26 27 29 37] 

 TRAIN:  [ 1  2  4  5  6  7  8  9 10 11]
286 170 71 42


In [37]:
dt = DecisionTreeClassifier(max_depth=10,random_state=0,min_samples_split=2,max_features=11)

scores = cross_val_score(dt, X_train, y_train, scoring='accuracy', cv=scv, n_jobs=1)


#print(scores)
print("分层K折交叉验证的准确率：",np.mean(scores))


分层K折交叉验证的准确率： 0.9261915851575843


# 5. Time Series Cross Validation
![时间序列交叉验证](时间序列交叉验证.png "时间序列交叉验证")
最后一种方法是时间序列交叉验证。当存在与时间相关的数据时，它很有用，因此我们需要保留数据的顺序。

通过随机化，我们将失去观察之间的依赖关系

在第一步中，我们不像其他方法那样取所有样本来训练和评估模型，而只是取一个子集。在第一步之后，每个训练集都是来自之前的训练和验证集的组合，我们每次都添加一个较小的数据来评估模型。只有在最后一次拆分中，我们才能使用所有数据

In [21]:
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit()

for train_index, val_index in tscv.split(X_train):
    print("\n TRAIN:", train_index, "\n \n TEST:", val_index)
    Xtrain, Xval = X_train[train_index], X_train[val_index]
    ytrain, yval = y_train[train_index], y_train[val_index]



 TRAIN: [0 1 2 3 4] 
 
 TEST: [5]

 TRAIN: [0 1 2 3 4 5] 
 
 TEST: [6]

 TRAIN: [0 1 2 3 4 5 6] 
 
 TEST: [7]

 TRAIN: [0 1 2 3 4 5 6 7] 
 
 TEST: [8]

 TRAIN: [0 1 2 3 4 5 6 7 8] 
 
 TEST: [9]


In [36]:
dt = DecisionTreeClassifier(max_depth=10,random_state=0,min_samples_split=2,max_features=11)

scores = cross_val_score(dt, X_train, y_train, scoring='accuracy', cv=tscv, n_jobs=1)

#print(scores)
print("Time Series Cross Validation准确率：",np.mean(scores))


Time Series Cross Validation准确率： 0.9


# 总体结果
- 传统交叉验证：0.90
- 留一法：0.92
- K折现交叉验证：0.9262
- 分层K折交叉验证：0.926
- 时间序列交叉验证：0.9

