<h1>Cross Validation</h1>

Usually, what we do is we divide the data as 80/20 or 75/25 when training the data. Suppose there are four blocks of data and we split as per 75/25, there’s a lot to worry about which block will be best for testing, the first one or the middle one or the last one. So, in such cases, cross validation is useful where it uses all of the blocks one after another and summarizes the result at the end. 


Cross Validation is the process of dividing the dataset into multiple folds. Among these folds, one of the folds is used for validation set i.e. the test set and the model is trained on the remaining folds. The process is repeated multiple times each time using different fold as a validation set. Lastly, the result is averaged to get the robust estimate of the model. 

<h3>Common steps involved in Cross Validation includes:</h3>
1. It reserves some portion of the sample from the entire dataset.<br>
2. The remainig samples of the dataset are used for training the model.<br>
3. The reserved portion is used as a test set for the model. 


<H1> Importing Libraries<h1>

In [34]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import LeavePOut
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GroupKFold
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler


<h1>Loading Dataset</h1>

In [2]:
df = pd.read_csv("heart_failure_clinical_records_dataset.csv")

In [3]:
df

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


<h1>Extracting Dependent and Independent Variable</h1>

In [4]:
x = df.iloc[:, :-1]
x

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8
...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280


In [5]:
y = df.loc[:, 'DEATH_EVENT']
# y = df.iloc[:, 12]
y

0      1
1      1
2      1
3      1
4      1
      ..
294    0
295    0
296    0
297    0
298    0
Name: DEATH_EVENT, Length: 299, dtype: int64

In [6]:
m1 = LogisticRegression()
m2 = SVC()

<h1>KFold Cross-Validation</h1>
We split the dataset into k number of folds or subsets and then we perform training on k-1 subsets leaving rest for testing.  

By passing the cross validation iterator.

In [7]:
kfold = KFold(n_splits = 10, shuffle = True, random_state = 42)

In [8]:
scores = cross_val_score(m1, x, y, cv= kfold, scoring = 'accuracy')
print("Scores", scores)
print("Logistic Regression Score, Mean is %.3f and Standard Deviation is %.3f" %(scores.mean(), scores.std()))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Scores [0.8        0.8        0.76666667 0.86666667 0.86666667 0.93333333
 0.8        0.8        0.7        0.86206897]
Logistic Regression Score, Mean is 0.820 and Standard Deviation is 0.061


In [9]:
scores = cross_val_score(m2, x, y, cv = kfold, scoring = 'accuracy')
print("Scores", scores)
print("SVC Score, Mean is %.3f and Standard Deviation is %.3f" %(scores.mean(), scores.std()))

Scores [0.6        0.56666667 0.6        0.56666667 0.83333333 0.7
 0.73333333 0.76666667 0.6        0.82758621]
SVC Score, Mean is 0.679 and Standard Deviation is 0.100


<h1>Cross-Validation with train/test split</h1>

In [10]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

In [26]:
standardscaler = StandardScaler()
standardscaler.fit(X_train)
X_train_scaled= standardscaler.transform(X_train)
X_test_scaled = standardscaler.transform(X_test)

In [12]:
m1.fit(X_train_scaled, Y_train)

In [13]:
m2.fit(X_train_scaled, Y_train)

Here, the cv argument is an integer.  
cross_val_score uses the KFold or StratifiedKFold strategies by default.

In [14]:
scores = cross_val_score(m1, X_train_scaled, Y_train, cv = 10, scoring = 'f1_macro')
print("Scores", scores)
print("Logistic Regression Mean is %0.3f and Standard Deviation is %0.3f" %(scores.mean(), scores.std()))


Scores [0.75757576 0.95151515 0.77229602 0.88888889 0.94725275 0.89915966
 0.82309582 0.52527473 0.70515971 0.69196429]
Logistic Regression Mean is 0.796 and Standard Deviation is 0.127


In [15]:
scores = cross_val_score(m2, X_train_scaled, Y_train, cv=10 )
print("Scores", scores)
print("SVC Mean is %0.3f and Standard Deviation is %0.3f" %(scores.mean(), scores.std()))

Scores [0.75       0.875      0.66666667 0.875      0.91666667 0.91666667
 0.875      0.66666667 0.79166667 0.73913043]
SVC Mean is 0.807 and Standard Deviation is 0.092


<h1>Repeated KFold Cross Validation</h1>
RepeatedKFold repeats K-Fold n times. It can be used when one requires to run KFold n times, producing different splits in each repetition.
k-fold cross-validation requires at least one train/test split by setting n_splits=2 or more, got n_splits=1

In [16]:
rkf = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 1234567)

In [17]:
scores = cross_val_score(m2, x, y, cv= rkf, scoring = 'accuracy')
print("Scores", scores)
print("Logistic Regression Score, Mean is %.3f and Standard Deviation is %.3f" %(scores.mean(), scores.std()))

Scores [0.68       0.65333333 0.66666667 0.71621622 0.66666667 0.64
 0.68       0.72972973 0.72       0.70666667 0.61333333 0.67567568]
Logistic Regression Score, Mean is 0.679 and Standard Deviation is 0.033


<h1>Leave One Out (LOO) Cross-Validation</h1>
 LOO takes all the samples for training except one. The left one used as test set. 

In [18]:
loo = LeaveOneOut()
for train, test in loo.split(x,y):
    print("%s %s" %(train, test))

[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
 235 236 237 238 239 240 241 242 243 244 245 246 24

In [19]:
train

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

In [20]:
test

array([298])

<h1>Leave P Out (LPO) Cross Validation</h1>
Similar to LOO, however in LPO the test set is determined according to the number we define for p. Here p = 2, thus p samples are taken as test set and remaining as train set. 

In [21]:
# lpo = LeavePOut(p=2)
# for train, test in lpo.split(x, y):
#     print("%s %s" %(train, test))


# LPO has heavy computation, thus commented.


<h1>StratifiedKFold Cross Validation</h1>
Stratified Cross Validation is similar to K-Fold Cross Validation. However, stratified cross validation uses stratified sampling rather than random sampling.

In [22]:
x

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8
...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280


In [23]:
y

0      1
1      1
2      1
3      1
4      1
      ..
294    0
295    0
296    0
297    0
298    0
Name: DEATH_EVENT, Length: 299, dtype: int64

Here, the np.bincount counts the number of the each class label in the array. Here in our data there is two classes, thus np.bincount counts the number of 0 and 1 in the array.

In [24]:
skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 1)
for train, test in skf.split(x, y):
    print('train-{}|test-{}'.format(np.bincount(y[train]), np.bincount(y[test])))

train-[135  64]|test-[68 32]
train-[135  64]|test-[68 32]
train-[136  64]|test-[67 32]


In [25]:
kf = KFold(n_splits=3)
for train, test in kf.split(x,y):
    print('train-{}| test-{}'.format(np.bincount(y[train]), np.bincount(y[test])))

train-[169  30]| test-[34 66]
train-[126  73]| test-[77 23]
train-[111  89]| test-[92  7]


Here we can see that the Stratified KFold preserves the ratios in both train and test dataset as compared to the KFold.

In [32]:
skf = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 1)

scores = cross_val_score(m2, x, y, cv= skf, scoring = 'accuracy')
print("Scores", scores)
print("Logistic Regression Score, Mean is %.3f and Standard Deviation is %.3f" %(scores.mean(), scores.std()))

Scores [0.66666667 0.66666667 0.66666667 0.66666667 0.66666667 0.66666667
 0.7        0.7        0.7        0.68965517]
Logistic Regression Score, Mean is 0.679 and Standard Deviation is 0.015


In [33]:
# skf = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 1)
# for train, test in skf.split(x, y):
#     x_train_fold, x_test_fold = X_train_scaled[train], X_test_scaled[test]
#     y_train_fold, y_test_fold = Y_train[train], Y_test[test]
#     m1.fit(x_train_fold, y_train_fold)
#     m1.score(x_test_fold, y_test_fold)

<h1>Group KFold Cross Validation</h1>
Group KFold Cross Validaiton is the variation of KFold Cross Validation that makes sure that the same group is not represented in both training and testing set. 

In [58]:
group = np.array(y)
s = group.tolist()
s

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,


In [72]:
group = np.array(y)
s = group.tolist()
s
gkf = GroupKFold(n_splits=2)
s 
for train, test in gkf.split(x, y, groups=s):
    print("%s %s" %(train, test))

[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  15  16  17  18
  19  21  22  24  25  26  27  28  29  30  31  32  34  35  36  37  39  40
  41  42  44  45  46  47  48  49  50  51  52  53  54  55  58  59  60  61
  63  65  66  67  68  69  72  74  75  82  84  93 105 110 113 119 124 126
 140 144 148 150 163 164 165 167 181 182 183 184 185 186 187 194 195 213
 217 220 230 246 262 266] [ 14  20  23  33  38  43  56  57  62  64  70  71  73  76  77  78  79  80
  81  83  85  86  87  88  89  90  91  92  94  95  96  97  98  99 100 101
 102 103 104 106 107 108 109 111 112 114 115 116 117 118 120 121 122 123
 125 127 128 129 130 131 132 133 134 135 136 137 138 139 141 142 143 145
 146 147 149 151 152 153 154 155 156 157 158 159 160 161 162 166 168 169
 170 171 172 173 174 175 176 177 178 179 180 188 189 190 191 192 193 196
 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 214 215
 216 218 219 221 222 223 224 225 226 227 228 229 231 232 233 234 235 236
 237 238 239 240 241 242 

Cross Validation is usually performed to prevent overfitting, which occurs when a model is trained too well on the training data and performs poorly in the test (new and unseen) data. As we evaluate the model on multiple validation sets in cross validation, thus the estimate is realistic and thus model performs well in new and unseen data.