Your company specializes in recognizing handwritten characters. It wants to improve the recognition of digits, which is why they have gathered a dataset of 1,797 handwritten digits from 0 to 9. The images have already been converted into their numeric representation, and so they have provided you with the dataset to split it into training/validation/testing sets. You can choose to either perform conventional splitting or cross-validation. Follow these steps to complete this activity:
1. Import all the required elements to split a dataset, as well as the load_digits function from scikit-learn to load the digits dataset.
2. Load the digits dataset and create Pandas DataFrames containing the features and target matrices.
3. Take the conventional split approach, using a split ratio of 60/20/20%.
4. Using the same DataFrames, perform a 10-fold cross-validation split.

In [227]:
from sklearn.datasets import load_digits
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

In [228]:
digits = load_digits()

In [229]:
X = pd.DataFrame(digits.data)
X.shape

(1797, 64)

In [230]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


In [231]:
Y = pd.DataFrame(digits.target)
Y.shape

(1797, 1)

In [232]:
percentile_60 = int(round(len(X) * 0.60))
train_set_range = [0,percentile_60]
percentile_20 = int(round(len(X) * 0.20))
dev_test_range = [train_set_range[1]+1,train_set_range[1]+percentile_20]
test_range = dev_test_range[1]

In [233]:
X_train =X.loc[train_set_range[0]:train_set_range[1],:]
X_train.shape

(1079, 64)

In [234]:
Y_train =Y.loc[train_set_range[0]:train_set_range[1],:]
Y_train.shape

(1079, 1)

In [235]:
X_dev_test =X.loc[dev_test_range[0]:dev_test_range[1],:]
X_dev_test.shape

(359, 64)

In [236]:
Y_dev_test =Y.loc[dev_test_range[0]:dev_test_range[1],:]
Y_dev_test.shape

(359, 1)

In [237]:
X_test =X.loc[test_range+1:,:]
X_test.shape

(359, 64)

In [238]:
Y_test =Y.loc[test_range+1:,:]
Y_test.shape

(359, 1)

In [239]:
print(X_train.shape,X_test.shape,X_dev_test.shape,Y_train.shape,Y_dev_test.shape,Y_test.shape)

(1079, 64) (359, 64) (359, 64) (1079, 1) (359, 1) (359, 1)


In [240]:
kf = KFold(n_splits = 10)

In [241]:
splits = kf.split(X_dev_test)

In [242]:
for train_index, test_index in splits:
    X_train, X_dev_test = X.iloc[train_index,:],X.iloc[test_index,:]
    Y_train, Y_dev_test = Y.iloc[train_index,:],Y.iloc[test_index,:]
print(X_train.shape, Y_train.shape, X_dev_test.shape,Y_dev_test.shape, X_test.shape, Y_test.shape)

(324, 64) (324, 1) (35, 64) (35, 1) (359, 64) (359, 1)


In [243]:
X_new, X_test,Y_new, Y_test = train_test_split(X, Y, test_size=0.2)
print(X_new.shape, Y_new.shape, X_test.shape, Y_test.shape)

(1437, 64) (1437, 1) (360, 64) (360, 1)


In [244]:
dev_size = X_test.shape[0]/X_new.shape[0]
dev_size

0.25052192066805845

In [245]:
X_train, X_dev,Y_train, Y_dev = train_test_split(X_new, Y_new,test_size = dev_size)
print(X_train.shape, Y_train.shape, X_dev.shape,Y_dev.shape, X_test.shape, Y_test.shape)

(1077, 64) (1077, 1) (360, 64) (360, 1) (360, 64) (360, 1)


Using the same DataFrames, perform a 10-fold cross-validation split.

First, divide the datasets into initial training and testing sets:

In [246]:
X_new_2, X_test_2,Y_new_2, Y_test_2 = train_test_split(X, Y, test_size=0.1)
print(X_new_2.shape,X_test_2.shape,Y_new_2.shape,Y_test_2.shape)

(1617, 64) (180, 64) (1617, 1) (180, 1)


Remember that cross-validation performs a different configuration of splits, shuffling data each time. Considering this, perform a for loop that will go through all the split configurations:

In [247]:
kf = KFold(n_splits = 10)
splits = kf.split(X_new_2)

In [248]:
for train_index, dev_index in splits:
    X_train_2, X_dev_2 = X_new_2.iloc[train_index,:],X_new_2.iloc[dev_index,:]
    Y_train_2, Y_dev_2 = Y_new_2.iloc[train_index,:],Y_new_2.iloc[dev_index,:]


In [249]:
print(X_train_2.shape, Y_train_2.shape, X_dev_2.shape,Y_dev_2.shape, X_test_2.shape, Y_test_2.shape)

(1456, 64) (1456, 1) (161, 64) (161, 1) (180, 64) (180, 1)
