Objective:
To evaluate and compare learning curves of leave-one-out with two-, three-, five-, and ten-fold cross-validation on a learning problem using real time dataset.

Theory: 
K-fold cross-validation splits the training dataset into k folds without replacementâ€”any given data point will only be part of one of the subsets, where k-1 folds are used for the model training and one-fold is used for testing. The procedure is repeated k times so that we obtain k models and performance estimates. Then the average performance of the models can be calculated based on the individual folds, to obtain a performance estimate that is less sensitive to the sub partitioning of the training data compared with the holdout or single fold method.

In [1]:
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_val_score, KFold, LeaveOneOut
from sklearn.preprocessing import StandardScaler


ModuleNotFoundError: No module named 'pandas'

In [None]:
# Read the data
df = pd.read_csv("D:\1 DO NOT CLICK  [MOHIT]\COLLEGE\LAB\AML LAB\EXP 4\diabetes.csv")
X = df.iloc[:, :8].values  # Independent variables
print(X)
y = df['Outcome'].values     # Dependent variable
print(y)

[[  6.    148.     72.    ...  33.6     0.627  50.   ]
 [  1.     85.     66.    ...  26.6     0.351  31.   ]
 [  8.    183.     64.    ...  23.3     0.672  32.   ]
 ...
 [  5.    121.     72.    ...  26.2     0.245  30.   ]
 [  1.    126.     60.    ...  30.1     0.349  47.   ]
 [  1.     93.     70.    ...  30.4     0.315  23.   ]]
[1 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0
 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0
 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0
 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 1 1
 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0
 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 0 0
 1 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0

In [None]:
# Normalize the data
sc = StandardScaler()
X = sc.fit_transform(X)

In [None]:
# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2017)

In [None]:
# Build a decision tree classifier
clf = tree.DecisionTreeClassifier(random_state=2017)

In [None]:
# Define the different CV strategies
cv_values = [3, 4, 5, 10]
cv_strategies = {f"{cv}-Fold": KFold(n_splits=cv, shuffle=True, random_state=2017) for cv in cv_values}
cv_strategies["Leave-One-Out"] = LeaveOneOut()

In [None]:

# Run cross-validation for each strategy
for name, cv in cv_strategies.items():
    train_scores = cross_val_score(clf, X_train, y_train, scoring='accuracy', cv=cv)
    test_scores = cross_val_score(clf, X_test, y_test, scoring='accuracy', cv=cv)
    
    print(f"\n{name} Cross-Validation:")
    print("Train Fold Scores:", train_scores)
    print("Train CV Accuracy:", train_scores.mean())
    print("Test Fold Scores:", test_scores)
    print("Test CV Accuracy:", test_scores.mean())



3-Fold Cross-Validation:
Train Fold Scores: [0.68156425 0.72625698 0.67597765]
Train CV Accuracy: 0.6945996275605214
Test Fold Scores: [0.77922078 0.7012987  0.74025974]
Test CV Accuracy: 0.7402597402597403

4-Fold Cross-Validation:
Train Fold Scores: [0.65925926 0.68656716 0.69402985 0.68656716]
Train CV Accuracy: 0.6816058595909342
Test Fold Scores: [0.79310345 0.70689655 0.65517241 0.71929825]
Test CV Accuracy: 0.7186176648517846

5-Fold Cross-Validation:
Train Fold Scores: [0.7037037  0.66666667 0.70093458 0.71028037 0.72897196]
Train CV Accuracy: 0.7021114572516441
Test Fold Scores: [0.80851064 0.7173913  0.7173913  0.67391304 0.63043478]
Test CV Accuracy: 0.7095282146160962

10-Fold Cross-Validation:
Train Fold Scores: [0.68518519 0.72222222 0.61111111 0.72222222 0.75925926 0.61111111
 0.61111111 0.62264151 0.77358491 0.66037736]
Train CV Accuracy: 0.6778825995807127
Test Fold Scores: [0.83333333 0.65217391 0.69565217 0.73913043 0.69565217 0.73913043
 0.73913043 0.82608696 0.782

In [None]:
pip install pandas scikit-learn
