### **K-Fold Cross-Validation Technique in Machine Learning**

* K-Fold Cross-Validation is a crucial technique for evaluating the performance of a machine learning model. 

* It provides a more robust and reliable estimate of a model's generalization capability than a single train-test split. 

* This notebook will use the **Car Acceptability Dataset**, a simple yet powerful example of a multiclass classification problem.

### Step 1: Data Loading and Preparation

* We will load the Car Acceptability Dataset from its source. 

* The dataset contains 6 features related to car attributes like buying price, maintenance cost, number of doors, etc., and a target variable representing the car's overall acceptability (unacceptable, acceptable, good, very good).

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.pipeline import Pipeline

In [2]:
# Load the car dataset directly from UCI repository
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
column_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
df = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataframe
df.head()


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


---

* **OrdinalEncoder**

  * The `OrdinalEncoder` is a data preprocessing utility from the scikit-learn library. 
  
  * Its primary purpose is to convert categorical features that have a meaningful, *inherent order* into a numerical format that can be used by machine learning algorithms.

  * It transforms each category within a feature into a unique integer. 
  
  * For example, if a feature is 'size' with categories 'small', 'medium', and 'large', an OrdinalEncoder would map these to integers like 0, 1, and 2.

---

In [3]:
# All features are categorical, so we use OrdinalEncoder to convert them to numbers
encoder = OrdinalEncoder(dtype=np.int64)
df_encoded = df.copy()
df_encoded[column_names] = encoder.fit_transform(df_encoded)

# Display the first few rows of the encoded dataframe
df_encoded.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,3,3,0,0,2,1,2
1,3,3,0,0,2,2,2
2,3,3,0,0,2,0,2
3,3,3,0,0,1,1,2
4,3,3,0,0,1,2,2


In [6]:
# Split the dataset into features and target variable
X = df_encoded.drop('class', axis=1)
y = df_encoded['class']

In [7]:
# Number of samples and features
print("Dataset Info:")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print("\nTarget classes distribution:")
print(y.value_counts().sort_index())

Dataset Info:
Number of samples: 1728
Number of features: 6

Target classes distribution:
class
0     384
1      69
2    1210
3      65
Name: count, dtype: int64


### Step 2: *K-Fold Cross-Validation*

* K-Fold works by partitioning the entire dataset into `K` equal-sized folds. 

* In each iteration, one fold is set aside as the test set, while the remaining `K-1` folds are combined to form the training set. 

* The model is trained on the training set and its performance is evaluated on the held-out test set.

* This process is repeated `K` times, with the final performance being the average score across all iterations.

**Manual K-Fold Implementation**

* We will manually implement `K-Fold` to see how the data is split and how the scores are calculated across the folds.

In [8]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(solver='liblinear')
scores = []

for fold_num, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', model)])
    
    pipeline.fit(X_train, y_train)
    score = pipeline.score(X_test, y_test)
    scores.append(score)
    print(f"Fold {fold_num+1} Accuracy: {score:.4f}")

print(f"\nAverage K-Fold Accuracy: {np.mean(scores):.4f}")

Fold 1 Accuracy: 0.6734
Fold 2 Accuracy: 0.7081
Fold 3 Accuracy: 0.7052
Fold 4 Accuracy: 0.6957
Fold 5 Accuracy: 0.6609

Average K-Fold Accuracy: 0.6886


### Step 3: *Stratified K-Fold vs. K-Fold*

* The Car Acceptability dataset has a significant class imbalance, making **Stratified K-Fold** the superior technique. 

* Stratified K-Fold ensures that each fold maintains the same proportion of class labels as the original dataset. 

* This prevents a single fold from containing a disproportionately low or high number of samples from a particular class, which would lead to a biased performance evaluation. 

* Stratified K-Fold therefore provides a more reliable and stable score for classification problems.

**Manual Stratified K-Fold Implementation**

* We'll now apply `Stratified K-Fold` to our problem to see its effect on the evaluation scores.

In [9]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
scores_stratified = []

for fold_num, (train_index, test_index) in enumerate(skf.split(X, y)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', model)])
    
    pipeline.fit(X_train, y_train)
    score = pipeline.score(X_test, y_test)
    scores_stratified.append(score)
    print(f"Fold {fold_num+1} Accuracy (Stratified): {score:.4f}")

print(f"\nAverage Stratified K-Fold Accuracy: {np.mean(scores_stratified):.4f}")

Fold 1 Accuracy (Stratified): 0.9827
Fold 2 Accuracy (Stratified): 0.9827
Fold 3 Accuracy (Stratified): 0.9827
Fold 4 Accuracy (Stratified): 0.9768
Fold 5 Accuracy (Stratified): 0.9681

Average Stratified K-Fold Accuracy: 0.9786


### Step 4: Using `cross_val_score` for Simplicity

* The `cross_val_score` function is a powerful utility from `scikit-learn` that automates the entire cross-validation process. 

* It handles the data splitting, model training, and scoring for each fold, making the process of model evaluation and comparison much more efficient. 

* For classification, it uses `Stratified K-Fold by default`.

**Comparing Multiple Models with `cross_val_score`**

* Let's use `cross_val_score` to compare the performance of several different models on our dataset.

In [10]:
# Define the models to evaluate
models = {
    'Logistic Regression': LogisticRegression(solver='liblinear'),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'SVM': SVC()
}

# Store results for each model
results = {}

# Evaluate each model and print the results
for name, model in models.items():
    pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', model)])
    scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
    results[name] = scores
    print(f"{name}: Average Accuracy = {scores.mean():.4f}, Std Dev = {scores.std():.4f}")

Logistic Regression: Average Accuracy = 0.6528, Std Dev = 0.0285
Random Forest: Average Accuracy = 0.8114, Std Dev = 0.0533
Decision Tree: Average Accuracy = 0.7946, Std Dev = 0.1059
SVM: Average Accuracy = 0.8016, Std Dev = 0.0571


### Step 5: Parameter Tuning with `cross_val_score`

* Another key application of cross-validation is *hyperparameter tuning*. 

* We can evaluate a model's performance with different parameter settings to find the combination that provides the best generalized performance. 

* Here, we tune the `n_estimators` parameter for our `RandomForestClassifier`.

In [11]:
n_estimators_list = [20, 50, 100, 200]
best_n_estimators = 0
best_score = 0

for n in n_estimators_list:
    model = RandomForestClassifier(n_estimators=n, random_state=42)
    pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', model)])
    scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
    avg_score = scores.mean()
    print(f"N_estimators={n}: Average Accuracy = {avg_score:.4f}")
    
    if avg_score > best_score:
        best_score = avg_score
        best_n_estimators = n

print(f"\nBest N_estimators: {best_n_estimators} with an average score of {best_score:.4f}")

N_estimators=20: Average Accuracy = 0.7981
N_estimators=50: Average Accuracy = 0.7894
N_estimators=100: Average Accuracy = 0.8114
N_estimators=200: Average Accuracy = 0.7912

Best N_estimators: 100 with an average score of 0.8114


### Summary

* Based on the cross-validation results from this notebook, we can draw the following conclusions:

  1. **Importance of Cross-Validation:** Using a cross-validation approach provided a stable and reliable performance metric, which is crucial for a dataset with a significant class imbalance like the Car Acceptability Dataset. It prevents the model evaluation from being skewed by a single train-test split.

  2. **Model Performance:** Among the models evaluated, the Random Forest Classifier demonstrated the highest average accuracy. This is a common finding, as ensemble methods like Random Forest are highly effective on complex, tabular datasets.

  3. **Hyperparameter Tuning:** Our analysis using `cross_val_score` showed a clear trend: increasing the number of trees (`n_estimators`) in the Random Forest model improved its performance up to a certain point. This confirms that proper hyperparameter tuning is a critical step for maximizing a model's predictive power.

---

*Machine Learning - Python Notebook* by [*Prakash Ukhalkar*](https://github.com/prakash-ukhalkar)

---

#### **Exercise : K-Fold Cross-Validation for Diabetes Prediction**

#### Problem Statement: 

* Build and evaluate machine learning models for predicting diabetes using the Pima Indians Diabetes Dataset.

* Use *K-Fold Cross-Validation* to ensure the model evaluation is robust and reliable.

##### Tasks to be perfomred:

  * Load the dataset and examine its features and target.

  * Manually implement `K-Fold` to understand the core concept of this evaluation technique.

  * Implement `Stratified K-Fold`, highlighting its advantages for classification problems.

  * Use the `cross_val_score` utility to streamline the evaluation of multiple models.

  * Use cross-validation to find the optimal parameters for a machine learning model.

  * The final task is to summarize the results and key insights gained from the analysis.

---

* Download Dataset:  [Pima Indians Diabetes Database](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
* Solution : [Exercise - K-Fold Cross-Validation for Diabetes Prediction](https://github.com/prakash-ukhalkar/ML/blob/main/11_KFold_Cross_Validation_ML/01_Exercise_KFold_Cross_Validation_ML/01_Exercise_KFold_Cross_Validation_ML.ipynb)

---

*Machine Learning - Python Notebook* by [*Prakash Ukhalkar*](https://github.com/prakash-ukhalkar)