<a href="https://colab.research.google.com/github/peeka-boo0/ml-learning-journey/blob/main/notebooks/Day_13_01_Different_kfolds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


---

### 1. **KFold**

* Splits the dataset into *k* folds (e.g., 5).
* In each round, one fold is the test set, others are train.
* Does **not guarantee class balance** in each fold.
* ⚠️ Risk: if your dataset is imbalanced (like 90% zeros, 10% ones), some folds might miss the minority class completely.

---

### 2. **StratifiedKFold**

* Same as KFold, but it **preserves the class distribution** (ratios of each label) in every fold.
* Much better for classification problems, especially with imbalanced data.
* Example: If your dataset has 30% “dog” and 70% “cat,” then every fold will keep that \~30/70 ratio.

---

### 3. **Leave-One-Out (LOO)**

* Extreme case of KFold where **k = number of samples**.
* Each fold uses exactly 1 sample as the test set, and the rest as training.
* Very thorough, but computationally **super expensive** for large datasets.
* Used when dataset is very small (like medical datasets).

---

### 4. **ShuffleSplit**

* Randomly splits the dataset into train/test multiple times.
* You can set the train/test size (e.g., 80% train, 20% test).
* Doesn’t require strict “folds” → more flexible.
* Useful when you want random resampling instead of strict partitioning.

---

👉 Summary for your notes:

* **KFold** → simple splitting, may miss minority classes.
* **StratifiedKFold** → keeps class ratios balanced (default for classification).
* **Leave-One-Out** → one test sample per fold, very slow but precise.
* **ShuffleSplit** → random splits, flexible, can control train/test ratio.

---



In [1]:
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold, ShuffleSplit
import numpy as np

# Load data
digits = load_digits()
X, y = digits.data, digits.target

# Model
rf = RandomForestClassifier(n_estimators=200, max_depth=20, random_state=42)

# 1. Simple KFold (no stratification)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kf = cross_val_score(rf, X, y, cv=kf)

# 2. StratifiedKFold (keeps class balance)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_skf = cross_val_score(rf, X, y, cv=skf)

# 3. ShuffleSplit (random train/test splits)
ss = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
scores_ss = cross_val_score(rf, X, y, cv=ss)

print("KFold mean accuracy:", np.mean(scores_kf))
print("StratifiedKFold mean accuracy:", np.mean(scores_skf))
print("ShuffleSplit mean accuracy:", np.mean(scores_ss))


KFold mean accuracy: 0.9771804394924171
StratifiedKFold mean accuracy: 0.9788502011761067
ShuffleSplit mean accuracy: 0.9755555555555556
