# Cross-Validation in Machine Learning

Cross-validation is a technique used to **assess the performance of a machine learning model** and to **tune hyperparameters** effectively.

---

## Data Splitting Basics

1. **Training Data**: Used to train the model.
2. **Validation Data**: Used to **tune hyperparameters**.
3. **Test Data**: Used to **evaluate final model performance** on unseen data.

### Example Split

If we have 1000 records:

* **Train-Test Split:** 70% train, 30% test
* **Train-Validation Split:** Further divide training data for hyperparameter tuning.

---

## Why Cross-Validation?

* Single train-validation split depends on **random state**.
* Different splits yield different accuracy:

$$
\text{Accuracy} \in [75%, 92%]
$$

* Cross-validation provides a **robust estimate** of model performance by averaging results across multiple splits.

---

# Types of Cross-Validation

## 1. Leave-One-Out Cross-Validation (LOOCV)

* Each experiment uses **one record as validation** and the rest as training.
* For (n) records, perform (n) experiments.

**Example:**

* Training size: 500
* Experiment 1: Record 1 → validation, Records 2-500 → training
* Experiment 2: Record 2 → validation, Records 1 & 3-500 → training
* ... up to Experiment 500

**Disadvantages:**

* Computationally expensive for large datasets
* High chance of **overfitting** due to very small validation set

---

## 2. Leave-P-Out Cross-Validation

* Similar to LOOCV but uses **p records** as validation.
* Hyperparameter (p) can be tuned (e.g., 10, 20, 30).
* Reduces number of experiments compared to LOOCV.

---

## 3. K-Fold Cross-Validation

* Divide training data into (k) folds of equal size.
* Perform (k) experiments, each fold used once as validation.

**Formulas:**

$$
\text{Fold size} = \frac{n}{k}
$$

* Example: (n = 500, k = 5) → Fold size = 100
* Experiment 1: Fold 1 → validation, Folds 2-5 → training
* Experiment 2: Fold 2 → validation, Folds 1, 3-5 → training
* ...
* Average the accuracies:

$$
\text{CV Accuracy} = \frac{1}{k} \sum_{i=1}^{k} \text{Accuracy}_i
$$

---

## 4. Stratified K-Fold Cross-Validation

* Ensures **class proportions** in each fold are similar to the original dataset.
* Particularly useful for **imbalanced classification**.
* Prevents a validation fold from having only one class.

**Example:**

* Binary classification: 60 ones, 40 zeros
* Each fold maintains ~60% ones, 40% zeros

---

## 5. Time Series Cross-Validation

* For **time-dependent data**, order matters.
* Split sequentially based on time:

$$
\text{Training: Day 1 → Day t}, \quad \text{Validation: Day t+1 → Day t+n}
$$

* Random splitting is **not allowed**.
* Useful for applications like **sentiment analysis over time** or **stock price prediction**.

---

## Summary Table of Cross-Validation Types

| CV Type           | Validation Size | Advantages                   | Disadvantages                          |
| ----------------- | --------------- | ---------------------------- | -------------------------------------- |
| LOOCV             | 1 record        | Unbiased, uses all data      | Computationally expensive, overfitting |
| Leave-P-Out       | p records       | Flexible                     | Still expensive                        |
| K-Fold            | n/k records     | Efficient, averages variance | May not maintain class distribution    |
| Stratified K-Fold | n/k records     | Maintains class distribution | Slightly more complex                  |
| Time Series CV    | Sequential days | Preserves temporal order     | Cannot randomly shuffle data           |

---

**Key Takeaways:**

1. Cross-validation helps **robust hyperparameter tuning**.
2. Choice of CV depends on **dataset size, class balance, and temporal dependencies**.
3. **Average, max, and min CV accuracies** provide better understanding of model performance.
4. **Time series data requires sequential validation** to avoid data leakage.

---


