# Data Spliting

Divide a dataset into two or more subsets to train, validate, and evaluate a *Machine Learning* model.

* **Training Set**
    * Largest portion (70-80%)
    * **Goal**: Train the model (learn **patterns and relationships** in the data).

* **Validation/Development  Set**
    * Small portion (10-15%)
    * **Goal**:
        * Tune the model's hyperparameters and prevent **overfitting**.

* **Test  Set**
    * Small portion (10-15%)
    * **Goal**:
        * Final unbiased estimation of the trained model's performance.
        * The model **does not learn anything** from this set.

### Importance of Data Spliting 

* **Prevent Overfitting**: By evaluating the model on data unseen during training (validation/test sets), we can check if the model has **learned generalizable patterns** (instead of **memorizing** the training data itself).
  
* **Model Selection and Hyperparameter Tuning**: The validation set allows us to compare different models and their configurations (hyperparameters) to choose the one that performs best on unseen data.

* **Assess Generalization**: The test set provides a final, unbiased estimate of the model's **ability to generalize to new data**.

#### Data Ordering Bias

* Data might be ordered by class or any other feature.
* If data was collected in batches, it might have inherent similarities within each batch.
    * Data is ordered by collection time.

To prevent bias from data ordering, data must be randomly shuffled prior to any data splitting.

* **Shuffle first, then split**
    * Or use a splitting function that does the shuffling.
* Except for time series, where the temporal order is crucial.

### Data Splitting Techniques

* **Train-Test Split**: The simplest (and possibly wrong) method. Does not allow hyperparameter tuning.

<center><img src="img/split-train-test.png" alt="train-test data split" style="width: 60%;"/></center>

* **Train-Validation-Test Split**: The most common method. Divides the data into three sets for training, hyperparameter tuning, and final evaluation.

<center><img src="img/split-train-valid-test.png" alt="train-test data split" style="width: 60%;"/></center>

* **K-Fold Cross-Validation**: The dataset is first divided into train and test data sets. Train dataset is then further divided into $k$ equal-sized *folds*.
    * The model is trained and evaluated $k$ times, with each fold serving as the validation set once and the remaining folds used for training.
    * The performance is <u>averaged</u> across all k evaluations.
    * Provides a <u>more robust estimate</u> of performance, especially with smaller datasets.

<center><img src="img/5-fold-cross-validation.png" alt="5-fold-cross-validation" style="width: 62%;"/></center>

Final evaluation:
* Cross-validation is used to tune the model's hyperparameters (or select the best model)
* The model is trained on the <u>entire training set</u>
* Model is evaluated on the test set

* **Nested Cross-Validation**: 

<center><img src="img/nested-cross-validation.png" alt="nested cross-validation" style="width: 60%;"/></center>

* **Time Series Split**: Used for time-dependent data
    * The data is split chronologically (train &rarr; test)
    * Avoid *Lookahead Bias*


<center><img src="img/time-series-split.png" alt="time-series split" style="width: 60%;"/></center>

* **Time-Series Cross-Validation**
    * Avoid Lookahead Bias
    * Split data chronologically
    * For each validation fold, use just previous training data

<center><img src="img/time-series cross-validation.png" alt="5-fold-cross-validation" style="width: 62%;"/></center>

Time-Series Cross-Validation with overlapping:

<center><img src="img/time-series cross-validation2.png" alt="5-fold-cross-validation" style="width: 62%;"/></center>

Datasets:

* Multi-class Classification Task &rarr; Iris Dataset
   * Number of Instances: 150
   * Number of Features: 4 (sepal length, sepal width, petal length, petal width)
   * Number of classes: 4
* Binary Classification Task &rarr; Breast Cancer Wisconsin (Diagnostic) Dataset
   * Number of Instances: 569
   * Number of Features: 30 (real-valued features computed from digitized images of cell nuclei)
   * Number of classes: 2 (Malignant/Benign)
* Regression Task &rarr; Wine Quality Dataset
   * Number of Instances: 4,900
   * Number of Features: 11 (physicochemical tests)
   * Prediction: quality score
* 