# Data Spliting

Divide a dataset into two or more subsets to train, validate, and evaluate a *Machine Learning* model.

* **Training Set**: The largest portion of the data (tipically 70-80%), used to train the model (learn **patterns and relationships** in the data).

* **Validation/Development  Set**: A smaller portion of the data (tipically 10-15%), used to tune the model's hyperparameters and assess its performance during training. It helps prevent **overfitting** by providing an **unbiased** evaluation while adjusting the model.

* **Test  Set**: Another smaller portion of the data (tipically 10-15%), used for the final evaluation of the trained model's performance. It provides an unbiased estimate of how well the model will perform on completely new, unseen data. The model **does not learn from this set** during training or hyperparameter tuning.

### Importance of Data Spliting 

* **Prevent Overfitting**: By evaluating the model on data unseen during training (validation/test sets), we can check if the model has **learned generalizable patterns** (instead of **memorizing** the training data itself).
  
* **Model Selection and Hyperparameter Tuning**: The validation set allows us to compare different models and their configurations (hyperparameters) to choose the one that performs best on unseen data.

* **Assess Generalization**: The test set provides a final, unbiased estimate of the model's **ability to generalize to new data**.

### Data Splitting Techniques

* **Train-Test Split**: The simplest (and possibly wrong) method. Does not allow hyperparameter tuning. 

* **Train-Validation-Test Split**: The most common method. Divides the data into three sets for training, hyperparameter tuning, and final evaluation. 

* **K-Fold Cross-Validation**: The dataset is divided into $k$ equal-sized *folds*. The model is trained and evaluated $k$ times, with each fold serving as the test set once and the remaining k−1 folds used for training. The performance is then averaged across all k evaluations. This provides a more robust estimate of performance, especially with smaller datasets.

* **Time Series Split**: Used for time-dependent data, where the data is split chronologically. The training set consists of earlier time periods, and the test set consists of later time periods, preventing *looking into the future* during training. 

Datasets:

* Multi-class Classification Task &rarr; Iris Dataset
   * Number of Instances: 150
   * Number of Features: 4 (sepal length, sepal width, petal length, petal width)
   * Number of classes: 4
* Binary Classification Task &rarr; Breast Cancer Wisconsin (Diagnostic) Dataset
   * Number of Instances: 569
   * Number of Features: 30 (real-valued features computed from digitized images of cell nuclei)
   * Number of classes: 2 (Malignant/Benign)
* Regression Task &rarr; Wine Quality Dataset
   * Number of Instances: 4,900
   * Number of Features: 11 (physicochemical tests)
   * Prediction: quality score
* 