
# Model Selection Process

Imagine you have a model `g` and a dataset `X` with target values `y`.  
Your model performs well on this data, but you want to know **how it performs on new, unseen data**.

To estimate this, we use a **Train–Validation Split** — for example:
- 80% of the data for **training** (old dataset)
- 20% for **validation** (new dataset)

---

## Steps to Evaluate Model Performance

1. Extract **feature matrix** `X_train` from the training dataset.  
2. Extract **target values** `y_train` from the training dataset.  
3. Train the model `g` using `X_train` and `y_train`.  
4. Take the validation dataset and extract `X_val` and `y_val`.  
5. Apply the trained model to the validation set:  
   $$
   \hat{y}_V = g(X_V)
   $$
6. Compare predicted values $\hat{y}_V$ with actual values $y_V$ to assess model performance.

| Predicted (Probability) | Predicted (Label) | Actual (Target) |
|--------------------------|-------------------|------------------|
| 0.8 | 1 | 1 |
| 0.7 | 1 | 0 |
| 0.6 | 1 | 1 |
| 0.1 | 0 | 0 |
| 0.9 | 1 | 1 |
| 0.6 | 1 | 0 |
4 out of 6 predictions are correct → **~66% Accuracy**

---

## Trying Different Models

| Model | Type | Accuracy |
|--------|------|-----------|
| g₁ | Linear Regression | 66% |
| g₂ | Decision Tree | 60% |
| g₃ | Random Forest | 67% |
| g₄ | Neural Network | **80%**|

The **Neural Network (g₄)** performs best based on validation accuracy.

---

## Multiple Comparison Problem

When comparing multiple models on the **same validation set**,  
the top-performing model might appear best **by chance** — similar to a coin flip.

To ensure that a model’s performance is **truly generalizable**,  
we introduce a **third dataset** — the **test set**.

---

## Training–Validation–Test Split

Use three separate datasets:
- **Training (60%)**
- **Validation (20%)**
- **Test (20%)**

We use the **train-validation** process to choose the best model,  
then apply the **winning model** to the **test dataset** to verify generalization.

| Model | Type | Validation Accuracy | Test Accuracy |
|--------|------|----------------------|----------------|
| g₁ | Linear Regression | 66% | — |
| g₂ | Decision Tree | 60% | — |
| g₃ | Random Forest | 67% | — |
| g₄ | Neural Network | 80% | **79%** |

Since **g₄ performs similarly on both validation and test sets**,  
we can conclude that the model **generalizes well**.

---

## Summary of the Model Selection Process

This entire workflow is known as the **Model Selection Process**,  
one of the most critical aspects of Machine Learning.

### The 6 Steps

1. Split dataset into **train, validation, and test sets** (60%-20%-20%)  
2. Train the model  
3. Apply model to validation dataset  
4. Repeat steps 2 & 3 for multiple models  
5. Select the best-performing model  
6. Apply that model to the test dataset  
7. Compare validation vs. test accuracy to ensure consistency  

---

## Alternative Approach

Sometimes, we want to **avoid wasting the validation dataset**.

In that case, after selecting the best model:

1. Split the original dataset (60%-20%-20%)  
2. Train models on the **training dataset**  
3. Evaluate models on the **validation dataset**  
4. Select the best-performing model  
5. **Combine training and validation datasets** to create a **larger training set**  
6. Retrain the selected model on this combined data  
7. Apply the retrained model to the **test dataset**

> By using more data for training, the model may capture richer patterns  
> and improve its ability to **generalize** to unseen data.

However, this approach doesn’t always guarantee better performance —  
it depends on the **dataset size, variability, and initial model performance**.  
Experimentation and evaluation are key to choosing the best strategy.

---

## Key Takeaway

The **Model Selection Process** is fundamental to ML success.  
It ensures you choose a model that performs **well not only on training data**  
but also on **new, unseen data**.

Combining **careful validation** and **robust testing**  
helps avoid overfitting and ensures real-world reliability.

---

## References & Resources

- [YouTube Video: ML Zoomcamp](https://www.youtube.com/watch?v=OH_R0Sl9neM&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=7)  
- [Reference Notes](https://knowmledge.com/2023/09/13/ml-zoomcamp-2023-introduction-to-machine-learning-part-5/)  
- [Slides: ML Zoomcamp – Model Selection](https://www.slideshare.net/slideshow/ml-zoomcamp-15-model-selection-process/250116524)
