In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import ipykernel
import sklearn

print("Numpy Version: ",np.__version__)
print("Pandas Version: ",pd.__version__)
print("Matplotlib Version: ",matplotlib.__version__)
print("Sklearn Version: ",sklearn.__version__)
print("ipykernel Version: ",ipykernel.__version__)

Numpy Version:  2.0.2
Pandas Version:  2.3.3
Matplotlib Version:  3.9.4
Sklearn Version:  1.6.1
ipykernel Version:  6.31.0


# üìò Data Splitting in Machine Learning

Before training machine learning models, we must **split the dataset** properly to avoid:
- ‚ùå Overfitting (model memorizes the training data)
- ‚ùå Poor generalization on new data
- ‚ùå Incorrect performance estimation

Data splitting divides our dataset into:

1. **Training Set**
   - Used to train the model.
   - Typically 70‚Äì80% of the dataset.

2. **Test Set**
   - Used only for final evaluation.
   - Typically 20‚Äì30%.

3. **Validation Set** (optional but useful)
   - Used for hyperparameter tuning.
   - Helps prevent overfitting.

In practice:
- Small datasets ‚Üí use **Cross-Validation**
- Medium/large datasets ‚Üí use **Train/Validation/Test split**


# üìò Train/Test Split

`train_test_split` from `sklearn.model_selection` is the most common method.

### Why Train/Test Split?
- Training ‚Üí learn model parameters
- Testing ‚Üí evaluate real-world performance

### Typical Ratios:
- 80% Training / 20% Testing
- 70% Training / 30% Testing

### Example Dataset:
We will use the built-in `diabetes` dataset from sklearn.


In [2]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Load dataset
data = load_diabetes()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training Set Shape:", X_train.shape)
print("Test Set Shape:", X_test.shape)


Training Set Shape: (353, 10)
Test Set Shape: (89, 10)


# üìò Validation Set

A **validation set** helps tune:
- Hyperparameters  
- Learning rate  
- Model complexity  
- Regularization strength  

### Why not tune on Test Set?
Because:
- It leaks information from test set
- Final performance becomes misleading

### Solution:
Split into:
- Training (60‚Äì70%)
- Validation (15‚Äì20%)
- Test (15‚Äì20%)


In [3]:
from sklearn.model_selection import train_test_split

# First split ‚Üí Train + Temp (Test+Validation)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Second split ‚Üí Validation + Test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

print("Train:", X_train.shape)
print("Validation:", X_val.shape)
print("Test:", X_test.shape)


Train: (309, 10)
Validation: (66, 10)
Test: (67, 10)


# üìò Cross-Validation (K-Fold)

Cross-Validation solves the problem of less data.  
Instead of one train-test split, it **splits the data into K folds**:

Example (K = 5):

Fold1 | Fold2 | Fold3 | Fold4 | Fold5

Train | Train | Train | Train | Test
Train | Train | Train | Test | Train
Train | Train | Test | Train | Train
Train | Test | Train | Train | Train
Test | Train | Train | Train | Train

Each fold is used once as testing data.

### Benefits:
- More accurate performance estimation
- Every sample gets used for training & testing
- Helps with small datasets

We use:  
‚úî `KFold`  
‚úî `cross_val_score`  
from sklearn.


In [4]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np

model = LinearRegression()

# Define KFold with 5 splits
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Compute cross-validation scores
scores = cross_val_score(model, X, y, cv=kf, scoring="r2")

print("Each Fold Score:", scores)
print("Mean CV Score:", np.mean(scores))


Each Fold Score: [0.45260276 0.57320015 0.39144785 0.58428888 0.39081186]
Mean CV Score: 0.47847030225778475


# üìò When to Use Which Data Splitting Method?

### ‚úî When dataset is large:
Use:
- Train (70%)
- Validation (15%)
- Test (15%)

### ‚úî When dataset is small:
Use:
- K-Fold Cross-Validation (K=5 or K=10)
- Avoid wasting data on a separate validation set

### ‚úî When tuning hyperparameters:
Use:
- Validation Set OR
- Cross-Validation (GridSearchCV / RandomizedSearchCV)

### ‚úî Final Model Evaluation:
ALWAYS use Test Set only once.


# üéØ Summary

### ‚úî Train/Test Split
Simple, quick, used on medium/large datasets.

### ‚úî Validation Set
Used for hyperparameter tuning, model selection.

### ‚úî Cross-Validation (K-Fold)
Best for small datasets, reduces variance, more reliable.

All three techniques prevent **overfitting** and ensure your model generalizes well.
