# Train-Test Split and Cross-Validation Basics

## Overview

**Why Split Data?**
- Training on entire dataset → Can't estimate real-world performance
- Model might just memorize training data (overfitting)
- Need **unseen data** to evaluate generalization

**The Golden Rule**: Never test on training data!

## Key Concepts

### 1. Train-Test Split
- Simple split: 70-80% train, 20-30% test
- Fast and straightforward
- ⚠️ High variance (depends on random split)

### 2. Cross-Validation (CV)
- Split data into K folds
- Train on K-1 folds, test on remaining fold
- Repeat K times, average results
- ✓ More reliable performance estimate
- ✓ Uses all data for both training and testing

## Setup and Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_wine, load_diabetes, load_breast_cancer
from sklearn.model_selection import (
    train_test_split, 
    cross_val_score, 
    cross_validate,
    KFold
)
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score

# Set random seed for reproducibility
np.random.seed(42)

print("✓ Libraries imported successfully")

## 1. Train-Test Split Basics

### Simple Split

In [None]:
# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

print("Dataset Info:")
print(f"Total samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Classes: {iris.target_names}")
print(f"Class distribution: {np.bincount(y)}")

# Basic split (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42     # For reproducibility
)

print(f"\nAfter Split:")
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X):.1%})")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X):.1%})")
print(f"\nTrain class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")

### Stratified Split (For Imbalanced Data)

**Problem**: Random split might create unbalanced train/test sets

**Solution**: `stratify` parameter maintains class distribution

In [None]:
# Create imbalanced dataset
X_imbalanced = X[y != 2]  # Remove class 2
y_imbalanced = y[y != 2]

print("Imbalanced Dataset:")
print(f"Class distribution: {np.bincount(y_imbalanced)}")
print(f"Class ratio: {np.bincount(y_imbalanced)[0]/len(y_imbalanced):.1%} vs {np.bincount(y_imbalanced)[1]/len(y_imbalanced):.1%}")

# Split WITHOUT stratification
X_tr1, X_te1, y_tr1, y_te1 = train_test_split(
    X_imbalanced, y_imbalanced, test_size=0.3, random_state=42
)

print(f"\nWithout Stratification:")
print(f"Train: {np.bincount(y_tr1)} → {np.bincount(y_tr1)[0]/len(y_tr1):.1%} vs {np.bincount(y_tr1)[1]/len(y_tr1):.1%}")
print(f"Test:  {np.bincount(y_te1)} → {np.bincount(y_te1)[0]/len(y_te1):.1%} vs {np.bincount(y_te1)[1]/len(y_te1):.1%}")

# Split WITH stratification
X_tr2, X_te2, y_tr2, y_te2 = train_test_split(
    X_imbalanced, y_imbalanced, 
    test_size=0.3, 
    random_state=42,
    stratify=y_imbalanced  # Maintain class distribution
)

print(f"\nWith Stratification:")
print(f"Train: {np.bincount(y_tr2)} → {np.bincount(y_tr2)[0]/len(y_tr2):.1%} vs {np.bincount(y_tr2)[1]/len(y_tr2):.1%}")
print(f"Test:  {np.bincount(y_te2)} → {np.bincount(y_te2)[0]/len(y_te2):.1%} vs {np.bincount(y_te2)[1]/len(y_te2):.1%}")

print("\n✓ Stratification preserves class distribution!")

### Training and Evaluating a Model

In [None]:
# Train model on training set
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Evaluate on both sets
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print("Model Performance:")
print(f"Training accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")
print(f"\nDifference: {abs(train_score - test_score):.4f}")

if train_score - test_score > 0.1:
    print("⚠️ Warning: Possible overfitting (train >> test)")
elif test_score > train_score:
    print("✓ Good generalization (test ≥ train)")
else:
    print("✓ Model generalizes well")

### Problem with Single Train-Test Split

Performance varies with different random splits:

In [None]:
# Test with different random states
test_scores = []

for seed in range(10):
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=0.2, random_state=seed
    )
    
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_tr, y_tr)
    score = model.score(X_te, y_te)
    test_scores.append(score)

print("Test Accuracy Across Different Splits:")
print(f"Scores: {[f'{s:.3f}' for s in test_scores]}")
print(f"\nMean: {np.mean(test_scores):.4f}")
print(f"Std:  {np.std(test_scores):.4f}")
print(f"Range: [{np.min(test_scores):.4f}, {np.max(test_scores):.4f}]")

print("\n⚠️ High variance! Need more reliable evaluation method...")

## 2. Cross-Validation: More Reliable Evaluation

### How K-Fold CV Works

```
Fold 1: [Test] [Train] [Train] [Train] [Train]
Fold 2: [Train] [Test] [Train] [Train] [Train]
Fold 3: [Train] [Train] [Test] [Train] [Train]
Fold 4: [Train] [Train] [Train] [Test] [Train]
Fold 5: [Train] [Train] [Train] [Train] [Test]

Final Score = Average of all fold scores
```

### Using cross_val_score

In [None]:
# Perform 5-fold cross-validation
model = LogisticRegression(max_iter=1000, random_state=42)

cv_scores = cross_val_score(
    model, X, y, 
    cv=5,              # 5 folds
    scoring='accuracy' # Metric to compute
)

print("5-Fold Cross-Validation Results:")
print(f"Fold scores: {cv_scores}")
print(f"\nMean accuracy: {cv_scores.mean():.4f}")
print(f"Std deviation: {cv_scores.std():.4f}")
print(f"95% confidence interval: [{cv_scores.mean() - 1.96*cv_scores.std():.4f}, {cv_scores.mean() + 1.96*cv_scores.std():.4f}]")

# Compare with single split
print(f"\nComparison:")
print(f"Single split std: {np.std(test_scores):.4f}")
print(f"CV std: {cv_scores.std():.4f}")
print(f"\n✓ Cross-validation provides more stable estimates!")

### Effect of Different K Values

In [None]:
# Test different K values
k_values = [3, 5, 10, 15]
model = LogisticRegression(max_iter=1000, random_state=42)

print("Cross-Validation with Different K:")
print("=" * 50)

for k in k_values:
    scores = cross_val_score(model, X, y, cv=k)
    print(f"\nK={k:2d}: Mean={scores.mean():.4f}, Std={scores.std():.4f}")
    print(f"      Training size per fold: {len(X) * (k-1) / k:.0f} samples")

print("\n💡 Trade-off:")
print("  - Higher K → More training data per fold, but more computation")
print("  - Lower K → Faster, but less stable estimates")
print("  - K=5 or K=10 are common choices")

## 3. cross_validate: More Information

`cross_validate` returns more details than `cross_val_score`:

In [None]:
from sklearn.model_selection import cross_validate

model = LogisticRegression(max_iter=1000, random_state=42)

# Get detailed CV results
cv_results = cross_validate(
    model, X, y,
    cv=5,
    scoring='accuracy',
    return_train_score=True,  # Also return training scores
    return_estimator=False     # Don't return fitted models (saves memory)
)

print("Detailed Cross-Validation Results:")
print("=" * 50)
print(f"\nTest scores:  {cv_results['test_score']}")
print(f"Train scores: {cv_results['train_score']}")
print(f"\nFit time (s): {cv_results['fit_time']}")
print(f"Score time (s): {cv_results['score_time']}")

print(f"\nSummary:")
print(f"  Test accuracy:  {cv_results['test_score'].mean():.4f} ± {cv_results['test_score'].std():.4f}")
print(f"  Train accuracy: {cv_results['train_score'].mean():.4f} ± {cv_results['train_score'].std():.4f}")
print(f"  Average fit time: {cv_results['fit_time'].mean():.4f}s")

# Check for overfitting
train_mean = cv_results['train_score'].mean()
test_mean = cv_results['test_score'].mean()
if train_mean - test_mean > 0.1:
    print("\n⚠️ Warning: Possible overfitting")
else:
    print("\n✓ Model generalizes well")

### Multiple Metrics at Once

In [None]:
# Evaluate multiple metrics
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

cv_results = cross_validate(
    model, X, y,
    cv=5,
    scoring=scoring
)

print("Multiple Metrics Cross-Validation:")
print("=" * 50)

for metric in scoring:
    scores = cv_results[f'test_{metric}']
    print(f"\n{metric.upper()}:")
    print(f"  Mean: {scores.mean():.4f}")
    print(f"  Std:  {scores.std():.4f}")

## 4. Real-World Example: Wine Quality Classification

In [None]:
# Load Wine dataset
wine = load_wine()
X_wine, y_wine = wine.data, wine.target

print("Wine Dataset:")
print(f"Samples: {len(X_wine)}")
print(f"Features: {X_wine.shape[1]}")
print(f"Classes: {wine.target_names}")

# Compare models with train-test split
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(
    X_wine, y_wine, test_size=0.3, random_state=42, stratify=y_wine
)

# Scale features
scaler = StandardScaler()
X_train_w_scaled = scaler.fit_transform(X_train_w)
X_test_w_scaled = scaler.transform(X_test_w)

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

print("\n" + "=" * 60)
print("METHOD 1: Single Train-Test Split")
print("=" * 60)

split_results = {}
for name, model in models.items():
    model.fit(X_train_w_scaled, y_train_w)
    test_score = model.score(X_test_w_scaled, y_test_w)
    split_results[name] = test_score
    print(f"{name:25s}: {test_score:.4f}")

In [None]:
# Compare with cross-validation
from sklearn.pipeline import Pipeline

print("\n" + "=" * 60)
print("METHOD 2: 5-Fold Cross-Validation")
print("=" * 60)

cv_results = {}
for name, model in models.items():
    # Create pipeline with scaling
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', model)
    ])
    
    scores = cross_val_score(pipeline, X_wine, y_wine, cv=5)
    cv_results[name] = scores
    print(f"{name:25s}: {scores.mean():.4f} ± {scores.std():.4f}")

print("\n" + "=" * 60)
print("COMPARISON")
print("=" * 60)
for name in models.keys():
    print(f"{name}:")
    print(f"  Single split: {split_results[name]:.4f}")
    print(f"  CV mean:      {cv_results[name].mean():.4f} ± {cv_results[name].std():.4f}")
    print()

## 5. Cross-Validation for Regression

In [None]:
# Load Diabetes dataset (regression)
diabetes = load_diabetes()
X_diab, y_diab = diabetes.data, diabetes.target

print("Diabetes Dataset (Regression):")
print(f"Samples: {len(X_diab)}")
print(f"Features: {X_diab.shape[1]}")
print(f"Target range: [{y_diab.min():.0f}, {y_diab.max():.0f}]")

# Compare regression models
from sklearn.linear_model import Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor

reg_models = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

print("\n" + "=" * 70)
print("REGRESSION CROSS-VALIDATION (R² Score)")
print("=" * 70)

for name, model in reg_models.items():
    # Use negative MSE and R² scoring
    r2_scores = cross_val_score(model, X_diab, y_diab, cv=5, scoring='r2')
    neg_mse_scores = cross_val_score(model, X_diab, y_diab, cv=5, scoring='neg_mean_squared_error')
    rmse_scores = np.sqrt(-neg_mse_scores)
    
    print(f"\n{name}:")
    print(f"  R² Score:  {r2_scores.mean():.4f} ± {r2_scores.std():.4f}")
    print(f"  RMSE:      {rmse_scores.mean():.2f} ± {rmse_scores.std():.2f}")

## 6. Cross-Validation Best Practices

In [None]:
print("Cross-Validation Best Practices:")
print("=" * 70)

practices = [
    ("1. Always use stratified splits for classification",
     "Use: cv=StratifiedKFold() or stratify=y"),
    
    ("2. Include preprocessing in pipeline",
     "Prevents data leakage from scaling/imputation"),
    
    ("3. Set random_state for reproducibility",
     "KFold(n_splits=5, shuffle=True, random_state=42)"),
    
    ("4. Choose K based on dataset size",
     "Small: K=5, Medium: K=10, Large: K=3"),
    
    ("5. Report mean AND std deviation",
     "Shows both performance and stability"),
    
    ("6. Use cross_validate for detailed info",
     "Get train scores, timing, multiple metrics"),
    
    ("7. For time series, use TimeSeriesSplit",
     "Respects temporal ordering"),
    
    ("8. For small datasets, consider LOOCV",
     "Leave-One-Out: K=n (expensive but thorough)")
]

for i, (practice, tip) in enumerate(practices, 1):
    print(f"\n{practice}")
    print(f"   → {tip}")

## 7. When to Use What?

In [None]:
# Create decision guide
guide = pd.DataFrame({
    'Scenario': [
        'Large dataset (>100k samples)',
        'Small dataset (<1000 samples)',
        'Very small dataset (<100 samples)',
        'Imbalanced classes',
        'Time series data',
        'Quick experimentation',
        'Final model evaluation',
        'Hyperparameter tuning'
    ],
    'Recommended Method': [
        'Single train-test split (80-20)',
        '5-fold or 10-fold CV',
        'Leave-One-Out CV (LOOCV)',
        'Stratified K-Fold CV',
        'TimeSeriesSplit CV',
        'Single split or 3-fold CV',
        '5-fold or 10-fold CV',
        'Nested CV or CV with GridSearch'
    ],
    'K Value': [
        'N/A',
        '5-10',
        'n (LOOCV)',
        '5-10',
        '5',
        'N/A or 3',
        '5-10',
        '3-5 (outer)'
    ]
})

print("\nDecision Guide: Which Method to Use?")
print("=" * 80)
print(guide.to_string(index=False))

## Key Takeaways

### Train-Test Split
- ✓ Fast and simple
- ✓ Good for large datasets
- ✗ High variance (depends on split)
- ✗ Wastes some data (test set not used for training)

### Cross-Validation
- ✓ More reliable performance estimate
- ✓ Uses all data for both training and testing
- ✓ Lower variance
- ✗ K times slower than single split
- ✗ Not suitable for very large datasets

### Critical Points

1. **Never test on training data!**
2. **Use stratification for imbalanced data**
3. **Include preprocessing in pipeline** (prevents data leakage)
4. **Report mean ± std** (not just mean)
5. **Choose K wisely**: 5-10 for most cases
6. **Set random_state** for reproducibility

### Common Workflows

```python
# Quick Experiment
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
model.fit(X_train, y_train)
model.score(X_test, y_test)

# Proper Evaluation
pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
scores = cross_val_score(pipeline, X, y, cv=5)
print(f"{scores.mean():.3f} ± {scores.std():.3f}")

# Detailed Analysis
cv_results = cross_validate(
    pipeline, X, y, cv=5,
    scoring=['accuracy', 'precision', 'recall'],
    return_train_score=True
)
```