# OmicSelector2: Hyperparameter Tuning & Cross-Validation

This notebook demonstrates advanced model training with hyperparameter optimization and cross-validation.

**Learning Objectives:**
- Understand cross-validation strategies
- Optimize hyperparameters with Optuna
- Use training callbacks (early stopping, model checkpointing)
- Evaluate model stability across folds

**Prerequisites:**
```bash
pip install omicselector2 optuna
```

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import OmicSelector2 components
from omicselector2.models.classical import RandomForestClassifier as RFClassifier
from omicselector2.models.classical import XGBoostClassifier
from omicselector2.training.trainer import Trainer
from omicselector2.training.callbacks import EarlyStopping, ModelCheckpoint, ProgressLogger
from omicselector2.training.cross_validation import StratifiedKFoldSplitter, CrossValidator
from omicselector2.training.evaluator import ClassificationEvaluator
from omicselector2.training.hyperparameter import HyperparameterOptimizer, PREDEFINED_SEARCH_SPACES
from omicselector2.features.classical.random_forest import RandomForestSelector

# Set random seed
np.random.seed(42)

# Configure visualization
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Generate Dataset

In [None]:
# Generate synthetic biomarker data
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=500,
    n_features=100,
    n_informative=20,
    n_redundant=10,
    n_repeated=0,
    n_classes=2,
    flip_y=0.01,
    random_state=42
)

# Convert to DataFrame
X = pd.DataFrame(X, columns=[f"gene_{i}" for i in range(X.shape[1])])
y = pd.Series(y, name="response")

print(f"Dataset: {X.shape}")
print(f"Class distribution: {y.value_counts().to_dict()}")

## 2. Feature Selection

First, select the most important features to reduce dimensionality.

In [None]:
# Select top 30 features using Random Forest
selector = RandomForestSelector(
    n_estimators=100,
    n_features_to_select=30,
    random_state=42
)

X_selected = selector.fit_transform(X, y)

print(f"Selected features: {X_selected.shape[1]}")
print(f"Top 10 features: {selector.selected_features_[:10]}")

## 3. Cross-Validation Strategy

Use stratified k-fold cross-validation to ensure balanced class distribution in each fold.

In [None]:
# Create cross-validation splitter
cv_splitter = StratifiedKFoldSplitter(n_splits=5, shuffle=True, random_state=42)

# Visualize fold distribution
fold_info = []
for fold_idx, (train_idx, val_idx) in enumerate(cv_splitter.split(X_selected, y)):
    fold_info.append({
        'Fold': fold_idx + 1,
        'Train Size': len(train_idx),
        'Val Size': len(val_idx),
        'Train Class 0': y.iloc[train_idx].value_counts()[0],
        'Train Class 1': y.iloc[train_idx].value_counts()[1]
    })

fold_df = pd.DataFrame(fold_info)
print("\nCross-validation folds:")
print(fold_df)

## 4. Train Model with Callbacks

Use callbacks to monitor training and save the best model.

In [None]:
# Split data for demonstration
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X_selected, y, test_size=0.2, random_state=42, stratify=y
)

# Create model
model = RFClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)

# Create callbacks
callbacks = [
    EarlyStopping(
        monitor='val_auc',
        patience=5,
        min_delta=0.001,
        mode='max'
    ),
    ModelCheckpoint(
        filepath='best_model.pkl',
        monitor='val_auc',
        save_best_only=True,
        mode='max'
    ),
    ProgressLogger()
]

# Create trainer
trainer = Trainer(
    model=model,
    callbacks=callbacks,
    random_state=42
)

# Train model
print("Training with callbacks...")
history = trainer.fit(
    X_train,
    y_train,
    X_val=X_val,
    y_val=y_val,
    epochs=20
)

# Plot training history
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history['accuracy'], label='Train Accuracy')
plt.plot(history['val_accuracy'], label='Val Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Model Accuracy')
plt.legend()
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(history['auc'], label='Train AUC')
plt.plot(history['val_auc'], label='Val AUC')
plt.xlabel('Epoch')
plt.ylabel('AUC')
plt.title('Model AUC')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

## 5. Hyperparameter Optimization with Optuna

Use Bayesian optimization to find the best hyperparameters.

In [None]:
# View predefined search spaces
print("Predefined search spaces:")
for model_name, space in PREDEFINED_SEARCH_SPACES.items():
    print(f"\n{model_name}:")
    for param, values in space.items():
        print(f"  {param}: {values}")

In [None]:
# Create hyperparameter optimizer for Random Forest
optimizer = HyperparameterOptimizer(
    model_class=RFClassifier,
    search_space='RandomForest',  # Use predefined space
    metric='auc',
    direction='maximize',
    n_trials=20,  # Number of trials
    cv=3,  # 3-fold CV
    random_state=42,
    verbose=True
)

# Run optimization
print("Starting hyperparameter optimization...")
study = optimizer.optimize(X_selected, y, timeout=120)  # 2-minute timeout

# Get best parameters
best_params = optimizer.get_best_params()
print(f"\nBest parameters: {best_params}")
print(f"Best CV AUC: {study.best_value:.4f}")

In [None]:
# Visualize optimization history
import optuna

fig = optuna.visualization.plot_optimization_history(study)
fig.show()

# Plot parameter importances
fig = optuna.visualization.plot_param_importances(study)
fig.show()

## 6. Train Best Model

Train the final model with optimized hyperparameters.

In [None]:
# Get best model (trained with best hyperparameters)
best_model = optimizer.get_best_model(X_selected, y)

print("Best model trained successfully!")
print(f"Model parameters: {best_params}")

## 7. Cross-Validation Evaluation

Evaluate model stability across folds.

In [None]:
# Create cross-validator
cv = CrossValidator(
    model=RFClassifier(**best_params),
    cv_strategy=cv_splitter,
    random_state=42
)

# Perform cross-validation
print("Performing 5-fold cross-validation...")
cv_results = cv.cross_validate(X_selected, y)

# Display results
print("\nCross-validation results:")
for fold_idx, metrics in enumerate(cv_results['fold_metrics']):
    print(f"\nFold {fold_idx + 1}:")
    print(f"  Accuracy: {metrics['accuracy']:.4f}")
    print(f"  AUC: {metrics['auc']:.4f}")
    print(f"  F1: {metrics['f1']:.4f}")

# Calculate mean and std
mean_metrics = cv_results['mean_metrics']
std_metrics = cv_results['std_metrics']

print("\nMean ± Std:")
print(f"  Accuracy: {mean_metrics['accuracy']:.4f} ± {std_metrics['accuracy']:.4f}")
print(f"  AUC: {mean_metrics['auc']:.4f} ± {std_metrics['auc']:.4f}")
print(f"  F1: {mean_metrics['f1']:.4f} ± {std_metrics['f1']:.4f}")

In [None]:
# Visualize cross-validation results
metrics_df = pd.DataFrame(cv_results['fold_metrics'])
metrics_df['Fold'] = [f"Fold {i+1}" for i in range(len(metrics_df))]

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['accuracy', 'auc', 'f1']):
    axes[idx].bar(metrics_df['Fold'], metrics_df[metric])
    axes[idx].axhline(
        y=mean_metrics[metric],
        color='r',
        linestyle='--',
        label=f'Mean: {mean_metrics[metric]:.3f}'
    )
    axes[idx].set_ylabel(metric.upper())
    axes[idx].set_title(f'{metric.upper()} per Fold')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

In this notebook, you learned:

1. **Cross-validation strategies** - Stratified k-fold for balanced evaluation
2. **Training callbacks** - Early stopping and model checkpointing
3. **Hyperparameter optimization** - Using Optuna for Bayesian optimization
4. **Model stability evaluation** - Cross-validation metrics with mean ± std

**Key Takeaways:**
- Always use cross-validation to assess model generalization
- Hyperparameter tuning can significantly improve performance
- Monitor multiple metrics (accuracy, AUC, F1) for comprehensive evaluation
- Use callbacks to prevent overfitting and save best models

**Next Steps:**
- Learn about signature benchmarking (next notebook)
- Try different models (XGBoost, Logistic Regression)
- Experiment with custom search spaces
- Apply to your own biomarker discovery problems