## What this notebook does (Classical Baselines)

This notebook builds **classical machine learning baselines** for the PLAsTiCC supernova subset (SNIa vs SNII) using the **full engineered feature set**. The goal is to establish a strong reference point before testing a quantum classifier.

### Steps in this notebook

1. **Load engineered features**
   - Reads `../data/plasticc/transient_features.csv`, which contains one row per transient and the precomputed light-curve features.

2. **Prepare the dataset**
   - Uses the full set of **16 engineered features** (magnitude, flux, timing, slopes, and `n_points`).
   - Converts labels from strings to numeric targets: **SNII â†’ 0**, **SNIa â†’ 1**.

3. **Train/test split**
   - Splits the data into **80% train / 20% test** with stratification to preserve the class balance.

4. **Scaling**
   - Applies `StandardScaler` for models that benefit from standardized inputs (Logistic Regression and the Ensemble).

5. **Train classical models**
   - **Logistic Regression**: linear baseline on scaled features.
   - **Random Forest**: non-linear baseline using raw (unscaled) features.
   - **CatBoost**: gradient boosting baseline using raw (unscaled) features.
   - **Soft Voting Ensemble**: combines LR + RF + CatBoost using probability averaging.

6. **Evaluate performance**
   - Reports **Accuracy** and **ROC AUC** for each model.
   - Prints a **confusion matrix** to show SNII vs SNIa errors.

7. **Save results for reproducibility**
   - Writes metrics to `../results/plasticc_classical_results_2k.json` so they can be referenced later in the README and compared directly to the quantum notebook.

### Output / takeaway

At the end, youâ€™ll have a clear classical performance baseline (including an ensemble) that serves as the benchmark for the **quantum ML comparison** in Notebook 03.

In [1]:
import os
import json
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from catboost import CatBoostClassifier

from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve
)

# ============================================================================
# LOAD FEATURES
# ============================================================================

DATA_DIR = os.path.join("..", "data", "plasticc")
features_df = pd.read_csv(os.path.join(DATA_DIR, "transient_features.csv"))

print(f"Loaded {len(features_df)} samples with {len(features_df.columns)} columns")
print(f"\nClass distribution:")
print(features_df['label'].value_counts())

# ============================================================================
# PREPARE DATA - USE ALL 16 FEATURES
# ============================================================================

feature_cols = [
    # Magnitude features
    "mag_min", "mag_max", "mag_mean", "mag_std", "mag_range",
    # Flux features
    "flux_max", "flux_mean", "flux_std",
    # Time features
    "time_span", "rise_time", "decline_time", "rise_decline_ratio",
    # Slope features
    "mean_rise_slope", "mean_decline_slope", "max_slope",
    # Metadata
    "n_points"
]

X = features_df[feature_cols].values
y_str = features_df['label'].values

label_map = {'SNII': 0, 'SNIa': 1}
y = np.array([label_map[label] for label in y_str])

print(f"\nFeature matrix: {X.shape}")
print(f"Labels: {y.shape}, distribution: {np.unique(y, return_counts=True)}")

# ============================================================================
# TRAIN-TEST SPLIT
# ============================================================================

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"\nTrain: {X_train.shape[0]}, Test: {X_test.shape[0]}")

# ============================================================================
# SCALE FEATURES
# ============================================================================

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ============================================================================
# TRAIN MODELS
# ============================================================================

print("\n" + "=" * 70)
print("TRAINING CLASSICAL MODELS")
print("=" * 70)

models = {}

# Logistic Regression
print("\n1. Logistic Regression...")
lr = LogisticRegression(max_iter=2000, C=2.0, random_state=42)
lr.fit(X_train_scaled, y_train)
models['Logistic Regression'] = lr

# Random Forest
print("2. Random Forest...")
rf = RandomForestClassifier(
    n_estimators=600,
    max_depth=20,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
models['Random Forest'] = rf

# CatBoost
print("3. CatBoost...")
cb = CatBoostClassifier(
    iterations=800,
    depth=8,
    learning_rate=0.05,
    random_state=42,
    verbose=False
)
cb.fit(X_train, y_train)
models['CatBoost'] = cb

# Ensemble
print("4. Ensemble (Voting)...")
ensemble = VotingClassifier(
    estimators=[
        ('lr', lr),
        ('rf', rf),
        ('cb', cb)
    ],
    voting='soft',
    weights=[1, 2, 3]
)
ensemble.fit(X_train_scaled, y_train)
models['Ensemble'] = ensemble

# ============================================================================
# EVALUATE ALL MODELS
# ============================================================================

print("\n" + "=" * 70)
print("CLASSICAL MODEL RESULTS")
print("=" * 70)

results = {}

for name, model in models.items():
    print(f"\n{name}:")
    print("-" * 50)
    
    # Use scaled features for LR and Ensemble, raw for RF and CB
    if name in ['Logistic Regression', 'Ensemble']:
        X_test_input = X_test_scaled
    else:
        X_test_input = X_test
    
    y_pred = model.predict(X_test_input)
    y_proba = model.predict_proba(X_test_input)[:, 1]
    
    acc = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba)
    
    print(f"Accuracy: {acc:.1%} ({acc:.3f})")
    print(f"AUC:      {auc:.1%} ({auc:.3f})")
    
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"              SNII  SNIa")
    print(f"Actual SNII   {cm[0,0]:3d}   {cm[0,1]:3d}")
    print(f"       SNIa   {cm[1,0]:3d}   {cm[1,1]:3d}")
    
    results[name] = {
        'accuracy': float(acc),
        'auc': float(auc),
        'confusion_matrix': cm.tolist()
    }

# ============================================================================
# SAVE RESULTS
# ============================================================================

RESULTS_DIR = os.path.join("..", "results")
os.makedirs(RESULTS_DIR, exist_ok=True)

results_path = os.path.join(RESULTS_DIR, "plasticc_classical_results_2k.json")
with open(results_path, 'w') as f:
    json.dump(results, f, indent=2)

print(f"\nâœ“ Results saved: {results_path}")

print("\n" + "=" * 70)
print("CLASSICAL TRAINING COMPLETE!")
print("=" * 70)
print("Next step: Train quantum model with top 3 features")

Loaded 1072 samples with 18 columns

Class distribution:
label
SNII    549
SNIa    523
Name: count, dtype: int64

Feature matrix: (1072, 16)
Labels: (1072,), distribution: (array([0, 1]), array([549, 523]))

Train: 857, Test: 215

TRAINING CLASSICAL MODELS

1. Logistic Regression...
2. Random Forest...
3. CatBoost...
4. Ensemble (Voting)...

CLASSICAL MODEL RESULTS

Logistic Regression:
--------------------------------------------------
Accuracy: 71.2% (0.712)
AUC:      77.0% (0.770)

Confusion Matrix:
              SNII  SNIa
Actual SNII    75    35
       SNIa    27    78

Random Forest:
--------------------------------------------------
Accuracy: 75.8% (0.758)
AUC:      84.5% (0.845)

Confusion Matrix:
              SNII  SNIa
Actual SNII    81    29
       SNIa    23    82

CatBoost:
--------------------------------------------------
Accuracy: 74.4% (0.744)
AUC:      84.6% (0.846)

Confusion Matrix:
              SNII  SNIa
Actual SNII    83    27
       SNIa    28    77

Ensemble:

## Feature Correlation Analysis (Preparation for Quantum ML)

Before moving to the quantum classifier, we analyze which engineered features
are most strongly correlated with the target label (SNIa vs SNII).

This step serves two purposes:
1. **Dimensionality reduction** â€“ quantum models are limited to a small number
   of input features due to qubit constraints.
2. **Fair comparison** â€“ ensures the quantum model uses the *best available*
   features rather than an arbitrary subset.

We use point-biserial correlation (appropriate for binary labels) to rank
all features and identify the top 3 candidates for quantum encoding.

The selected features and their class separation statistics are saved and
reused directly in `03_quantum_classifier.ipynb`.

In [None]:
# ============================================================================
# FEATURE CORRELATION ANALYSIS - FIND TOP 3 FOR QUANTUM
# ============================================================================

from scipy.stats import pointbiserialr

print("\n" + "=" * 70)
print("FEATURE CORRELATION WITH LABEL (SNIa vs SNII)")
print("=" * 70)

# Numeric encoding: SNII=0, SNIa=1
label_numeric = y

correlations = []
for i, feat in enumerate(feature_cols):
    corr, pval = pointbiserialr(label_numeric, X[:, i])
    correlations.append({
        'feature': feat,
        'correlation': abs(corr),  # absolute value for ranking
        'correlation_signed': corr,
        'p_value': pval
    })

corr_df = pd.DataFrame(correlations).sort_values('correlation', ascending=False)

print("\nFeatures ranked by correlation strength:")
print(corr_df.to_string(index=False))

# Select TOP 3
top_3_features = corr_df.head(3)['feature'].tolist()

print("\n" + "=" * 70)
print(f"ðŸŽ¯ TOP 3 FEATURES FOR QUANTUM: {top_3_features}")
print("=" * 70)

# Show class separation for these features
print("\nClass separation check:")
for feat in top_3_features:
    feat_idx = feature_cols.index(feat)
    feat_data = X[:, feat_idx]
    
    snia_vals = feat_data[y == 1]
    snii_vals = feat_data[y == 0]
    
    print(f"\n{feat}:")
    print(f"  SNIa: mean={snia_vals.mean():.3f}, std={snia_vals.std():.3f}")
    print(f"  SNII: mean={snii_vals.mean():.3f}, std={snii_vals.std():.3f}")
    print(f"  Separation: {abs(snia_vals.mean() - snii_vals.mean()):.3f} ({abs(snia_vals.mean() - snii_vals.mean()) / snii_vals.std():.2f} Ïƒ)")

# Save for quantum notebook
top_3_path = os.path.join(RESULTS_DIR, "top_3_features.json")
with open(top_3_path, 'w') as f:
    json.dump({'top_3_features': top_3_features}, f, indent=2)

print(f"\nâœ“ Saved top 3 features to: {top_3_path}")