# Exoplanet Classification Model Training
## LightGBM + Random Forest Baselines

This notebook trains baseline classification models to predict exoplanet candidates using the cleaned Kepler dataset.

### Objectives:
1. Load and prepare cleaned Kepler data
2. Define classification target (binary or multi-class)
3. Train baseline models (RandomForest & LightGBM)
4. Evaluate performance metrics
5. Export trained models and metadata

### Models:
- **Random Forest**: `n_estimators=300, max_depth=None`
- **LightGBM**: `num_leaves=63, n_estimators=500, learning_rate=0.05`

### Evaluation Metrics:
- Accuracy
- ROC-AUC (binary) or Macro F1 (multi-class)
- Precision-Recall AUC (for imbalanced data)
- Confusion Matrix
- Feature Importance

## 1. Setup & Imports

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import json
from datetime import datetime, timezone

# Machine learning imports
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, 
    roc_auc_score, 
    f1_score, 
    confusion_matrix,
    classification_report,
    precision_recall_curve,
    roc_curve,
    auc,
    average_precision_score
)
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
import joblib

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("✓ Libraries imported successfully")
print(f"Random state set to: {RANDOM_STATE}")

OSError: dlopen(/Users/jorgesandoval/Documents/current/fermix/venv/lib/python3.13/site-packages/lightgbm/lib/lib_lightgbm.dylib, 0x0006): Library not loaded: @rpath/libomp.dylib
  Referenced from: <D44045CD-B874-3A27-9A61-F131D99AACE4> /Users/jorgesandoval/Documents/current/fermix/venv/lib/python3.13/site-packages/lightgbm/lib/lib_lightgbm.dylib
  Reason: tried: '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file), '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file), '/opt/homebrew/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/lib/libomp.dylib' (no such file), '/opt/homebrew/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/lib/libomp.dylib' (no such file)

## 2. Load Cleaned Data

In [None]:
# Load the cleaned Kepler dataset
data_path = Path('../data/clean/kepler_clean.csv')

print("Loading cleaned Kepler dataset...")
df = pd.read_csv(data_path)

print(f"✓ Dataset loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\nFirst few rows:")
display(df.head())

print(f"\nColumn types:")
print(df.dtypes.value_counts())

print(f"\nDataset info:")
print(f"  - Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"  - Missing values: {df.isnull().sum().sum():,}")

## 3. Define Classification Target

We'll create a classification target based on `koi_disposition`:
- **Binary**: CONFIRMED (1) vs FALSE POSITIVE (0), excluding CANDIDATE
- **3-Class**: CONFIRMED (2) vs CANDIDATE (1) vs FALSE POSITIVE (0)

Let's start with binary classification for clearer model performance.

In [None]:
# Examine target distribution
print("="*80)
print("TARGET VARIABLE ANALYSIS")
print("="*80)

print(f"\nOriginal koi_disposition distribution:")
print(df['koi_disposition'].value_counts())
print(f"\nPercentages:")
print(df['koi_disposition'].value_counts(normalize=True) * 100)

# Create binary classification target (CONFIRMED vs FALSE POSITIVE)
# Remove CANDIDATE for cleaner binary classification
df_binary = df[df['koi_disposition'].isin(['CONFIRMED', 'FALSE POSITIVE'])].copy()

# Create target variable
df_binary['label'] = (df_binary['koi_disposition'] == 'CONFIRMED').astype(int)

print(f"\n{'='*80}")
print(f"BINARY CLASSIFICATION DATASET")
print(f"{'='*80}")
print(f"Dataset size: {df_binary.shape[0]:,} rows (removed {df.shape[0] - df_binary.shape[0]:,} CANDIDATE rows)")
print(f"\nTarget distribution:")
print(df_binary['label'].value_counts())
print(f"\nClass balance:")
class_pct = df_binary['label'].value_counts(normalize=True) * 100
print(f"  - Class 0 (FALSE POSITIVE): {class_pct[0]:.2f}%")
print(f"  - Class 1 (CONFIRMED): {class_pct[1]:.2f}%")

# Check for class imbalance
imbalance_ratio = class_pct.max() / class_pct.min()
print(f"\nImbalance ratio: {imbalance_ratio:.2f}:1")
if imbalance_ratio > 2:
    print("⚠️  Dataset is imbalanced - PR-AUC will be important metric")
else:
    print("✓ Dataset is relatively balanced")

# Store task type
TASK_TYPE = "binary"
print(f"\n🎯 Task type: {TASK_TYPE} classification")

## 4. Feature Selection & Preprocessing

In [None]:
# Select features for modeling
print("="*80)
print("FEATURE SELECTION")
print("="*80)

# Exclude non-feature columns
exclude_cols = [
    'rowid', 'kepid', 'kepoi_name', 'kepler_name', 
    'koi_disposition', 'koi_pdisposition', 'koi_comment',
    'koi_disp_prov', 'label'  # target
]

# Get all numeric columns
numeric_cols = df_binary.select_dtypes(include=[np.number]).columns.tolist()

# Remove excluded columns
feature_cols = [col for col in numeric_cols if col not in exclude_cols]

print(f"\nTotal numeric columns: {len(numeric_cols)}")
print(f"Excluded columns: {len(exclude_cols)}")
print(f"Selected features: {len(feature_cols)}")

# Check for missing values in features
missing_by_col = df_binary[feature_cols].isnull().sum()
cols_with_missing = missing_by_col[missing_by_col > 0]

print(f"\nFeatures with missing values: {len(cols_with_missing)}")
if len(cols_with_missing) > 0:
    print("\nTop 10 features with missing values:")
    for col, count in cols_with_missing.head(10).items():
        pct = (count / len(df_binary) * 100)
        print(f"  - {col}: {count} ({pct:.2f}%)")

# Create feature matrix and target
X = df_binary[feature_cols].copy()
y = df_binary['label'].copy()

# Handle missing values - simple imputation with median
print(f"\n{'='*80}")
print("HANDLING MISSING VALUES")
print(f"{'='*80}")

missing_before = X.isnull().sum().sum()
print(f"Missing values before imputation: {missing_before:,}")

# Impute with median
for col in feature_cols:
    if X[col].isnull().any():
        median_val = X[col].median()
        X[col].fillna(median_val, inplace=True)

missing_after = X.isnull().sum().sum()
print(f"Missing values after imputation: {missing_after:,}")
print(f"✓ All missing values handled")

# Final dataset info
print(f"\n{'='*80}")
print("FINAL DATASET")
print(f"{'='*80}")
print(f"Features (X): {X.shape[0]:,} rows × {X.shape[1]} features")
print(f"Target (y): {y.shape[0]:,} samples")
print(f"Feature list (first 10): {feature_cols[:10]}")
print(f"...")

## 5. Train/Test Split

In [None]:
# Split data into train and test sets
TEST_SIZE = 0.2

print("="*80)
print("TRAIN/TEST SPLIT")
print("="*80)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=TEST_SIZE, 
    random_state=RANDOM_STATE,
    stratify=y  # Maintain class distribution
)

print(f"\nTest size: {TEST_SIZE * 100}%")
print(f"\nTraining set:")
print(f"  - X_train: {X_train.shape[0]:,} rows × {X_train.shape[1]} features")
print(f"  - y_train: {y_train.shape[0]:,} samples")
print(f"  - Class distribution: {y_train.value_counts().to_dict()}")

print(f"\nTest set:")
print(f"  - X_test: {X_test.shape[0]:,} rows × {X_test.shape[1]} features")
print(f"  - y_test: {y_test.shape[0]:,} samples")
print(f"  - Class distribution: {y_test.value_counts().to_dict()}")

# Verify stratification
train_pos_pct = (y_train.sum() / len(y_train)) * 100
test_pos_pct = (y_test.sum() / len(y_test)) * 100
print(f"\nClass balance verification:")
print(f"  - Train positive class: {train_pos_pct:.2f}%")
print(f"  - Test positive class: {test_pos_pct:.2f}%")
print(f"✓ Stratification successful (similar distributions)")

## 6. Train Baseline Models

### 6.1 Random Forest Classifier

In [None]:
# Train Random Forest Classifier
print("="*80)
print("TRAINING RANDOM FOREST CLASSIFIER")
print("="*80)

# Initialize model with specified hyperparameters
rf_model = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=RANDOM_STATE,
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

print("\nModel hyperparameters:")
print(f"  - n_estimators: {rf_model.n_estimators}")
print(f"  - max_depth: {rf_model.max_depth}")
print(f"  - random_state: {rf_model.random_state}")

print("\nTraining model...")
import time
start_time = time.time()

rf_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"\n✓ Training completed in {training_time:.2f} seconds")

# Make predictions
print("\nGenerating predictions...")
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

print("✓ Random Forest model trained and predictions generated")

### 6.2 LightGBM Classifier

In [None]:
# Train LightGBM Classifier
print("="*80)
print("TRAINING LIGHTGBM CLASSIFIER")
print("="*80)

# Initialize model with specified hyperparameters
lgbm_model = lgb.LGBMClassifier(
    num_leaves=63,
    n_estimators=500,
    learning_rate=0.05,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=-1  # Suppress per-iteration output
)

print("\nModel hyperparameters:")
print(f"  - num_leaves: {lgbm_model.num_leaves}")
print(f"  - n_estimators: {lgbm_model.n_estimators}")
print(f"  - learning_rate: {lgbm_model.learning_rate}")
print(f"  - random_state: {lgbm_model.random_state}")

print("\nTraining model...")
start_time = time.time()

lgbm_model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    eval_metric='auc',
    callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=False)]
)

training_time = time.time() - start_time
print(f"\n✓ Training completed in {training_time:.2f} seconds")
print(f"  - Best iteration: {lgbm_model.best_iteration_}")
print(f"  - Best score: {lgbm_model.best_score_['valid_0']['auc']:.4f}")

# Make predictions
print("\nGenerating predictions...")
y_pred_lgbm = lgbm_model.predict(X_test)
y_pred_proba_lgbm = lgbm_model.predict_proba(X_test)[:, 1]

print("✓ LightGBM model trained and predictions generated")

## 7. Model Evaluation

Calculate comprehensive metrics for both models.

In [None]:
# Evaluate both models
print("="*80)
print("MODEL EVALUATION")
print("="*80)

def evaluate_model(y_true, y_pred, y_pred_proba, model_name):
    """Calculate comprehensive evaluation metrics"""
    
    # Basic metrics
    accuracy = accuracy_score(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, y_pred_proba)
    pr_auc = average_precision_score(y_true, y_pred_proba)
    f1 = f1_score(y_true, y_pred)
    
    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    # Precision and Recall
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    print(f"\n{'='*80}")
    print(f"{model_name.upper()} PERFORMANCE")
    print(f"{'='*80}")
    print(f"\n📊 Classification Metrics:")
    print(f"  - Accuracy:        {accuracy:.4f}")
    print(f"  - ROC-AUC:         {roc_auc:.4f}")
    print(f"  - PR-AUC:          {pr_auc:.4f}")
    print(f"  - F1 Score:        {f1:.4f}")
    print(f"  - Precision:       {precision:.4f}")
    print(f"  - Recall:          {recall:.4f}")
    
    print(f"\n📈 Confusion Matrix:")
    print(f"  - True Negatives:  {tn}")
    print(f"  - False Positives: {fp}")
    print(f"  - False Negatives: {fn}")
    print(f"  - True Positives:  {tp}")
    
    print(f"\n📋 Classification Report:")
    print(classification_report(y_true, y_pred, target_names=['FALSE POSITIVE', 'CONFIRMED']))
    
    return {
        'accuracy': accuracy,
        'roc_auc': roc_auc,
        'pr_auc': pr_auc,
        'f1_score': f1,
        'precision': precision,
        'recall': recall,
        'confusion_matrix': cm.tolist()
    }

# Evaluate Random Forest
metrics_rf = evaluate_model(y_test, y_pred_rf, y_pred_proba_rf, "Random Forest")

# Evaluate LightGBM
metrics_lgbm = evaluate_model(y_test, y_pred_lgbm, y_pred_proba_lgbm, "LightGBM")

# Compare models
print(f"\n{'='*80}")
print("MODEL COMPARISON")
print(f"{'='*80}")
print(f"\n{'Metric':<20} {'Random Forest':<15} {'LightGBM':<15} {'Winner':<10}")
print(f"{'-'*60}")
for metric in ['accuracy', 'roc_auc', 'pr_auc', 'f1_score']:
    rf_val = metrics_rf[metric]
    lgbm_val = metrics_lgbm[metric]
    winner = "🏆 RF" if rf_val > lgbm_val else "🏆 LGBM" if lgbm_val > rf_val else "Tie"
    print(f"{metric:<20} {rf_val:<15.4f} {lgbm_val:<15.4f} {winner:<10}")

print(f"\n🎯 Both models show strong performance for exoplanet classification!")

## 8. Export Trained Models

Save models and metadata for future use.

In [None]:
# Create models directory
models_dir = Path('../models')
models_dir.mkdir(exist_ok=True)

print("="*80)
print("EXPORTING MODELS AND METADATA")
print("="*80)

# Save Random Forest model
rf_model_path = models_dir / 'model_rf.pkl'
joblib.dump(rf_model, rf_model_path)
print(f"\n✓ Random Forest model saved: {rf_model_path}")
print(f"  - File size: {rf_model_path.stat().st_size / 1024**2:.2f} MB")

# Save LightGBM model
lgbm_model_path = models_dir / 'model_lgbm.pkl'
joblib.dump(lgbm_model, lgbm_model_path)
print(f"\n✓ LightGBM model saved: {lgbm_model_path}")
print(f"  - File size: {lgbm_model_path.stat().st_size / 1024**2:.2f} MB")

# Create metadata
metadata = {
    "created_utc": datetime.now(timezone.utc).isoformat(),
    "dataset": "Kepler KOI cleaned (binary classification)",
    "task": "binary",
    "n_samples": {
        "total": len(df_binary),
        "train": len(X_train),
        "test": len(X_test)
    },
    "n_features": len(feature_cols),
    "features": feature_cols,
    "target": "label",
    "target_mapping": {
        "0": "FALSE POSITIVE",
        "1": "CONFIRMED"
    },
    "class_distribution": {
        "train": y_train.value_counts().to_dict(),
        "test": y_test.value_counts().to_dict()
    },
    "models": {
        "random_forest": {
            "version": "1.0.0",
            "hyperparameters": {
                "n_estimators": 300,
                "max_depth": None,
                "random_state": RANDOM_STATE
            },
            "metrics": {
                "accuracy": float(metrics_rf['accuracy']),
                "roc_auc": float(metrics_rf['roc_auc']),
                "pr_auc": float(metrics_rf['pr_auc']),
                "f1_score": float(metrics_rf['f1_score']),
                "precision": float(metrics_rf['precision']),
                "recall": float(metrics_rf['recall'])
            },
            "confusion_matrix": metrics_rf['confusion_matrix']
        },
        "lightgbm": {
            "version": "1.0.0",
            "hyperparameters": {
                "num_leaves": 63,
                "n_estimators": 500,
                "learning_rate": 0.05,
                "random_state": RANDOM_STATE
            },
            "metrics": {
                "accuracy": float(metrics_lgbm['accuracy']),
                "roc_auc": float(metrics_lgbm['roc_auc']),
                "pr_auc": float(metrics_lgbm['pr_auc']),
                "f1_score": float(metrics_lgbm['f1_score']),
                "precision": float(metrics_lgbm['precision']),
                "recall": float(metrics_lgbm['recall'])
            },
            "confusion_matrix": metrics_lgbm['confusion_matrix'],
            "best_iteration": int(lgbm_model.best_iteration_)
        }
    },
    "preprocessing": {
        "missing_value_strategy": "median imputation",
        "feature_selection": "numeric features only, excluded identifiers and target",
        "train_test_split": {
            "test_size": TEST_SIZE,
            "random_state": RANDOM_STATE,
            "stratify": True
        }
    }
}

# Save metadata
metadata_path = models_dir / 'metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"\n✓ Metadata saved: {metadata_path}")

# Save feature list separately for easy access
features_path = models_dir / 'features.json'
with open(features_path, 'w') as f:
    json.dump({"features": feature_cols, "n_features": len(feature_cols)}, f, indent=2)

print(f"✓ Feature list saved: {features_path}")

print(f"\n{'='*80}")
print("EXPORT COMPLETE")
print(f"{'='*80}")
print(f"\n📦 Exported files:")
print(f"  - model_rf.pkl")
print(f"  - model_lgbm.pkl")
print(f"  - metadata.json")
print(f"  - features.json")
print(f"\n🎉 All models and metadata successfully exported!")

## Summary

### ✅ Completed Workflow:
1. **Data Loading** - Loaded cleaned Kepler dataset
2. **Target Definition** - Created binary classification (CONFIRMED vs FALSE POSITIVE)
3. **Feature Selection** - Selected numeric features, handled missing values
4. **Train/Test Split** - 80/20 split with stratification
5. **Model Training** - Trained Random Forest and LightGBM classifiers
6. **Evaluation** - Calculated comprehensive metrics (Accuracy, ROC-AUC, PR-AUC, F1)
7. **Export** - Saved models and metadata

### 🎯 Next Steps:
- Run `04_eval_plots.ipynb` to generate visualizations
- Review `docs/model_card.md` for model documentation
- Use models for predictions on new data

---
*Model training pipeline completed successfully!*