# Exoplanet Classification Model Training
## LightGBM + Random Forest Baselines

This notebook trains baseline classification models to predict exoplanet candidates using the cleaned Kepler dataset.

### Objectives:
1. Load and prepare cleaned Kepler data
2. Define classification target (binary or multi-class)
3. Train baseline models (RandomForest & LightGBM)
4. Evaluate performance metrics
5. Export trained models and metadata

### Models:
- **Random Forest**: `n_estimators=300, max_depth=None`
- **LightGBM**: `num_leaves=63, n_estimators=500, learning_rate=0.05`

### Evaluation Metrics:
- Accuracy
- ROC-AUC (binary) or Macro F1 (multi-class)
- Precision-Recall AUC (for imbalanced data)
- Confusion Matrix
- Feature Importance

## 1. Setup & Imports

In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import json
from datetime import datetime, timezone

# Machine learning imports
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, 
    roc_auc_score, 
    f1_score, 
    confusion_matrix,
    classification_report,
    precision_recall_curve,
    roc_curve,
    auc,
    average_precision_score
)
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
import joblib

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("✓ Libraries imported successfully")
print(f"Random state set to: {RANDOM_STATE}")

✓ Libraries imported successfully
Random state set to: 42


## 2. Load Cleaned Data

In [3]:
# Load the cleaned Kepler dataset
data_path = Path('../data/clean/kepler_clean.csv')

print("Loading cleaned Kepler dataset...")
df = pd.read_csv(data_path)

print(f"✓ Dataset loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\nFirst few rows:")
display(df.head())

print(f"\nColumn types:")
print(df.dtypes.value_counts())

print(f"\nDataset info:")
print(f"  - Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"  - Missing values: {df.isnull().sum().sum():,}")

Loading cleaned Kepler dataset...
✓ Dataset loaded: 9,564 rows × 122 columns

First few rows:


Unnamed: 0,rowid,kepid,kepoi_name,kepler_name,koi_disposition,koi_vet_stat,koi_vet_date,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_disp_prov,koi_comment,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,koi_time0bk_err2,koi_time0,koi_time0_err1,koi_time0_err2,koi_eccen,koi_impact,koi_impact_err1,koi_impact_err2,koi_duration,koi_duration_err1,koi_duration_err2,koi_depth,koi_depth_err1,koi_depth_err2,koi_ror,koi_ror_err1,koi_ror_err2,koi_srho,koi_srho_err1,koi_srho_err2,koi_fittype,koi_prad,koi_prad_err1,koi_prad_err2,koi_sma,koi_incl,koi_teq,koi_insol,koi_insol_err1,koi_insol_err2,koi_dor,koi_dor_err1,koi_dor_err2,koi_limbdark_mod,koi_ldm_coeff4,koi_ldm_coeff3,koi_ldm_coeff2,koi_ldm_coeff1,koi_parm_prov,koi_max_sngle_ev,koi_max_mult_ev,koi_model_snr,koi_count,koi_num_transits,koi_tce_plnt_num,koi_tce_delivname,koi_quarters,koi_bin_oedp_sig,koi_trans_mod,koi_datalink_dvr,koi_datalink_dvs,koi_steff,koi_steff_err1,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_smet,koi_smet_err1,koi_smet_err2,koi_srad,koi_srad_err1,koi_srad_err2,koi_smass,koi_smass_err1,koi_smass_err2,koi_sparprov,ra,dec,koi_kepmag,koi_gmag,koi_rmag,koi_imag,koi_zmag,koi_jmag,koi_hmag,koi_kmag,koi_fwm_stat_sig,koi_fwm_sra,koi_fwm_sra_err,koi_fwm_sdec,koi_fwm_sdec_err,koi_fwm_srao,koi_fwm_srao_err,koi_fwm_sdeco,koi_fwm_sdeco_err,koi_fwm_prao,koi_fwm_prao_err,koi_fwm_pdeco,koi_fwm_pdeco_err,koi_dicco_mra,koi_dicco_mra_err,koi_dicco_mdec,koi_dicco_mdec_err,koi_dicco_msky,koi_dicco_msky_err,koi_dikco_mra,koi_dikco_mra_err,koi_dikco_mdec,koi_dikco_mdec_err,koi_dikco_msky,koi_dikco_msky_err
0,1,10797460,K00752.01,Kepler-227 b,CONFIRMED,Done,2018-08-16,CANDIDATE,1.0,0,0,0,0,q1_q17_dr25_sup_koi,NO_COMMENT,9.488036,2.775e-05,-2.775e-05,170.53875,0.00216,-0.00216,2455003.539,0.00216,-0.00216,0.0,0.146,0.318,-0.146,2.9575,0.0819,-0.0819,615.8,19.5,-19.5,0.022344,0.000832,-0.000528,3.20796,0.33173,-1.09986,LS+MCMC,2.26,0.26,-0.15,0.0853,89.66,793.0,93.59,29.45,-16.65,24.81,2.6,-2.6,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.2291,0.4603,q1_q17_dr25_koi,5.135849,28.47082,35.8,2,142.0,1.0,q1_q17_dr25_tce,11111111111111111000000000000000,0.6864,Mandel and Agol (2002 ApJ 580 171),010/010797/010797460/dv/kplr010797460-20160209...,010/010797/010797460/dv/kplr010797460-001-2016...,5455.0,81.0,-81.0,4.467,0.064,-0.096,0.14,0.15,-0.15,0.927,0.105,-0.061,0.919,0.052,-0.046,q1_q17_dr25_stellar,291.93423,48.141651,15.347,15.89,15.27,15.114,15.006,14.082,13.751,13.648,0.002,19.462294,1.4e-05,48.14191,0.00013,0.43,0.51,0.94,0.48,-0.0002,0.00032,-0.00055,0.00031,-0.01,0.13,0.2,0.16,0.2,0.17,0.08,0.13,0.31,0.17,0.32,0.16
1,2,10797460,K00752.02,Kepler-227 c,CONFIRMED,Done,2018-08-16,CANDIDATE,0.969,0,0,0,0,q1_q17_dr25_sup_koi,NO_COMMENT,54.418383,0.0002479,-0.0002479,162.51384,0.00352,-0.00352,2454995.514,0.00352,-0.00352,0.0,0.586,0.059,-0.443,4.507,0.116,-0.116,874.8,35.5,-35.5,0.027954,0.009078,-0.001347,3.02368,2.20489,-2.49638,LS+MCMC,2.83,0.32,-0.19,0.2734,89.57,443.0,9.11,2.87,-1.62,77.9,28.4,-28.4,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.2291,0.4603,q1_q17_dr25_koi,7.027669,20.109507,25.8,2,25.0,2.0,q1_q17_dr25_tce,11111111111111111000000000000000,0.0023,Mandel and Agol (2002 ApJ 580 171),010/010797/010797460/dv/kplr010797460-20160209...,010/010797/010797460/dv/kplr010797460-002-2016...,5455.0,81.0,-81.0,4.467,0.064,-0.096,0.14,0.15,-0.15,0.927,0.105,-0.061,0.919,0.052,-0.046,q1_q17_dr25_stellar,291.93423,48.141651,15.347,15.89,15.27,15.114,15.006,14.082,13.751,13.648,0.003,19.462265,2e-05,48.14199,0.00019,-0.63,0.72,1.23,0.68,0.00066,0.00065,-0.00105,0.00063,0.39,0.36,0.0,0.48,0.39,0.36,0.49,0.34,0.12,0.73,0.5,0.45
2,3,10811496,K00753.01,,CANDIDATE,Done,2018-08-16,CANDIDATE,0.0,0,0,0,0,q1_q17_dr25_sup_koi,DEEP_V_SHAPED,19.89914,1.494e-05,-1.494e-05,175.850252,0.000581,-0.000581,2455008.85,0.000581,-0.000581,0.0,0.969,5.126,-0.077,1.7822,0.0341,-0.0341,10829.0,171.0,-171.0,0.154046,5.034292,-0.042179,7.29555,35.03293,-2.75453,LS+MCMC,14.6,3.92,-1.31,0.1419,88.96,638.0,39.3,31.04,-10.49,53.5,25.7,-25.7,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.2711,0.3858,q1_q17_dr25_koi,37.159767,187.4491,76.3,1,56.0,1.0,q1_q17_dr25_tce,11111101110111011000000000000000,0.6624,Mandel and Agol (2002 ApJ 580 171),010/010811/010811496/dv/kplr010811496-20160209...,010/010811/010811496/dv/kplr010811496-001-2016...,5853.0,158.0,-176.0,4.544,0.044,-0.176,-0.18,0.3,-0.3,0.868,0.233,-0.078,0.961,0.11,-0.121,q1_q17_dr25_stellar,297.00482,48.134129,15.436,15.943,15.39,15.22,15.166,14.254,13.9,13.826,0.278,19.800321,1.9e-06,48.13412,2e-05,-0.021,0.069,-0.038,0.071,0.0007,0.0024,0.0006,0.0034,-0.025,0.07,-0.034,0.07,0.042,0.072,0.002,0.071,-0.027,0.074,0.027,0.074
3,4,10848459,K00754.01,,FALSE POSITIVE,Done,2018-08-16,FALSE POSITIVE,0.0,0,1,0,0,q1_q17_dr25_sup_koi,MOD_ODDEVEN_DV---MOD_ODDEVEN_ALT---DEEP_V_SHAPED,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,-0.000115,2455003.308,0.000115,-0.000115,0.0,1.276,0.115,-0.092,2.40641,0.00537,-0.00537,8079.2,12.8,-12.8,0.387394,0.109232,-0.08495,0.2208,0.00917,-0.01837,LS+MCMC,33.46,8.5,-2.83,0.0267,67.09,1395.0,891.96,668.95,-230.35,3.278,0.136,-0.136,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.2865,0.3556,q1_q17_dr25_koi,39.06655,541.8951,505.6,1,621.0,1.0,q1_q17_dr25_tce,11111110111011101000000000000000,0.0,Mandel and Agol (2002 ApJ 580 171),010/010848/010848459/dv/kplr010848459-20160209...,010/010848/010848459/dv/kplr010848459-001-2016...,5805.0,157.0,-174.0,4.564,0.053,-0.168,-0.52,0.3,-0.3,0.791,0.201,-0.067,0.836,0.093,-0.077,q1_q17_dr25_stellar,285.53461,48.28521,15.597,16.1,15.554,15.382,15.266,14.326,13.911,13.809,0.0,19.035638,8.6e-07,48.28521,7e-06,-0.111,0.031,0.002,0.027,0.00302,0.00057,-0.00142,0.00081,-0.249,0.072,0.147,0.078,0.289,0.079,-0.257,0.072,0.099,0.077,0.276,0.076
4,5,10854555,K00755.01,Kepler-664 b,CONFIRMED,Done,2018-08-16,CANDIDATE,1.0,0,0,0,0,q1_q17_dr25_sup_koi,NO_COMMENT,2.525592,3.761e-06,-3.761e-06,171.59555,0.00113,-0.00113,2455004.596,0.00113,-0.00113,0.0,0.701,0.235,-0.478,1.6545,0.042,-0.042,603.3,16.9,-16.9,0.024064,0.003751,-0.001522,1.98635,2.71141,-1.74541,LS+MCMC,2.75,0.88,-0.35,0.0374,85.41,1406.0,926.16,874.33,-314.24,8.75,4.0,-4.0,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.2844,0.3661,q1_q17_dr25_koi,4.749945,33.1919,40.9,1,515.0,1.0,q1_q17_dr25_tce,01111111111111111000000000000000,0.309,Mandel and Agol (2002 ApJ 580 171),010/010854/010854555/dv/kplr010854555-20160209...,010/010854/010854555/dv/kplr010854555-001-2016...,6031.0,169.0,-211.0,4.438,0.07,-0.21,0.07,0.25,-0.3,1.046,0.334,-0.133,1.095,0.151,-0.136,q1_q17_dr25_stellar,288.75488,48.2262,15.509,16.015,15.468,15.292,15.241,14.366,14.064,13.952,0.733,19.250326,9.7e-06,48.22626,0.0001,-0.01,0.35,0.23,0.37,8e-05,0.0002,-7e-05,0.00022,0.03,0.19,-0.09,0.18,0.1,0.14,0.07,0.18,0.02,0.16,0.07,0.2



Column types:
float64    98
object     17
int64       7
Name: count, dtype: int64

Dataset info:
  - Memory usage: 18.34 MB
  - Missing values: 55,396


## 3. Define Classification Target

We'll create a classification target based on `koi_disposition`:
- **Binary**: CONFIRMED (1) vs FALSE POSITIVE (0), excluding CANDIDATE
- **3-Class**: CONFIRMED (2) vs CANDIDATE (1) vs FALSE POSITIVE (0)

Let's start with binary classification for clearer model performance.

In [4]:
# Examine target distribution
print("="*80)
print("TARGET VARIABLE ANALYSIS")
print("="*80)

print(f"\nOriginal koi_disposition distribution:")
print(df['koi_disposition'].value_counts())
print(f"\nPercentages:")
print(df['koi_disposition'].value_counts(normalize=True) * 100)

# Create binary classification target (CONFIRMED vs FALSE POSITIVE)
# Remove CANDIDATE for cleaner binary classification
df_binary = df[df['koi_disposition'].isin(['CONFIRMED', 'FALSE POSITIVE'])].copy()

# Create target variable
df_binary['label'] = (df_binary['koi_disposition'] == 'CONFIRMED').astype(int)

print(f"\n{'='*80}")
print(f"BINARY CLASSIFICATION DATASET")
print(f"{'='*80}")
print(f"Dataset size: {df_binary.shape[0]:,} rows (removed {df.shape[0] - df_binary.shape[0]:,} CANDIDATE rows)")
print(f"\nTarget distribution:")
print(df_binary['label'].value_counts())
print(f"\nClass balance:")
class_pct = df_binary['label'].value_counts(normalize=True) * 100
print(f"  - Class 0 (FALSE POSITIVE): {class_pct[0]:.2f}%")
print(f"  - Class 1 (CONFIRMED): {class_pct[1]:.2f}%")

# Check for class imbalance
imbalance_ratio = class_pct.max() / class_pct.min()
print(f"\nImbalance ratio: {imbalance_ratio:.2f}:1")
if imbalance_ratio > 2:
    print("⚠️  Dataset is imbalanced - PR-AUC will be important metric")
else:
    print("✓ Dataset is relatively balanced")

# Store task type
TASK_TYPE = "binary"
print(f"\n🎯 Task type: {TASK_TYPE} classification")

TARGET VARIABLE ANALYSIS

Original koi_disposition distribution:
koi_disposition
FALSE POSITIVE    4839
CONFIRMED         2746
CANDIDATE         1979
Name: count, dtype: int64

Percentages:
koi_disposition
FALSE POSITIVE    50.595985
CONFIRMED         28.711836
CANDIDATE         20.692179
Name: proportion, dtype: float64

BINARY CLASSIFICATION DATASET
Dataset size: 7,585 rows (removed 1,979 CANDIDATE rows)

Target distribution:
label
0    4839
1    2746
Name: count, dtype: int64

Class balance:
  - Class 0 (FALSE POSITIVE): 63.80%
  - Class 1 (CONFIRMED): 36.20%

Imbalance ratio: 1.76:1
✓ Dataset is relatively balanced

🎯 Task type: binary classification


## 4. Feature Selection & Preprocessing

In [5]:
# Select features for modeling
print("="*80)
print("FEATURE SELECTION")
print("="*80)

# Exclude non-feature columns
exclude_cols = [
    'rowid', 'kepid', 'kepoi_name', 'kepler_name', 
    'koi_disposition', 'koi_pdisposition', 'koi_comment',
    'koi_disp_prov', 'label'  # target
]

# Get all numeric columns
numeric_cols = df_binary.select_dtypes(include=[np.number]).columns.tolist()

# Remove excluded columns
feature_cols = [col for col in numeric_cols if col not in exclude_cols]

print(f"\nTotal numeric columns: {len(numeric_cols)}")
print(f"Excluded columns: {len(exclude_cols)}")
print(f"Selected features: {len(feature_cols)}")

# Check for missing values in features
missing_by_col = df_binary[feature_cols].isnull().sum()
cols_with_missing = missing_by_col[missing_by_col > 0]

print(f"\nFeatures with missing values: {len(cols_with_missing)}")
if len(cols_with_missing) > 0:
    print("\nTop 10 features with missing values:")
    for col, count in cols_with_missing.head(10).items():
        pct = (count / len(df_binary) * 100)
        print(f"  - {col}: {count} ({pct:.2f}%)")

# Create feature matrix and target
X = df_binary[feature_cols].copy()
y = df_binary['label'].copy()

# Handle missing values - simple imputation with median
print(f"\n{'='*80}")
print("HANDLING MISSING VALUES")
print(f"{'='*80}")

missing_before = X.isnull().sum().sum()
print(f"Missing values before imputation: {missing_before:,}")

# Impute with median
for col in feature_cols:
    if X[col].isnull().any():
        median_val = X[col].median()
        X[col].fillna(median_val, inplace=True)

missing_after = X.isnull().sum().sum()
print(f"Missing values after imputation: {missing_after:,}")
print(f"✓ All missing values handled")

# Final dataset info
print(f"\n{'='*80}")
print("FINAL DATASET")
print(f"{'='*80}")
print(f"Features (X): {X.shape[0]:,} rows × {X.shape[1]} features")
print(f"Target (y): {y.shape[0]:,} samples")
print(f"Feature list (first 10): {feature_cols[:10]}")
print(f"...")

FEATURE SELECTION

Total numeric columns: 106
Excluded columns: 9
Selected features: 103

Features with missing values: 92

Top 10 features with missing values:
  - koi_score: 910 (12.00%)
  - koi_period_err1: 338 (4.46%)
  - koi_period_err2: 338 (4.46%)
  - koi_time0bk_err1: 338 (4.46%)
  - koi_time0bk_err2: 338 (4.46%)
  - koi_time0_err1: 338 (4.46%)
  - koi_time0_err2: 338 (4.46%)
  - koi_eccen: 259 (3.41%)
  - koi_impact: 259 (3.41%)
  - koi_impact_err1: 338 (4.46%)

HANDLING MISSING VALUES
Missing values before imputation: 33,546
Missing values after imputation: 0
✓ All missing values handled

FINAL DATASET
Features (X): 7,585 rows × 103 features
Target (y): 7,585 samples
Feature list (first 10): ['koi_score', 'koi_fpflag_nt', 'koi_fpflag_ss', 'koi_fpflag_co', 'koi_fpflag_ec', 'koi_period', 'koi_period_err1', 'koi_period_err2', 'koi_time0bk', 'koi_time0bk_err1']
...
Missing values after imputation: 0
✓ All missing values handled

FINAL DATASET
Features (X): 7,585 rows × 103 featur

## 5. Train/Test Split

In [6]:
# Split data into train and test sets
TEST_SIZE = 0.2

print("="*80)
print("TRAIN/TEST SPLIT")
print("="*80)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=TEST_SIZE, 
    random_state=RANDOM_STATE,
    stratify=y  # Maintain class distribution
)

print(f"\nTest size: {TEST_SIZE * 100}%")
print(f"\nTraining set:")
print(f"  - X_train: {X_train.shape[0]:,} rows × {X_train.shape[1]} features")
print(f"  - y_train: {y_train.shape[0]:,} samples")
print(f"  - Class distribution: {y_train.value_counts().to_dict()}")

print(f"\nTest set:")
print(f"  - X_test: {X_test.shape[0]:,} rows × {X_test.shape[1]} features")
print(f"  - y_test: {y_test.shape[0]:,} samples")
print(f"  - Class distribution: {y_test.value_counts().to_dict()}")

# Verify stratification
train_pos_pct = (y_train.sum() / len(y_train)) * 100
test_pos_pct = (y_test.sum() / len(y_test)) * 100
print(f"\nClass balance verification:")
print(f"  - Train positive class: {train_pos_pct:.2f}%")
print(f"  - Test positive class: {test_pos_pct:.2f}%")
print(f"✓ Stratification successful (similar distributions)")

TRAIN/TEST SPLIT

Test size: 20.0%

Training set:
  - X_train: 6,068 rows × 103 features
  - y_train: 6,068 samples
  - Class distribution: {0: 3871, 1: 2197}

Test set:
  - X_test: 1,517 rows × 103 features
  - y_test: 1,517 samples
  - Class distribution: {0: 968, 1: 549}

Class balance verification:
  - Train positive class: 36.21%
  - Test positive class: 36.19%
✓ Stratification successful (similar distributions)


## 6. Train Baseline Models

### 6.1 Random Forest Classifier

In [7]:
# Train Random Forest Classifier
print("="*80)
print("TRAINING RANDOM FOREST CLASSIFIER")
print("="*80)

# Initialize model with specified hyperparameters
rf_model = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=RANDOM_STATE,
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

print("\nModel hyperparameters:")
print(f"  - n_estimators: {rf_model.n_estimators}")
print(f"  - max_depth: {rf_model.max_depth}")
print(f"  - random_state: {rf_model.random_state}")

print("\nTraining model...")
import time
start_time = time.time()

rf_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"\n✓ Training completed in {training_time:.2f} seconds")

# Make predictions
print("\nGenerating predictions...")
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

print("✓ Random Forest model trained and predictions generated")

TRAINING RANDOM FOREST CLASSIFIER

Model hyperparameters:
  - n_estimators: 300
  - max_depth: None
  - random_state: 42

Training model...


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    0.9s



✓ Training completed in 1.52 seconds

Generating predictions...
✓ Random Forest model trained and predictions generated


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    1.5s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    0.0s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    0.0s finished


### 6.2 LightGBM Classifier

In [8]:
# Train LightGBM Classifier
print("="*80)
print("TRAINING LIGHTGBM CLASSIFIER")
print("="*80)

# Initialize model with specified hyperparameters
lgbm_model = lgb.LGBMClassifier(
    num_leaves=63,
    n_estimators=500,
    learning_rate=0.05,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=-1  # Suppress per-iteration output
)

print("\nModel hyperparameters:")
print(f"  - num_leaves: {lgbm_model.num_leaves}")
print(f"  - n_estimators: {lgbm_model.n_estimators}")
print(f"  - learning_rate: {lgbm_model.learning_rate}")
print(f"  - random_state: {lgbm_model.random_state}")

print("\nTraining model...")
start_time = time.time()

lgbm_model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    eval_metric='auc',
    callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=False)]
)

training_time = time.time() - start_time
print(f"\n✓ Training completed in {training_time:.2f} seconds")
print(f"  - Best iteration: {lgbm_model.best_iteration_}")
print(f"  - Best score: {lgbm_model.best_score_['valid_0']['auc']:.4f}")

# Make predictions
print("\nGenerating predictions...")
y_pred_lgbm = lgbm_model.predict(X_test)
y_pred_proba_lgbm = lgbm_model.predict_proba(X_test)[:, 1]

print("✓ LightGBM model trained and predictions generated")

TRAINING LIGHTGBM CLASSIFIER

Model hyperparameters:
  - num_leaves: 63
  - n_estimators: 500
  - learning_rate: 0.05
  - random_state: 42

Training model...

✓ Training completed in 3.44 seconds
  - Best iteration: 287
  - Best score: 0.9999

Generating predictions...
✓ LightGBM model trained and predictions generated

✓ Training completed in 3.44 seconds
  - Best iteration: 287
  - Best score: 0.9999

Generating predictions...
✓ LightGBM model trained and predictions generated


## 7. Model Evaluation

Calculate comprehensive metrics for both models.

In [9]:
# Evaluate both models
print("="*80)
print("MODEL EVALUATION")
print("="*80)

def evaluate_model(y_true, y_pred, y_pred_proba, model_name):
    """Calculate comprehensive evaluation metrics"""
    
    # Basic metrics
    accuracy = accuracy_score(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, y_pred_proba)
    pr_auc = average_precision_score(y_true, y_pred_proba)
    f1 = f1_score(y_true, y_pred)
    
    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    # Precision and Recall
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    print(f"\n{'='*80}")
    print(f"{model_name.upper()} PERFORMANCE")
    print(f"{'='*80}")
    print(f"\n📊 Classification Metrics:")
    print(f"  - Accuracy:        {accuracy:.4f}")
    print(f"  - ROC-AUC:         {roc_auc:.4f}")
    print(f"  - PR-AUC:          {pr_auc:.4f}")
    print(f"  - F1 Score:        {f1:.4f}")
    print(f"  - Precision:       {precision:.4f}")
    print(f"  - Recall:          {recall:.4f}")
    
    print(f"\n📈 Confusion Matrix:")
    print(f"  - True Negatives:  {tn}")
    print(f"  - False Positives: {fp}")
    print(f"  - False Negatives: {fn}")
    print(f"  - True Positives:  {tp}")
    
    print(f"\n📋 Classification Report:")
    print(classification_report(y_true, y_pred, target_names=['FALSE POSITIVE', 'CONFIRMED']))
    
    return {
        'accuracy': accuracy,
        'roc_auc': roc_auc,
        'pr_auc': pr_auc,
        'f1_score': f1,
        'precision': precision,
        'recall': recall,
        'confusion_matrix': cm.tolist()
    }

# Evaluate Random Forest
metrics_rf = evaluate_model(y_test, y_pred_rf, y_pred_proba_rf, "Random Forest")

# Evaluate LightGBM
metrics_lgbm = evaluate_model(y_test, y_pred_lgbm, y_pred_proba_lgbm, "LightGBM")

# Compare models
print(f"\n{'='*80}")
print("MODEL COMPARISON")
print(f"{'='*80}")
print(f"\n{'Metric':<20} {'Random Forest':<15} {'LightGBM':<15} {'Winner':<10}")
print(f"{'-'*60}")
for metric in ['accuracy', 'roc_auc', 'pr_auc', 'f1_score']:
    rf_val = metrics_rf[metric]
    lgbm_val = metrics_lgbm[metric]
    winner = "🏆 RF" if rf_val > lgbm_val else "🏆 LGBM" if lgbm_val > rf_val else "Tie"
    print(f"{metric:<20} {rf_val:<15.4f} {lgbm_val:<15.4f} {winner:<10}")

print(f"\n🎯 Both models show strong performance for exoplanet classification!")

MODEL EVALUATION

RANDOM FOREST PERFORMANCE

📊 Classification Metrics:
  - Accuracy:        0.9927
  - ROC-AUC:         0.9997
  - PR-AUC:          0.9996
  - F1 Score:        0.9900
  - Precision:       0.9927
  - Recall:          0.9872

📈 Confusion Matrix:
  - True Negatives:  964
  - False Positives: 4
  - False Negatives: 7
  - True Positives:  542

📋 Classification Report:
                precision    recall  f1-score   support

FALSE POSITIVE       0.99      1.00      0.99       968
     CONFIRMED       0.99      0.99      0.99       549

      accuracy                           0.99      1517
     macro avg       0.99      0.99      0.99      1517
  weighted avg       0.99      0.99      0.99      1517


LIGHTGBM PERFORMANCE

📊 Classification Metrics:
  - Accuracy:        0.9954
  - ROC-AUC:         0.9999
  - PR-AUC:          0.9999
  - F1 Score:        0.9936
  - Precision:       0.9963
  - Recall:          0.9909

📈 Confusion Matrix:
  - True Negatives:  966
  - False Positi

## 8. Export Trained Models

Save models and metadata for future use.

In [10]:
# Create models directory
models_dir = Path('../models')
models_dir.mkdir(exist_ok=True)

print("="*80)
print("EXPORTING MODELS AND METADATA")
print("="*80)

# Save Random Forest model
rf_model_path = models_dir / 'model_rf.pkl'
joblib.dump(rf_model, rf_model_path)
print(f"\n✓ Random Forest model saved: {rf_model_path}")
print(f"  - File size: {rf_model_path.stat().st_size / 1024**2:.2f} MB")

# Save LightGBM model
lgbm_model_path = models_dir / 'model_lgbm.pkl'
joblib.dump(lgbm_model, lgbm_model_path)
print(f"\n✓ LightGBM model saved: {lgbm_model_path}")
print(f"  - File size: {lgbm_model_path.stat().st_size / 1024**2:.2f} MB")

# Create metadata
metadata = {
    "created_utc": datetime.now(timezone.utc).isoformat(),
    "dataset": "Kepler KOI cleaned (binary classification)",
    "task": "binary",
    "n_samples": {
        "total": len(df_binary),
        "train": len(X_train),
        "test": len(X_test)
    },
    "n_features": len(feature_cols),
    "features": feature_cols,
    "target": "label",
    "target_mapping": {
        "0": "FALSE POSITIVE",
        "1": "CONFIRMED"
    },
    "class_distribution": {
        "train": y_train.value_counts().to_dict(),
        "test": y_test.value_counts().to_dict()
    },
    "models": {
        "random_forest": {
            "version": "1.0.0",
            "hyperparameters": {
                "n_estimators": 300,
                "max_depth": None,
                "random_state": RANDOM_STATE
            },
            "metrics": {
                "accuracy": float(metrics_rf['accuracy']),
                "roc_auc": float(metrics_rf['roc_auc']),
                "pr_auc": float(metrics_rf['pr_auc']),
                "f1_score": float(metrics_rf['f1_score']),
                "precision": float(metrics_rf['precision']),
                "recall": float(metrics_rf['recall'])
            },
            "confusion_matrix": metrics_rf['confusion_matrix']
        },
        "lightgbm": {
            "version": "1.0.0",
            "hyperparameters": {
                "num_leaves": 63,
                "n_estimators": 500,
                "learning_rate": 0.05,
                "random_state": RANDOM_STATE
            },
            "metrics": {
                "accuracy": float(metrics_lgbm['accuracy']),
                "roc_auc": float(metrics_lgbm['roc_auc']),
                "pr_auc": float(metrics_lgbm['pr_auc']),
                "f1_score": float(metrics_lgbm['f1_score']),
                "precision": float(metrics_lgbm['precision']),
                "recall": float(metrics_lgbm['recall'])
            },
            "confusion_matrix": metrics_lgbm['confusion_matrix'],
            "best_iteration": int(lgbm_model.best_iteration_)
        }
    },
    "preprocessing": {
        "missing_value_strategy": "median imputation",
        "feature_selection": "numeric features only, excluded identifiers and target",
        "train_test_split": {
            "test_size": TEST_SIZE,
            "random_state": RANDOM_STATE,
            "stratify": True
        }
    }
}

# Save metadata
metadata_path = models_dir / 'metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"\n✓ Metadata saved: {metadata_path}")

# Save feature list separately for easy access
features_path = models_dir / 'features.json'
with open(features_path, 'w') as f:
    json.dump({"features": feature_cols, "n_features": len(feature_cols)}, f, indent=2)

print(f"✓ Feature list saved: {features_path}")

print(f"\n{'='*80}")
print("EXPORT COMPLETE")
print(f"{'='*80}")
print(f"\n📦 Exported files:")
print(f"  - model_rf.pkl")
print(f"  - model_lgbm.pkl")
print(f"  - metadata.json")
print(f"  - features.json")
print(f"\n🎉 All models and metadata successfully exported!")

EXPORTING MODELS AND METADATA

✓ Random Forest model saved: ../models/model_rf.pkl
  - File size: 5.46 MB

✓ LightGBM model saved: ../models/model_lgbm.pkl
  - File size: 1.80 MB

✓ Metadata saved: ../models/metadata.json
✓ Feature list saved: ../models/features.json

EXPORT COMPLETE

📦 Exported files:
  - model_rf.pkl
  - model_lgbm.pkl
  - metadata.json
  - features.json

🎉 All models and metadata successfully exported!


## Summary

### ✅ Completed Workflow:
1. **Data Loading** - Loaded cleaned Kepler dataset
2. **Target Definition** - Created binary classification (CONFIRMED vs FALSE POSITIVE)
3. **Feature Selection** - Selected numeric features, handled missing values
4. **Train/Test Split** - 80/20 split with stratification
5. **Model Training** - Trained Random Forest and LightGBM classifiers
6. **Evaluation** - Calculated comprehensive metrics (Accuracy, ROC-AUC, PR-AUC, F1)
7. **Export** - Saved models and metadata

### 🎯 Next Steps:
- Run `04_eval_plots.ipynb` to generate visualizations
- Review `docs/model_card.md` for model documentation
- Use models for predictions on new data

---
*Model training pipeline completed successfully!*