# Advanced Model Exploration

This notebook explores additional advanced and experimental machine learning algorithms to further evaluate performance on the Kepler exoplanet dataset.

The goal is to expand the model search space beyond standard baselines and ensembles, while maintaining consistent evaluation and reproducibility.

## Section 1 - Import Required Libraries

In [9]:
import pandas as pd
import numpy as np
import os
import json
import joblib
import time

from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, ExtraTreesClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

## Load Feature-Engineered Dataset

In [10]:
DATA_PATH = "../data/processed/feature_engineered_data.csv"
df = pd.read_csv(DATA_PATH)

target_column = "koi_disposition"
X = df.drop(columns=[target_column])
y = df[target_column]

## Section 2 - Trainâ€“Test Split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## Section 3 - Feature Scaling

In [12]:
scaler = joblib.load("../models/standard_scaler.pkl")

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [13]:
## Results Directory Setup
RESULTS_DIR = "../results"
MODELS_DIR = "../models"

os.makedirs(RESULTS_DIR, exist_ok=True)
os.makedirs(MODELS_DIR, exist_ok=True)

## Section 4 - Training, Evaluation, and Saving Utility

In [14]:
def train_evaluate_save(
    model,
    model_name,
    X_tr,
    X_te,
    y_tr,
    y_te
):
    print("\n" + "=" * 50)
    print(f"Training {model_name}")
    print("=" * 50)

    start_time = time.time()

    # Train model
    model.fit(X_tr, y_tr)

    elapsed = time.time() - start_time
    print(f"Training completed in {elapsed:.2f} seconds")

    # Predictions
    y_pred = model.predict(X_te)

    # Create model-specific results directory
    model_dir = os.path.join(
        RESULTS_DIR,
        model_name.replace(" ", "_").lower()
    )
    os.makedirs(model_dir, exist_ok=True)

    # Classification report
    report = classification_report(y_te, y_pred, output_dict=True)
    with open(os.path.join(model_dir, "classification_report.json"), "w") as f:
        json.dump(report, f, indent=4)

    # Confusion matrix
    cm = confusion_matrix(y_te, y_pred)
    np.savetxt(
        os.path.join(model_dir, "confusion_matrix.csv"),
        cm,
        delimiter=",",
        fmt="%d"
    )

    # ROC-AUC (if supported)
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_te)
        roc_auc = roc_auc_score(y_te, y_prob, multi_class="ovr")
        with open(os.path.join(model_dir, "roc_auc.txt"), "w") as f:
            f.write(str(roc_auc))
        print("ROC-AUC:", roc_auc)
    else:
        print("ROC-AUC: N/A")

    # Save model
    joblib.dump(
        model,
        os.path.join(
            MODELS_DIR,
            f"{model_name.replace(' ', '_').lower()}.pkl"
        )
    )

    print(f"{model_name} saved successfully")

## Section 5 - AdaBoost Classifier

In [15]:
ada_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=2),
    n_estimators=200,
    learning_rate=0.5,
    random_state=42
)

train_evaluate_save(
    ada_model,
    "AdaBoost",
    X_train,
    X_test,
    y_train,
    y_test
)


Training AdaBoost
Training completed in 6.94 seconds
ROC-AUC: 0.9557548845233241
AdaBoost saved successfully


## Section 6 - Extra Trees Classifier

In [16]:
extra_trees_model = ExtraTreesClassifier(
    n_estimators=400,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1,
    verbose=1 
)

train_evaluate_save(
    extra_trees_model,
    "Extra Trees",
    X_train,
    X_test,
    y_train,
    y_test
)


Training Extra Trees


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:    0.4s finished
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done 400 out of 400 | elapsed:    0.0s finished
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done 400 out of 400 | elapsed:    0.0s finished


Training completed in 0.58 seconds
ROC-AUC: 0.9792610301500869
Extra Trees saved successfully


## Section 7 - Ridge Classifier

In [17]:
ridge_model = RidgeClassifier(class_weight="balanced")

train_evaluate_save(
    ridge_model,
    "Ridge Classifier",
    X_train_scaled,
    X_test_scaled,
    y_train,
    y_test
)


Training Ridge Classifier
Training completed in 0.06 seconds
ROC-AUC: N/A
Ridge Classifier saved successfully


## Section 8 - Gaussian Process Classifier

In [18]:
gpc_model = GaussianProcessClassifier(
    kernel=RBF(length_scale=1.0),
    random_state=42
)

train_evaluate_save(
    gpc_model,
    "Gaussian Process",
    X_train_scaled,
    X_test_scaled,
    y_train,
    y_test
)


Training Gaussian Process
Training completed in 1993.08 seconds
ROC-AUC: 0.9716933289647193
Gaussian Process saved successfully


## Summary

- Multiple advanced models were trained with visible progress
- Both linear and non-linear algorithms were explored
- All models and evaluation outputs were saved
- This notebook further expands the model search space

The project is now ready for final aggregation and model selection.