# Model Usage & Accuracy Showcase

This notebook demonstrates how the trained machine learning models are used in practice and compares their predictive accuracy on the same test dataset.

The objective is to clearly identify the most accurate model for the Kepler exoplanet classification task.

## Import Required Libraries

In [11]:
import pandas as pd
import numpy as np
import joblib

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Load Feature-Engineered Data (for models)
## Load Raw Data (for identifiers only)

In [12]:
# Feature-engineered dataset (used for prediction)
FEATURES_PATH = "../data/processed/feature_engineered_data.csv"
df_features = pd.read_csv(FEATURES_PATH)

# Raw dataset (used ONLY for identifiers)
RAW_PATH = "../data/raw/cumulative.csv"
df_raw = pd.read_csv(RAW_PATH)

## Extract Planet Identifiers

Identifiers are NOT used as features. They are attached back after prediction for interpretability.

In [13]:
identifiers = df_raw[["kepoi_name", "kepler_name"]]

# Align identifiers with feature-engineered data by index
identifiers = identifiers.loc[df_features.index].reset_index(drop=True)

## Build Feature Matrix and Target Vector

In [14]:
target_column = "koi_disposition"

X = df_features.drop(columns=[target_column])
y = df_features[target_column]

## Trainâ€“Test Split (Preserving Identifiers)

In [15]:
X_train, X_test, y_train, y_test, id_train, id_test = train_test_split(
    X,
    y,
    identifiers,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## Load Trained Scaler

Used for scale-sensitive models only.

In [16]:
scaler = joblib.load("../models/standard_scaler.pkl")
X_test_scaled = scaler.transform(X_test)

## Load Trained Models

In [17]:
models = {
    "Logistic Regression": joblib.load("../models/baseline_logistic_regression.pkl"),
    "Random Forest": joblib.load("../models/random_forest_model.pkl"),
    "Gradient Boosting": joblib.load("../models/gradient_boosting_model.pkl"),
    "Extra Trees": joblib.load("../models/extra_trees.pkl"),
    "AdaBoost": joblib.load("../models/adaboost.pkl"),
    "KNN": joblib.load("../models/knn_model.pkl"),
    "Naive Bayes": joblib.load("../models/naive_bayes_model.pkl"),
    "SVM RBF": joblib.load("../models/svm_rbf_model.pkl"),
    "MLP Neural Network": joblib.load("../models/mlp_neural_network_model.pkl"),
    "Ridge Classifier": joblib.load("../models/ridge_classifier.pkl")
}

## Demonstration: Predictions With Identifiers

This section shows which Kepler objects are predicted as CONFIRMED, CANDIDATE, or FALSE POSITIVE.

In [25]:
# Choose the best-performing model (example: Extra Trees)
best_model = models["Extra Trees"]

# Predict first 10 samples
predictions = best_model.predict(X_test.iloc[:20])

# Build readable output table
showcase_df = id_test.iloc[:20].copy()
showcase_df["Predicted_Disposition"] = predictions
showcase_df["Actual_Disposition"] = y_test.iloc[:20].values

showcase_df

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done 400 out of 400 | elapsed:    0.0s finished


Unnamed: 0,kepoi_name,kepler_name,Predicted_Disposition,Actual_Disposition
1297,K01482.01,,2,2
6885,K05884.01,,0,2
6496,K05819.01,,0,0
2507,K02821.01,,0,0
3701,K04109.01,,0,0
9032,K08006.01,,2,2
1241,K01539.01,,2,0
2089,K02586.01,Kepler-1282 b,1,1
8219,K06749.01,,2,2
6803,K05933.01,,2,2
