# Advanced Models — Random Forest & Gradient Boosting

This notebook evaluates non-linear, tree-based models to improve upon
the Logistic Regression baseline. Specifically, we:
- Train a Random Forest classifier
- Train a Gradient Boosting classifier
- Compare performance using consistent metrics
- Identify performance gains over the baseline model

## Section 1 - Import Required Libraries

The following libraries are used for advanced modeling and evaluation.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)

In [2]:
# Import the feature engineered dataset
DATA_PATH = "../data/processed/feature_engineered_data.csv"

df = pd.read_csv(DATA_PATH)

print("Dataset shape:", df.shape)
df.head()

Dataset shape: (9564, 32)


Unnamed: 0,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_time0bk,koi_impact,koi_tce_plnt_num,koi_steff,koi_slogg,...,koi_period_log,koi_duration_log,koi_depth_log,koi_prad_log,koi_teq_log,koi_insol_log,koi_model_snr_log,depth_to_snr_ratio,planet_to_star_radius_ratio,koi_disposition
0,1.0,0.0,0.0,0.0,0.0,170.53875,0.146,1.0,5455.0,4.467,...,2.350235,1.375613,6.424545,1.181727,6.677083,4.549552,3.605498,17.201117,2.437969,1
1,0.969,0.0,0.0,0.0,0.0,162.51384,0.586,2.0,5455.0,4.467,...,4.014911,1.70602,6.775138,1.342865,6.095825,2.313525,3.288402,33.906975,3.052855,1
2,0.0,0.0,1.0,0.0,0.0,175.850252,0.969,1.0,5853.0,4.544,...,3.039708,1.023242,9.290075,2.747271,6.459904,3.696351,4.347694,141.926604,16.820257,2
3,0.0,0.0,1.0,0.0,0.0,170.307565,1.276,1.0,5805.0,4.564,...,1.006845,1.225659,8.997172,3.539799,7.241366,6.794542,6.227722,15.97943,42.300831,2
4,1.0,0.0,0.0,0.0,0.0,171.59555,0.701,1.0,6031.0,4.438,...,1.260048,0.976256,6.404071,1.321756,7.249215,6.832126,3.735286,14.750611,2.629061,1


## Section 2 - Separate Features and Target

Features and target labels are separated prior to training.

In [3]:
target_column = "koi_disposition"

X = df.drop(columns=[target_column])
y = df[target_column]

## Train–Test Split

A stratified split is used to preserve class proportions.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## Section 3 - Random Forest Model

Random Forest is an ensemble of decision trees that captures non-linear relationships and feature interactions.

In [5]:
rf_model = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    class_weight="balanced",
    n_jobs=-1
)

rf_model.fit(X_train, y_train)

0,1,2
,n_estimators,200
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## Random Forest Evaluation

Model performance is evaluated using standard classification metrics.

In [None]:
rf_pred = rf_model.predict(X_test)
rf_prob = rf_model.predict_proba(X_test)

# Generate the report 
print("Random Forest Classification Report")
print(classification_report(y_test, rf_pred))

# Generate the confusion matrix
rf_conf_matrix = confusion_matrix(y_test, rf_pred)
rf_conf_matrix

Random Forest Classification Report
              precision    recall  f1-score   support

           0       0.83      0.81      0.82       449
           1       0.83      0.85      0.84       459
           2       0.98      0.99      0.98      1005

    accuracy                           0.91      1913
   macro avg       0.88      0.88      0.88      1913
weighted avg       0.91      0.91      0.91      1913



array([[362,  76,  11],
       [ 61, 389,   9],
       [ 11,   1, 993]])

In [None]:
# Find the ROC-AUC value for random forest model
rf_roc_auc = roc_auc_score(y_test, rf_prob, multi_class="ovr")
print("Random Forest ROC-AUC:", rf_roc_auc)

Random Forest ROC-AUC: 0.9780873148851614


## Section 4 - Gradient Boosting Model

Gradient Boosting builds trees sequentially, focusing on correcting errors made by previous models.

In [8]:
gb_model = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    random_state=42
)

gb_model.fit(X_train, y_train)

0,1,2
,loss,'log_loss'
,learning_rate,0.1
,n_estimators,200
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,3
,min_impurity_decrease,0.0


## Gradient Boosting Evaluation

Performance is evaluated using the same metrics for fair comparison.

In [None]:
gb_pred = gb_model.predict(X_test)
gb_prob = gb_model.predict_proba(X_test)

# Print the classification report
print("Gradient Boosting Classification Report")
print(classification_report(y_test, gb_pred))

# Print the confusion matrix
gb_conf_matrix = confusion_matrix(y_test, gb_pred)
gb_conf_matrix

Gradient Boosting Classification Report
              precision    recall  f1-score   support

           0       0.83      0.81      0.82       449
           1       0.84      0.85      0.84       459
           2       0.98      0.99      0.99      1005

    accuracy                           0.91      1913
   macro avg       0.88      0.88      0.88      1913
weighted avg       0.91      0.91      0.91      1913



array([[364,  76,   9],
       [ 61, 391,   7],
       [ 13,   0, 992]])

In [10]:
gb_roc_auc = roc_auc_score(y_test, gb_prob, multi_class="ovr")
print("Gradient Boosting ROC-AUC:", gb_roc_auc)

Gradient Boosting ROC-AUC: 0.9788897399930105


# Section 5 - Model Benchmarking

This section evaluates additional machine learning models using the same train–test split and evaluation metrics to enable fair comparison.

The goal is to understand how different model families perform on the Kepler exoplanet classification task.

## Import Additional Models

We include distance-based, margin-based, and neural models for comparison.

In [14]:
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, roc_auc_score

from tqdm import tqdm
import time

## Reuse Existing Train–Test Split

The same split is reused to ensure fair comparison.


## Section 6 - Evaluation Utility Function

A helper function is used to train and evaluate models consistently.

In [15]:
def evaluate_model_with_progress(
    model,
    X_train,
    X_test,
    y_train,
    y_test,
    model_name,
    use_probabilities=True
):
    print(f"\n==============================")
    print(f"Training {model_name}")
    print(f"==============================")

    start_time = time.time()

    # Train model
    model.fit(X_train, y_train)

    train_time = time.time() - start_time
    print(f"Training completed in {train_time:.2f} seconds")

    # Predictions
    y_pred = model.predict(X_test)

    # ROC-AUC
    if use_probabilities and hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)
        roc_auc = roc_auc_score(y_test, y_prob, multi_class="ovr")
    else:
        roc_auc = "Not available"

    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print("ROC-AUC:", roc_auc)

### Section 7 - Support Vector Machine (RBF Kernel)

SVM training progress is shown via verbose output.

In [16]:
svm_model = SVC(
    kernel="rbf",
    C=1.0,
    probability=True,
    class_weight="balanced",
    verbose=True,
    random_state=42
)

evaluate_model_with_progress(
    svm_model,
    X_train,
    X_test,
    y_train,
    y_test,
    model_name="SVM (RBF Kernel)"
)


Training SVM (RBF Kernel)
[LibSVM]Training completed in 15.04 seconds

Classification Report:
              precision    recall  f1-score   support

           0       0.09      0.01      0.02       449
           1       0.27      1.00      0.42       459
           2       0.88      0.14      0.25      1005

    accuracy                           0.32      1913
   macro avg       0.41      0.38      0.23      1913
weighted avg       0.55      0.32      0.24      1913

ROC-AUC: 0.6651441473133809


### Section 8 - k-Nearest Neighbors

k-NN has no training loop, but evaluation progress is shown.

In [17]:
knn_model = KNeighborsClassifier(n_neighbors=7)

evaluate_model_with_progress(
    knn_model,
    X_train,
    X_test,
    y_train,
    y_test,
    model_name="k-Nearest Neighbors",
    use_probabilities=False
)


Training k-Nearest Neighbors
Training completed in 0.00 seconds

Classification Report:
              precision    recall  f1-score   support

           0       0.31      0.24      0.27       449
           1       0.56      0.65      0.60       459
           2       0.74      0.77      0.76      1005

    accuracy                           0.61      1913
   macro avg       0.54      0.55      0.54      1913
weighted avg       0.60      0.61      0.60      1913

ROC-AUC: Not available


### Section 9 - Naive Bayes

Naive Bayes trains instantly; progress messages confirm execution.

In [18]:
nb_model = GaussianNB()

evaluate_model_with_progress(
    nb_model,
    X_train,
    X_test,
    y_train,
    y_test,
    model_name="Naive Bayes",
    use_probabilities=True
)


Training Naive Bayes
Training completed in 0.01 seconds

Classification Report:
              precision    recall  f1-score   support

           0       0.09      0.08      0.08       449
           1       0.32      0.99      0.49       459
           2       0.94      0.12      0.21      1005

    accuracy                           0.32      1913
   macro avg       0.45      0.40      0.26      1913
weighted avg       0.59      0.32      0.25      1913

ROC-AUC: 0.7164434822555755


### Section 10 - Neural Network (MLP)

The neural network displays loss updates per iteration.

In [19]:
mlp_model = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    max_iter=500,
    random_state=42,
    verbose=True,
    early_stopping=True
)

evaluate_model_with_progress(
    mlp_model,
    X_train,
    X_test,
    y_train,
    y_test,
    model_name="MLP Neural Network"
)


Training MLP Neural Network
Iteration 1, loss = 20.59628007
Validation score: 0.451697
Iteration 2, loss = 13.62522728
Validation score: 0.523499
Iteration 3, loss = 11.32357112
Validation score: 0.570496
Iteration 4, loss = 9.23796257
Validation score: 0.520888
Iteration 5, loss = 10.60853186
Validation score: 0.422977
Iteration 6, loss = 8.95655909
Validation score: 0.481723
Iteration 7, loss = 12.76362605
Validation score: 0.502611
Iteration 8, loss = 10.42217805
Validation score: 0.502611
Iteration 9, loss = 11.85788241
Validation score: 0.546997
Iteration 10, loss = 5.76909101
Validation score: 0.331593
Iteration 11, loss = 14.24997400
Validation score: 0.471279
Iteration 12, loss = 11.18749303
Validation score: 0.640992
Iteration 13, loss = 10.66205819
Validation score: 0.595300
Iteration 14, loss = 3.81107610
Validation score: 0.284595
Iteration 15, loss = 14.72964239
Validation score: 0.627937
Iteration 16, loss = 9.48196191
Validation score: 0.617493
Iteration 17, loss = 9.68

## Section 11 - Save Trained Models

All trained models are saved to disk to allow reuse in evaluation, comparison, and deployment without retraining.

In [23]:
import joblib
import os

# Create models directory if it does not exist
os.makedirs("../models", exist_ok=True)

# Save advanced models
joblib.dump(rf_model, "../models/random_forest_model.pkl")
joblib.dump(gb_model, "../models/gradient_boosting_model.pkl")

joblib.dump(svm_model, "../models/svm_rbf_model.pkl")
joblib.dump(knn_model, "../models/knn_model.pkl")
joblib.dump(nb_model, "../models/naive_bayes_model.pkl")
joblib.dump(mlp_model, "../models/mlp_neural_network_model.pkl")

print("All trained models saved successfully.")

All trained models saved successfully.


## Section 12 - Save Model Evaluation Results

This section saves all model evaluation outputs (reports, confusion matrices, and ROC-AUC scores) to disk for later analysis and comparison.

In [28]:
import os
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Create results directory if it does not exist
RESULTS_DIR = "../results"
os.makedirs(RESULTS_DIR, exist_ok=True)

In [29]:
def save_model_results_full(
    model_name,
    model,
    X_test,
    y_test
):
    """
    Saves classification report, confusion matrix, and ROC-AUC (if available)
    for a given trained model.
    """

    model_dir = os.path.join(
        RESULTS_DIR,
        model_name.replace(" ", "_").lower()
    )
    os.makedirs(model_dir, exist_ok=True)

    # Predictions
    y_pred = model.predict(X_test)

    # Classification report
    report = classification_report(y_test, y_pred, output_dict=True)
    with open(os.path.join(model_dir, "classification_report.json"), "w") as f:
        json.dump(report, f, indent=4)

    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    np.savetxt(
        os.path.join(model_dir, "confusion_matrix.csv"),
        cm,
        delimiter=",",
        fmt="%d"
    )

    # ROC-AUC (if supported)
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)
        roc_auc = roc_auc_score(y_test, y_prob, multi_class="ovr")
        with open(os.path.join(model_dir, "roc_auc.txt"), "w") as f:
            f.write(str(roc_auc))

    print(f"Results saved for {model_name}")

In [32]:
import joblib

scaler = joblib.load("../models/standard_scaler.pkl")

X_test_scaled = scaler.transform(X_test)

In [34]:
# Scaled models
save_model_results_full("SVM RBF", svm_model, X_test_scaled, y_test)
save_model_results_full("MLP Neural Network", mlp_model, X_test_scaled, y_test)

# Non-scaled models
save_model_results_full("Random Forest", rf_model, X_test, y_test)
save_model_results_full("Gradient Boosting", gb_model, X_test, y_test)
save_model_results_full("K Nearest Neighbors", knn_model, X_test, y_test)
save_model_results_full("Naive Bayes", nb_model, X_test, y_test)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Results saved for SVM RBF
Results saved for MLP Neural Network
Results saved for Random Forest
Results saved for Gradient Boosting




Results saved for K Nearest Neighbors
Results saved for Naive Bayes
