# Baseline Model — Logistic Regression

This notebook establishes a baseline classification model using Logistic Regression.
The objectives are to:
- Split the data into training and test sets
- Apply feature scaling where required
- Train a baseline Logistic Regression model
- Evaluate performance using appropriate metrics

This baseline serves as a reference point for more advanced models.

## Section 1 - Import Required Libraries

Libraries for modeling, evaluation, and preprocessing are imported below.

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)

In [4]:
# Load the feature engineered dataset
DATA_PATH = "../data/processed/feature_engineered_data.csv"

df = pd.read_csv(DATA_PATH)

print("Dataset shape:", df.shape)
df.head()


Dataset shape: (9564, 32)


Unnamed: 0,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_time0bk,koi_impact,koi_tce_plnt_num,koi_steff,koi_slogg,...,koi_period_log,koi_duration_log,koi_depth_log,koi_prad_log,koi_teq_log,koi_insol_log,koi_model_snr_log,depth_to_snr_ratio,planet_to_star_radius_ratio,koi_disposition
0,1.0,0.0,0.0,0.0,0.0,170.53875,0.146,1.0,5455.0,4.467,...,2.350235,1.375613,6.424545,1.181727,6.677083,4.549552,3.605498,17.201117,2.437969,1
1,0.969,0.0,0.0,0.0,0.0,162.51384,0.586,2.0,5455.0,4.467,...,4.014911,1.70602,6.775138,1.342865,6.095825,2.313525,3.288402,33.906975,3.052855,1
2,0.0,0.0,1.0,0.0,0.0,175.850252,0.969,1.0,5853.0,4.544,...,3.039708,1.023242,9.290075,2.747271,6.459904,3.696351,4.347694,141.926604,16.820257,2
3,0.0,0.0,1.0,0.0,0.0,170.307565,1.276,1.0,5805.0,4.564,...,1.006845,1.225659,8.997172,3.539799,7.241366,6.794542,6.227722,15.97943,42.300831,2
4,1.0,0.0,0.0,0.0,0.0,171.59555,0.701,1.0,6031.0,4.438,...,1.260048,0.976256,6.404071,1.321756,7.249215,6.832126,3.735286,14.750611,2.629061,1


## Section 2 - Separate Features and Target

The feature matrix and target labels are separated for training.

In [5]:
target_column = "koi_disposition"

X = df.drop(columns=[target_column])
y = df[target_column]

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

Feature matrix shape: (9564, 31)
Target vector shape: (9564,)


## Section 3 - Train–Test Split

The dataset is split using stratification to preserve class distribution.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

Training set size: (7651, 31)
Test set size: (1913, 31)


## Section 4 - Feature Scaling

Logistic Regression is sensitive to feature scale. Standardization is applied to all input features.

In [7]:
scaler = StandardScaler()

# Fit on training data only
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same transformation to test data
X_test_scaled = scaler.transform(X_test)

## Section 5 - Train Logistic Regression Model

A Logistic Regression model is trained using class weights to address class imbalance.

In [8]:
log_reg = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    solver="lbfgs",
    multi_class="auto"
)

log_reg.fit(X_train_scaled, y_train)



0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'lbfgs'
,max_iter,1000


## Section 6 - Model Evaluation

The model is evaluated using precision, recall, F1-score, and confusion matrix.

In [9]:
# Generate predictions
y_pred = log_reg.predict(X_test_scaled)

# Classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.79      0.80       449
           1       0.80      0.85      0.82       459
           2       0.99      0.97      0.98      1005

    accuracy                           0.90      1913
   macro avg       0.86      0.87      0.87      1913
weighted avg       0.90      0.90      0.90      1913



In [10]:
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

array([[354,  93,   2],
       [ 63, 388,   8],
       [ 22,   5, 978]])

## ROC-AUC Evaluation

ROC-AUC provides a threshold-independent performance measure for multi-class classification.

In [11]:
# Predict probabilities
y_prob = log_reg.predict_proba(X_test_scaled)

# Compute multi-class ROC-AUC (One-vs-Rest)
roc_auc = roc_auc_score(y_test, y_prob, multi_class="ovr")

print("ROC-AUC Score:", roc_auc)

ROC-AUC Score: 0.9718088280945528


## Baseline Logistic Regression Interpretation

The baseline Logistic Regression model demonstrates strong overall performance, achieving high accuracy and a ROC-AUC score close to 0.97, indicating excellent class separability. Precision and recall are well balanced across all classes, with particularly strong performance on the majority class. The use of class weighting effectively mitigates class imbalance, allowing the model to maintain reasonable performance on minority classes. Overall, this baseline confirms that the engineered features contain substantial predictive signal and provides a solid reference point for comparison with more complex models.

## Section 7 - Save Baseline Model Artifacts

The trained model and scaler are saved for reuse and comparison.


In [12]:
import joblib

joblib.dump(log_reg, "../models/baseline_logistic_regression.pkl")
joblib.dump(scaler, "../models/standard_scaler.pkl")

print("Baseline model and scaler saved.")

Baseline model and scaler saved.


## Summary

- A stratified train–test split was applied
- Features were standardized
- A class-balanced Logistic Regression model was trained
- Performance was evaluated using multiple metrics
- A reproducible baseline was established

This baseline will be used to benchmark advanced models.

## Section 8 - Save Logistic Regression Evaluation Results

This section saves the evaluation outputs of the baseline Logistic Regression model for later comparison with advanced models.

In [13]:
import os

RESULTS_DIR = "../results/logistic_regression"
os.makedirs(RESULTS_DIR, exist_ok=True)

In [14]:
import json
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Predict on test data (already scaled in Notebook 05)
y_pred = log_reg.predict(X_test_scaled)
y_prob = log_reg.predict_proba(X_test_scaled)

# Classification report
report = classification_report(y_test, y_pred, output_dict=True)
with open(os.path.join(RESULTS_DIR, "classification_report.json"), "w") as f:
    json.dump(report, f, indent=4)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
np.savetxt(
    os.path.join(RESULTS_DIR, "confusion_matrix.csv"),
    cm,
    delimiter=",",
    fmt="%d"
)

# ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob, multi_class="ovr")
with open(os.path.join(RESULTS_DIR, "roc_auc.txt"), "w") as f:
    f.write(str(roc_auc))

print("Logistic Regression results saved successfully.")


Logistic Regression results saved successfully.
