# Feature Engineering

This notebook enhances the cleaned Kepler dataset by:
- Reducing redundancy from highly correlated features
- Applying transformations to skewed features
- Creating physically meaningful derived features
- Preparing a final feature matrix for modeling

No models are trained in this notebook.


## Section 1 - Import Required Libraries

The following libraries are used for feature transformation and analysis.

In [1]:
import pandas as pd
import numpy as np

In [2]:
DATA_PATH = "../data/processed/cleaned_data.csv"

df = pd.read_csv(DATA_PATH)

print("Dataset shape before feature engineering:", df.shape)
df.head()

Dataset shape before feature engineering: (9564, 40)


Unnamed: 0,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff_err1,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,koi_kepmag,koi_disposition
0,1.0,0.0,0.0,0.0,0.0,9.488036,2.775e-05,-2.775e-05,170.53875,0.00216,...,81.0,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,15.347,1
1,0.969,0.0,0.0,0.0,0.0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,...,81.0,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,15.347,1
2,0.0,0.0,1.0,0.0,0.0,19.89914,1.494e-05,-1.494e-05,175.850252,0.000581,...,158.0,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,15.436,2
3,0.0,0.0,1.0,0.0,0.0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,...,157.0,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,15.597,2
4,1.0,0.0,0.0,0.0,0.0,2.525592,3.761e-06,-3.761e-06,171.59555,0.00113,...,169.0,-211.0,4.438,0.07,-0.21,1.046,0.334,-0.133,15.509,1


## Section 2 - Feature engineering
# 1) Separate Features and Target

The target variable is separated to prevent accidental modification.

In [3]:
target_column = "koi_disposition"

X = df.drop(columns=[target_column])
y = df[target_column]

# 2)Reduce Redundancy in Uncertainty Features

Many features have paired uncertainty columns (err1, err2). We combine these into a single uncertainty magnitude to reduce redundancy.

In [4]:
# Identify uncertainty feature pairs
error_features = {}

for col in X.columns:
    if col.endswith("_err1"):
        base = col.replace("_err1", "")
        err2 = base + "_err2"
        if err2 in X.columns:
            error_features[base] = (col, err2)

# Create combined uncertainty features
for base, (err1, err2) in error_features.items():
    X[f"{base}_err"] = np.sqrt(X[err1]**2 + X[err2]**2)

# Drop original err1 and err2 columns
cols_to_drop = [c for pair in error_features.values() for c in pair]
X = X.drop(columns=cols_to_drop)

print("Shape after uncertainty reduction:", X.shape)

Shape after uncertainty reduction: (9564, 29)


# 3) Log Transformation of Skewed Features

Highly skewed features are log-transformed to stabilize scale and improve performance of linear models.

In [5]:
skewed_features = [
    "koi_period",
    "koi_duration",
    "koi_depth",
    "koi_prad",
    "koi_teq",
    "koi_insol",
    "koi_model_snr"
]

# Apply log1p transformation safely
for feature in skewed_features:
    if feature in X.columns:
        X[f"{feature}_log"] = np.log1p(X[feature])

# 4) Create Derived Physical Features

Simple ratios are created to capture relationships between transit depth, planet size, and signal strength.

In [6]:
# Avoid division by zero
epsilon = 1e-6

if "koi_depth" in X.columns and "koi_model_snr" in X.columns:
    X["depth_to_snr_ratio"] = X["koi_depth"] / (X["koi_model_snr"] + epsilon)

if "koi_prad" in X.columns and "koi_srad" in X.columns:
    X["planet_to_star_radius_ratio"] = X["koi_prad"] / (X["koi_srad"] + epsilon)

# 5) Drop Redundant Original Features

Original raw features are removed once transformed versions are created to avoid multicollinearity.

In [7]:
features_to_drop = [
    f for f in skewed_features if f in X.columns
]

X = X.drop(columns=features_to_drop)

print("Final feature count:", X.shape[1])

Final feature count: 31


## Section 3 - Recombine Features and Target

The engineered feature matrix is recombined with the target variable.

In [8]:
final_df = X.copy()
final_df[target_column] = y

print("Final dataset shape:", final_df.shape)
final_df.head()

Final dataset shape: (9564, 32)


Unnamed: 0,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_time0bk,koi_impact,koi_tce_plnt_num,koi_steff,koi_slogg,...,koi_period_log,koi_duration_log,koi_depth_log,koi_prad_log,koi_teq_log,koi_insol_log,koi_model_snr_log,depth_to_snr_ratio,planet_to_star_radius_ratio,koi_disposition
0,1.0,0.0,0.0,0.0,0.0,170.53875,0.146,1.0,5455.0,4.467,...,2.350235,1.375613,6.424545,1.181727,6.677083,4.549552,3.605498,17.201117,2.437969,1
1,0.969,0.0,0.0,0.0,0.0,162.51384,0.586,2.0,5455.0,4.467,...,4.014911,1.70602,6.775138,1.342865,6.095825,2.313525,3.288402,33.906975,3.052855,1
2,0.0,0.0,1.0,0.0,0.0,175.850252,0.969,1.0,5853.0,4.544,...,3.039708,1.023242,9.290075,2.747271,6.459904,3.696351,4.347694,141.926604,16.820257,2
3,0.0,0.0,1.0,0.0,0.0,170.307565,1.276,1.0,5805.0,4.564,...,1.006845,1.225659,8.997172,3.539799,7.241366,6.794542,6.227722,15.97943,42.300831,2
4,1.0,0.0,0.0,0.0,0.0,171.59555,0.701,1.0,6031.0,4.438,...,1.260048,0.976256,6.404071,1.321756,7.249215,6.832126,3.735286,14.750611,2.629061,1


## Section 4 - Save Feature-Engineered Dataset

The final dataset is saved for use in modeling notebooks.

In [9]:
OUTPUT_PATH = "../data/processed/feature_engineered_data.csv"

final_df.to_csv(OUTPUT_PATH, index=False)

print(f"Feature-engineered dataset saved to {OUTPUT_PATH}")

Feature-engineered dataset saved to ../data/processed/feature_engineered_data.csv


## Summary

- Uncertainty features were combined to reduce redundancy
- Skewed features were log-transformed
- Physically meaningful ratio features were created
- Redundant raw features were removed
- A final modeling-ready dataset was saved

The dataset is now suitable for baseline and advanced modeling.