# External Data Processing for Model Usage

This notebook processes an external NASA Kepler Objects of Interest (KOI) dataset and prepares it for downstream model usage.

The processing pipeline mirrors the feature engineering steps applied to the training data to ensure compatibility and fairness. No model training or inference is performed in this notebook.


## Imports and Path Configuration

Paths are constructed relative to the project root to avoid platform-dependent file path issues.

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
# Resolve project root: kepler_exoplanet_ml/
PROJECT_ROOT = Path.cwd().parent

NASA_PATH = PROJECT_ROOT / "data" / "external" / "cumulative_2026.csv"
OUTPUT_PATH = PROJECT_ROOT / "data" / "processed" / "nasa_external_processed.csv"

print("NASA input path:", NASA_PATH)
print("Processed output path:", OUTPUT_PATH)

NASA input path: c:\Users\ibaan\Documents\Coding\Python\kepler_exoplanet_ml\data\external\cumulative_2026.csv
Processed output path: c:\Users\ibaan\Documents\Coding\Python\kepler_exoplanet_ml\data\processed\nasa_external_processed.csv


## Load External NASA KOI Dataset

This dataset is independent from the Kaggle training data and is used for external validation and usage demonstration.

In [14]:
df_ext = pd.read_csv(
    NASA_PATH,
    sep=",",          # comma-separated (confirmed by Excel)
    comment="#",      # skip metadata lines
    engine="python",
    encoding="utf-8"
)

print(df_ext.shape)
df_ext.head()


(9564, 49)


Unnamed: 0,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,10797460,K00752.01,Kepler-227 b,CONFIRMED,CANDIDATE,1.0,0,0,0,0,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,10797460,K00752.02,Kepler-227 c,CONFIRMED,CANDIDATE,0.969,0,0,0,0,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,10811496,K00753.01,,CANDIDATE,CANDIDATE,0.0,0,0,0,0,...,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
3,10848459,K00754.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,0,...,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
4,10854555,K00755.01,Kepler-664 b,CONFIRMED,CANDIDATE,1.0,0,0,0,0,...,-211.0,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509


In [15]:
# Column check
df_ext.columns.tolist()[:10]

# Numeric feature count
df_ext.select_dtypes(include="number").shape

(9564, 44)

## Dataset Overview

In [16]:
with open(NASA_PATH, "r", encoding="utf-8") as f:
    for _ in range(5):
        print(f.readline())

# This file was produced by the NASA Exoplanet Archive  http://exoplanetarchive.ipac.caltech.edu

# Fri Jan  2 20:53:06 2026

#

# COLUMN kepid:          KepID

# COLUMN kepoi_name:     KOI Name



In [17]:
print("Shape:", df_ext.shape)
df_ext.columns.tolist()

Shape: (9564, 49)


['kepid',
 'kepoi_name',
 'kepler_name',
 'koi_disposition',
 'koi_pdisposition',
 'koi_score',
 'koi_fpflag_nt',
 'koi_fpflag_ss',
 'koi_fpflag_co',
 'koi_fpflag_ec',
 'koi_period',
 'koi_period_err1',
 'koi_period_err2',
 'koi_time0bk',
 'koi_time0bk_err1',
 'koi_time0bk_err2',
 'koi_impact',
 'koi_impact_err1',
 'koi_impact_err2',
 'koi_duration',
 'koi_duration_err1',
 'koi_duration_err2',
 'koi_depth',
 'koi_depth_err1',
 'koi_depth_err2',
 'koi_prad',
 'koi_prad_err1',
 'koi_prad_err2',
 'koi_teq',
 'koi_teq_err1',
 'koi_teq_err2',
 'koi_insol',
 'koi_insol_err1',
 'koi_insol_err2',
 'koi_model_snr',
 'koi_tce_plnt_num',
 'koi_tce_delivname',
 'koi_steff',
 'koi_steff_err1',
 'koi_steff_err2',
 'koi_slogg',
 'koi_slogg_err1',
 'koi_slogg_err2',
 'koi_srad',
 'koi_srad_err1',
 'koi_srad_err2',
 'ra',
 'dec',
 'koi_kepmag']

## Target Column and Identifiers

- Target: `koi_disposition`
- Identifiers (kept for interpretation, not modeling):
  - KOI Name
  - Kepler Name

In [18]:
TARGET_COL = "koi_disposition"

IDENTIFIER_COLS = [
    "kepoi_name",
    "kepler_name"
]

available_identifiers = [c for c in IDENTIFIER_COLS if c in df_ext.columns]

identifiers = df_ext[available_identifiers].copy()

## Feature Selection

We retain only numeric features used during training. Non-numeric and leakage-prone columns are removed.

In [19]:
# Keep numeric features only
X_ext = df_ext.select_dtypes(include=[np.number]).copy()

print("Numeric feature count:", X_ext.shape[1])

Numeric feature count: 44


## Missing Value Handling

We apply median imputation to mirror training preprocessing. This ensures consistency between training and inference.

In [20]:
# Median imputation
X_ext = X_ext.fillna(X_ext.median())

# Final sanity check
X_ext.isnull().sum().sum()

np.int64(19128)

## Target Encoding (Optional)

If `koi_disposition` exists, we encode it for evaluation. If not, this dataset is treated as inference-only.

In [21]:
if TARGET_COL in df_ext.columns:
    y_ext = df_ext[TARGET_COL].map({
        "FALSE POSITIVE": 0,
        "CANDIDATE": 1,
        "CONFIRMED": 2
    })
else:
    y_ext = None

## Assemble Usage Dataset

We combine:
- Identifiers
- Clean numeric features
- Encoded target (if available)

In [22]:
usage_df = pd.concat(
    [identifiers.reset_index(drop=True), X_ext.reset_index(drop=True)],
    axis=1
)

if y_ext is not None:
    usage_df["true_label"] = y_ext.values

usage_df.head()

Unnamed: 0,kepoi_name,kepler_name,kepid,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,...,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag,true_label
0,K00752.01,Kepler-227 b,10797460,1.0,0,0,0,0,9.488036,2.775e-05,...,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347,2
1,K00752.02,Kepler-227 c,10797460,0.969,0,0,0,0,54.418383,0.0002479,...,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347,2
2,K00753.01,,10811496,0.0,0,0,0,0,19.89914,1.494e-05,...,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436,1
3,K00754.01,,10848459,0.0,0,1,0,0,1.736952,2.63e-07,...,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597,0
4,K00755.01,Kepler-664 b,10854555,1.0,0,0,0,0,2.525592,3.761e-06,...,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509,2


## Save Processed Dataset

In [23]:
usage_df.to_csv(OUTPUT_PATH, index=False)

print(f"Saved processed usage dataset to: {OUTPUT_PATH}")

Saved processed usage dataset to: c:\Users\ibaan\Documents\Coding\Python\kepler_exoplanet_ml\data\processed\nasa_external_processed.csv
