# Kepler Exoplanet Dataset — Data Cleaning & Preprocessing

## Objective
This notebook prepares the Kepler Exoplanet dataset for machine learning by:
- Removing non-informative and leakage-prone columns
- Defining features (X) and target variable (y)
- Handling missing values appropriately
- Encoding categorical labels
- Producing a clean, model-ready dataset

The output of this notebook will be used for all subsequent modeling steps.

## Section 1 - Import libraries and paths

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

In [2]:
# Load raw dataset (relative to notebooks/)
DATA_PATH = "../data/raw/cumulative.csv"

df = pd.read_csv(DATA_PATH)

print("Raw dataset shape:", df.shape)
df.head()

Raw dataset shape: (9564, 50)


Unnamed: 0,rowid,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,1,10797460,K00752.01,Kepler-227 b,CONFIRMED,CANDIDATE,1.0,0,0,0,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,2,10797460,K00752.02,Kepler-227 c,CONFIRMED,CANDIDATE,0.969,0,0,0,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,3,10811496,K00753.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,...,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
3,4,10848459,K00754.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,...,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
4,5,10854555,K00755.01,Kepler-664 b,CONFIRMED,CANDIDATE,1.0,0,0,0,...,-211.0,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509


## Section 2 - Remove Identifier and Leakage-Prone Columns

Columns that act as identifiers or introduce label leakage are removed. These columns do not provide physical or predictive value.

In [None]:
# Columns that do not provide predictive value or cause label leakage
columns_to_drop = [
    "rowid",
    "kepid",
    "kepoi_name",
    "kepler_name",   # exists only for confirmed planets
    "ra",
    "dec",
    "koi_tce_delivname"
]

df = df.drop(columns=columns_to_drop, errors="ignore")

print("Shape after dropping columns:", df.shape)

Shape after dropping columns: (9564, 43)


## Section 3 - Define Target Variable

The literature-based disposition of each object is used as the classification target.

In [4]:
# Primary classification target
target_column = "koi_disposition"

# Inspect target distribution
df[target_column].value_counts()

koi_disposition
FALSE POSITIVE    5023
CONFIRMED         2293
CANDIDATE         2248
Name: count, dtype: int64

In [5]:
# Remove pipeline-based disposition to avoid label leakage
df = df.drop(columns=["koi_pdisposition"], errors="ignore")

## Section 4 - Separate Features and Target and encode target labels

Features (X) and target labels (y) are separated for preprocessing. The categorical target labels are encoded into numerical form for model compatibility.

In [6]:
# Separate features and target
X = df.drop(columns=[target_column])
y = df[target_column]

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

Feature matrix shape: (9564, 41)
Target vector shape: (9564,)


In [7]:
# Encode categorical target labels into integers
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Mapping for reference
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
label_mapping

{'CANDIDATE': np.int64(0),
 'CONFIRMED': np.int64(1),
 'FALSE POSITIVE': np.int64(2)}

Target labels are encoded numerically for model compatibility. The original label meanings are preserved via the mapping above.

In [8]:
# Select numerical features only
X_numeric = X.select_dtypes(include=[np.number])

print("Numerical feature count:", X_numeric.shape[1])
X_numeric.head()

Numerical feature count: 41


Unnamed: 0,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff,koi_steff_err1,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,koi_kepmag
0,1.0,0,0,0,0,9.488036,2.775e-05,-2.775e-05,170.53875,0.00216,...,5455.0,81.0,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,15.347
1,0.969,0,0,0,0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,...,5455.0,81.0,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,15.347
2,0.0,0,1,0,0,19.89914,1.494e-05,-1.494e-05,175.850252,0.000581,...,5853.0,158.0,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,15.436
3,0.0,0,1,0,0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,...,5805.0,157.0,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,15.597
4,1.0,0,0,0,0,2.525592,3.761e-06,-3.761e-06,171.59555,0.00113,...,6031.0,169.0,-211.0,4.438,0.07,-0.21,1.046,0.334,-0.133,15.509


Only numerical features are used at this stage. All remaining categorical or identifier-style columns have been removed.

## Section 5 - Handle Missing Values

## Remove Fully Missing Numerical Columns

Some numerical features contain no valid values. These columns are removed to ensure stable imputation.

In [10]:
# Drop numerical columns that are entirely missing
X_numeric = X_numeric.dropna(axis=1, how="all")

print("Numerical features after dropping empty columns:", X_numeric.shape[1])

Numerical features after dropping empty columns: 39


Median imputation is applied to numerical features to preserve dataset size and reduce sensitivity to outliers.

In [11]:
# Median imputation for remaining numerical features
imputer = SimpleImputer(strategy="median")

X_imputed_array = imputer.fit_transform(X_numeric)

# Recreate DataFrame with correct columns
X_imputed = pd.DataFrame(
    X_imputed_array,
    columns=X_numeric.columns,
    index=X_numeric.index
)

# Verify no missing values remain
X_imputed.isnull().sum().sum()

np.int64(0)

## Section 6 - Combine Cleaned Features and Target and save new dataset

The cleaned feature matrix and encoded target are recombined into a single dataset.


In [12]:
cleaned_df = X_imputed.copy()
cleaned_df["koi_disposition"] = y_encoded

print("Cleaned dataset shape:", cleaned_df.shape)
cleaned_df.head()

Cleaned dataset shape: (9564, 40)


Unnamed: 0,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff_err1,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,koi_kepmag,koi_disposition
0,1.0,0.0,0.0,0.0,0.0,9.488036,2.775e-05,-2.775e-05,170.53875,0.00216,...,81.0,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,15.347,1
1,0.969,0.0,0.0,0.0,0.0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,...,81.0,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,15.347,1
2,0.0,0.0,1.0,0.0,0.0,19.89914,1.494e-05,-1.494e-05,175.850252,0.000581,...,158.0,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,15.436,2
3,0.0,0.0,1.0,0.0,0.0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,...,157.0,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,15.597,2
4,1.0,0.0,0.0,0.0,0.0,2.525592,3.761e-06,-3.761e-06,171.59555,0.00113,...,169.0,-211.0,4.438,0.07,-0.21,1.046,0.334,-0.133,15.509,1


In [13]:
# Save dataset
OUTPUT_PATH = "../data/processed/cleaned_data.csv"

cleaned_df.to_csv(OUTPUT_PATH, index=False)

print(f"Cleaned dataset saved to {OUTPUT_PATH}")


Cleaned dataset saved to ../data/processed/cleaned_data.csv


In [15]:
# Check total number of missing values in the cleaned dataset
total_missing = cleaned_df.isnull().sum().sum()

print("Total missing values in cleaned dataset:", total_missing)

Total missing values in cleaned dataset: 0


In [16]:
# Check missing values per column (only show if any exist)
missing_per_column = cleaned_df.isnull().sum()
missing_per_column[missing_per_column > 0]

Series([], dtype: int64)

## Summary

- Identifier and leakage-prone columns were removed
- Target labels were encoded numerically
- Numerical features were retained and imputed
- A clean dataset was saved for modeling

The dataset is now ready for exploratory analysis and baseline modeling.