# Kepler Exoplanet Data Preprocessing Pipeline - 3-Class Classification

This notebook implements a comprehensive preprocessing pipeline for the Kepler exoplanet dataset from NASA's Kepler mission for 3-class classification.

## Dataset Information:
- **Source**: Kepler Objects of Interest (KOI) Cumulative Table
- **Size**: 9,564 objects × 141 features
- **Target Classes**: CONFIRMED (2,746), FALSE POSITIVE (4,839), CANDIDATE (1,979)

## Preprocessing Steps:
1. **Data Loading and Initial Exploration** - Load CSV, check dimensions, data types, and missing values
2. **Target Variable Creation** - 3-class classification: Confirmed vs Candidate vs False Positive
3. **Feature Selection and Engineering** - Process Kepler-specific features
4. **Missing Value Handling** - Clean and impute missing data
5. **Data Scaling and Splitting** - Prepare for model training
6. **Export Processed Data** - Save for model training pipeline

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import joblib
import json
import os
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set display options for better output formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("📚 Libraries imported successfully!")
print("🚀 Ready for Kepler data preprocessing!")

📚 Libraries imported successfully!
🚀 Ready for Kepler data preprocessing!


## 1. Data Loading and Initial Exploration

Load the Kepler CSV file and perform initial data exploration including:
- Dataset dimensions (rows/columns)
- Data types examination
- Missing value analysis
- Target distribution analysis

In [2]:
# Load the Kepler dataset
df = pd.read_csv('kepler.csv')

# Basic dataset information
print("=== KEPLER DATASET OVERVIEW ===")
print(f"📊 Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"💾 Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Target variable analysis
print("\n=== TARGET DISTRIBUTION ===")
target_counts = df['koi_disposition'].value_counts()
print(target_counts)
print(f"\n📈 Target percentages:")
for disposition, count in target_counts.items():
    pct = count / len(df) * 100
    print(f"   {disposition}: {count} ({pct:.1f}%)")

# Data types overview
print("\n=== DATA TYPES SUMMARY ===")
dtype_counts = df.dtypes.value_counts()
print(dtype_counts)

# Missing values analysis
print("\n=== MISSING VALUES ANALYSIS ===")
missing_total = df.isnull().sum().sum()
missing_pct = (missing_total / (df.shape[0] * df.shape[1])) * 100
print(f"Total missing values: {missing_total:,} ({missing_pct:.2f}%)")

# Columns with highest missing values
missing_by_col = df.isnull().sum().sort_values(ascending=False)
top_missing = missing_by_col[missing_by_col > 0].head(10)
if not top_missing.empty:
    print("\n📉 Top 10 columns with missing values:")
    for col, missing_count in top_missing.items():
        missing_pct_col = (missing_count / len(df)) * 100
        print(f"   {col}: {missing_count} ({missing_pct_col:.1f}%)")

# Display sample data
print("\n=== SAMPLE DATA (First 3 rows) ===")
print(df.head(3))

FileNotFoundError: [Errno 2] No such file or directory: 'kepler.csv'

## 2. Target Variable Creation - 3-Class Classification

Convert the Kepler disposition categories into a 3-class classification problem:
- **Candidate (0)**: Objects marked as CANDIDATE
- **Confirmed (1)**: Objects marked as CONFIRMED  
- **False_Positive (2)**: Objects marked as FALSE POSITIVE

In [None]:
# Create 3-class target variable
print("🎯 Creating 3-class target variable...")

# Map Kepler dispositions to 3 classes
disposition_mapping = {
    'CANDIDATE': 0,        # Planet candidates needing follow-up
    'CONFIRMED': 1,        # Confirmed exoplanets
    'FALSE POSITIVE': 2    # False positives and refuted objects
}

# Create target variable
df['target_3class'] = df['koi_disposition'].map(disposition_mapping)

# Check for any unmapped values
unmapped = df[df['target_3class'].isnull()]['koi_disposition'].unique()
if len(unmapped) > 0:
    print(f"⚠️ Unmapped dispositions found: {unmapped}")
    # Handle any edge cases here if needed
    
# Remove rows with missing target
initial_count = len(df)
df = df.dropna(subset=['target_3class'])
removed_count = initial_count - len(df)
if removed_count > 0:
    print(f"🗑️ Removed {removed_count} rows with missing target")

# Final target distribution
print("\n📊 FINAL 3-CLASS TARGET DISTRIBUTION:")
target_counts = df['target_3class'].value_counts().sort_index()
class_names = ['Candidate', 'Confirmed', 'False_Positive']

for class_id, count in target_counts.items():
    class_name = class_names[int(class_id)]
    pct = count / len(df) * 100
    print(f"   {class_id} ({class_name}): {count} ({pct:.1f}%)")

# Class imbalance analysis
class_counts = target_counts.values
imbalance_ratio = class_counts.max() / class_counts.min()
print(f"\n⚖️ Class imbalance ratio: {imbalance_ratio:.1f}:1")

# Create target mapping for reference
target_mapping = {
    'original_mapping': disposition_mapping,
    'encoding': {str(k): v for v, k in enumerate(class_names)},
    'class_descriptions': {
        'Candidate': 'Planet candidates requiring follow-up observation',
        'Confirmed': 'Confirmed exoplanets with high confidence',
        'False_Positive': 'False positives and refuted planetary candidates'
    }
}

print(f"\n✅ Target variable created successfully!")
print(f"📈 Final dataset shape: {df.shape}")

🎯 Creating 3-class target variable...

📊 FINAL 3-CLASS TARGET DISTRIBUTION:
   0 (Candidate): 1979 (20.7%)
   1 (Confirmed): 2746 (28.7%)
   2 (False_Positive): 4839 (50.6%)

⚖️ Class imbalance ratio: 2.4:1

✅ Target variable created successfully!
📈 Final dataset shape: (9564, 142)


## 3. Feature Selection and Engineering

Select and engineer relevant features from the Kepler dataset:
- Remove non-predictive columns (IDs, names, comments)
- Process numerical features
- Handle error columns and flags
- Create derived features

In [None]:
print("🔧 Starting feature selection and engineering...")

# Columns to exclude from features (non-predictive)
exclude_columns = [
    # ID and name columns
    'rowid', 'kepid', 'kepoi_name', 'kepler_name',
    
    # Target and disposition columns
    'koi_disposition', 'koi_pdisposition', 'target_3class',
    
    # Comments and metadata
    'koi_comment', 'koi_disp_prov', 'koi_parm_prov', 'koi_sparprov',
    
    # Data links and deliverables
    'koi_tce_delivname', 'koi_datalink_dvr', 'koi_datalink_dvs',
    
    # Transit model specific (keep koi_trans_mod, exclude detailed limb darkening)
    'koi_limbdark_mod',
    
    # Date columns (keep if needed for temporal analysis)
    'koi_vet_date'
]

# Get all column names
all_columns = df.columns.tolist()
print(f"📋 Total columns in dataset: {len(all_columns)}")

# Feature columns (excluding target and non-predictive)
feature_columns = [col for col in all_columns if col not in exclude_columns]
print(f"🎯 Feature columns selected: {len(feature_columns)}")

# Display excluded columns
print(f"\n🗑️ Excluded columns ({len(exclude_columns)}):")
for col in exclude_columns:
    if col in all_columns:
        print(f"   • {col}")

# Analyze feature types
X_features = df[feature_columns].copy()

print(f"\n📊 FEATURE ANALYSIS:")
print(f"   Numerical features: {X_features.select_dtypes(include=[np.number]).shape[1]}")
print(f"   Object features: {X_features.select_dtypes(include=['object']).shape[1]}")

# Check for object/string columns that might need processing
object_cols = X_features.select_dtypes(include=['object']).columns.tolist()
if object_cols:
    print(f"\n🔤 Object columns requiring attention:")
    for col in object_cols:
        unique_values = X_features[col].nunique()
        print(f"   • {col}: {unique_values} unique values")
        if unique_values <= 10:  # Show sample values for categorical
            sample_values = X_features[col].dropna().unique()[:5]
            print(f"     Sample: {sample_values}")

print(f"\n✅ Feature selection completed!")
print(f"📈 Features shape: {X_features.shape}")

🔧 Starting feature selection and engineering...
📋 Total columns in dataset: 142
🎯 Feature columns selected: 126

🗑️ Excluded columns (16):
   • rowid
   • kepid
   • kepoi_name
   • kepler_name
   • koi_disposition
   • koi_pdisposition
   • target_3class
   • koi_comment
   • koi_disp_prov
   • koi_parm_prov
   • koi_sparprov
   • koi_tce_delivname
   • koi_datalink_dvr
   • koi_datalink_dvs
   • koi_limbdark_mod
   • koi_vet_date

📊 FEATURE ANALYSIS:
   Numerical features: 122
   Object features: 4

🔤 Object columns requiring attention:
   • koi_vet_stat: 1 unique values
     Sample: ['Done']
   • koi_fittype: 4 unique values
     Sample: ['LS+MCMC' 'MCMC' 'LS' 'none']
   • koi_quarters: 212 unique values
   • koi_trans_mod: 1 unique values
     Sample: ['Mandel and Agol (2002 ApJ 580 171)']

✅ Feature selection completed!
📈 Features shape: (9564, 126)


## 4. Handle Object/Categorical Columns

Process categorical and object columns in the Kepler dataset.

In [None]:
print("🔤 Processing object/categorical columns...")

# Handle specific object columns in Kepler dataset
X_processed = X_features.copy()

# Process koi_quarters (quarters when observed)
if 'koi_quarters' in X_processed.columns:
    print("📅 Processing koi_quarters...")
    # Convert quarters to number of quarters observed
    X_processed['quarters_count'] = X_processed['koi_quarters'].apply(
        lambda x: str(x).count('1') if pd.notna(x) else 0
    )
    # Keep original for now, will decide later
    
# Process koi_fittype (fitting method)
if 'koi_fittype' in X_processed.columns:
    print("🔬 Processing koi_fittype...")
    fittype_counts = X_processed['koi_fittype'].value_counts()
    print(f"   Fit types: {fittype_counts.to_dict()}")
    
    # Convert to numerical encoding
    fittype_mapping = {}
    for i, fittype in enumerate(fittype_counts.index):
        if pd.notna(fittype):
            fittype_mapping[fittype] = i
    
    X_processed['fittype_encoded'] = X_processed['koi_fittype'].map(fittype_mapping)
    print(f"   Encoding: {fittype_mapping}")

# Process koi_trans_mod (transit model)
if 'koi_trans_mod' in X_processed.columns:
    print("🌌 Processing koi_trans_mod...")
    transmod_counts = X_processed['koi_trans_mod'].value_counts()
    print(f"   Transit models: {transmod_counts.to_dict()}")
    
    # Most common transit model gets 1, others get 0
    most_common_model = transmod_counts.index[0] if len(transmod_counts) > 0 else None
    X_processed['is_mandel_agol'] = (X_processed['koi_trans_mod'] == most_common_model).astype(int)
    print(f"   Most common model: {most_common_model}")

# Drop original object columns after processing
object_cols_to_drop = ['koi_fittype', 'koi_trans_mod', 'koi_quarters']
existing_to_drop = [col for col in object_cols_to_drop if col in X_processed.columns]
if existing_to_drop:
    X_processed = X_processed.drop(columns=existing_to_drop)
    print(f"🗑️ Dropped original object columns: {existing_to_drop}")

# Check remaining object columns
remaining_objects = X_processed.select_dtypes(include=['object']).columns.tolist()
if remaining_objects:
    print(f"⚠️ Remaining object columns: {remaining_objects}")
    # Drop them for now
    X_processed = X_processed.drop(columns=remaining_objects)
    print(f"🗑️ Dropped remaining object columns")

print(f"\n✅ Object column processing completed!")
print(f"📈 Processed features shape: {X_processed.shape}")
print(f"🔢 All columns are now numerical: {X_processed.select_dtypes(include=[np.number]).shape[1] == X_processed.shape[1]}")

🔤 Processing object/categorical columns...
📅 Processing koi_quarters...
🔬 Processing koi_fittype...
   Fit types: {'LS+MCMC': 7897, 'MCMC': 1206, 'none': 369, 'LS': 92}
   Encoding: {'LS+MCMC': 0, 'MCMC': 1, 'none': 2, 'LS': 3}
🌌 Processing koi_trans_mod...
   Transit models: {'Mandel and Agol (2002 ApJ 580 171)': 9201}
   Most common model: Mandel and Agol (2002 ApJ 580 171)
🗑️ Dropped original object columns: ['koi_fittype', 'koi_trans_mod', 'koi_quarters']
⚠️ Remaining object columns: ['koi_vet_stat']
🗑️ Dropped remaining object columns

✅ Object column processing completed!
📈 Processed features shape: (9564, 125)
🔢 All columns are now numerical: True


## 5. Missing Value Analysis and Handling

Analyze and handle missing values in the processed features.

In [None]:
print("🕳️ Analyzing and handling missing values...")

# Missing value analysis
missing_analysis = pd.DataFrame({
    'column': X_processed.columns,
    'missing_count': X_processed.isnull().sum(),
    'missing_pct': (X_processed.isnull().sum() / len(X_processed)) * 100
}).sort_values('missing_pct', ascending=False)

# Filter columns with missing values
missing_cols = missing_analysis[missing_analysis['missing_count'] > 0]

print(f"📊 Columns with missing values: {len(missing_cols)} out of {len(X_processed.columns)}")

if len(missing_cols) > 0:
    print("\n📉 Top 15 columns with missing values:")
    for _, row in missing_cols.head(15).iterrows():
        print(f"   {row['column']}: {row['missing_count']} ({row['missing_pct']:.1f}%)")
    
    # Strategy for handling missing values
    print("\n🛠️ Missing value handling strategy:")
    
    # 1. Drop columns with >80% missing values
    high_missing_cols = missing_cols[missing_cols['missing_pct'] > 80]['column'].tolist()
    if high_missing_cols:
        print(f"🗑️ Dropping {len(high_missing_cols)} columns with >80% missing values")
        X_processed = X_processed.drop(columns=high_missing_cols)
        for col in high_missing_cols[:5]:  # Show first 5
            missing_pct = missing_cols[missing_cols['column'] == col]['missing_pct'].iloc[0]
            print(f"   • {col} ({missing_pct:.1f}% missing)")
        if len(high_missing_cols) > 5:
            print(f"   ... and {len(high_missing_cols) - 5} more")
    
    # 2. Impute remaining missing values
    remaining_missing = X_processed.isnull().sum().sum()
    if remaining_missing > 0:
        print(f"\n🔧 Imputing {remaining_missing:,} remaining missing values...")
        
        # Use median imputation for numerical features
        imputer = SimpleImputer(strategy='median')
        X_processed_imputed = pd.DataFrame(
            imputer.fit_transform(X_processed),
            columns=X_processed.columns,
            index=X_processed.index
        )
        
        # Verify no missing values remain
        final_missing = X_processed_imputed.isnull().sum().sum()
        print(f"✅ Missing values after imputation: {final_missing}")
        
        X_processed = X_processed_imputed
    
else:
    print("✅ No missing values found!")

print(f"\n📈 Final processed features shape: {X_processed.shape}")
print(f"🎯 Ready for model training!")

🕳️ Analyzing and handling missing values...
📊 Columns with missing values: 111 out of 125

📉 Top 15 columns with missing values:
   koi_ingress_err1: 9564 (100.0%)
   koi_model_chisq: 9564 (100.0%)
   koi_longp_err1: 9564 (100.0%)
   koi_longp: 9564 (100.0%)
   koi_eccen_err2: 9564 (100.0%)
   koi_eccen_err1: 9564 (100.0%)
   koi_model_dof: 9564 (100.0%)
   koi_ingress: 9564 (100.0%)
   koi_ingress_err2: 9564 (100.0%)
   koi_sage: 9564 (100.0%)
   koi_sage_err1: 9564 (100.0%)
   koi_sma_err1: 9564 (100.0%)
   koi_sma_err2: 9564 (100.0%)
   koi_sage_err2: 9564 (100.0%)
   koi_incl_err1: 9564 (100.0%)

🛠️ Missing value handling strategy:
🗑️ Dropping 19 columns with >80% missing values
   • koi_ingress_err1 (100.0% missing)
   • koi_model_chisq (100.0% missing)
   • koi_longp_err1 (100.0% missing)
   • koi_longp (100.0% missing)
   • koi_eccen_err2 (100.0% missing)
   ... and 14 more

🔧 Imputing 44,101 remaining missing values...
✅ Missing values after imputation: 0

📈 Final processed fea

## 6. Data Scaling and Train-Test Split

Split the data and apply scaling for machine learning models.

In [None]:
print("⚖️ Splitting data and applying scaling...")

# Prepare final dataset
X_final = X_processed.copy()
y_final = df['target_3class'].copy()

print(f"📊 Final dataset shape: X={X_final.shape}, y={y_final.shape}")

# Verify alignment
assert len(X_final) == len(y_final), "X and y must have same number of samples"
print("✅ X and y are properly aligned")

# Train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y_final,
    test_size=0.2,
    random_state=42,
    stratify=y_final
)

print(f"\n📊 TRAIN-TEST SPLIT RESULTS:")
print(f"   Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X_final)*100:.1f}%)")
print(f"   Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X_final)*100:.1f}%)")

# Check class distribution in splits
print(f"\n📈 Class distribution in splits:")
class_names = ['Candidate', 'Confirmed', 'False_Positive']

for split_name, y_split in [('Train', y_train), ('Test', y_test)]:
    print(f"   {split_name}:")
    split_counts = y_split.value_counts().sort_index()
    for class_id, count in split_counts.items():
        class_name = class_names[int(class_id)]
        pct = count / len(y_split) * 100
        print(f"     {class_id} ({class_name}): {count} ({pct:.1f}%)")

# Apply scaling
print(f"\n🔧 Applying StandardScaler...")
scaler = StandardScaler()

# Fit on training data and transform both sets
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

print(f"✅ Scaling completed!")
print(f"📊 Scaled features - Train: {X_train_scaled.shape}, Test: {X_test_scaled.shape}")

# Verify scaling
print(f"\n📏 Scaling verification (training set):")
print(f"   Mean: {X_train_scaled.mean().mean():.6f} (should be ~0)")
print(f"   Std: {X_train_scaled.std().mean():.6f} (should be ~1)")

⚖️ Splitting data and applying scaling...
📊 Final dataset shape: X=(9564, 106), y=(9564,)
✅ X and y are properly aligned

📊 TRAIN-TEST SPLIT RESULTS:
   Training set: 7651 samples (80.0%)
   Test set: 1913 samples (20.0%)

📈 Class distribution in splits:
   Train:
     0 (Candidate): 1583 (20.7%)
     1 (Confirmed): 2197 (28.7%)
     2 (False_Positive): 3871 (50.6%)
   Test:
     0 (Candidate): 396 (20.7%)
     1 (Confirmed): 549 (28.7%)
     2 (False_Positive): 968 (50.6%)

🔧 Applying StandardScaler...
✅ Scaling completed!
📊 Scaled features - Train: (7651, 106), Test: (1913, 106)

📏 Scaling verification (training set):
   Mean: 0.000000 (should be ~0)
   Std: 0.971762 (should be ~1)


## 7. Export Processed Data

Save the preprocessed data for model training.

In [None]:
print("💾 Exporting processed data...")

# Create output directory
output_dir = 'kepler_3class'
os.makedirs(output_dir, exist_ok=True)
print(f"📂 Output directory: {output_dir}/")

# Save train-test splits
print("💾 Saving train-test splits...")
X_train_scaled.to_csv(f'{output_dir}/X_train_scaled.csv', index=False)
X_test_scaled.to_csv(f'{output_dir}/X_test_scaled.csv', index=False)
y_train.to_csv(f'{output_dir}/y_train.csv', index=False)
y_test.to_csv(f'{output_dir}/y_test.csv', index=False)

# Save full processed dataset
print("💾 Saving full processed dataset...")
X_final.to_csv(f'{output_dir}/X_final_cleaned.csv', index=False)
y_final.to_csv(f'{output_dir}/y_final_cleaned.csv', index=False)

# Save scaler
print("💾 Saving scaler...")
joblib.dump(scaler, f'{output_dir}/scaler.joblib')

# Save target mapping
print("💾 Saving target mapping...")
with open(f'{output_dir}/target_mapping.json', 'w') as f:
    json.dump(target_mapping, f, indent=2)

# Create metadata
metadata = {
    'dataset_name': 'Kepler Exoplanet 3-Class',
    'original_shape': df.shape,
    'final_samples': len(X_final),
    'features': len(X_final.columns),
    'target_classes': 3,
    'class_distribution': y_final.value_counts().sort_index().to_dict(),
    'train_samples': len(X_train),
    'test_samples': len(X_test),
    'feature_names': X_final.columns.tolist(),
    'preprocessing_steps': [
        'Target creation (3-class)',
        'Feature selection and engineering',
        'Object column processing',
        'Missing value imputation',
        'Standard scaling',
        'Train-test split (80/20)'
    ],
    'preprocessing_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
    'class_descriptions': target_mapping['class_descriptions']
}

# Save metadata
print("💾 Saving metadata...")
with open(f'{output_dir}/metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

# Summary of exported files
print(f"\n✅ PREPROCESSING COMPLETED SUCCESSFULLY!")
print(f"📂 All files saved to: {output_dir}/")
print(f"\n📋 Files created:")
print(f"   • X_train_scaled.csv - Scaled training features ({X_train_scaled.shape})")
print(f"   • X_test_scaled.csv - Scaled test features ({X_test_scaled.shape})")
print(f"   • y_train.csv - Training targets ({len(y_train)} samples)")
print(f"   • y_test.csv - Test targets ({len(y_test)} samples)")
print(f"   • X_final_cleaned.csv - Full feature matrix ({X_final.shape})")
print(f"   • y_final_cleaned.csv - Full target vector ({len(y_final)} samples)")
print(f"   • scaler.joblib - Fitted StandardScaler")
print(f"   • target_mapping.json - Class mappings and descriptions")
print(f"   • metadata.json - Complete preprocessing metadata")

print(f"\n🎯 READY FOR MODEL TRAINING!")
print(f"📊 Dataset: {len(X_final)} samples, {len(X_final.columns)} features, 3 classes")
print(f"🏆 Class distribution: {dict(y_final.value_counts().sort_index())}")
print(f"⚖️ Class imbalance ratio: {y_final.value_counts().max() / y_final.value_counts().min():.1f}:1")

💾 Exporting processed data...
📂 Output directory: kepler_3class/
💾 Saving train-test splits...
💾 Saving full processed dataset...
💾 Saving full processed dataset...
💾 Saving scaler...
💾 Saving target mapping...
💾 Saving metadata...

✅ PREPROCESSING COMPLETED SUCCESSFULLY!
📂 All files saved to: kepler_3class/

📋 Files created:
   • X_train_scaled.csv - Scaled training features ((7651, 106))
   • X_test_scaled.csv - Scaled test features ((1913, 106))
   • y_train.csv - Training targets (7651 samples)
   • y_test.csv - Test targets (1913 samples)
   • X_final_cleaned.csv - Full feature matrix ((9564, 106))
   • y_final_cleaned.csv - Full target vector (9564 samples)
   • scaler.joblib - Fitted StandardScaler
   • target_mapping.json - Class mappings and descriptions
   • metadata.json - Complete preprocessing metadata

🎯 READY FOR MODEL TRAINING!
📊 Dataset: 9564 samples, 106 features, 3 classes
🏆 Class distribution: {0: 1979, 1: 2746, 2: 4839}
⚖️ Class imbalance ratio: 2.4:1
💾 Saving scal