# TESS Exoplanet Data Preprocessing Pipeline - 3-Class Classification

This notebook implements a comprehensive preprocessing pipeline for the TESS Objects of Interest (TOI) dataset for 3-class classification.

## Dataset Information:
- **Source**: TESS Objects of Interest (TOI) Catalog
- **Size**: 7,703 objects × 87 features
- **Target Classes**: PC (Planet Candidate), CP (Confirmed Planet), FP (False Positive)

## Target Distribution:
- **PC (Planet Candidate)**: 4,679 objects
- **FP (False Positive)**: 1,197 objects  
- **CP (Confirmed Planet)**: 684 objects
- **Other categories**: KP, APC, FA (will be grouped appropriately)

## Preprocessing Steps:
1. **Data Loading and Initial Exploration** - Load CSV, check dimensions, data types, and missing values
2. **Target Variable Creation** - 3-class classification: Confirmed vs Candidate vs False Positive
3. **Feature Selection and Engineering** - Process TESS-specific features
4. **Missing Value Handling** - Clean and impute missing data
5. **Data Scaling and Splitting** - Prepare for model training
6. **Export Processed Data** - Save for model training pipeline

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import joblib
import json
import os
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set display options for better output formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("📚 Libraries imported successfully!")
print("🚀 Ready for TESS data preprocessing!")

📚 Libraries imported successfully!
🚀 Ready for TESS data preprocessing!


## 1. Data Loading and Initial Exploration

Load the TESS TOI CSV file and perform initial data exploration including:
- Dataset dimensions (rows/columns)
- Data types examination
- Missing value analysis
- Target distribution analysis

In [2]:
# Load the TESS TOI dataset
df = pd.read_csv('TOI.csv')

# Basic dataset information
print("=== TESS TOI DATASET OVERVIEW ===")
print(f"📊 Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"💾 Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Target variable analysis
print("\n=== TARGET DISTRIBUTION (TFOPWG_DISP) ===")
target_counts = df['tfopwg_disp'].value_counts()
print(target_counts)
print(f"\n📈 Target percentages:")
for disposition, count in target_counts.items():
    pct = count / len(df) * 100
    print(f"   {disposition}: {count} ({pct:.1f}%)")

# Explain TESS categories
print("\n📋 TESS DISPOSITION CATEGORIES:")
tess_categories = {
    'PC': 'Planet Candidate - Objects that pass initial vetting',
    'CP': 'Confirmed Planet - Objects confirmed as exoplanets',
    'FP': 'False Positive - Objects determined to be false alarms',
    'KP': 'Known Planet - Previously known planets in TOI catalog',
    'APC': 'Ambiguous Planet Candidate - Uncertain classification',
    'FA': 'False Alarm - Clear false detections'
}

for category, description in tess_categories.items():
    if category in target_counts.index:
        count = target_counts[category]
        print(f"   • {category}: {description} ({count} objects)")

# Data types overview
print("\n=== DATA TYPES SUMMARY ===")
dtype_counts = df.dtypes.value_counts()
print(dtype_counts)

# Missing values analysis
print("\n=== MISSING VALUES ANALYSIS ===")
missing_total = df.isnull().sum().sum()
missing_pct = (missing_total / (df.shape[0] * df.shape[1])) * 100
print(f"Total missing values: {missing_total:,} ({missing_pct:.2f}%)")

# Columns with highest missing values
missing_by_col = df.isnull().sum().sort_values(ascending=False)
top_missing = missing_by_col[missing_by_col > 0].head(10)
if not top_missing.empty:
    print("\n📉 Top 10 columns with missing values:")
    for col, missing_count in top_missing.items():
        missing_pct_col = (missing_count / len(df)) * 100
        print(f"   {col}: {missing_count} ({missing_pct_col:.1f}%)")

# Display sample data
print("\n=== SAMPLE DATA (First 3 rows, selected columns) ===")
sample_cols = ['toi', 'tfopwg_disp', 'ra', 'dec', 'pl_orbper', 'pl_trandur', 'pl_rade', 'st_rad', 'st_teff']
available_cols = [col for col in sample_cols if col in df.columns]
print(df[available_cols].head(3))

=== TESS TOI DATASET OVERVIEW ===
📊 Dataset shape: 7703 rows × 87 columns
💾 Memory usage: 7.10 MB

=== TARGET DISTRIBUTION (TFOPWG_DISP) ===
tfopwg_disp
PC     4679
FP     1197
CP      684
KP      583
APC     462
FA       98
Name: count, dtype: int64

📈 Target percentages:
   PC: 4679 (60.7%)
   FP: 1197 (15.5%)
   CP: 684 (8.9%)
   KP: 583 (7.6%)
   APC: 462 (6.0%)
   FA: 98 (1.3%)

📋 TESS DISPOSITION CATEGORIES:
   • PC: Planet Candidate - Objects that pass initial vetting (4679 objects)
   • CP: Confirmed Planet - Objects confirmed as exoplanets (684 objects)
   • FP: False Positive - Objects determined to be false alarms (1197 objects)
   • KP: Known Planet - Previously known planets in TOI catalog (583 objects)
   • APC: Ambiguous Planet Candidate - Uncertain classification (462 objects)
   • FA: False Alarm - Clear false detections (98 objects)

=== DATA TYPES SUMMARY ===
float64    58
int64      24
object      5
Name: count, dtype: int64

=== MISSING VALUES ANALYSIS ===
Total mi

## 2. Target Variable Creation - 3-Class Classification

Convert the TESS disposition categories into a 3-class classification problem:
- **Candidate (0)**: Planet candidates needing follow-up (PC, APC)
- **Confirmed (1)**: Confirmed and known planets (CP, KP)  
- **False_Positive (2)**: False positives and false alarms (FP, FA)

In [3]:
# Create 3-class target variable
print("🎯 Creating 3-class target variable...")

# Map TESS dispositions to 3 classes
disposition_mapping = {
    'PC': 0,    # Planet Candidate
    'APC': 0,   # Ambiguous Planet Candidate
    'CP': 1,    # Confirmed Planet
    'KP': 1,    # Known Planet  
    'FP': 2,    # False Positive
    'FA': 2     # False Alarm
}

# Create target variable
df['target_3class'] = df['tfopwg_disp'].map(disposition_mapping)

# Check for any unmapped values
unmapped = df[df['target_3class'].isnull()]['tfopwg_disp'].unique()
if len(unmapped) > 0:
    print(f"⚠️ Unmapped dispositions found: {unmapped}")
    # Handle any edge cases here if needed
    
# Remove rows with missing target
initial_count = len(df)
df = df.dropna(subset=['target_3class'])
removed_count = initial_count - len(df)
if removed_count > 0:
    print(f"🗑️ Removed {removed_count} rows with missing target")

# Show original to 3-class mapping
print("\n📊 DISPOSITION TO 3-CLASS MAPPING:")
mapping_summary = df.groupby(['tfopwg_disp', 'target_3class']).size().reset_index(name='count')
class_names = ['Candidate', 'Confirmed', 'False_Positive']

for _, row in mapping_summary.iterrows():
    original_disp = row['tfopwg_disp']
    target_class = int(row['target_3class'])
    count = row['count']
    class_name = class_names[target_class]
    print(f"   {original_disp} → {target_class} ({class_name}): {count} objects")

# Final target distribution
print("\n📈 FINAL 3-CLASS TARGET DISTRIBUTION:")
target_counts = df['target_3class'].value_counts().sort_index()

for class_id, count in target_counts.items():
    class_name = class_names[int(class_id)]
    pct = count / len(df) * 100
    print(f"   {class_id} ({class_name}): {count} ({pct:.1f}%)")

# Class imbalance analysis
class_counts = target_counts.values
imbalance_ratio = class_counts.max() / class_counts.min()
print(f"\n⚖️ Class imbalance ratio: {imbalance_ratio:.1f}:1")

# Create target mapping for reference
target_mapping = {
    'original_mapping': disposition_mapping,
    'encoding': {str(k): v for v, k in enumerate(class_names)},
    'class_descriptions': {
        'Candidate': 'Planet candidates and ambiguous candidates requiring follow-up observation',
        'Confirmed': 'Confirmed exoplanets and known planets with high confidence',
        'False_Positive': 'False positives and false alarms determined to be non-planetary'
    }
}

print(f"\n✅ Target variable created successfully!")
print(f"📈 Final dataset shape: {df.shape}")

🎯 Creating 3-class target variable...

📊 DISPOSITION TO 3-CLASS MAPPING:
   APC → 0 (Candidate): 462 objects
   CP → 1 (Confirmed): 684 objects
   FA → 2 (False_Positive): 98 objects
   FP → 2 (False_Positive): 1197 objects
   KP → 1 (Confirmed): 583 objects
   PC → 0 (Candidate): 4679 objects

📈 FINAL 3-CLASS TARGET DISTRIBUTION:
   0 (Candidate): 5141 (66.7%)
   1 (Confirmed): 1267 (16.4%)
   2 (False_Positive): 1295 (16.8%)

⚖️ Class imbalance ratio: 4.1:1

✅ Target variable created successfully!
📈 Final dataset shape: (7703, 88)


## 3. Feature Selection and Engineering

Select and engineer relevant features from the TESS dataset:
- Remove non-predictive columns (IDs, names, comments)
- Process numerical features
- Handle string columns and create derived features

In [4]:
print("🔧 Starting feature selection and engineering...")

# Columns to exclude from features (non-predictive)
exclude_columns = [
    # ID and name columns
    'rowid', 'toi', 'toipfx', 'tid', 'ctoi_alias', 'pl_pnum',
    
    # Target and disposition columns
    'tfopwg_disp', 'target_3class',
    
    # String identifier columns
    'rastr', 'decstr', 'toiurl',
    
    # Date/time columns (keep if needed for temporal analysis)
    'toi_created', 'rowupdate',
    
    # Notes and comments
    'tfopwg_tag', 'toi_tag'
]

# Get all column names
all_columns = df.columns.tolist()
print(f"📋 Total columns in dataset: {len(all_columns)}")

# Feature columns (excluding target and non-predictive)
feature_columns = [col for col in all_columns if col not in exclude_columns]
print(f"🎯 Feature columns selected: {len(feature_columns)}")

# Display excluded columns
print(f"\n🗑️ Excluded columns ({len(exclude_columns)}):")
for col in exclude_columns:
    if col in all_columns:
        print(f"   • {col}")

# Analyze feature types
X_features = df[feature_columns].copy()

print(f"\n📊 FEATURE ANALYSIS:")
print(f"   Numerical features: {X_features.select_dtypes(include=[np.number]).shape[1]}")
print(f"   Object features: {X_features.select_dtypes(include=['object']).shape[1]}")

# Check for object/string columns that might need processing
object_cols = X_features.select_dtypes(include=['object']).columns.tolist()
if object_cols:
    print(f"\n🔤 Object columns requiring attention:")
    for col in object_cols:
        unique_values = X_features[col].nunique()
        print(f"   • {col}: {unique_values} unique values")
        if unique_values <= 10:  # Show sample values for categorical
            sample_values = X_features[col].dropna().unique()[:5]
            print(f"     Sample: {sample_values}")

print(f"\n✅ Feature selection completed!")
print(f"📈 Features shape: {X_features.shape}")

🔧 Starting feature selection and engineering...
📋 Total columns in dataset: 88
🎯 Feature columns selected: 76

🗑️ Excluded columns (15):
   • rowid
   • toi
   • toipfx
   • tid
   • ctoi_alias
   • pl_pnum
   • tfopwg_disp
   • target_3class
   • rastr
   • decstr
   • toi_created
   • rowupdate

📊 FEATURE ANALYSIS:
   Numerical features: 76
   Object features: 0

✅ Feature selection completed!
📈 Features shape: (7703, 76)


## 4. Handle Object/Categorical Columns

Process categorical and object columns in the TESS dataset.

In [5]:
print("🔤 Processing object/categorical columns...")

# Handle specific object columns in TESS dataset
X_processed = X_features.copy()

# Get object columns
object_cols = X_processed.select_dtypes(include=['object']).columns.tolist()

if object_cols:
    print(f"📋 Processing {len(object_cols)} object columns...")
    
    for col in object_cols:
        print(f"\n🔍 Processing {col}:")
        unique_vals = X_processed[col].nunique()
        print(f"   Unique values: {unique_vals}")
        
        if unique_vals <= 1:
            # Constant column - drop it
            print(f"   ❌ Dropping constant column")
            X_processed = X_processed.drop(columns=[col])
            
        elif unique_vals <= 20:
            # Categorical with few values - encode
            print(f"   🏷️ Categorical encoding")
            value_counts = X_processed[col].value_counts()
            print(f"   Values: {value_counts.to_dict()}")
            
            # Create numerical encoding
            encoding_map = {}
            for i, value in enumerate(value_counts.index):
                if pd.notna(value):
                    encoding_map[value] = i
            
            # Apply encoding
            X_processed[f'{col}_encoded'] = X_processed[col].map(encoding_map)
            X_processed = X_processed.drop(columns=[col])
            print(f"   ✅ Created {col}_encoded")
            
        else:
            # Too many unique values - likely need special handling or drop
            print(f"   ⚠️ Too many unique values ({unique_vals}) - dropping")
            X_processed = X_processed.drop(columns=[col])

# Check remaining object columns
remaining_objects = X_processed.select_dtypes(include=['object']).columns.tolist()
if remaining_objects:
    print(f"\n⚠️ Remaining object columns: {remaining_objects}")
    # Drop them for now
    X_processed = X_processed.drop(columns=remaining_objects)
    print(f"🗑️ Dropped remaining object columns")

print(f"\n✅ Object column processing completed!")
print(f"📈 Processed features shape: {X_processed.shape}")
print(f"🔢 All columns are now numerical: {X_processed.select_dtypes(include=[np.number]).shape[1] == X_processed.shape[1]}")

🔤 Processing object/categorical columns...

✅ Object column processing completed!
📈 Processed features shape: (7703, 76)
🔢 All columns are now numerical: True


## 5. Missing Value Analysis and Handling

Analyze and handle missing values in the processed features.

In [6]:
print("🕳️ Analyzing and handling missing values...")

# Missing value analysis
missing_analysis = pd.DataFrame({
    'column': X_processed.columns,
    'missing_count': X_processed.isnull().sum(),
    'missing_pct': (X_processed.isnull().sum() / len(X_processed)) * 100
}).sort_values('missing_pct', ascending=False)

# Filter columns with missing values
missing_cols = missing_analysis[missing_analysis['missing_count'] > 0]

print(f"📊 Columns with missing values: {len(missing_cols)} out of {len(X_processed.columns)}")

if len(missing_cols) > 0:
    print("\n📉 Top 15 columns with missing values:")
    for _, row in missing_cols.head(15).iterrows():
        print(f"   {row['column']}: {row['missing_count']} ({row['missing_pct']:.1f}%)")
    
    # Strategy for handling missing values
    print("\n🛠️ Missing value handling strategy:")
    
    # 1. Drop columns with >85% missing values
    high_missing_cols = missing_cols[missing_cols['missing_pct'] > 85]['column'].tolist()
    if high_missing_cols:
        print(f"🗑️ Dropping {len(high_missing_cols)} columns with >85% missing values")
        X_processed = X_processed.drop(columns=high_missing_cols)
        for col in high_missing_cols[:5]:  # Show first 5
            missing_pct = missing_cols[missing_cols['column'] == col]['missing_pct'].iloc[0]
            print(f"   • {col} ({missing_pct:.1f}% missing)")
        if len(high_missing_cols) > 5:
            print(f"   ... and {len(high_missing_cols) - 5} more")
    
    # 2. Impute remaining missing values
    remaining_missing = X_processed.isnull().sum().sum()
    if remaining_missing > 0:
        print(f"\n🔧 Imputing {remaining_missing:,} remaining missing values...")
        
        # Use median imputation for numerical features
        imputer = SimpleImputer(strategy='median')
        X_processed_imputed = pd.DataFrame(
            imputer.fit_transform(X_processed),
            columns=X_processed.columns,
            index=X_processed.index
        )
        
        # Verify no missing values remain
        final_missing = X_processed_imputed.isnull().sum().sum()
        print(f"✅ Missing values after imputation: {final_missing}")
        
        X_processed = X_processed_imputed
    
else:
    print("✅ No missing values found!")

print(f"\n📈 Final processed features shape: {X_processed.shape}")
print(f"🎯 Ready for model training!")

🕳️ Analyzing and handling missing values...
📊 Columns with missing values: 48 out of 76

📉 Top 15 columns with missing values:
   pl_insolsymerr: 7703 (100.0%)
   raerr2: 7703 (100.0%)
   decerr1: 7703 (100.0%)
   decerr2: 7703 (100.0%)
   pl_eqtsymerr: 7703 (100.0%)
   pl_eqtlim: 7703 (100.0%)
   pl_eqterr2: 7703 (100.0%)
   pl_eqterr1: 7703 (100.0%)
   pl_insollim: 7703 (100.0%)
   pl_insolerr2: 7703 (100.0%)
   pl_insolerr1: 7703 (100.0%)
   raerr1: 7703 (100.0%)
   st_loggerr1: 2271 (29.5%)
   st_loggerr2: 2271 (29.5%)
   st_raderr2: 1963 (25.5%)

🛠️ Missing value handling strategy:
🗑️ Dropping 12 columns with >85% missing values
   • pl_insolsymerr (100.0% missing)
   • raerr2 (100.0% missing)
   • decerr1 (100.0% missing)
   • decerr2 (100.0% missing)
   • pl_eqtsymerr (100.0% missing)
   ... and 7 more

🔧 Imputing 18,577 remaining missing values...
✅ Missing values after imputation: 0

📈 Final processed features shape: (7703, 64)
🎯 Ready for model training!


## 6. Data Scaling and Train-Test Split

Split the data and apply scaling for machine learning models.

In [7]:
print("⚖️ Splitting data and applying scaling...")

# Prepare final dataset
X_final = X_processed.copy()
y_final = df['target_3class'].copy()

print(f"📊 Final dataset shape: X={X_final.shape}, y={y_final.shape}")

# Verify alignment
assert len(X_final) == len(y_final), "X and y must have same number of samples"
print("✅ X and y are properly aligned")

# Train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y_final,
    test_size=0.2,
    random_state=42,
    stratify=y_final
)

print(f"\n📊 TRAIN-TEST SPLIT RESULTS:")
print(f"   Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X_final)*100:.1f}%)")
print(f"   Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X_final)*100:.1f}%)")

# Check class distribution in splits
print(f"\n📈 Class distribution in splits:")
class_names = ['Candidate', 'Confirmed', 'False_Positive']

for split_name, y_split in [('Train', y_train), ('Test', y_test)]:
    print(f"   {split_name}:")
    split_counts = y_split.value_counts().sort_index()
    for class_id, count in split_counts.items():
        class_name = class_names[int(class_id)]
        pct = count / len(y_split) * 100
        print(f"     {class_id} ({class_name}): {count} ({pct:.1f}%)")

# Apply scaling
print(f"\n🔧 Applying StandardScaler...")
scaler = StandardScaler()

# Fit on training data and transform both sets
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

print(f"✅ Scaling completed!")
print(f"📊 Scaled features - Train: {X_train_scaled.shape}, Test: {X_test_scaled.shape}")

# Verify scaling
print(f"\n📏 Scaling verification (training set):")
print(f"   Mean: {X_train_scaled.mean().mean():.6f} (should be ~0)")
print(f"   Std: {X_train_scaled.std().mean():.6f} (should be ~1)")

⚖️ Splitting data and applying scaling...
📊 Final dataset shape: X=(7703, 64), y=(7703,)
✅ X and y are properly aligned

📊 TRAIN-TEST SPLIT RESULTS:
   Training set: 6162 samples (80.0%)
   Test set: 1541 samples (20.0%)

📈 Class distribution in splits:
   Train:
     0 (Candidate): 4112 (66.7%)
     1 (Confirmed): 1014 (16.5%)
     2 (False_Positive): 1036 (16.8%)
   Test:
     0 (Candidate): 1029 (66.8%)
     1 (Confirmed): 253 (16.4%)
     2 (False_Positive): 259 (16.8%)

🔧 Applying StandardScaler...
✅ Scaling completed!
📊 Scaled features - Train: (6162, 64), Test: (1541, 64)

📏 Scaling verification (training set):
   Mean: -0.000000 (should be ~0)
   Std: 0.625051 (should be ~1)


## 7. Export Processed Data

Save the preprocessed data for model training.

In [8]:
print("💾 Exporting processed data...")

# Create output directory
output_dir = 'tess_3class'
os.makedirs(output_dir, exist_ok=True)
print(f"📂 Output directory: {output_dir}/")

# Save train-test splits
print("💾 Saving train-test splits...")
X_train_scaled.to_csv(f'{output_dir}/X_train_scaled.csv', index=False)
X_test_scaled.to_csv(f'{output_dir}/X_test_scaled.csv', index=False)
y_train.to_csv(f'{output_dir}/y_train.csv', index=False)
y_test.to_csv(f'{output_dir}/y_test.csv', index=False)

# Save full processed dataset
print("💾 Saving full processed dataset...")
X_final.to_csv(f'{output_dir}/X_final_cleaned.csv', index=False)
y_final.to_csv(f'{output_dir}/y_final_cleaned.csv', index=False)

# Save scaler
print("💾 Saving scaler...")
joblib.dump(scaler, f'{output_dir}/scaler.joblib')

# Save target mapping
print("💾 Saving target mapping...")
with open(f'{output_dir}/target_mapping.json', 'w') as f:
    json.dump(target_mapping, f, indent=2)

# Create metadata
metadata = {
    'dataset_name': 'TESS TOI 3-Class',
    'original_shape': df.shape,
    'final_samples': len(X_final),
    'features': len(X_final.columns),
    'target_classes': 3,
    'class_distribution': y_final.value_counts().sort_index().to_dict(),
    'train_samples': len(X_train),
    'test_samples': len(X_test),
    'feature_names': X_final.columns.tolist(),
    'preprocessing_steps': [
        'Target creation (3-class from TFOPWG_DISP)',
        'Feature selection and engineering',
        'Object column processing and encoding',
        'Missing value imputation (>85% missing dropped)',
        'Standard scaling',
        'Train-test split (80/20)'
    ],
    'preprocessing_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
    'class_descriptions': target_mapping['class_descriptions'],
    'original_dispositions': disposition_mapping
}

# Save metadata
print("💾 Saving metadata...")
with open(f'{output_dir}/metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

# Summary of exported files
print(f"\n✅ PREPROCESSING COMPLETED SUCCESSFULLY!")
print(f"📂 All files saved to: {output_dir}/")
print(f"\n📋 Files created:")
print(f"   • X_train_scaled.csv - Scaled training features ({X_train_scaled.shape})")
print(f"   • X_test_scaled.csv - Scaled test features ({X_test_scaled.shape})")
print(f"   • y_train.csv - Training targets ({len(y_train)} samples)")
print(f"   • y_test.csv - Test targets ({len(y_test)} samples)")
print(f"   • X_final_cleaned.csv - Full feature matrix ({X_final.shape})")
print(f"   • y_final_cleaned.csv - Full target vector ({len(y_final)} samples)")
print(f"   • scaler.joblib - Fitted StandardScaler")
print(f"   • target_mapping.json - Class mappings and descriptions")
print(f"   • metadata.json - Complete preprocessing metadata")

print(f"\n🎯 READY FOR MODEL TRAINING!")
print(f"📊 Dataset: {len(X_final)} samples, {len(X_final.columns)} features, 3 classes")
print(f"🏆 Class distribution: {dict(y_final.value_counts().sort_index())}")
print(f"⚖️ Class imbalance ratio: {y_final.value_counts().max() / y_final.value_counts().min():.1f}:1")

# TESS-specific information
print(f"\n🌟 TESS-SPECIFIC INFO:")
print(f"📡 Source: TESS Objects of Interest (TOI) Catalog")
print(f"🎯 3-Class Mapping:")
print(f"   • Candidate (0): PC, APC - Planet candidates needing follow-up")
print(f"   • Confirmed (1): CP, KP - Confirmed and known planets")
print(f"   • False_Positive (2): FP, FA - False positives and false alarms")

💾 Exporting processed data...
📂 Output directory: tess_3class/
💾 Saving train-test splits...
💾 Saving full processed dataset...
💾 Saving scaler...
💾 Saving target mapping...
💾 Saving metadata...

✅ PREPROCESSING COMPLETED SUCCESSFULLY!
📂 All files saved to: tess_3class/

📋 Files created:
   • X_train_scaled.csv - Scaled training features ((6162, 64))
   • X_test_scaled.csv - Scaled test features ((1541, 64))
   • y_train.csv - Training targets (6162 samples)
   • y_test.csv - Test targets (1541 samples)
   • X_final_cleaned.csv - Full feature matrix ((7703, 64))
   • y_final_cleaned.csv - Full target vector (7703 samples)
   • scaler.joblib - Fitted StandardScaler
   • target_mapping.json - Class mappings and descriptions
   • metadata.json - Complete preprocessing metadata

🎯 READY FOR MODEL TRAINING!
📊 Dataset: 7703 samples, 64 features, 3 classes
🏆 Class distribution: {0: 5141, 1: 1267, 2: 1295}
⚖️ Class imbalance ratio: 4.1:1

🌟 TESS-SPECIFIC INFO:
📡 Source: TESS Objects of Interes