# Missing Data Imputation

## Objective
This notebook focuses on imputing missing values in the **approved 30 features** of the dataset.
The goal is to handle missing data responsibly while preserving statistical meaning and minimizing bias.

## Why Imputation Matters
- Many ML models cannot handle missing values directly
- Improper imputation can introduce bias
- Different features require different imputation strategies

This notebook explores **multiple imputation techniques**, justifies each choice, and documents observations clearly.


In [1]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
# Load dataset
df = pd.read_csv("../../../data/full_data.csv")

df.shape


(4048, 112)

In [3]:
approved_columns = [
    "P_MASS", "P_RADIUS", "P_DENSITY", "P_GRAVITY", "P_ESCAPE", "P_TYPE",
    "P_PERIOD", "P_SEMI_MAJOR_AXIS", "P_ECCENTRICITY", "P_INCLINATION",
    "P_OMEGA", "P_PERIASTRON", "P_APASTRON", "P_IMPACT_PARAMETER",
    "P_HILL_SPHERE",
    "S_MASS", "S_RADIUS", "S_LUMINOSITY", "S_TEMPERATURE", "S_AGE",
    "S_METALLICITY", "S_LOG_G", "S_TYPE", "S_MAG", "S_DISC",
    "S_MAGNETIC_FIELD",
    "S_SNOW_LINE", "S_TIDAL_LOCK", "P_DETECTION", "P_DISTANCE"
]
df = df[approved_columns]
df.head()

Unnamed: 0,P_MASS,P_RADIUS,P_DENSITY,P_GRAVITY,P_ESCAPE,P_TYPE,P_PERIOD,P_SEMI_MAJOR_AXIS,P_ECCENTRICITY,P_INCLINATION,...,S_METALLICITY,S_LOG_G,S_TYPE,S_MAG,S_DISC,S_MAGNETIC_FIELD,S_SNOW_LINE,S_TIDAL_LOCK,P_DETECTION,P_DISTANCE
0,6165.8633,,,,,Jovian,326.03,1.29,0.231,,...,-0.35,2.31,K0 III,4.74,,,34.529063,0.6424,Radial Velocity,1.324418
1,4684.7848,,,,,Jovian,516.21997,1.53,0.08,,...,-0.02,1.93,K4 III,5.016,,,42.732816,0.648683,Radial Velocity,1.534896
2,1525.5744,,,,,Jovian,185.84,0.83,0.0,,...,-0.24,2.63,G8 III,5.227,,,20.593611,0.60001,Radial Velocity,0.83
3,1481.0785,,,,,Jovian,1773.4,2.93,0.37,,...,0.41,4.45,K0 V,6.61,,,2.141648,0.445415,Radial Velocity,3.130558
4,565.73385,,,,,Jovian,798.5,1.66,0.68,,...,0.06,4.36,G2.5 V,6.25,,,3.019411,0.473325,Radial Velocity,2.043792


## Missing Data Analysis
Before imputation, we examine:
- Percentage of missing values per feature
- Features with extremely high missingness


In [4]:
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_percent = missing_percent.sort_values(ascending=False)

missing_percent


Unnamed: 0,0
S_MAGNETIC_FIELD,100.0
S_DISC,100.0
P_ESCAPE,82.559289
P_DENSITY,82.559289
P_GRAVITY,82.559289
P_OMEGA,81.571146
P_INCLINATION,79.150198
S_TYPE,66.156126
P_ECCENTRICITY,65.909091
P_IMPACT_PARAMETER,65.192688


## Imputation Strategy

Instead of applying a single imputation method to all features,
we divide columns into categories:

1. **High Missingness (>60%)**
   - Dropped due to insufficient information

2. **Numerical Continuous Features**
   - Median imputation (robust to outliers)

3. **Categorical / Discrete Features**
   - Mode imputation

4. **Strongly Correlated Numerical Features**
   - Multivariate imputation using IterativeImputer


In [5]:
drop_cols = missing_percent[missing_percent > 60].index.tolist()

df_dropped = df.drop(columns=drop_cols)

print(f"Dropped columns: {drop_cols}")


Dropped columns: ['S_MAGNETIC_FIELD', 'S_DISC', 'P_ESCAPE', 'P_DENSITY', 'P_GRAVITY', 'P_OMEGA', 'P_INCLINATION', 'S_TYPE', 'P_ECCENTRICITY', 'P_IMPACT_PARAMETER', 'P_HILL_SPHERE', 'P_MASS']


### Observation
Columns with more than 60% missing values were removed as imputing them
would introduce excessive noise and unreliable assumptions.


In [6]:
num_cols = df_dropped.select_dtypes(include=[np.number]).columns
cat_cols = df_dropped.select_dtypes(exclude=[np.number]).columns

num_cols, cat_cols


(Index(['P_RADIUS', 'P_PERIOD', 'P_SEMI_MAJOR_AXIS', 'P_PERIASTRON',
        'P_APASTRON', 'S_MASS', 'S_RADIUS', 'S_LUMINOSITY', 'S_TEMPERATURE',
        'S_AGE', 'S_METALLICITY', 'S_LOG_G', 'S_MAG', 'S_SNOW_LINE',
        'S_TIDAL_LOCK', 'P_DISTANCE'],
       dtype='object'),
 Index(['P_TYPE', 'P_DETECTION'], dtype='object'))

In [7]:
median_imputer = SimpleImputer(strategy="median")
df_dropped[num_cols] = median_imputer.fit_transform(df_dropped[num_cols])


### Why Median?
- Robust against skewed distributions
- Less sensitive to outliers than mean


In [9]:
mode_imputer = SimpleImputer(strategy="most_frequent")
df_dropped[cat_cols] = mode_imputer.fit_transform(df_dropped[cat_cols])


### Why Mode?
- Preserves the most common category
- Avoids introducing artificial categories


In [10]:
iterative_cols = num_cols  # or a subset with strong correlations

iter_imputer = IterativeImputer(
    max_iter=10,
    random_state=42
)

df_dropped[iterative_cols] = iter_imputer.fit_transform(
    df_dropped[iterative_cols]
)


### Iterative Imputation Insight
This method models each feature as a function of others,
allowing more realistic value estimation when strong correlations exist.

This approach is closer to real-world data generation
compared to univariate imputation.


In [11]:
df_dropped.isnull().sum().sum()


np.int64(0)

In [12]:
df_dropped.describe()


Unnamed: 0,P_RADIUS,P_PERIOD,P_SEMI_MAJOR_AXIS,P_PERIASTRON,P_APASTRON,S_MASS,S_RADIUS,S_LUMINOSITY,S_TEMPERATURE,S_AGE,S_METALLICITY,S_LOG_G,S_MAG,S_SNOW_LINE,S_TIDAL_LOCK,P_DISTANCE
count,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0,4048.0
mean,3.77381,2246.911,3.923461,3.819192,4.068383,0.997443,1.491608,5.470347,5495.668906,-1061006.0,0.015802,4.354158,12.753432,3.4522,0.441667,3.979463
std,4.277302,115104.8,61.851002,61.802309,61.978517,0.588114,3.559151,47.63246,1719.391442,47727750.0,0.147916,0.494073,2.896768,5.288483,0.066955,61.895805
min,0.3363,0.09070629,0.0044,0.004136,0.004664,0.01,0.01,7.933139e-07,575.0,-2147484000.0,-0.89,-4.995,0.85,0.002405,0.030707,0.004408
25%,1.75997,4.610251,0.094,0.052,0.055096,0.86,0.81,0.4432157,5068.0,4.07,-0.02,4.31,11.68875,1.79751,0.431567,0.054
50%,2.33168,11.87053,0.118,0.1,0.10465,0.97,0.97,0.9050349,5598.0,4.07,0.02,4.44,13.701,2.5686,0.448357,0.103
75%,2.93702,40.11097,0.154,0.240605,0.261343,1.08,1.22,1.760674,5904.25,4.07,0.05,4.54,14.866,3.582641,0.462869,0.258561
max,77.349,7300000.0,2500.0,2500.0,2500.0,23.56,71.23,1486.896,57000.0,14.9,0.69,5.52,20.15,104.11278,1.322542,2500.0


## Final Observations

- Missing values were handled using feature-specific strategies
- High-missing columns were removed to prevent noise
- Median and mode imputation preserved statistical stability
- Iterative imputation improved realism by leveraging feature relationships

## Conclusion
A hybrid imputation strategy leads to better data quality than
a single-method approach. This ensures robustness for downstream
machine learning and statistical analysis.
