## Objective
The goal of this notebook is to handle missing values in the approved dataset by:

- Identifying missing-value patterns  
- Removing non-imputable columns  
- Applying KNN Imputation for numerical features  
- Applying appropriate imputation for categorical features  
- Producing a clean dataset


In [1]:
import pandas as pd
import numpy as np


Pandas is used for data manipulation and analysis.
NumPy is used for numerical operations and missing value handling.


In [9]:
df = pd.read_csv("../../datasets/full_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4048 entries, 0 to 4047
Columns: 112 entries, P_NAME to P_SEMI_MAJOR_AXIS_EST
dtypes: float64(94), int64(4), object(14)
memory usage: 3.5+ MB


In [10]:
missing_summary = pd.DataFrame({
    "Missing_Count": df.isnull().sum(),
    "Missing_Percentage": df.isnull().mean() * 100
}).sort_values(by="Missing_Percentage", ascending=False)

missing_summary.head(30)


Unnamed: 0,Missing_Count,Missing_Percentage
P_ATMOSPHERE,4048,100.0
P_ALT_NAMES,4048,100.0
P_DETECTION_RADIUS,4048,100.0
P_GEO_ALBEDO,4048,100.0
P_DETECTION_MASS,4048,100.0
S_MAGNETIC_FIELD,4048,100.0
S_DISC,4048,100.0
P_TEMP_MEASURED,4043,99.876482
P_GEO_ALBEDO_ERROR_MIN,4043,99.876482
P_GEO_ALBEDO_ERROR_MAX,4043,99.876482


Both numerical and categorical features contain missing values.
They are handled separately using suitable imputation techniques.


In [14]:

num_cols = df.select_dtypes(include=["int64", "float64"]).columns
cat_cols = df.select_dtypes(include=["object"]).columns

len(num_cols), len(cat_cols)


(98, 14)

After separating the features based on data type, we observe that the dataset
contains **98 numerical columns** and **14 categorical columns**.
Since these feature types have different characteristics, they are imputed
using different strategies.


In [13]:
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5)

df[num_cols] = pd.DataFrame(
    knn_imputer.fit_transform(df[num_cols]),
    columns=num_cols,
    index=df.index
)

df[num_cols].isnull().sum().sum()


ValueError: Shape of passed values is (4048, 91), indices imply (4048, 98)