# **Problem Statement**  
## **21. Handle missing values using sklearn’s SimpleImputer.**

Handle missing values in a dataset using sklearn’s SimpleImputer and compare it with a manual (brute-force) imputation approach.

### Constraints & Example Inputs/Outputs

### Constraints
- Dataset may contain:
    - Numerical missing values (NaN)
    - Categorical missing values
- No row deletion
- Use:
    - Mean / Median for numerical features
    - Most frequent value for categorical features

### Example Input:
```python
Age   Salary   City
25    50000    Delhi
NaN   60000    Mumbai
30    NaN      NaN
40    80000    Delhi

```

Expected Output:
```python
No missing values
Numerical values imputed correctly
Categorical values filled with most frequent value

```

### Solution Approach

### Why Handle Missing Values?
- Most ML models cannot handle NaNs
- Missing data can bias training
- Imputation preserves dataset size

### Two Approaches

**Brute Force (Manual)**
- Loop over columns
- Compute mean / mode manually
- Replace missing values

**Optimized (Best Practice)**
- Use sklearn.impute.SimpleImputer
- Fast, clean, production-ready
- Works inside pipelines

### Solution Code

In [1]:
# Step 1: Import Libraries 

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

In [7]:
# Step 2: Create Sample Dataset

data = {
    "Age": [25, np.nan, 30, 40],
    "Salary": [50000, 60000, np.nan, 80000],
    "City": ["Delhi", "Mumbai", np.nan, "Delhi"]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Age,Salary,City
0,25.0,50000.0,Delhi
1,,60000.0,Mumbai
2,30.0,,
3,40.0,80000.0,Delhi


In [8]:
# Approach 1: Brute Force Solution (Manual Imputation)
def manual_imputation(df):
    df_copy = df.copy()

    # Numerical columns
    for col in df_copy.select_dtypes(include=np.number).columns:
        mean_value = df_copy[col].mean()
        df_copy[col].fillna(mean_value, inplace=True)

    # Categorical columns
    for col in df_copy.select_dtypes(include=object).columns:
        mode_value = df_copy[col].mode()[0]
        df_copy[col].fillna(mode_value, inplace=True)

    return df_copy


In [9]:
df_manual = manual_imputation(df)
df_manual


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_copy[col].fillna(mean_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_copy[col].fillna(mode_value, inplace=True)


Unnamed: 0,Age,Salary,City
0,25.0,50000.0,Delhi
1,31.666667,60000.0,Mumbai
2,30.0,63333.333333,Delhi
3,40.0,80000.0,Delhi


### Alternative Solution

In [10]:
# Approach 2: Optimized Solution(Using SimpleImputer)

# Numerical Imputation 
num_imputer = SimpleImputer(strategy="mean")

df_num = df[["Age", "Salary"]]
df_num_imputed = pd.DataFrame(
    num_imputer.fit_transform(df_num),
    columns=df_num.columns
)

# Categorical Imputation
cat_imputer = SimpleImputer(strategy="most_frequent")

df_cat = df[["City"]]
df_cat_imputed = pd.DataFrame(
    cat_imputer.fit_transform(df_cat),
    columns=df_cat.columns
)

# Combine Back
df_imputed = pd.concat([df_num_imputed, df_cat_imputed], axis=1)
df_imputed


Unnamed: 0,Age,Salary,City
0,25.0,50000.0,Delhi
1,31.666667,60000.0,Mumbai
2,30.0,63333.333333,Delhi
3,40.0,80000.0,Delhi


### Alternative Approaches

- Drop rows (dropna) ❌ (loses data)
- KNN Imputer
- Iterative Imputer (MICE)
- Model-based imputation

➡️ SimpleImputer is best for baseline & pipelines

### Test Case

In [12]:
# Test Case 1: No NMissing Values(Manual)
assert df_manual.isnull().sum().sum() == 0
print("Test Case 1 Passed: Manual imputation successful")


Test Case 1 Passed: Manual imputation successful


In [13]:
# Test Case 2: No Missing Values (SimpleImputer)
assert df_imputed.isnull().sum().sum() == 0
print("Test Case 2 Passed: SimpleImputer successful")


Test Case 2 Passed: SimpleImputer successful


In [14]:
# Test Case 3: Mean Imputation Check
expected_age_mean = (25 + 30 + 40) / 3
assert df_imputed.loc[1, "Age"] == expected_age_mean
print("Test Case 3 Passed: Mean imputation correct")


Test Case 3 Passed: Mean imputation correct


In [15]:
# Test Case 4: Categorical Mode Check
assert df_imputed.loc[2, "City"] == "Delhi"
print("Test Case 4 Passed: Categorical imputation correct")


Test Case 4 Passed: Categorical imputation correct


## Complexity Analysis

**Manual Imputation**
- Time: O(n × d)
- Space: O(1)

**SimpleImputer**
- Time: O(n × d)
- Space: O(d)


#### Thank You!!