# Handling Missing Values With Scikit Learn

Using the same loan dataframe from the previous notes and the link to the kaggle dataset is [here](https://www.kaggle.com/datasets/tanishaj225/loancsv)

In [1]:
import kagglehub
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from kagglehub import KaggleDatasetAdapter
from sklearn.impute import SimpleImputer
import warnings

# Ignore only FutureWarning and DeprecationWarning
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
# Set the path to the file you'd like to load
file_path = "loan.csv"  # file inside tanishaj225/loancsv dataset

# Load dataset into pandas DataFrame
dataset = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "tanishaj225/loancsv",   # dataset slug
    path=file_path           # specify file to load
)



In [3]:
dataset.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Urban,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N


In [4]:
dataset.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [5]:
dataset.select_dtypes(include='float64').columns

Index(['CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term',
       'Credit_History'],
      dtype='object')

#### SimpleImputer provides several hyperparameters that let you control how missing values are handled. Some of the key ones are:

1. `missing_values = NaN` → Specifies which value should be considered as missing (by default, it looks for `NaN`).

2. `strategy = 'mean'` → Defines the method used to replace missing values. Options include `'mean'`, `'median'`, `'most_frequent'`, or `'constant'`.

3. `fill_value = None` → The value used when `strategy = 'constant'`. For example, you could fill all missing entries with `0` or `"Unknown"`.

4. `copy = True` → Ensures the original dataset is not changed; instead, a copy is modified.

5. `add_indicator = False` → If set to `True`, an extra column is added that indicates whether a value was missing or not.

6. `keep_empty_features = False` → Controls whether to keep features (columns) that only contain missing values. By default, such columns are dropped.

---

Since we are dealing with a numerical data I will be using `strategy='mean'`

In [6]:
si = SimpleImputer(strategy='mean')
si.fit_transform(dataset[dataset.select_dtypes(include='float64').columns])

array([[0.00000000e+00, 1.46412162e+02, 3.60000000e+02, 1.00000000e+00],
       [1.50800000e+03, 1.28000000e+02, 3.60000000e+02, 1.00000000e+00],
       [0.00000000e+00, 6.60000000e+01, 3.60000000e+02, 1.00000000e+00],
       ...,
       [2.40000000e+02, 2.53000000e+02, 3.60000000e+02, 1.00000000e+00],
       [0.00000000e+00, 1.87000000e+02, 3.60000000e+02, 1.00000000e+00],
       [0.00000000e+00, 1.33000000e+02, 3.60000000e+02, 0.00000000e+00]],
      shape=(614, 4))

Since the output is in a form of an array so wil save the values inside a variable and then we will upload it to the dataframe like:

In [7]:
ar = si.fit_transform(dataset[dataset.select_dtypes(include='float64').columns])

In [8]:
new_dataset = pd.DataFrame(ar,columns=dataset.select_dtypes(include='float64').columns)

In [9]:
new_dataset.isna().sum()

CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
dtype: int64

In [10]:
new_dataset.head(10)

Unnamed: 0,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,0.0,146.412162,360.0,1.0
1,1508.0,128.0,360.0,1.0
2,0.0,66.0,360.0,1.0
3,2358.0,120.0,360.0,1.0
4,0.0,141.0,360.0,1.0
5,4196.0,267.0,360.0,1.0
6,1516.0,95.0,360.0,1.0
7,2504.0,158.0,360.0,0.0
8,1526.0,168.0,360.0,1.0
9,10968.0,349.0,360.0,1.0


#### Although this solves the problem of missing values but it also bring a new problem i.e. duplicate data

* When you use mean imputation, all missing values in a column are replaced with the exact same value (the mean).

* So if 50 rows were missing, all 50 get the same replacement → this creates many duplicate entries in that column.

* It’s not "wrong," but it reduces variance and can cause bias in ML models.

---

### How to counter this

There are a few strategies depending on how serious the problem is:

1. Median instead of mean

        si = SimpleImputer(strategy='median')

    Median is less sensitive to outliers, so duplicates still appear, but they’re less biased.

2. Use `most_frequent` for categorical-like floats
    If the float column behaves like categories (e.g., ratings 1, 2, 3), you might fill with the most common value.

3. Use interpolation (better for time series / ordered data)

        dataset['col'] = dataset['col'].interpolate(method='linear')

    This fills missing values with values “in between,” reducing duplicates.

4. KNN Imputer (smarter imputation)

        from sklearn.impute import KNNImputer
        knn = KNNImputer(n_neighbors=5)
        new_dataset = pd.DataFrame(
            knn.fit_transform(dataset.select_dtypes(include='float64')),
            columns=dataset.select_dtypes(include='float64').columns
       )

Here, missing values are replaced using similar rows, so you don’t get identical duplicates everywhere.

5. Drop rows if duplicates overwhelm the dataset
If a column is heavily missing (like 40–50%), sometimes dropping that column/row is better.

In [11]:
from sklearn.impute import KNNImputer

In [12]:
knn = KNNImputer(n_neighbors=5)
new_dataset = pd.DataFrame(knn.fit_transform(dataset.select_dtypes(include='float64')), columns=dataset.select_dtypes(include='float64').columns)

In [13]:
new_dataset.head(10)

Unnamed: 0,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,0.0,216.4,360.0,1.0
1,1508.0,128.0,360.0,1.0
2,0.0,66.0,360.0,1.0
3,2358.0,120.0,360.0,1.0
4,0.0,141.0,360.0,1.0
5,4196.0,267.0,360.0,1.0
6,1516.0,95.0,360.0,1.0
7,2504.0,158.0,360.0,0.0
8,1526.0,168.0,360.0,1.0
9,10968.0,349.0,360.0,1.0


---
## SimpleImputer

### Advantages

* **Fast & efficient** → Works instantly even on very large datasets.

* **Easy to implement** → Just one line of code with no complexity.

* **Stable & reproducible** → Always imputes the same value (e.g., mean, median).

* **Good as a baseline** → Quick fix to prepare data for initial models.

* **Less risk of overfitting** → Since it doesn’t “learn” from neighbors.

### Disadvantages

* **Ignores feature relationships** → Just fills a fixed value, doesn’t consider correlations.

* **Introduces bias** → Mean/median replacement can distort the distribution.

* **Reduces variance** → Too many repeated values (e.g., mean everywhere).

* **Not realistic** → Imputed values may not represent real-world behavior.

---

## KNNImputer

### Advantages

* **Preserves feature relationships** → Uses neighboring rows, so correlations are maintained.
 
* **More realistic imputation** → Produces values closer to the actual missing ones.

* **Maintains variance** → Avoids duplicate values like SimpleImputer does.

* **Works well for both** categorical & numerical (after encoding).

* **Improves ML performance** → Models trained on KNN-imputed data often perform better.

### Disadvantages

* **Computationally expensive** → Slow for large datasets, since it calculates distances.

* **Memory-heavy** → Requires storing and searching through many rows.

* **Sensitive to scaling** → Distance-based method, so features must be standardized.

* **Sensitive to outliers** → Outliers can distort neighbor selection.

* **Choice of K is tricky** → Too small = noisy imputation, too large = over-smoothed values.

### Rule of thumb:

* Use `SimpleImputer` for speed, simplicity, and very large datasets.

* Use `KNNImputer` when accuracy and maintaining feature relationships are important, and dataset size is manageable.

---

## One Hot Coding & Dummy Variable