# Handling Missing Values With Scikit Learn

Using the same loan dataframe from the previous notes and the link to the kaggle dataset is [here](https://www.kaggle.com/datasets/tanishaj225/loancsv)

In [1]:
import kagglehub
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from kagglehub import KaggleDatasetAdapter
from sklearn.impute import SimpleImputer
import warnings

# Ignore only FutureWarning and DeprecationWarning
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
# Set the path to the file you'd like to load
file_path = "loan.csv"  # file inside tanishaj225/loancsv dataset

# Load dataset into pandas DataFrame
dataset = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "tanishaj225/loancsv",   # dataset slug
    path=file_path           # specify file to load
)

In [3]:
dataset.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Urban,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N


In [4]:
dataset.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [5]:
dataset.select_dtypes(include='float64').columns

Index(['CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term',
       'Credit_History'],
      dtype='object')

#### SimpleImputer provides several hyperparameters that let you control how missing values are handled. Some of the key ones are:

1. `missing_values = NaN` → Specifies which value should be considered as missing (by default, it looks for `NaN`).

2. `strategy = 'mean'` → Defines the method used to replace missing values. Options include `'mean'`, `'median'`, `'most_frequent'`, or `'constant'`.

3. `fill_value = None` → The value used when `strategy = 'constant'`. For example, you could fill all missing entries with `0` or `"Unknown"`.

4. `copy = True` → Ensures the original dataset is not changed; instead, a copy is modified.

5. `add_indicator = False` → If set to `True`, an extra column is added that indicates whether a value was missing or not.

6. `keep_empty_features = False` → Controls whether to keep features (columns) that only contain missing values. By default, such columns are dropped.

---

Since we are dealing with a numerical data I will be using `strategy='mean'`

In [6]:
si = SimpleImputer(strategy='mean')
si.fit_transform(dataset[dataset.select_dtypes(include='float64').columns])

array([[0.00000000e+00, 1.46412162e+02, 3.60000000e+02, 1.00000000e+00],
       [1.50800000e+03, 1.28000000e+02, 3.60000000e+02, 1.00000000e+00],
       [0.00000000e+00, 6.60000000e+01, 3.60000000e+02, 1.00000000e+00],
       ...,
       [2.40000000e+02, 2.53000000e+02, 3.60000000e+02, 1.00000000e+00],
       [0.00000000e+00, 1.87000000e+02, 3.60000000e+02, 1.00000000e+00],
       [0.00000000e+00, 1.33000000e+02, 3.60000000e+02, 0.00000000e+00]],
      shape=(614, 4))

Since the output is in a form of an array so wil save the values inside a variable and then we will upload it to the dataframe like:

In [7]:
ar = si.fit_transform(dataset[dataset.select_dtypes(include='float64').columns])

In [8]:
new_dataset = pd.DataFrame(ar,columns=dataset.select_dtypes(include='float64').columns)

In [9]:
new_dataset.isna().sum()

CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
dtype: int64

In [10]:
new_dataset.head(10)

Unnamed: 0,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,0.0,146.412162,360.0,1.0
1,1508.0,128.0,360.0,1.0
2,0.0,66.0,360.0,1.0
3,2358.0,120.0,360.0,1.0
4,0.0,141.0,360.0,1.0
5,4196.0,267.0,360.0,1.0
6,1516.0,95.0,360.0,1.0
7,2504.0,158.0,360.0,0.0
8,1526.0,168.0,360.0,1.0
9,10968.0,349.0,360.0,1.0


#### Although this solves the problem of missing values but it also bring a new problem i.e. duplicate data

* When you use mean imputation, all missing values in a column are replaced with the exact same value (the mean).

* So if 50 rows were missing, all 50 get the same replacement → this creates many duplicate entries in that column.

* It’s not "wrong," but it reduces variance and can cause bias in ML models.

---

### How to counter this

There are a few strategies depending on how serious the problem is:

1. Median instead of mean

        si = SimpleImputer(strategy='median')

    Median is less sensitive to outliers, so duplicates still appear, but they’re less biased.

2. Use `most_frequent` for categorical-like floats
    If the float column behaves like categories (e.g., ratings 1, 2, 3), you might fill with the most common value.

3. Use interpolation (better for time series / ordered data)

        dataset['col'] = dataset['col'].interpolate(method='linear')

    This fills missing values with values “in between,” reducing duplicates.

4. KNN Imputer (smarter imputation)

        from sklearn.impute import KNNImputer
        knn = KNNImputer(n_neighbors=5)
        new_dataset = pd.DataFrame(
            knn.fit_transform(dataset.select_dtypes(include='float64')),
            columns=dataset.select_dtypes(include='float64').columns
       )

Here, missing values are replaced using similar rows, so you don’t get identical duplicates everywhere.

5. Drop rows if duplicates overwhelm the dataset
If a column is heavily missing (like 40–50%), sometimes dropping that column/row is better.

In [11]:
from sklearn.impute import KNNImputer

In [12]:
knn = KNNImputer(n_neighbors=5)
new_dataset = pd.DataFrame(knn.fit_transform(dataset.select_dtypes(include='float64')), columns=dataset.select_dtypes(include='float64').columns)

In [13]:
new_dataset.head(10)

Unnamed: 0,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,0.0,216.4,360.0,1.0
1,1508.0,128.0,360.0,1.0
2,0.0,66.0,360.0,1.0
3,2358.0,120.0,360.0,1.0
4,0.0,141.0,360.0,1.0
5,4196.0,267.0,360.0,1.0
6,1516.0,95.0,360.0,1.0
7,2504.0,158.0,360.0,0.0
8,1526.0,168.0,360.0,1.0
9,10968.0,349.0,360.0,1.0


---
## SimpleImputer

### Advantages

* **Fast & efficient** → Works instantly even on very large datasets.

* **Easy to implement** → Just one line of code with no complexity.

* **Stable & reproducible** → Always imputes the same value (e.g., mean, median).

* **Good as a baseline** → Quick fix to prepare data for initial models.

* **Less risk of overfitting** → Since it doesn’t “learn” from neighbors.

### Disadvantages

* **Ignores feature relationships** → Just fills a fixed value, doesn’t consider correlations.

* **Introduces bias** → Mean/median replacement can distort the distribution.

* **Reduces variance** → Too many repeated values (e.g., mean everywhere).

* **Not realistic** → Imputed values may not represent real-world behavior.

---

## KNNImputer

### Advantages

* **Preserves feature relationships** → Uses neighboring rows, so correlations are maintained.
 
* **More realistic imputation** → Produces values closer to the actual missing ones.

* **Maintains variance** → Avoids duplicate values like SimpleImputer does.

* **Works well for both** categorical & numerical (after encoding).

* **Improves ML performance** → Models trained on KNN-imputed data often perform better.

### Disadvantages

* **Computationally expensive** → Slow for large datasets, since it calculates distances.

* **Memory-heavy** → Requires storing and searching through many rows.

* **Sensitive to scaling** → Distance-based method, so features must be standardized.

* **Sensitive to outliers** → Outliers can distort neighbor selection.

* **Choice of K is tricky** → Too small = noisy imputation, too large = over-smoothed values.

### Rule of thumb:

* Use `SimpleImputer` for speed, simplicity, and very large datasets.

* Use `KNNImputer` when accuracy and maintaining feature relationships are important, and dataset size is manageable.

---

## Encoding

Encoding means converting categorical/text data into numeric values so that machine learning models can understand it.
Because ML algorithms work with numbers (math, distances, probabilities), they can’t directly process words like `"red"`, `"blue"`, `"dog"`, `"cat"`.

### Why Do We Use Encoding?

1. ML algorithms only understand numbers → they can’t calculate on strings.

2. To preserve meaning → encoding helps us represent categories without losing the relationships.

3. Better performance → properly encoded data improves model accuracy.

### Types of Encoding (Most Common)

1. `Label Encoding`

* Assigns each category a number.

* Example:

        Red → 0
        Blue → 1
        Green → 2


* Problem: Implies an order/priority (0 < 1 < 2) even if there isn’t any.

2. `One-Hot Encoding`

* Creates separate binary columns for each category.

* Example:

        Red   → [1, 0, 0]
        Blue  → [0, 1, 0]
        Green → [0, 0, 1]


* No order issue, but increases dimensions if many categories.

3. `Ordinal Encoding`

* Used when categories have a natural order (e.g., "low", "medium", "high").

* Example:

        Low → 1
        Medium → 2
        High → 3


4. `Target / Mean Encoding (advanced)`

* Replace a category with the average target value for that category.

* Example (predicting salary):

        Job: Engineer → 70k avg salary
        Job: Teacher → 50k avg salary
        Job: Doctor → 100k avg salary

In [14]:
# Set the path to the file you'd like to load
file_path = "loan.csv"  # file inside tanishaj225/loancsv dataset

# Load dataset into pandas DataFrame
ds = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "tanishaj225/loancsv",   # dataset slug
    path=file_path           # specify file to load
)

In [15]:
ds.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


So One-Hot Encoding works like:

For example let's take the gender column from the above dataframe so what One-Hot Encoding does is 
instead of using one column, we create two columns (one for each category) like:

| Person | Gender_Male | Gender_Female |
| ------ | ----        | ------        |
| 1      | 1           | 0             |
| 2      | 0           | 1             |
| 3      | 1           | 0             |

Now:

* If Gender = Male → Male=1, Female=0

* If Gender = Female → Male=0, Female=1

This way, the model doesn’t assume order or hierarchy. Each category just becomes a separate feature.

In [16]:
# pip install --upgrade kagglehub

So let's start encoding but first let's check for missing values in gender and the married columns

In [17]:
print(f"The number of missing values in column Genderis: {ds.Gender.isna().sum()} \nThe number of missing values in column Married is: {ds.Married.isna().sum()}")

The number of missing values in column Genderis: 13 
The number of missing values in column Married is: 3


As we can see there are 13 missing values in column Gender and 3 missing values in column Married.

Let's fill them up using mode.

In [18]:
ds.Gender.fillna(ds.Gender.mode()[0],inplace=True)
ds.Married.fillna(ds.Married.mode()[0],inplace=True)

In [19]:
print('After applying the mode method we got')
print(f"The number of missing values in column Genderis: {ds.Gender.isna().sum()} \nThe number of missing values in column Married is: {ds.Married.isna().sum()}")

After applying the mode method we got
The number of missing values in column Genderis: 0 
The number of missing values in column Married is: 0


Now since there is no null values in both the columns let's use Encoding.

There are two ways of applying One-Hot Encoding:

1. Use `pd.get_dummies()` for quick prototyping or EDA.
     * Easiest, quick method when working with DataFrames.
     * `Pros`: Quick, simple, great for small datasets.
     * `Cons`: Limited flexibility (no control over unseen categories in test data).   

2. Use `OneHotEncoder` for production ML models & pipelines.
    * It is class present inside the scikit-learn library
    * More powerful, especially for ML pipelines.
    * `Pros`:
      * Works smoothly inside Scikit-Learn pipelines.
      * Can handle unseen categories (handle_unknown='ignore').
    * `Cons`: Slightly more code compared to get_dummies().

---

### What is a Pipeline?

In ML, a pipeline is like an assembly line in a factory.

Instead of manually applying preprocessing → encoding → scaling → training each time, a pipeline chains them together into one process.

Think of it like:

`Raw Data  →  Preprocessing  →  Feature Encoding  →  Scaling  →  Model Training  →  Predictions`

#### A Pipeline object makes sure:

* Every step runs in the right order.

* Same transformations applied on training data are applied on test data.

* Less risk of data leakage.

* Cleaner code.

### Why Pipelines are Important?

* Prevents **data leakage** (train/test preprocessing mismatch).

* Code is **shorter and cleaner**.

* Easy to reuse (you can save the whole pipeline and deploy it).

* Handles **complex datasets** with mixed feature types easily.

In [20]:
en_data = ds[['Gender','Married']]
en_data.head()

Unnamed: 0,Gender,Married
0,Male,No
1,Male,Yes
2,Male,Yes
3,Male,Yes
4,Male,No


In [21]:
encoding = pd.get_dummies(en_data)

In [22]:
encoding

Unnamed: 0,Gender_Female,Gender_Male,Married_No,Married_Yes
0,False,True,True,False
1,False,True,False,True
2,False,True,False,True
3,False,True,False,True
4,False,True,True,False
...,...,...,...,...
609,True,False,True,False
610,False,True,False,True
611,False,True,False,True
612,False,True,False,True


In [23]:
encoding.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Gender_Female  614 non-null    bool 
 1   Gender_Male    614 non-null    bool 
 2   Married_No     614 non-null    bool 
 3   Married_Yes    614 non-null    bool 
dtypes: bool(4)
memory usage: 2.5 KB


Since machine learning models can only work with numbers, we cannot directly feed them categorical values like 'Male' or 'Female'. These values need to be converted into a numerical representation so that the model can process and learn patterns from them without misinterpreting the categories as having any mathematical order.

So instead of using `pd.get_dummies` we could use `OneHotEncoder` class from scikit-learn

In [24]:
from sklearn.preprocessing import OneHotEncoder

In [25]:
ohe = OneHotEncoder()
ohe.fit_transform(en_data)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1228 stored elements and shape (614, 4)>

#### .fit()

* This step learns the categories from your data.

* For example, if en_data is:

        ["Male", "Female", "Male"]


* then .fit() will figure out that the categories are:

        ['Female', 'Male']

(sklearn usually sorts them alphabetically).

#### .transform()

* Once categories are learned, .transform() converts each row into the one-hot encoded format.

* So the above data becomes:

        Male   → [0, 1]  
        Female → [1, 0]  
        Male   → [0, 1]

#### .fit_transform()

* This is just a shortcut → runs .fit() first (learn categories) then immediately .transform() (encode them).

* So:

        ohe.fit(en_data)        # Step 1: Learn categories
        ohe.transform(en_data)  # Step 2: Encode


* is the same as:

        ohe.fit_transform(en_data)

### Important note:

* You usually call `.fit_transform()` on the training data.

* Then, when you have new data (like test data), you only use `.transform()` (not .fit()) so that it uses the same categories learned from training.

---

***So in the above output we are getting this:***

    Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1228 stored elements and shape (614, 4)>

### What is a Sparse Matrix?

A sparse matrix is just a way to save space in memory.
Instead of storing all the 0s from One-Hot Encoding, it only stores the 1s and their positions.

### Why does this happen in One-Hot Encoding?

Example: suppose we encode Gender → Male / Female.

After One-Hot Encoding:

| Person | Male | Female |
| ------ | ---- | ------ |
| 1      | 1    | 0      |
| 2      | 0    | 1      |
| 3      | 1    | 0      |

Most of the numbers are 0.
Imagine if we had 100 categories → then every row would have 99 zeros and only 1 one.

Storing all those zeros would waste memory.

---

### There are two ways to turn the sparse matrix into a normal array (dense):

* **Method 1** → While creating the encoder

        ohe = OneHotEncoder(sparse_output=False)   # new versions use sparse_output=False
        encoded = ohe.fit_transform(data)

This directly gives you a NumPy array (with 0s and 1s).

* **Method 2** → After encoding (convert manually)

        ohe = OneHotEncoder()   # default gives sparse matrix
  
        encoded_sparse = ohe.fit_transform(data)
  
        # Convert sparse → dense (normal array)
        encoded_dense = encoded_sparse.toarray()

#### Both do the same thing.

* `sparse=False` → tells scikit-learn: “Don’t give me sparse, give me array directly.”

* `.toarray()` → you take the sparse matrix and convert it later.

#### Think of it like this:

* Sparse = a list of where the lights are ON (saves memory).

* Dense = a big grid showing every light, ON or OFF (takes more space).

In [26]:
# Using the method 1
ohe = OneHotEncoder(sparse_output=False)   # New versions use sparse_output=False
encoded = ohe.fit_transform(en_data)

In [27]:
encoded

array([[0., 1., 1., 0.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       ...,
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [1., 0., 1., 0.]], shape=(614, 4))

But now I need the data not in array but in form of a dataframe.

So inorder to convert them into Dataframe we will use:

`pd.DataFrame()`

In [28]:
pd.DataFrame(encoded,columns=['Gender_Female','Gender_Male','Married_No','Married_Yes'])

Unnamed: 0,Gender_Female,Gender_Male,Married_No,Married_Yes
0,0.0,1.0,1.0,0.0
1,0.0,1.0,0.0,1.0
2,0.0,1.0,0.0,1.0
3,0.0,1.0,0.0,1.0
4,0.0,1.0,1.0,0.0
...,...,...,...,...
609,1.0,0.0,1.0,0.0
610,0.0,1.0,0.0,1.0
611,0.0,1.0,0.0,1.0
612,0.0,1.0,0.0,1.0


When we apply One-Hot Encoding, it creates a new column for each category.

For example, if a column has two categories like Male and Female, it will create two new columns: `Gender_Male` and `Gender_Female`.

But notice something:

* If `Gender_Male = 1`, then `Gender_Female` will always be `0`.

* If `Gender_Female = 1`, then `Gender_Male` will always be `0`.

This means one of the columns is redundant. To avoid this, we can drop one column using the argument:

        `OneHotEncoder(drop='first')`

This way, for two categories, only one column is created.
So instead of storing both `Male` and `Female`, the model only needs one column:

* `0` = Female

* `1` = Male

This reduces unnecessary columns while still keeping all the information

***In short: `drop='first'` helps reduce the number of columns and avoids redundancy, especially when you have many categories.***

In [29]:
ohe = OneHotEncoder(sparse_output=False,drop='first')
encoded = ohe.fit_transform(en_data)

In [30]:
pd.DataFrame(encoded,columns=['Gender_Male','Married_Yes'])

Unnamed: 0,Gender_Male,Married_Yes
0,1.0,0.0
1,1.0,1.0
2,1.0,1.0
3,1.0,1.0
4,1.0,0.0
...,...,...
609,0.0,0.0
610,1.0,1.0
611,1.0,1.0
612,1.0,1.0


## Label Encoding