<a href="https://colab.research.google.com/github/kupendrav/DS-AI/blob/main/MissingValues_Impute.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.ybifoundation.org

# **Create Dummy Dataset**

In [None]:
# import library
import pandas as pd
import numpy as np

In [None]:
# create dummy dataset
data = {
    'Age': [25, 30, np.nan, 35, 40, 45, np.nan],
    'Salary': [50000, 60000, 55000, np.nan, 65000, 70000, 72000],
    'Gender': ['Male', 'Female', 'Female', 'Female', np.nan, 'Male', 'Female']
}
df = pd.DataFrame(data)

In [None]:
df

Unnamed: 0,Age,Salary,Gender
0,25.0,50000.0,Male
1,30.0,60000.0,Female
2,,55000.0,Female
3,35.0,,Female
4,40.0,65000.0,
5,45.0,70000.0,Male
6,,72000.0,Female


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     5 non-null      float64
 1   Salary  6 non-null      float64
 2   Gender  6 non-null      object 
dtypes: float64(2), object(1)
memory usage: 296.0+ bytes


## **Simple Imputer**

If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

In [None]:
from sklearn.impute import SimpleImputer
simple_imputer = SimpleImputer(strategy='mean')

In [None]:
df[['Age_imputed']] = simple_imputer.fit_transform(df[['Age_imputed']])
df

Unnamed: 0,Age,Salary,Gender,Age_imputed,Salary_imputed
0,25.0,50000.0,Male,25.0,50000.0
1,30.0,60000.0,Female,30.0,60000.0
2,,55000.0,Female,27.5,55000.0
3,35.0,,Female,35.0,62500.0
4,40.0,65000.0,,40.0,65000.0
5,45.0,70000.0,Male,45.0,70000.0
6,,72000.0,Female,42.5,72000.0


## **Localised Imputer**

In [None]:
# forward fill
df['Age'].ffill()

0    25.0
1    30.0
2    30.0
3    35.0
4    40.0
5    45.0
6    45.0
Name: Age, dtype: float64

In [None]:
# backward fill
df['Age'].bfill()

0    25.0
1    30.0
2    35.0
3    35.0
4    40.0
5    45.0
6     NaN
Name: Age, dtype: float64

In [None]:
# interpolate
df['Age'].interpolate()

0    25.0
1    30.0
2    32.5
3    35.0
4    40.0
5    45.0
6    45.0
Name: Age, dtype: float64

## **Regression Imputation**

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

In [None]:
# Only numeric columns should be imputed this way
iterative_imputer = IterativeImputer(estimator=LinearRegression())
df[['Age_imputed', 'Salary_imputed']] = iterative_imputer.fit_transform(df[['Age', 'Salary']])
df

Unnamed: 0,Age,Salary,Gender,Age_imputed,Salary_imputed
0,25.0,50000.0,Male,25.0,50000.0
1,30.0,60000.0,Female,30.0,60000.0
2,,55000.0,Female,28.430657,55000.0
3,35.0,,Female,35.0,62000.0
4,40.0,65000.0,,40.0,65000.0
5,45.0,70000.0,Male,45.0,70000.0
6,,72000.0,Female,45.880474,72000.0


## **K-Nearest Neighbors Imputation**

In [None]:
from sklearn.impute import KNNImputer

In [None]:
# Only numeric columns should be imputed or categorical after encoding
knn_imputer = KNNImputer(n_neighbors=2)
df[['Age_imputed', 'Salary_imputed']] = knn_imputer.fit_transform(df[['Age', 'Salary']])
df

Unnamed: 0,Age,Salary,Gender,Age_imputed,Salary_imputed
0,25.0,50000.0,Male,25.0,50000.0
1,30.0,60000.0,Female,30.0,60000.0
2,,55000.0,Female,27.5,55000.0
3,35.0,,Female,35.0,62500.0
4,40.0,65000.0,,40.0,65000.0
5,45.0,70000.0,Male,45.0,70000.0
6,,72000.0,Female,42.5,72000.0


---
#🔗[Have a Doubt? Join Our Live Q&A for Help](https://www.ybifoundation.org/session/live-doubt-resolution-session)