## HANDLING MISSING VALUES ( IMPUTING CATEGORY DATA )

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [27]:
loan_data = pd.read_csv('loan.csv')

In [28]:
loan_data.head(5)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### Backward Filling

Backward filling is a data imputation technique used in time series or tabular datasets to handle missing values. In this method, the missing values are filled using the next valid observation found in the dataset. It ensures that gaps are filled with future available data points, propagating values backwards.

#### When to Use:
- The data follows a temporal or sequential order.
- Future values are expected to provide meaningful approximations for earlier missing values.
- Suitable when the trend of the data does not change significantly between consecutive observations.

#### Example:
Consider a dataset where some temperature readings are missing. Using backward filling, the missing values will be replaced by the next available reading.

| Day | Temperature (°C) |
|-----|------------------|
| 1   | 25               |
| 2   | NaN              |
| 3   | 28               |
| 4   | NaN              |
| 5   | 30               |

After backward filling:

| Day | Temperature (°C) |
|-----|------------------|
| 1   | 25               |
| 2   | 28               |
| 3   | 28               |
| 4   | 30               |
| 5   | 30               |


In [29]:
loan_data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [30]:
loan_data['LoanAmount'].fillna(method='bfill', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loan_data['LoanAmount'].fillna(method='bfill', inplace=True)
  loan_data['LoanAmount'].fillna(method='bfill', inplace=True)


In [31]:
loan_data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

### Forward Fill

Forward fill is a data imputation technique used to handle missing values by propagating the last valid observation forward. It ensures that gaps in the dataset are filled with the most recent known value, maintaining continuity in sequential data.

#### When to Use:
- The data follows a temporal or sequential order.
- Recent past values are good estimates for subsequent missing values.
- Useful in scenarios where measurements are expected to remain stable over short periods.

#### Example:
Consider a dataset where some stock prices are missing. Using forward fill, the missing values will be replaced by the last available price.

| Day | Stock Price (USD) |
|-----|-------------------|
| 1   | 100               |
| 2   | NaN               |
| 3   | NaN               |
| 4   | 105               |
| 5   | NaN               |

After forward fill:

| Day | Stock Price (USD) |
|-----|-------------------|
| 1   | 100               |
| 2   | 100               |
| 3   | 100               |
| 4   | 105               |
| 5   | 105               |



In [32]:
loan_data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [34]:
loan_data['Loan_Amount_Term'].fillna(method='ffill', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loan_data['Loan_Amount_Term'].fillna(method='ffill', inplace=True)
  loan_data['Loan_Amount_Term'].fillna(method='ffill', inplace=True)


In [35]:
loan_data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term      0
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

### Left to Right or Right to Left Fill

These filling techniques are used to handle missing values in datasets, typically in multi-dimensional data such as matrices or tables. They propagate values either horizontally across columns (left to right or right to left) instead of vertically across rows.

#### Left to Right Fill:
- Missing values are filled by the preceding non-null value from the left.
- Useful when data has a logical sequence across columns (e.g., measurements taken at different time intervals on the same day).

| Day | Morning | Noon  | Evening |
|-----|---------|-------|---------|
| 1   | 20      | NaN   | NaN     |
| 2   | 18      | 22    | NaN     |

After left to right fill:

| Day | Morning | Noon | Evening |
|-----|---------|------|---------|
| 1   | 20      | 20   | 20      |
| 2   | 18      | 22   | 22      |

#### Right to Left Fill:
- Missing values are filled by the next non-null value from the right.
- Useful when later observations are reliable indicators for earlier missing entries.

| Day | Morning | Noon  | Evening |
|-----|---------|-------|---------|
| 1   | NaN     | NaN   | 25      |
| 2   | NaN     | 22    | 24      |

After right to left fill:

| Day | Morning | Noon | Evening |
|-----|---------|------|---------|
| 1   | 25      | 25   | 25      |
| 2   | 22      | 22   | 24      |


In [36]:
loan_data.fillna(method='bfill', axis=1)

  loan_data.fillna(method='bfill', axis=1)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [37]:
loan_data.fillna(method='ffill', axis=1)

  loan_data.fillna(method='ffill', axis=1)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


### Mode Filling

Mode filling is a data imputation technique where missing values are replaced with the most frequent value (mode) in a dataset. This method is particularly useful for categorical data or discrete numerical data where the most common value represents a reasonable estimate.

#### When to Use:
- The data contains categories or discrete values.
- The most frequent value is meaningful and can provide a reasonable approximation.
- Used when the dataset has a small number of unique values, and missing data is minimal.

#### Example:
Consider a dataset where the preferred mode of transportation is recorded, but some values are missing.

| Person | Transportation |
|--------|----------------|
| 1      | Bus            |
| 2      | NaN            |
| 3      | Train          |
| 4      | Bus            |
| 5      | NaN            |

After mode filling:

| Person | Transportation |
|--------|----------------|
| 1      | Bus            |
| 2      | Bus            |
| 3      | Train          |
| 4      | Bus            |
| 5      | Bus            |

In this example, the mode of the column is "Bus," so the missing values are replaced by "Bus."


In [39]:
loan_data['Gender'].mode()

0    Male
Name: Gender, dtype: object

In [40]:
loan_data['Gender'].mode()[0]

'Male'

In [41]:
loan_data['Gender'].isnull().sum()

13

In [44]:
loan_data['Gender'].fillna(loan_data['Gender'].mode()[0], inplace=True)

In [45]:
loan_data['Gender'].isnull().sum()

0

### Filling all object (Cetagoric data) with mode

In [46]:
loan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             614 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         614 non-null    float64
 9   Loan_Amount_Term   614 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [48]:
loan_data.select_dtypes(include='object').head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,Urban,Y
4,LP001008,Male,No,0,Graduate,No,Urban,Y


In [49]:
loan_data.select_dtypes(include='object').isnull().sum()

Loan_ID           0
Gender            0
Married           3
Dependents       15
Education         0
Self_Employed    32
Property_Area     0
Loan_Status       0
dtype: int64

In [50]:
loan_data.select_dtypes(include='object').columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [52]:
for column in loan_data.select_dtypes(include='object').columns:
    print(column)

Loan_ID
Gender
Married
Dependents
Education
Self_Employed
Property_Area
Loan_Status


In [53]:
for column in loan_data.select_dtypes(include='object').columns:
    loan_data[column].fillna(loan_data[column].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loan_data[column].fillna(loan_data[column].mode()[0], inplace=True)


In [54]:
loan_data.select_dtypes('object').isnull().sum()

Loan_ID          0
Gender           0
Married          0
Dependents       0
Education        0
Self_Employed    0
Property_Area    0
Loan_Status      0
dtype: int64