# Handling Missing Values

## Dataset Summary
- **Standard NaNs**: Distributed across all columns.
- **Non-standard Missing Values**:
    - **age**: Contains a `?` value.
    - **gender**: Contains an `Unknown` category.
    - **city**: Contains an empty string `' '`.
- **Data Types**: Salary and Age now contain mixed/float types due to the introduction of nulls.

In [1]:
import pandas as pd, numpy as np

In [2]:
df = pd.read_csv("07_missing_data.csv")
df

Unnamed: 0,name,gender,age,salary,city,date of joining
0,Amit,Male,28,40769.0,Kolkata,30-10-2021
1,Riya,,41,99735.0,Pune,09-02-2018
2,John,Male,36,,Pune,02-06-2019
3,Neha,Female,32,42433.0,Kolkata,04-03-2020
4,Siddharth,Male,?,45311.0,Pune,15-10-2022
5,Zoe,Female,,77819.0,Bangalore,28-11-2023
6,Ken,Male,28,79188.0,Chennai,12-05-2018
7,Anjali,Female,47,57568.0,Kolkata,25-10-2020
8,Vijay,Male,40,93707.0,,06-09-2022
9,Priya,Female,44,59769.0,Delhi,03-09-2022


## Checking Missing Values
- Use for missing values - `isnull()` and `isna()`
- Use for non-missing values - `notnull()` and `notna()`

### Check for standard missing values

In [3]:
df.isnull().sum()

name               1
gender             1
age                1
salary             2
city               0
date of joining    1
dtype: int64

### Check for non-standard missing values

In [4]:
df["age"].unique()

array(['28', '41', '36', '32', '?', nan, '47', '40', '44', '45', '42',
       '25', '29', '24', '43'], dtype=object)

In [5]:
df["city"].unique()

array(['Kolkata', 'Pune', 'Bangalore', 'Chennai', ' ', 'Delhi',
       'Hyderabad', 'Mumbai'], dtype=object)

In [6]:
df["gender"].unique()

array(['Male', nan, 'Female', 'Unknown'], dtype=object)

## Handling Non-Standard Missing Data
Convert placeholders to actual NaN values so Pandas can process them.

### Replace Wrong Data with NaN

In [7]:
df.replace(["?", "Unknown", " ",""], np.nan)

Unnamed: 0,name,gender,age,salary,city,date of joining
0,Amit,Male,28.0,40769.0,Kolkata,30-10-2021
1,Riya,,41.0,99735.0,Pune,09-02-2018
2,John,Male,36.0,,Pune,02-06-2019
3,Neha,Female,32.0,42433.0,Kolkata,04-03-2020
4,Siddharth,Male,,45311.0,Pune,15-10-2022
5,Zoe,Female,,77819.0,Bangalore,28-11-2023
6,Ken,Male,28.0,79188.0,Chennai,12-05-2018
7,Anjali,Female,47.0,57568.0,Kolkata,25-10-2020
8,Vijay,Male,40.0,93707.0,,06-09-2022
9,Priya,Female,44.0,59769.0,Delhi,03-09-2022


## Imputation (Filling) Strategies

### Fill Salary with Median

In [8]:
df['salary'] = df['salary'].fillna(df.groupby('city')['salary'].transform('median'))

### Categorical: Fill Gender with Mode

In [9]:
df['gender'] = df['gender'].fillna(df['gender'].mode()[0])

### Forward Fill

In [10]:
df['city'] = df['city'].ffill()

## Deletion Strategies

### Drop rows where any value is missing
Dropping missing values is suitable when the proportion is small,
while filling is preferred when data loss would be significant.

In [11]:
df.dropna(inplace=True)
df

Unnamed: 0,name,gender,age,salary,city,date of joining
0,Amit,Male,28,40769.0,Kolkata,30-10-2021
1,Riya,Female,41,99735.0,Pune,09-02-2018
2,John,Male,36,72523.0,Pune,02-06-2019
3,Neha,Female,32,42433.0,Kolkata,04-03-2020
4,Siddharth,Male,?,45311.0,Pune,15-10-2022
6,Ken,Male,28,79188.0,Chennai,12-05-2018
7,Anjali,Female,47,57568.0,Kolkata,25-10-2020
8,Vijay,Male,40,93707.0,,06-09-2022
9,Priya,Female,44,59769.0,Delhi,03-09-2022
10,Rahul,Unknown,32,68693.0,Bangalore,09-03-2020


## Summary
- Always inspect missing values first
- Choose drop or fill based on context
- Missing data handling impacts analysis and models