# **Data Cleaning**

## **1. Missing data handling**

In [119]:
import numpy as np
import pandas as pd

### 🔍 **What is Missing Data?**

Missing data occurs when:

* A value is **not recorded** or is **unavailable**
* A column has `NaN`, `None`, or `NaT` values

In pandas, missing values are typically represented as:

* `np.nan` (for float)
* `None` (object or string)
* `NaT` (for datetime)

In [120]:
data = {
    'Name': ['Alice', 'Bob', np.nan, 'David', 'Eva'],
    'Age': [25, np.nan, 30, np.nan, 22],
    'Salary': [50000, 60000, np.nan, 70000, 65000],
    'JoinDate': ['2020-01-01', '2019-05-21', None, '2021-07-15', ''],
}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,,60000.0,2019-05-21
2,,30.0,,
3,David,,70000.0,2021-07-15
4,Eva,22.0,65000.0,


## ✅ Techniques for Handling Missing Data

### 🔹 **1. Detecting Missing Values**

#### ▶️ Method: `isnull()` / `notnull()`

In [121]:
df.isnull()

Unnamed: 0,Name,Age,Salary,JoinDate
0,False,False,False,False
1,False,True,False,False
2,True,False,True,True
3,False,True,False,False
4,False,False,False,False


In [122]:
df.isnull().sum()

Name        1
Age         2
Salary      1
JoinDate    1
dtype: int64

In [123]:
df.Name.isnull()

0    False
1    False
2     True
3    False
4    False
Name: Name, dtype: bool

In [124]:
df['Age'].isnull().sum()

2

#### ✅ Use Case:

To **explore** where missing data exists and how much is missing.

🔹 *Why this method?*
It gives a **diagnostic overview** — vital before deciding how to clean.


### 🔹 **2. Dropping Missing Values**

#### ▶️ Method: `dropna()`

In [125]:
df

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,,60000.0,2019-05-21
2,,30.0,,
3,David,,70000.0,2021-07-15
4,Eva,22.0,65000.0,


In [126]:
df.dropna() # Drop rows with any NaNs

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
4,Eva,22.0,65000.0,


In [127]:
df.dropna(how='all') # Drop rows only if all values are NaN

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,,60000.0,2019-05-21
2,,30.0,,
3,David,,70000.0,2021-07-15
4,Eva,22.0,65000.0,


In [128]:
df.dropna(subset=['Age']) # Drop rows where 'Age' is NaN

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
2,,30.0,,
4,Eva,22.0,65000.0,


In [129]:
df.dropna(subset=['Name', 'Age']) # Drop rows where 'Name' or 'Age' is NaN

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
4,Eva,22.0,65000.0,


In [130]:
df.dropna(subset=['Name', 'Age'], how='all')
# Though we mention both subset and how, subset has priority.

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,,60000.0,2019-05-21
2,,30.0,,
3,David,,70000.0,2021-07-15
4,Eva,22.0,65000.0,


In [131]:
df.dropna(axis=1) # Drop columns with any missing values

0
1
2
3
4


#### ✅ Real-World Use Case:

When doing a **customer churn analysis**, and a row has **too many missing values**, it may be better to drop the row entirely.

🔹 *Why this method?*
When the **missingness is high** or can't be reliably imputed.


### 🔹 **3. Filling Missing Values (Imputation)**

#### ▶️ Method: `fillna()`

In [132]:
df

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,,60000.0,2019-05-21
2,,30.0,,
3,David,,70000.0,2021-07-15
4,Eva,22.0,65000.0,


In [133]:
df['Age'].fillna(0) # Fill with constant

0    25.0
1     0.0
2    30.0
3     0.0
4    22.0
Name: Age, dtype: float64

In [134]:
age_mean = df['Age'].mean()
print(f"Mean: {age_mean}")
df['Age'].fillna(age_mean) # Fill with mean

Mean: 25.666666666666668


0    25.000000
1    25.666667
2    30.000000
3    25.666667
4    22.000000
Name: Age, dtype: float64

In [135]:
df.fillna(method='ffill') # Forward fill

  df.fillna(method='ffill') # Forward fill


Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,25.0,60000.0,2019-05-21
2,Bob,30.0,60000.0,2019-05-21
3,David,30.0,70000.0,2021-07-15
4,Eva,22.0,65000.0,


In [136]:
df.ffill()

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,25.0,60000.0,2019-05-21
2,Bob,30.0,60000.0,2019-05-21
3,David,30.0,70000.0,2021-07-15
4,Eva,22.0,65000.0,


In [137]:
df.fillna(method='bfill') 

  df.fillna(method='bfill')


Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,30.0,60000.0,2019-05-21
2,David,30.0,70000.0,2021-07-15
3,David,22.0,70000.0,2021-07-15
4,Eva,22.0,65000.0,


In [138]:
df.bfill() # Backward fill

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,30.0,60000.0,2019-05-21
2,David,30.0,70000.0,2021-07-15
3,David,22.0,70000.0,2021-07-15
4,Eva,22.0,65000.0,


#### ✅ Real-World Use Cases:

* In time-series like stock prices, **forward fill** is used to propagate previous known values.
* In health records, **mean or median imputation** is used for missing height/weight.

🔹 *Why this method?*
Preserves rows by making **informed guesses**, better than dropping rows in critical data.


### 🔹 **4. Interpolating Missing Values**

#### ▶️ Method: `interpolate()`

In [139]:
df.interpolate()

  df.interpolate()


Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,27.5,60000.0,2019-05-21
2,,30.0,65000.0,
3,David,26.0,70000.0,2021-07-15
4,Eva,22.0,65000.0,


In [140]:
df['Age'].interpolate()

0    25.0
1    27.5
2    30.0
3    26.0
4    22.0
Name: Age, dtype: float64

#### ✅ Real-World Use Case:

Sensor data with values missing at regular intervals — interpolate to **maintain trends**.

🔹 *Why this method?*
More **mathematically accurate** than simple filling — especially useful in continuous data.

### 🔹 **5. Replacing Specific Missing Patterns**

#### ▶️ Method: `replace()`

In [141]:
df

Unnamed: 0,Name,Age,Salary,JoinDate
0,Alice,25.0,50000.0,2020-01-01
1,Bob,,60000.0,2019-05-21
2,,30.0,,
3,David,,70000.0,2021-07-15
4,Eva,22.0,65000.0,


In [142]:
df.JoinDate.replace('', np.nan)

0    2020-01-01
1    2019-05-21
2          None
3    2021-07-15
4           NaN
Name: JoinDate, dtype: object

In [143]:
df.JoinDate.replace(['', None], np.nan)

0    2020-01-01
1    2019-05-21
2           NaN
3    2021-07-15
4           NaN
Name: JoinDate, dtype: object

#### ✅ Real-World Use Case:

In dirty imports, empty strings or placeholders like `'NA'`, `'null'`, `'?'` need to be **standardized to NaN**.

🔹 *Why this method?*
Prepares the data for consistent missing value handling using `fillna()` or `dropna()`.


### 🔹 **6. Creating Missing Value Indicator Columns**

#### ▶️ Method:

In [144]:
df['Age_missing'] = df['Age'].isnull()
df

Unnamed: 0,Name,Age,Salary,JoinDate,Age_missing
0,Alice,25.0,50000.0,2020-01-01,False
1,Bob,,60000.0,2019-05-21,True
2,,30.0,,,False
3,David,,70000.0,2021-07-15,True
4,Eva,22.0,65000.0,,False


In [145]:
df['Age_missing'] = df['Age'].isnull().astype(int)
df

Unnamed: 0,Name,Age,Salary,JoinDate,Age_missing
0,Alice,25.0,50000.0,2020-01-01,0
1,Bob,,60000.0,2019-05-21,1
2,,30.0,,,0
3,David,,70000.0,2021-07-15,1
4,Eva,22.0,65000.0,,0


#### ✅ Real-World Use Case:

In ML modeling, **presence of missing values** can be a predictive feature itself.

🔹 *Why this method?*
Preserves missingness info before filling — especially important in predictive models.


### 🔹 **7. Column-wise Imputation with Custom Strategies**

In [146]:
df

Unnamed: 0,Name,Age,Salary,JoinDate,Age_missing
0,Alice,25.0,50000.0,2020-01-01,0
1,Bob,,60000.0,2019-05-21,1
2,,30.0,,,0
3,David,,70000.0,2021-07-15,1
4,Eva,22.0,65000.0,,0


In [147]:
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Name'] = df['Name'].fillna('Unknown')

In [148]:
df

Unnamed: 0,Name,Age,Salary,JoinDate,Age_missing
0,Alice,25.0,50000.0,2020-01-01,0
1,Bob,25.0,60000.0,2019-05-21,1
2,Unknown,30.0,,,0
3,David,25.0,70000.0,2021-07-15,1
4,Eva,22.0,65000.0,,0


#### ✅ Use Case:

In demographics data:

* **Age**: numerical, use **median** (robust to outliers)
* **Name**: categorical, use placeholder like **'Unknown'**

🔹 *Why this method?*
Tailors imputation strategy to **column data type and distribution**.


### 🔹 **8. Conditional Imputation**

In [149]:
df

Unnamed: 0,Name,Age,Salary,JoinDate,Age_missing
0,Alice,25.0,50000.0,2020-01-01,0
1,Bob,25.0,60000.0,2019-05-21,1
2,Unknown,30.0,,,0
3,David,25.0,70000.0,2021-07-15,1
4,Eva,22.0,65000.0,,0


In [150]:
# Fill missing salary with average salary by age group
df['Salary'] = df.groupby('Age')['Salary'].transform(
    lambda x: x.fillna(x.mean())
)

In [151]:
df

Unnamed: 0,Name,Age,Salary,JoinDate,Age_missing
0,Alice,25.0,50000.0,2020-01-01,0
1,Bob,25.0,60000.0,2019-05-21,1
2,Unknown,30.0,,,0
3,David,25.0,70000.0,2021-07-15,1
4,Eva,22.0,65000.0,,0


#### ✅ Real-World Use Case:

Filling **missing salary** using related feature like **Age** segment (logical association).

🔹 *Why this method?*
**Context-aware** and more accurate than global average.


### 🔹 **9. Filling Using External Data / Domain Rules**

Example: Fill missing age using voter records, employee database, etc. (manual or external lookup).

#### ✅ Real-World Use Case:

Employee records with missing age cross-verified with HR system.

🔹 *Why this method?*
Highest accuracy when domain-specific knowledge or systems available.


### 🔹 **10. Use of Third-party Libraries**

* `sklearn.impute.SimpleImputer` – Mean/median/mode strategies
* `fancyimpute` – KNN imputation
* `missingno` – Visualization of missingness


## 📌 Summary Table

| Technique              | When to Use                                | Example                              |
| ---------------------- | ------------------------------------------ | ------------------------------------ |
| `dropna()`             | Row/column with too much missing           | Drop user record with all values NaN |
| `fillna()`             | Simple fill (mean, median, zero, etc.)     | Fill missing Age                     |
| `interpolate()`        | Time-series or continuous data             | Sensor/stock value filling           |
| `replace()`            | Replace 'NA', '', '?' with np.nan          | Clean imported CSV                   |
| Indicator Columns      | Model needs info on missing value presence | Add Age\_missing column              |
| Group-based Imputation | Missing value depends on another feature   | Salary by Age group                  |
| Conditional Logic      | Use if-else for specific rows              | Set Age to 30 if Name == 'Bob'       |
| Domain Knowledge       | Pull data from external source             | Fill from HR system                  |


<center><b>Thanks</b></center>