# **Data Cleaning**

## **4. Handling Invalid / Out-of-Range Values**

In [17]:
import numpy as np
import pandas as pd 

This is a **critical part** of Data Cleaning because invalid values can:

* Break your analytics or ML models,
* Mislead business insights,
* Indicate data corruption or poor validation.

Let’s now go through this topic **comprehensively**, including techniques, real-world examples, and why to use each approach.

## 💡 What Are Invalid / Out-of-Range Values?

These are values that are:

* Outside of **logical or domain-specific boundaries**,
* **Impossible** or **improbable** entries,
* Wrong **data type** or **format** (e.g., string in numeric column).

### ✅ Real-World Examples

| Domain      | Column          | Invalid Values Example            |
| ----------- | --------------- | --------------------------------- |
| Healthcare  | Age             | -5, 200                           |
| Education   | Score (%)       | 101%, -10                         |
| E-commerce  | Purchase Amount | ₹0 or ₹99999999                   |
| Banking     | Account Opened  | "32nd Feb", "N/A"                 |
| IoT/Sensors | Temperature     | 9999, -300°C (physics impossible) |


## 🧰 Techniques for Handling Invalid / Out-of-Range Values

In [18]:
df = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5, 6],
    'Age': [25, 34, -1, 120, 45, np.nan],
    'Score': [88, 92, 105, -10, 76, 95]
})

df

Unnamed: 0,CustomerID,Age,Score
0,1,25.0,88
1,2,34.0,92
2,3,-1.0,105
3,4,120.0,-10
4,5,45.0,76
5,6,,95


### 🔹 1. **Detecting Invalid Values via Logical Checks**

Use **logical conditions** to identify and flag invalid entries.

In [19]:
df[df['Age'] < 0] # Age can't be negative

Unnamed: 0,CustomerID,Age,Score
2,3,-1.0,105


In [20]:
df[df['Score'] > 100] # Score > 100 is invalid

Unnamed: 0,CustomerID,Age,Score
2,3,-1.0,105


#### ✅ Use Case:

* Use in domains with **fixed limits** (e.g., age, temperature, percentages)

🔹 *Why?*
Fast, intuitive, and works well with **domain knowledge.**

### 🔹 2. **Using Boolean Masking for Range Checks**

In [21]:
valid_age_mask = (df['Age'] >= 0) & (df['Age'] <= 100)
df[valid_age_mask]

Unnamed: 0,CustomerID,Age,Score
0,1,25.0,88
1,2,34.0,92
4,5,45.0,76


#### ✅ Use Case:

Remove rows violating **known valid ranges** (e.g., valid human age)

🔹 *Why?*
Precise control; you can also inverse it to isolate invalid data.


### 🔹 3. **Replacing Invalid Values with NaN**

In [22]:
df.loc[~valid_age_mask, 'Age'] = np.nan

In [23]:
df

Unnamed: 0,CustomerID,Age,Score
0,1,25.0,88
1,2,34.0,92
2,3,,105
3,4,,-10
4,5,45.0,76
5,6,,95


#### ✅ Use Case:

Standardize invalid entries as `NaN` before **imputation** or analysis.

🔹 *Why?*
`NaN` can be handled consistently in Pandas (mean/median fill etc.).

### 🔹 4. **Using `apply()` for Complex Checks**

In [24]:
def clean_score(score):
    if 0 <= score <= 100:
        return score
    else:
        return np.nan
    
df['Score'] = df['Score'].apply(clean_score)

df

Unnamed: 0,CustomerID,Age,Score
0,1,25.0,88.0
1,2,34.0,92.0
2,3,,
3,4,,
4,5,45.0,76.0
5,6,,95.0


#### ✅ Use Case:

Flexible for **non-linear rules** or if multiple columns are needed in the logic.

🔹 *Why?*
Great for **custom rules** across different columns.

### 🔹 5. **Handling Invalid Dates or Formats**

In [25]:
df1 = pd.DataFrame({
    'StartDate': ['2024-01-01', 'not-a-date', '2025-07-21']
})

df1['StartDate'] = pd.to_datetime(df1['StartDate'], errors='coerce')

df1

Unnamed: 0,StartDate
0,2024-01-01
1,NaT
2,2025-07-21


#### ✅ Use Case:

When parsing **date columns** with malformed entries.

🔹 *Why?*
Converts bad dates to `NaT` (like `NaN` for dates) for easy handling.

### 🔹 6. **Domain-Based Replacement**

In [27]:
# Replace out-of-range age with median age
median_age = df[(df['Age'] >= 0) & (df['Age'] <= 100)]['Age'].median()
df['Age'] = df['Age'].apply(lambda x: x if 0 <= x <= 100 else median_age)

df

Unnamed: 0,CustomerID,Age,Score
0,1,25.0,88.0
1,2,34.0,92.0
2,3,34.0,
3,4,34.0,
4,5,45.0,76.0
5,6,34.0,95.0


#### ✅ Use Case:

When you need to keep all records (e.g., small sample size or business need).

🔹 *Why?*
Balances data integrity with statistical smoothness.

## 📊 Summary Table

| Technique                         | When to Use                          | Example Use Case                  |
| --------------------------------- | ------------------------------------ | --------------------------------- |
| Logical Checks                    | Clear domain limits                  | Age < 0, Score > 100              |
| Boolean Masks                     | Efficient filtering                  | Validating temperature range      |
| Replace with NaN                  | Prepare for imputation               | Replacing -1 salary with NaN      |
| `apply()` + custom logic          | Complex validation                   | Age + Role logic for employment   |
| `pd.to_datetime(errors='coerce')` | For bad/malformed dates              | Parsing signup dates in user logs |
| Replace with Median/Mode          | Keep all rows with reasonable values | Fill invalid ages with median     |


### ✅ Best Practices

* Always **use domain knowledge** to define valid ranges.
* Replace invalid values with `NaN` to use built-in pandas tools (`fillna`, `dropna`, etc.).
* If unsure, flag invalid values rather than deleting them.
* Validate both **range** and **data type** (e.g., strings in numeric columns).


<center><b>Thanks</b></center>