# Lesson 3.6: Handling Missing Data

## Why This Matters for ML

Real-world data is MESSY. Sensors go offline, users skip fields, data gets corrupted.
ML models **can't handle missing values** (NaN) - you must deal with them first.

### PHP Parallel
Like checking for `null` in PHP: `$value ?? 'default'` or `isset($value)`

In [None]:
import pandas as pd
import numpy as np

# Data with missing values (NaN = Not a Number)
df = pd.DataFrame({
    'filter_id': ['F001', 'F002', 'F003', 'F004', 'F005', 'F006'],
    'tds_output': [42, np.nan, 120, 35, np.nan, 180],
    'flow_rate': [2.1, 1.5, np.nan, 2.3, 1.2, 0.5],
    'age_days': [60, 180, 320, np.nan, 240, 350],
    'region': ['North', 'South', None, 'East', 'West', 'South']
})
df

In [None]:
# Detect missing values
print("Missing values per column:")
print(df.isna().sum())
print(f"\nTotal missing: {df.isna().sum().sum()}")

In [None]:
# Strategy 1: DROP rows with any missing value
# Use when: few missing rows, lots of data
cleaned = df.dropna()
print(f"Before: {len(df)} rows â†’ After dropna: {len(cleaned)} rows")
cleaned

In [None]:
# Strategy 2: FILL with a value
# Like PHP: $value ?? 'default'

# Fill with mean (most common for numeric)
df_filled = df.copy()
df_filled['tds_output'] = df_filled['tds_output'].fillna(df_filled['tds_output'].mean())
df_filled['flow_rate'] = df_filled['flow_rate'].fillna(df_filled['flow_rate'].median())
df_filled['age_days'] = df_filled['age_days'].fillna(df_filled['age_days'].median())
df_filled['region'] = df_filled['region'].fillna('Unknown')

print("After filling:")
df_filled

## Exercise

1. Check which rows have missing values using `df[df.isna().any(axis=1)]`
2. Fill numeric NaNs with column median
3. Drop only rows where tds_output is missing

In [None]:
# YOUR CODE HERE