# Lesson 3.7: Data Cleaning

## The 80/20 Rule of Data Science

Data scientists spend ~80% of time **cleaning data** and 20% on actual modeling.

### PHP Parallel
- `rename()` → Like renaming database columns in a migration
- `.str.lower()` → Like `strtolower()` applied to entire column
- `apply()` → Exactly like `Collection::map()`

In [None]:
import pandas as pd
import numpy as np

# Messy data (typical real-world scenario)
messy = pd.DataFrame({
    'Filter ID': ['F001', 'f002', 'F003', 'F001', 'F004'],  # Inconsistent case, duplicate
    'TDS Output ': [42, 78, '120', 35, 95],     # Extra space, string mixed in
    'Region  ': ['  North ', 'south', 'NORTH', 'North', ' East'],  # Messy strings
    'status': ['active', 'active', 'inactive', 'active', 'Active']  # Inconsistent
})
print("Messy data:")
messy

In [None]:
# Step 1: Fix column names
df = messy.copy()
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print("Cleaned columns:", list(df.columns))
df

In [None]:
# Step 2: Fix string columns (strip whitespace, lowercase)
# .str accessor = PHP string functions on entire column!
df['filter_id'] = df['filter_id'].str.upper()
df['region'] = df['region'].str.strip().str.title()  # '  North ' → 'North'
df['status'] = df['status'].str.lower()
df

In [None]:
# Step 3: Fix data types
df['tds_output'] = pd.to_numeric(df['tds_output'])  # Convert string '120' to number
print(df.dtypes)

In [None]:
# Step 4: Remove duplicates
print(f"Before: {len(df)} rows")
df = df.drop_duplicates(subset='filter_id', keep='first')
print(f"After removing duplicates: {len(df)} rows")
df

In [None]:
# apply() - Like Collection::map() for custom transformations
df['tds_category'] = df['tds_output'].apply(
    lambda x: 'Safe' if x < 80 else 'Warning' if x < 120 else 'Danger'
)
df

## Exercise

1. Create a messy DataFrame with inconsistent data
2. Clean column names, strip whitespace, standardize case
3. Use `apply()` to create a new computed column

In [None]:
# YOUR CODE HERE