# üßπ Data Cleaning ‚Äî Amazon Sales Dataset

This notebook identifies data quality issues, exports bad rows to a separate CSV, cleans the main dataset, and re-runs the validation pipeline.

> ‚ö†Ô∏è **Run this notebook BEFORE running `python dq_pipeline.py`** to ensure the data is clean.

## 1. Load Data

In [None]:
import pandas as pd
import importlib

CSV_PATH = "data/amazon_sales.csv"
BAD_ROWS_PATH = "data/bad_rows.csv"

df = pd.read_csv(CSV_PATH, low_memory=False)
print(f"Rows: {len(df):,}  |  Columns: {len(df.columns)}")
df.head()

## 2. Inspect Data Quality Issues

In [None]:
key_cols = ["Order ID", "Date", "Status", "Fulfilment", "currency", "Qty", "Amount", "ship-country"]

print("=== NULL COUNTS ===")
null_counts = df[key_cols].isnull().sum()
print(null_counts[null_counts > 0])
print()
print("=== All Status Values ===")
print(df["Status"].value_counts(dropna=False))
print()
print("=== Currency Values ===")
print(df["currency"].value_counts(dropna=False))
print()
print("=== Ship-Country Values ===")
print(df["ship-country"].value_counts(dropna=False))

## 3. Identify & Export Bad Rows

In [None]:
# Identify all rows with any issue
mask_null_currency = df["currency"].isnull()
mask_null_amount = df["Amount"].isnull()
mask_null_country = df["ship-country"].isnull()
mask_null_order_id = df["Order ID"].isnull()
mask_neg_qty = df["Qty"] < 0

bad_mask = mask_null_currency | mask_null_amount | mask_null_country | mask_null_order_id | mask_neg_qty

bad_rows = df[bad_mask].copy()
bad_rows["issue"] = ""
bad_rows.loc[mask_null_currency, "issue"] += "null_currency; "
bad_rows.loc[mask_null_amount, "issue"] += "null_amount; "
bad_rows.loc[mask_null_country, "issue"] += "null_ship_country; "
bad_rows.loc[mask_null_order_id, "issue"] += "null_order_id; "
bad_rows.loc[mask_neg_qty, "issue"] += "negative_qty; "

print(f"Total bad rows found: {len(bad_rows):,}")
print(f"  Null currency:     {mask_null_currency.sum():,}")
print(f"  Null Amount:       {mask_null_amount.sum():,}")
print(f"  Null ship-country: {mask_null_country.sum():,}")
print(f"  Null Order ID:     {mask_null_order_id.sum():,}")
print(f"  Negative Qty:      {mask_neg_qty.sum():,}")
print()
bad_rows.head(10)

In [None]:
# Export bad rows to a separate CSV for reference
bad_rows.to_csv(BAD_ROWS_PATH, index=False)
print(f"‚úÖ Exported {len(bad_rows):,} bad rows ‚Üí {BAD_ROWS_PATH}")

## 4. Fix Data Issues

| Fix | Column | Action | Reason |
|-----|--------|--------|--------|
| 1 | `currency` | Fill NaN ‚Üí `"INR"` | All valid rows use INR |
| 2 | `Amount` | Fill NaN ‚Üí `0.0` | Cancelled orders have no amount |
| 3 | `ship-country` | Fill NaN ‚Üí `"IN"` | All valid rows use IN |

In [None]:
# Apply fixes
df["currency"]     = df["currency"].fillna("INR")
df["Amount"]       = df["Amount"].fillna(0.0)
df["ship-country"] = df["ship-country"].fillna("IN")

print("‚úÖ All fixes applied!")
print()
print("Remaining nulls in key columns:")
remaining = df[key_cols].isnull().sum()
remaining = remaining[remaining > 0]
print("  None! ‚úÖ" if remaining.empty else remaining)

## 5. Save Cleaned Data

In [None]:
df.to_csv(CSV_PATH, index=False)
print(f"‚úÖ Cleaned data saved ‚Üí {CSV_PATH}")
print(f"   {len(df):,} rows  |  {len(df.columns)} columns")

## 6. Re-run Validation Pipeline

Reload the modules to pick up any code changes, then validate the cleaned data.

In [None]:
# Force reload modules (picks up code changes without kernel restart)
import src.ge_validation as _ge
import src.pydantic_validation as _py
importlib.reload(_ge)
importlib.reload(_py)

# Re-read the cleaned CSV
df_clean = pd.read_csv(CSV_PATH, low_memory=False)

print("=" * 60)
print("   RE-RUNNING VALIDATION ON CLEANED DATA")
print("=" * 60)

ge_summary = _ge.run_ge_validation(df_clean)
pydantic_summary = _py.run_pydantic_validation(df_clean)

all_ok = ge_summary["overall_success"] and pydantic_summary["overall_success"]

print("\n" + "=" * 60)
print("   FINAL RESULT")
print("=" * 60)
print(f"   GE Validation      : {'‚úÖ' if ge_summary['overall_success'] else '‚ùå'}")
print(f"   Pydantic Validation : {'‚úÖ' if pydantic_summary['overall_success'] else '‚ùå'}")
print(f"   Overall             : {'‚úÖ ALL PASSED' if all_ok else '‚ùå ISSUES FOUND'}")
print("=" * 60)

## 7. Done!

Now you can run `python dq_pipeline.py` from the terminal and it should pass ‚úÖ