---

# 📊 Pandas\_ETL — Data Validation in ETL

### 🎯 Intent

Use **Pydantic v2 + Pandas** to validate ETL pipeline data—ensuring rows are clean, typed, and consistent before analytics or storage.

---

### 🧩 Core Components

1. **🧱 Row Models**

   * Define schema with `BaseModel` + field constraints.
   * Example fields: `id: int`, `name: str`, `amount: float`.

2. **📥 Validate Rows**

   * Convert DataFrame rows to dicts.
   * Validate with `.model_validate()`.
   * Catch `ValidationError` → log/drop.

3. **⚡ Batch Validation**

   * Use `TypeAdapter(list[Model])` for vectorized checks.
   * Much faster for bulk ETL.

4. **🛡️ Cleaning**

   * Drop invalid rows.
   * Fill defaults via field defaults.
   * Save errors to logs/quarantine table.

5. **📦 Integration**

   * Run validation step **before transformations**.
   * Build new DataFrame from validated rows.

6. **🧪 Common Constraints**

   * Ranges (`ge`, `le`), string length, regex patterns.
   * Enums (`Literal`, `Enum`).
   * Dates (`datetime`, `PastDate`).

7. **🔗 Schema Reuse**

   * Same Pydantic model can validate both **API inputs** and **ETL rows**.

8. **📊 Error Auditing**

   * Collect row index + error message.
   * Store invalid data separately for review.

---
