
---

# 🧠 ML/ETL Config with Pydantic

### 🎯 Intent

Use **Pydantic v2** to define configs & schemas for ML/ETL pipelines → reproducible runs, clean inputs, early failures.

---

### 🧩 Core Components

1. **⚙️ Settings Layer**

   * Centralize env configs with `BaseSettings`.
   * Typical: `DATA_DIR: DirectoryPath`, `SEED: int`, `TRACKING_URL: AnyUrl`.
   * Load via `.env`, env vars, `secrets_dir`.

2. **🧱 Pipeline Config Models**

   * Split configs per stage: `ExtractConfig`, `TransformConfig`, `TrainConfig`, `PredictConfig`.
   * Nest inside `AppConfig`.

3. **📊 Dataset Row Schema**

   * Define row-level `BaseModel` with type/range/regex checks.
   * Validate batches via `TypeAdapter(list[RowModel])`.

4. **🏷️ Column Contracts**

   * Use `Enum` / `Literal` for feature names (avoid typos).
   * Map raw → canonical via `Field(alias="raw_col")`.

5. **🔢 Hyperparams as Types**

   * `confloat(ge=0, le=1)` → learning rate.
   * `conint(ge=1)` → depth.
   * Discriminated unions for algo-specific configs (`algo="xgb" | "rf"`).

6. **📁 Paths & I/O**

   * Use `Path`, `FilePath`, `DirectoryPath`, `AnyUrl`.
   * Optional `@field_validator` to check existence.

7. **🧪 Data Quality Guards**

   * `@model_validator(mode="after")`: check `start <= end`, no nulls, unique keys.
   * Raise `PydanticCustomError` with clear codes.

8. **🚀 Batch & Stream Validation**

   * Batch → one adapter per chunk (fast).
   * Stream → same schema for Kafka/SQS messages.

9. **📤 Reproducibility**

   * Save config dump (`model_dump_json`) + compute hash for lineage.
   * Store with metrics/artifacts.

10. **🧰 Schema Docs**

* Publish `model_json_schema(by_alias=True)` for team contracts.
* Snapshot-test schema for stability.

11. **🛡️ PII & Secrets**

* Use `SecretStr` + `@field_serializer` to mask.
* Reference sensitive data by ID, not raw values.

12. **⚡ Performance**

* Batch > row-by-row validation.
* Keep models flat; avoid deep nesting.
* Use discriminators instead of wide unions.

13. **🧪 Testing**

* Unit-test configs (valid/invalid).
* Snapshot dumps + schema.
* Seed values for reproducibility.

14. **🧭 Orchestration-Friendly**

* Keep configs JSON/YAML serializable.
* Pass typed configs through Airflow, Prefect, Cron jobs.

---
