
# Pandera & Pydantic — From Data Validation to Production Contracts

**Day 11/25 — #25DaysOfDataTech**  
By Prerna Joshi

This notebook covers:
- Why validation matters in production data systems
- Pandera for DataFrame-level validation
- Pydantic for API, config, and data contracts
- Using Pandera + Pydantic together
- A clear comparison table (interview-ready)

---



## 1. Why Data Validation Matters

Most real-world data failures are *silent*:
- Wrong data types
- Missing values
- Unexpected categories
- Out-of-range numbers

These issues:
- Break dashboards
- Corrupt features
- Degrade ML models
- Cause late-night production incidents

**Solution:** Treat data like code — validate it with explicit contracts.



## 2. Pandera — Data Validation for Pandas

Pandera lets you define schemas for Pandas DataFrames.

It answers:
> "What should *valid data* look like?"

Pandera is best used **inside data pipelines**:
- ETL
- Feature engineering
- Batch validation
- Model training inputs


In [1]:

# Install (run once)
# !pip install pandera pandas


In [2]:

import pandas as pd
import pandera as pa
from pandera import Column, Check


In [3]:

# Sample dataset
df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "age": [25, 42, 17],
    "country": ["US", "IN", "US"],
    "salary": [70000, 120000, 30000]
})

df


Unnamed: 0,user_id,age,country,salary
0,1,25,US,70000
1,2,42,IN,120000
2,3,17,US,30000


In [4]:

# Define Pandera schema
schema = pa.DataFrameSchema({
    "user_id": Column(int, nullable=False),
    "age": Column(int, Check.between(18, 65)),
    "country": Column(str, Check.isin(["US", "IN", "UK"])),
    "salary": Column(int, Check.gt(0))
})


top-level pandera module will be **removed in a future version of pandera**.
If you're using pandera to validate pandas objects, we highly recommend updating
your import:

```
# old import
import pandera as pa

# new import
import pandera.pandas as pa
```

If you're using pandera to validate objects from other compatible libraries
like pyspark or polars, see the supported libraries section of the documentation
for more information on how to import pandera:

https://pandera.readthedocs.io/en/stable/supported_libraries.html


```
```



In [5]:
# This cell intentionally demonstrates a validation failure.
'''
This validation failure is intentional.
Pandera enforces strict data contracts, so when a value like age = 17 violates the business rule (age >= 18), the pipeline stops immediately.
This prevents bad data from silently affecting feature engineering, models, or dashboards
'''

try:
    validated_df = schema.validate(df)
    validated_df

except pa.errors.SchemaError as e:
    print("Data validation failed as expected")
    print("\nReason:")
    print(e)

    print("\nFailure cases (row-level details):")
    print(e.failure_cases)



Data validation failed as expected

Reason:
Column 'age' failed element-wise validator number 0: in_range(18, 65) failure cases: 17

Failure cases (row-level details):
   index  failure_case
0      2            17



### What Pandera Gives You
- Clear schema definitions
- Automatic validation errors
- Fail-fast behavior
- Reproducible, testable pipelines



## 3. Pydantic — Validation for APIs & Configs

Pydantic is used to validate *structured inputs*:
- API requests / responses
- Config files
- Environment variables
- JSON payloads

It is the backbone of **FastAPI**.


In [6]:

# Install (run once)
# !pip install pydantic


In [7]:

from pydantic import BaseModel, Field, ValidationError
from typing import Literal


In [8]:

class UserRequest(BaseModel):
    user_id: int
    age: int = Field(gt=17, lt=66)
    country: Literal["US", "IN", "UK"]
    salary: int = Field(gt=0)


In [9]:

# Valid input
user = UserRequest(
    user_id=1,
    age=30,
    country="US",
    salary=90000
)

user


UserRequest(user_id=1, age=30, country='US', salary=90000)

In [10]:

# Invalid input example
try:
    bad_user = UserRequest(
        user_id="abc",
        age=10,
        country="FR",
        salary=-500
    )
except ValidationError as e:
    print(e)


4 validation errors for UserRequest
user_id
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='abc', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/int_parsing
age
  Input should be greater than 17 [type=greater_than, input_value=10, input_type=int]
    For further information visit https://errors.pydantic.dev/2.10/v/greater_than
country
  Input should be 'US', 'IN' or 'UK' [type=literal_error, input_value='FR', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/literal_error
salary
  Input should be greater than 0 [type=greater_than, input_value=-500, input_type=int]
    For further information visit https://errors.pydantic.dev/2.10/v/greater_than



### What Pydantic Gives You
- Automatic type coercion
- Detailed error messages
- Safe API boundaries
- Self-documenting data models



## 4. Using Pandera + Pydantic Together

In real production systems:

**Pydantic**
- Validates inputs at system boundaries
- APIs, configs, services

**Pandera**
- Validates tabular data inside pipelines
- DataFrames, features, batches

They solve *different but complementary* problems.


In [11]:

# Example: Pydantic validated input -> Pandera validated DataFrame

request = UserRequest(
    user_id=10,
    age=40,
    country="IN",
    salary=80000
)

df_from_api = pd.DataFrame([request.model_dump()])

schema.validate(df_from_api)


Unnamed: 0,user_id,age,country,salary
0,10,40,IN,80000



## 5. Pandera vs Pydantic — Comparison Table



| Aspect | Pandera | Pydantic |
|------|--------|---------|
| Primary Use | DataFrame validation | API & object validation |
| Works With | Pandas DataFrames | Python objects / JSON |
| Common Usage | ETL, features, ML pipelines | FastAPI, configs, services |
| Validation Level | Column & dataset level | Field & object level |
| Error Style | Schema validation errors | Structured validation errors |
| Production Role | Guards internal data flow | Guards system boundaries |
| Learning Curve | Medium | Easy–Medium |
| Best Paired With | Pandas, ML pipelines | FastAPI, backend services |



## 6. Interview-Ready Summary

- Pandera prevents **bad data from entering pipelines**
- Pydantic prevents **bad inputs from entering systems**
- Together, they enforce trust, safety, and reliability
- Validation is not overhead — it's production insurance


