# 01: Python Type System & Validation

**Focus**: Type hints, Pydantic models, and fail-fast validation for data engineering

---

## üéØ The Problem

### Data Pipeline Configuration Without Types

Imagine configuring a data pipeline with plain dictionaries:

In [2]:
# What could go wrong?
config = {
    "connection": "azure_blob",
    "account_name": "myaccount",
    "container": "data",
    "port": "1433",  # Should be int!
    "retries": -5,    # Negative retries?
    # Missing required fields?
}

# Errors discovered HOURS into execution üí•
print(config)

{'connection': 'azure_blob', 'account_name': 'myaccount', 'container': 'data', 'port': '1433', 'retries': -5}


**Problems**:
1. ‚ùå No validation until runtime (expensive!)
2. ‚ùå Type errors caught late
3. ‚ùå Missing required fields silently ignored
4. ‚ùå No IDE autocomplete
5. ‚ùå Hard to document

**Solution**: Type hints + Pydantic = Fail fast, self-document, validate early

---

## ü¶â First Principles

### Static vs Dynamic Typing

**Dynamic Typing** (Python default):
```python
x = 5       # x is int
x = "hello" # now x is str
```

**Static Typing** (Java, C++):
```java
int x = 5;
x = "hello"; // Compile error!
```

**Gradual Typing** (Python 3.5+):
```python
x: int = 5       # Type hint
x = "hello"      # Runs fine, but mypy catches it!
```

### Type Hints: Documentation + Tooling

Type hints are:
- **Optional** (Python doesn't enforce them)
- **For tools** (mypy, IDEs, linters)
- **Self-documenting**
- **Zero runtime cost** (ignored by interpreter)

Pydantic adds:
- **Runtime validation**
- **Data parsing**
- **Automatic coercion**

---

## ‚ö° Minimal Examples

### 1. Primitives

In [3]:
# Basic type hints
age: int = 30
name: str = "Alice"
is_active: bool = True
temperature: float = 98.6

def greet(name: str, age: int) -> str:
    return f"Hello {name}, you are {age} years old"

print(greet("Bob", 25))
# mypy would catch: greet(123, "invalid")  # Wrong types!

Hello Bob, you are 25 years old


### 2. Collections

In [4]:
from typing import List, Dict, Set, Tuple

# List of integers
numbers: List[int] = [1, 2, 3, 4]

# Dictionary: string keys, any values
config: Dict[str, any] = {"host": "localhost", "port": 8080}

# Set of strings
tags: Set[str] = {"python", "data", "engineering"}

# Tuple with specific types
coordinate: Tuple[float, float] = (40.7128, -74.0060)

print(f"Numbers: {numbers}")
print(f"Config: {config}")
print(f"Coordinate: {coordinate}")

Numbers: [1, 2, 3, 4]
Config: {'host': 'localhost', 'port': 8080}
Coordinate: (40.7128, -74.006)


### 3. Optional and Union

In [5]:
from typing import Optional, Union

# Optional[T] means T or None
middle_name: Optional[str] = None  # Could be str or None

# Union means one of several types
identifier: Union[int, str] = "USER123"  # Could be int OR str

def process_data(data: str, max_rows: Optional[int] = None) -> Dict[str, any]:
    """Process data with optional row limit."""
    result = {"data": data}
    if max_rows is not None:
        result["limit"] = max_rows
    return result

print(process_data("sample"))
print(process_data("sample", max_rows=100))

{'data': 'sample'}
{'data': 'sample', 'limit': 100}


### 4. Literal (Constrained Values)

In [6]:
from typing import Literal

# Only these exact values allowed
Environment = Literal["dev", "staging", "prod"]

def deploy(env: Environment) -> None:
    print(f"Deploying to {env}")

deploy("prod")  # ‚úÖ OK
# deploy("production")  # ‚ùå mypy error - not in Literal

Deploying to prod


### 5. Custom Types with Pydantic

In [7]:
from pydantic import BaseModel, Field

class User(BaseModel):
    """User model with validation."""
    name: str
    age: int = Field(gt=0, lt=120)  # Constraints!
    email: Optional[str] = None

# ‚úÖ Valid user
user = User(name="Alice", age=30, email="alice@example.com")
print(user)
print(user.model_dump())  # Convert to dict

name='Alice' age=30 email='alice@example.com'
{'name': 'Alice', 'age': 30, 'email': 'alice@example.com'}


In [8]:
# ‚ùå Validation errors caught immediately
try:
    invalid_user = User(name="Bob", age=-5)  # Negative age!
except Exception as e:
    print(f"Validation error: {e}")

Validation error: 1 validation error for User
age
  Input should be greater than 0 [type=greater_than, input_value=-5, input_type=int]
    For further information visit https://errors.pydantic.dev/2.6/v/greater_than


---

## üîç Odibi Analysis

Let's examine real production code from the Odibi framework.

### Example 1: Enum for Type Safety

In [9]:
# From odibi_snippets.py
from enum import Enum

class ConnectionType(str, Enum):
    """Supported connection types."""
    LOCAL = "local"
    AZURE_BLOB = "azure_blob"
    DELTA = "delta"
    SQL_SERVER = "sql_server"

# Why Enum?
# 1. Autocomplete in IDE
# 2. Typo protection
# 3. Clear documentation of valid values

print(ConnectionType.AZURE_BLOB)
print(ConnectionType.AZURE_BLOB.value)
print(list(ConnectionType))  # All valid options

ConnectionType.AZURE_BLOB
azure_blob
[<ConnectionType.LOCAL: 'local'>, <ConnectionType.AZURE_BLOB: 'azure_blob'>, <ConnectionType.DELTA: 'delta'>, <ConnectionType.SQL_SERVER: 'sql_server'>]


### Example 2: Connection Configuration

In [10]:
from pydantic import BaseModel, Field
from typing import Dict

class AzureBlobConnectionConfig(BaseModel):
    """Azure Blob Storage connection."""
    type: ConnectionType = ConnectionType.AZURE_BLOB
    account_name: str  # Required!
    container: str     # Required!
    auth: Dict[str, str] = Field(default_factory=dict)  # Optional with default

# ‚úÖ Valid config
azure_config = AzureBlobConnectionConfig(
    account_name="myaccount",
    container="data",
    auth={"key": "secret"}
)
print(azure_config.model_dump_json(indent=2))

{
  "type": "azure_blob",
  "account_name": "myaccount",
  "container": "data",
  "auth": {
    "key": "secret"
  }
}


In [11]:
# ‚ùå Missing required fields
try:
    bad_config = AzureBlobConnectionConfig(account_name="test")
except Exception as e:
    print(f"Error: {e}")

Error: 1 validation error for AzureBlobConnectionConfig
container
  Field required [type=missing, input_value={'account_name': 'test'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.6/v/missing


### Example 3: Custom Validators

In [12]:
from pydantic import model_validator
from typing import Optional

class ReadConfig(BaseModel):
    """Configuration for reading data."""
    connection: str
    format: str
    table: Optional[str] = None
    path: Optional[str] = None
    
    @model_validator(mode="after")
    def check_table_or_path(self):
        """Ensure either table or path is provided."""
        if not self.table and not self.path:
            raise ValueError("Either 'table' or 'path' must be provided")
        return self

# ‚úÖ Valid - has path
read_config = ReadConfig(
    connection="local",
    format="parquet",
    path="/data/input.parquet"
)
print(read_config)

connection='local' format='parquet' table=None path='/data/input.parquet'


In [13]:
# ‚ùå Invalid - missing both table and path
try:
    invalid_read = ReadConfig(
        connection="local",
        format="parquet"
    )
except Exception as e:
    print(f"Validation error: {e}")

Validation error: 1 validation error for ReadConfig
  Value error, Either 'table' or 'path' must be provided [type=value_error, input_value={'connection': 'local', 'format': 'parquet'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error


### Example 4: Field Constraints

In [14]:
class RetryConfig(BaseModel):
    """Retry configuration with constraints."""
    enabled: bool = True
    max_attempts: int = Field(default=3, ge=1, le=10)  # Between 1 and 10
    backoff: str = Field(
        default="exponential",
        pattern="^(exponential|linear|constant)$"  # Regex pattern!
    )

# ‚úÖ Valid
retry = RetryConfig(max_attempts=5, backoff="linear")
print(retry)

# ‚ùå Invalid - too many attempts
try:
    bad_retry = RetryConfig(max_attempts=100)
except Exception as e:
    print(f"Error: {e}")

enabled=True max_attempts=5 backoff='linear'
Error: 1 validation error for RetryConfig
max_attempts
  Input should be less than or equal to 10 [type=less_than_equal, input_value=100, input_type=int]
    For further information visit https://errors.pydantic.dev/2.6/v/less_than_equal


---

## üèóÔ∏è Build It: Mini Config System

Let's build a simplified data pipeline configuration system from scratch.

### Step 1: Define Enums

In [15]:
from enum import Enum
from pydantic import BaseModel, Field, field_validator
from typing import List, Optional, Dict, Any

class DataFormat(str, Enum):
    CSV = "csv"
    PARQUET = "parquet"
    JSON = "json"

class ProcessingMode(str, Enum):
    BATCH = "batch"
    STREAMING = "streaming"

### Step 2: Source Configuration

In [16]:
class DataSource(BaseModel):
    """Configuration for a data source."""
    name: str = Field(description="Unique source name")
    path: str = Field(description="Path to data")
    format: DataFormat = DataFormat.PARQUET
    options: Dict[str, Any] = Field(default_factory=dict)
    
    @field_validator('name')
    @classmethod
    def validate_name(cls, v: str) -> str:
        """Ensure name is valid identifier."""
        if not v.isidentifier():
            raise ValueError(f"Name must be valid Python identifier: {v}")
        return v

# Test it
source = DataSource(
    name="sales_data",
    path="/data/sales.parquet",
    format=DataFormat.PARQUET,
    options={"compression": "snappy"}
)
print(source.model_dump_json(indent=2))

{
  "name": "sales_data",
  "path": "/data/sales.parquet",
  "format": "parquet",
  "options": {
    "compression": "snappy"
  }
}


### Step 3: Pipeline Configuration

In [17]:
class Pipeline(BaseModel):
    """Complete pipeline configuration."""
    name: str
    mode: ProcessingMode = ProcessingMode.BATCH
    sources: List[DataSource] = Field(min_length=1)  # At least one source!
    max_workers: int = Field(default=4, ge=1, le=32)
    tags: List[str] = Field(default_factory=list)
    
    @field_validator('sources')
    @classmethod
    def check_unique_names(cls, sources: List[DataSource]) -> List[DataSource]:
        """Ensure all source names are unique."""
        names = [s.name for s in sources]
        if len(names) != len(set(names)):
            raise ValueError(f"Duplicate source names: {names}")
        return sources

# Build a pipeline
pipeline = Pipeline(
    name="daily_sales_etl",
    mode=ProcessingMode.BATCH,
    sources=[
        DataSource(name="sales", path="/data/sales.parquet"),
        DataSource(name="customers", path="/data/customers.csv", format=DataFormat.CSV)
    ],
    max_workers=8,
    tags=["daily", "production"]
)

print(pipeline.model_dump_json(indent=2))

{
  "name": "daily_sales_etl",
  "mode": "batch",
  "sources": [
    {
      "name": "sales",
      "path": "/data/sales.parquet",
      "format": "parquet",
      "options": {}
    },
    {
      "name": "customers",
      "path": "/data/customers.csv",
      "format": "csv",
      "options": {}
    }
  ],
  "max_workers": 8,
  "tags": [
    "daily",
    "production"
  ]
}


### Step 4: Complex Validation

In [18]:
from pydantic import model_validator

class AdvancedPipeline(Pipeline):
    """Pipeline with cross-field validation."""
    output_path: Optional[str] = None
    output_format: Optional[DataFormat] = None
    
    @model_validator(mode="after")
    def check_output_consistency(self):
        """If output_path is set, output_format must be set."""
        if self.output_path and not self.output_format:
            raise ValueError("output_format required when output_path is set")
        if self.output_format and not self.output_path:
            raise ValueError("output_path required when output_format is set")
        return self

# ‚úÖ Valid - both set
valid = AdvancedPipeline(
    name="test",
    sources=[DataSource(name="src", path="/data")],
    output_path="/output",
    output_format=DataFormat.PARQUET
)
print("‚úÖ Valid pipeline created")

# ‚ùå Invalid - only path set
try:
    invalid = AdvancedPipeline(
        name="test",
        sources=[DataSource(name="src", path="/data")],
        output_path="/output"  # Missing format!
    )
except Exception as e:
    print(f"‚ùå Validation error: {e}")

‚úÖ Valid pipeline created
‚ùå Validation error: 1 validation error for AdvancedPipeline
  Value error, output_format required when output_path is set [type=value_error, input_value={'name': 'test', 'sources...output_path': '/output'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error


---

## ‚úÖ Test It

Let's test our models systematically.

In [19]:
def test_data_source_validation():
    """Test DataSource validation."""
    # Valid case
    source = DataSource(name="test", path="/data/test.parquet")
    assert source.name == "test"
    print("‚úÖ Valid DataSource accepted")
    
    # Invalid name
    try:
        DataSource(name="invalid-name", path="/data")  # Hyphen not allowed
        assert False, "Should have raised error"
    except ValueError as e:
        print(f"‚úÖ Invalid name rejected: {e}")
    
    # Missing required field
    try:
        DataSource(name="test")  # Missing path!
        assert False, "Should have raised error"
    except Exception as e:
        print(f"‚úÖ Missing field caught: validation_error")

test_data_source_validation()

‚úÖ Valid DataSource accepted
‚úÖ Invalid name rejected: 1 validation error for DataSource
name
  Value error, Name must be valid Python identifier: invalid-name [type=value_error, input_value='invalid-name', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error
‚úÖ Missing field caught: validation_error


In [20]:
def test_pipeline_validation():
    """Test Pipeline validation."""
    # Valid pipeline
    pipeline = Pipeline(
        name="test",
        sources=[DataSource(name="s1", path="/data")]
    )
    assert len(pipeline.sources) == 1
    print("‚úÖ Valid pipeline created")
    
    # Empty sources
    try:
        Pipeline(name="test", sources=[])
        assert False, "Should require at least one source"
    except Exception:
        print("‚úÖ Empty sources rejected")
    
    # Duplicate names
    try:
        Pipeline(
            name="test",
            sources=[
                DataSource(name="dup", path="/a"),
                DataSource(name="dup", path="/b")
            ]
        )
        assert False, "Should reject duplicate names"
    except ValueError:
        print("‚úÖ Duplicate source names rejected")

test_pipeline_validation()

‚úÖ Valid pipeline created
‚úÖ Empty sources rejected
‚úÖ Duplicate source names rejected


### JSON Serialization

In [21]:
# Pydantic makes JSON serialization trivial
pipeline_json = pipeline.model_dump_json(indent=2)
print("Serialized to JSON:")
print(pipeline_json)

# And deserialization
import json
data = json.loads(pipeline_json)
reconstructed = Pipeline(**data)
print(f"\n‚úÖ Reconstructed: {reconstructed.name}")
assert reconstructed.name == pipeline.name

Serialized to JSON:
{
  "name": "daily_sales_etl",
  "mode": "batch",
  "sources": [
    {
      "name": "sales",
      "path": "/data/sales.parquet",
      "format": "parquet",
      "options": {}
    },
    {
      "name": "customers",
      "path": "/data/customers.csv",
      "format": "csv",
      "options": {}
    }
  ],
  "max_workers": 8,
  "tags": [
    "daily",
    "production"
  ]
}

‚úÖ Reconstructed: daily_sales_etl


---

## üéØ Exercises

Complete these TODOs to practice what you've learned.

### Exercise 1: Database Connection Config

Create a `DatabaseConfig` model with validation.

In [22]:
# TODO: Create DatabaseConfig with:
# - host: str (required)
# - port: int (default 5432, must be 1-65535)
# - database: str (required)
# - username: str (required)
# - password: str (optional, for security reasons)
# - ssl_enabled: bool (default True)

class DatabaseConfig(BaseModel):
    host: str
    port: int = Field(default=5432, ge=1, le=65535)
    database: str
    username: str
    password: Optional[str] = None

# Test your implementation
db = DatabaseConfig(
    host="localhost",
    database="analytics",
    username="analyst"
)
print(db)

host='localhost' port=5432 database='analytics' username='analyst' password=None


### Exercise 2: Add Custom Validator

Add validation to ensure host is not 'localhost' in production.

In [23]:
# TODO: Add an environment field and validator
# - environment: Literal["dev", "staging", "prod"]
# - Add model_validator to ensure:
#   - If environment is "prod", host cannot be "localhost"

class ProductionDatabaseConfig(BaseModel):
    host: str
    port: int = Field(default=5432, ge=1, le=65535)
    database: str
    username: str
    password: Optional[str] = None
    environment: Literal["dev", "staging", "prod"]

    @model_validator(mode="after")
    def check_env_and_host(self):
        if self.environment == "prod" and self.host == "localhost":
            raise ValueError(f"host cannot be `localhost` if environment = 'prod'")


# This should fail:
try:
    bad_config = ProductionDatabaseConfig(
        host="localhost",
        environment="prod",
        database="db",
        username="user"
    )
except ValueError as e:
    print(f"‚úÖ Caught error: {e}")

‚úÖ Caught error: 1 validation error for ProductionDatabaseConfig
  Value error, host cannot be `localhost` if environment = 'prod' [type=value_error, input_value={'host': 'localhost', 'en...db', 'username': 'user'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error


### Exercise 3: Transformation Config

Create a model for SQL transformations.

In [26]:
# TODO: Create TransformationConfig with:
# - name: str (valid identifier)
# - sql: str (required, non-empty)
# - description: Optional[str]
# - parameters: Dict[str, Any] (default empty dict)
# Add validator to ensure sql is not just whitespace

class TransformationConfig(BaseModel):
    name: str
    sql: str
    description: Optional[str]
    parameters: Dict[str, Any] = Field(default=dict)

    @field_validator("sql")
    def check_non_empty_string(v):
        if not v:
            raise ValueError("sql must be non empty string")

# Test:
transform = TransformationConfig(
    name="clean_sales",
    sql="SELECT * FROM sales WHERE amount > 0",
    description="Remove negative amounts"
)
print(transform)

name='clean_sales' sql=None description='Remove negative amounts' parameters=<class 'dict'>


### Exercise 4: Complete ETL Config

Combine everything into a complete ETL configuration.

In [None]:
# TODO: Create ETLConfig that combines:
# - source: DataSource
# - transformations: List[TransformationConfig] (at least one)
# - destination: DatabaseConfig
# - schedule: Optional[str] (cron expression)
# Add validation to ensure transformations list is not empty

class ETLConfig(BaseModel):
    source: DataSource
    transformations: List[TransformationConfig] = Field(min_length=1)
    destination: DatabaseConfig
    schedule: Optional[str]

# Create a complete ETL config:
etl = ETLConfig()
print(etl.model_dump_json(indent=2))

TypeError: BaseModel.__init__() takes 1 positional argument but 2 were given

---

## üéì Summary

You've learned:

1. ‚úÖ **Type hints**: Document and enable tooling
2. ‚úÖ **Pydantic models**: Runtime validation + parsing
3. ‚úÖ **Field constraints**: `ge`, `le`, `pattern`, `min_length`
4. ‚úÖ **Custom validators**: `@field_validator`, `@model_validator`
5. ‚úÖ **Enums**: Type-safe constants
6. ‚úÖ **Fail-fast**: Catch errors at config time, not runtime

### Next Steps

1. Complete exercises.ipynb
2. Check solutions.ipynb
3. Explore odibi_snippets.py for more patterns
4. Try running mypy on your code

### Key Takeaway

> **In data engineering, failing fast with clear errors is better than failing hours into a pipeline run.**
>
> Type hints + Pydantic = Self-documenting, validated, maintainable configs.