# Deep Dive 01: Odibi Configuration System

## üéØ The Problem

Data pipelines need configuration:
- Where to read data from?
- What transformations to apply?
- Where to write results?
- How nodes depend on each other?

**Without type-safe config:**
```python
# ‚ùå Errors caught at runtime (too late!)
config = yaml.safe_load(open('pipeline.yaml'))
connection = config['conections']  # Typo - fails later
mode = config.get('mode', 'overrite')  # Wrong default - silent corruption
```

**With Pydantic:**
```python
# ‚úÖ Errors caught immediately on load
config = ProjectConfig.from_yaml('project.yaml')
# Typos, missing fields, wrong types = instant, clear errors
```

## ü¶â First Principles

### 1. Validate Early, Fail Fast
Catch config errors **before** executing any pipeline logic.

### 2. Type Safety
Use Python's type system + Pydantic to enforce correctness.

### 3. Clear Error Messages
Users should know *exactly* what's wrong and where.

### 4. Composability
Build complex configs from simple, reusable pieces.

## ‚ö° Read Actual Odibi Code

Let's examine the production config system:

In [None]:
from pathlib import Path

odibi_config_path = Path(r"c:/Users/hodibi/OneDrive - Ingredion/Desktop/Repos/Odibi/odibi/config.py")

with open(odibi_config_path) as f:
    lines = f.readlines()
    
print(f"Total lines: {len(lines)}")
print(f"\nFirst 50 lines:\n")
print(''.join(lines[:50]))

## üîç Analysis: Configuration Architecture

### Hierarchy Overview

```
ProjectConfig (top-level)
‚îú‚îÄ‚îÄ project: str
‚îú‚îÄ‚îÄ engine: EngineType
‚îú‚îÄ‚îÄ connections: Dict[str, ConnectionConfig]
‚îú‚îÄ‚îÄ pipelines: List[PipelineConfig]
‚îÇ   ‚îî‚îÄ‚îÄ PipelineConfig
‚îÇ       ‚îú‚îÄ‚îÄ pipeline: str
‚îÇ       ‚îî‚îÄ‚îÄ nodes: List[NodeConfig]
‚îÇ           ‚îî‚îÄ‚îÄ NodeConfig
‚îÇ               ‚îú‚îÄ‚îÄ name: str
‚îÇ               ‚îú‚îÄ‚îÄ read: Optional[ReadConfig]
‚îÇ               ‚îú‚îÄ‚îÄ transform: Optional[TransformConfig]
‚îÇ               ‚îî‚îÄ‚îÄ write: Optional[WriteConfig]
‚îú‚îÄ‚îÄ story: StoryConfig
‚îú‚îÄ‚îÄ retry: RetryConfig
‚îî‚îÄ‚îÄ logging: LoggingConfig
```

## 1Ô∏è‚É£ Enum-Based Validation

### Why Enums?

Enums provide:
- **Type safety**: Only valid values accepted
- **IDE autocomplete**: See all options
- **No typos**: `"sprk"` rejected, must be `EngineType.SPARK`
- **Clear errors**: "Value must be one of: spark, pandas"

In [None]:
from enum import Enum
from pydantic import BaseModel, ValidationError

# Odibi's actual enums
class EngineType(str, Enum):
    """Supported execution engines."""
    SPARK = "spark"
    PANDAS = "pandas"

class ConnectionType(str, Enum):
    """Supported connection types."""
    LOCAL = "local"
    AZURE_BLOB = "azure_blob"
    DELTA = "delta"
    SQL_SERVER = "sql_server"

class WriteMode(str, Enum):
    """Write modes for output operations."""
    OVERWRITE = "overwrite"
    APPEND = "append"

class LogLevel(str, Enum):
    """Logging levels."""
    DEBUG = "DEBUG"
    INFO = "INFO"
    WARNING = "WARNING"
    ERROR = "ERROR"

# Test enum validation
print("‚úÖ Valid:")
print(f"  Engine: {EngineType.SPARK}")
print(f"  Connection: {ConnectionType.AZURE_BLOB}")
print(f"  Write mode: {WriteMode.APPEND}")

# Test with Pydantic
class Config(BaseModel):
    engine: EngineType
    mode: WriteMode

valid = Config(engine="spark", mode="append")
print(f"\n‚úÖ Valid config: {valid}")

try:
    invalid = Config(engine="dask", mode="append")
except ValidationError as e:
    print(f"\n‚ùå Invalid engine type:\n{e}")

### üí° Key Insight: `str, Enum` Pattern

```python
class EngineType(str, Enum):  # ‚Üê Inherits from str AND Enum
```

**Why both?**
- `Enum`: Provides enumeration behavior
- `str`: Makes values JSON/YAML serializable
- Pydantic accepts both `"spark"` string and `EngineType.SPARK` enum

## 2Ô∏è‚É£ Simple Nested Models

Let's build from simple to complex.

In [None]:
from typing import Optional, Dict, Any
from pydantic import BaseModel, Field

# Simplified Odibi configs

class RetryConfig(BaseModel):
    """Retry configuration."""
    enabled: bool = True
    max_attempts: int = Field(default=3, ge=1, le=10)  # Between 1 and 10
    backoff: str = Field(default="exponential", pattern="^(exponential|linear|constant)$")

class LoggingConfig(BaseModel):
    """Logging configuration."""
    level: LogLevel = LogLevel.INFO
    structured: bool = Field(default=False, description="Output JSON logs")
    metadata: Dict[str, Any] = Field(default_factory=dict, description="Extra metadata")

# Test defaults
retry = RetryConfig()
print(f"Default retry: {retry}")

logging = LoggingConfig(level="DEBUG", metadata={"team": "data-eng"})
print(f"Custom logging: {logging}")

# Test constraints
try:
    bad_retry = RetryConfig(max_attempts=100)  # > 10
except ValidationError as e:
    print(f"\n‚ùå Constraint violation:\n{e}")

### Field Constraints

Pydantic provides rich validation:
```python
Field(default=3, ge=1, le=10)  # Greater/equal 1, less/equal 10
Field(pattern="^(exponential|linear|constant)$")  # Regex validation
Field(default_factory=dict)  # Mutable defaults (safe!)
```

## 3Ô∏è‚É£ Read/Write Configs with Model Validators

### The Business Rule

For reading/writing data, you need **either** a `table` OR a `path`, but not both and not neither.

**This is cross-field validation** - can't check with single field validator.

In [None]:
from pydantic import model_validator

class ReadConfig(BaseModel):
    """Configuration for reading data."""
    connection: str = Field(description="Connection name")
    format: str = Field(description="Data format (csv, parquet, delta)")
    table: Optional[str] = Field(default=None, description="Table name for SQL/Delta")
    path: Optional[str] = Field(default=None, description="Path for file-based sources")
    options: Dict[str, Any] = Field(default_factory=dict, description="Format options")

    @model_validator(mode="after")
    def check_table_or_path(self):
        """Ensure either table or path is provided."""
        if not self.table and not self.path:
            raise ValueError("Either 'table' or 'path' must be provided for read config")
        return self

class WriteConfig(BaseModel):
    """Configuration for writing data."""
    connection: str = Field(description="Connection name")
    format: str = Field(description="Output format")
    table: Optional[str] = Field(default=None, description="Table name")
    path: Optional[str] = Field(default=None, description="Output path")
    mode: WriteMode = Field(default=WriteMode.OVERWRITE, description="Write mode")
    options: Dict[str, Any] = Field(default_factory=dict)

    @model_validator(mode="after")
    def check_table_or_path(self):
        """Ensure either table or path is provided."""
        if not self.table and not self.path:
            raise ValueError("Either 'table' or 'path' must be provided for write config")
        return self

# ‚úÖ Valid - has table
read1 = ReadConfig(connection="delta", format="delta", table="sales")
print(f"‚úÖ Read with table: {read1}")

# ‚úÖ Valid - has path
write1 = WriteConfig(connection="local", format="parquet", path="output/data.parquet")
print(f"‚úÖ Write with path: {write1}")

# ‚ùå Invalid - neither
try:
    read2 = ReadConfig(connection="local", format="csv")
except ValidationError as e:
    print(f"\n‚ùå Missing table/path:\n{e}")

### üí° Model Validator Pattern

```python
@model_validator(mode="after")
def check_something(self):
    # self.field1, self.field2, etc. are all populated
    if some_condition:
        raise ValueError("Clear error message")
    return self  # ‚Üê MUST return self!
```

**When to use:**
- Cross-field validation
- Business rules involving multiple fields
- Complex conditional logic

## 4Ô∏è‚É£ Transform Config - Flexibility

Transformations can be:
1. Simple SQL strings
2. Structured `TransformStep` objects

In [None]:
from typing import Union, List

class TransformStep(BaseModel):
    """Single transformation step."""
    sql: Optional[str] = None
    function: Optional[str] = None
    operation: Optional[str] = None
    params: Dict[str, Any] = Field(default_factory=dict)

    @model_validator(mode="after")
    def check_step_type(self):
        """Ensure exactly one step type is provided."""
        step_types = [self.sql, self.function, self.operation]
        if sum(x is not None for x in step_types) != 1:
            raise ValueError("Exactly one of 'sql', 'function', or 'operation' must be provided")
        return self

class TransformConfig(BaseModel):
    """Configuration for transforming data."""
    steps: List[Union[str, TransformStep]] = Field(
        description="List of transformation steps (SQL strings or TransformStep configs)"
    )

# ‚úÖ Simple SQL strings
transform1 = TransformConfig(steps=[
    "SELECT * FROM data WHERE amount > 0",
    "SELECT customer_id, SUM(amount) as total FROM data GROUP BY customer_id"
])
print(f"‚úÖ SQL transforms:\n{transform1}\n")

# ‚úÖ Structured steps
transform2 = TransformConfig(steps=[
    TransformStep(sql="SELECT * FROM data"),
    TransformStep(function="deduplicate", params={"columns": ["id"]}),
    TransformStep(operation="filter_nulls", params={"columns": ["name", "email"]})
])
print(f"‚úÖ Structured transforms:\n{transform2}\n")

# ‚ùå Invalid - multiple step types
try:
    bad_step = TransformStep(sql="SELECT *", function="dedupe")
except ValidationError as e:
    print(f"‚ùå Multiple step types:\n{e}")

### Union Types for Flexibility

```python
steps: List[Union[str, TransformStep]]
```

**Allows:**
- Quick prototyping: just pass SQL strings
- Advanced usage: structured steps with parameters
- Mix both in same pipeline!

## 5Ô∏è‚É£ Node Config - Composition

Nodes are the atomic units of pipelines.

In [None]:
class ValidationConfig(BaseModel):
    """Configuration for data validation."""
    schema_validation: Optional[Dict[str, Any]] = Field(
        default=None, alias="schema", description="Schema validation rules"
    )
    not_empty: bool = Field(default=False, description="Ensure result is not empty")
    no_nulls: List[str] = Field(
        default_factory=list, description="Columns that must not have nulls"
    )

class NodeConfig(BaseModel):
    """Configuration for a single node."""
    name: str = Field(description="Unique node name")
    description: Optional[str] = Field(default=None, description="Human-readable description")
    depends_on: List[str] = Field(default_factory=list, description="List of node dependencies")

    # Operations (at least one required)
    read: Optional[ReadConfig] = None
    transform: Optional[TransformConfig] = None
    write: Optional[WriteConfig] = None

    # Optional features
    cache: bool = Field(default=False, description="Cache result for reuse")
    validation: Optional[ValidationConfig] = None

    @model_validator(mode="after")
    def check_at_least_one_operation(self):
        """Ensure at least one operation is defined."""
        if not any([self.read, self.transform, self.write]):
            raise ValueError(
                f"Node '{self.name}' must have at least one of: read, transform, write"
            )
        return self

# ‚úÖ Read-only node
source_node = NodeConfig(
    name="raw_sales",
    read=ReadConfig(connection="delta", format="delta", table="raw.sales")
)
print(f"‚úÖ Source node: {source_node.name}\n")

# ‚úÖ Transform node with dependencies
transform_node = NodeConfig(
    name="clean_sales",
    depends_on=["raw_sales"],
    transform=TransformConfig(steps=[
        "SELECT * FROM raw_sales WHERE amount > 0"
    ]),
    validation=ValidationConfig(not_empty=True, no_nulls=["customer_id", "amount"]),
    cache=True
)
print(f"‚úÖ Transform node: {transform_node.name}, depends on {transform_node.depends_on}\n")

# ‚úÖ Write node
sink_node = NodeConfig(
    name="sales_output",
    depends_on=["clean_sales"],
    write=WriteConfig(
        connection="delta",
        format="delta",
        table="silver.sales",
        mode=WriteMode.OVERWRITE
    )
)
print(f"‚úÖ Sink node: {sink_node.name}\n")

# ‚ùå Invalid - no operations
try:
    empty_node = NodeConfig(name="empty")
except ValidationError as e:
    print(f"‚ùå No operations:\n{e}")

### üí° Node Design Principles

1. **Flexible**: Can read, transform, write, or any combination
2. **Validated**: Must have at least one operation
3. **Dependencies**: Explicit via `depends_on`
4. **Optional features**: Caching, validation bolt-ons

## 6Ô∏è‚É£ Pipeline Config - Collections

Pipelines are collections of nodes with uniqueness validation.

In [None]:
from pydantic import field_validator

class PipelineConfig(BaseModel):
    """Configuration for a pipeline."""
    pipeline: str = Field(description="Pipeline name")
    description: Optional[str] = Field(default=None, description="Pipeline description")
    layer: Optional[str] = Field(default=None, description="Logical layer (bronze/silver/gold)")
    nodes: List[NodeConfig] = Field(description="List of nodes in this pipeline")

    @field_validator("nodes")
    @classmethod
    def check_unique_node_names(cls, nodes: List[NodeConfig]) -> List[NodeConfig]:
        """Ensure all node names are unique within the pipeline."""
        names = [node.name for node in nodes]
        if len(names) != len(set(names)):
            duplicates = [name for name in names if names.count(name) > 1]
            raise ValueError(f"Duplicate node names found: {set(duplicates)}")
        return nodes

# ‚úÖ Valid pipeline
pipeline = PipelineConfig(
    pipeline="sales_processing",
    description="Clean and aggregate sales data",
    layer="silver",
    nodes=[source_node, transform_node, sink_node]
)
print(f"‚úÖ Pipeline: {pipeline.pipeline}")
print(f"   Nodes: {[n.name for n in pipeline.nodes]}\n")

# ‚ùå Invalid - duplicate names
try:
    bad_pipeline = PipelineConfig(
        pipeline="bad",
        nodes=[
            NodeConfig(name="node1", read=ReadConfig(connection="x", format="csv", path="a")),
            NodeConfig(name="node1", read=ReadConfig(connection="x", format="csv", path="b"))
        ]
    )
except ValidationError as e:
    print(f"‚ùå Duplicate node names:\n{e}")

### Field Validator vs Model Validator

**Field Validator:**
```python
@field_validator("nodes")
@classmethod
def check_unique_node_names(cls, nodes: List[NodeConfig]):
    # Validates ONLY the 'nodes' field
    # Must be @classmethod, receives cls + field value
```

**Model Validator:**
```python
@model_validator(mode="after")
def check_cross_field(self):
    # Validates across ALL fields
    # Instance method, receives self
```

## 7Ô∏è‚É£ Connection Configs - Discriminated Unions

Different connection types need different fields.

In [None]:
from typing import Union

class BaseConnectionConfig(BaseModel):
    """Base configuration for all connections."""
    type: ConnectionType
    validation_mode: str = "lazy"  # 'lazy' or 'eager'

class LocalConnectionConfig(BaseConnectionConfig):
    """Local filesystem connection."""
    type: ConnectionType = ConnectionType.LOCAL
    base_path: str = Field(default="./data", description="Base directory path")

class AzureBlobConnectionConfig(BaseConnectionConfig):
    """Azure Blob Storage connection."""
    type: ConnectionType = ConnectionType.AZURE_BLOB
    account_name: str
    container: str
    auth: Dict[str, str] = Field(default_factory=dict)

class DeltaConnectionConfig(BaseConnectionConfig):
    """Delta Lake connection."""
    type: ConnectionType = ConnectionType.DELTA
    catalog: str
    schema_name: str = Field(alias="schema")  # 'schema' is Python keyword

class SQLServerConnectionConfig(BaseConnectionConfig):
    """SQL Server connection."""
    type: ConnectionType = ConnectionType.SQL_SERVER
    host: str
    database: str
    port: int = 1433
    auth: Dict[str, str] = Field(default_factory=dict)

# Union of all connection types
ConnectionConfig = Union[
    LocalConnectionConfig,
    AzureBlobConnectionConfig,
    DeltaConnectionConfig,
    SQLServerConnectionConfig,
]

# ‚úÖ Local connection
local = LocalConnectionConfig(type="local", base_path="/data/raw")
print(f"‚úÖ Local: {local}\n")

# ‚úÖ Azure connection
azure = AzureBlobConnectionConfig(
    type="azure_blob",
    account_name="myaccount",
    container="data",
    auth={"method": "sas_token"}
)
print(f"‚úÖ Azure: {azure}\n")

# ‚úÖ Delta connection
delta = DeltaConnectionConfig(
    type="delta",
    catalog="main",
    schema_name="bronze"  # Note: schema_name, not schema (Python keyword)
)
print(f"‚úÖ Delta: {delta}")

### üí° Discriminated Union Pattern

```python
ConnectionConfig = Union[
    LocalConnectionConfig,
    AzureBlobConnectionConfig,
    # ...
]
```

Pydantic uses the `type` field to determine which model to use:
- `type: "local"` ‚Üí `LocalConnectionConfig`
- `type: "azure_blob"` ‚Üí `AzureBlobConnectionConfig`
- etc.

**Benefits:**
- Type-safe: Each connection type has its required fields
- Extensible: Add new connection types without changing existing code
- Clear errors: "Missing field 'container' for Azure connection"

## 8Ô∏è‚É£ Project Config - The Top Level

Everything comes together in `ProjectConfig`.

In [None]:
class StoryConfig(BaseModel):
    """Story generation configuration.
    
    Stories are ODIBI's core value - execution reports with lineage.
    """
    connection: str = Field(description="Connection name for story output")
    path: str = Field(description="Path for stories")
    max_sample_rows: int = Field(default=10, ge=0, le=100)
    auto_generate: bool = True

class ProjectConfig(BaseModel):
    """Complete project configuration from YAML."""
    
    # === MANDATORY ===
    project: str = Field(description="Project name")
    engine: EngineType = Field(default=EngineType.PANDAS, description="Execution engine")
    connections: Dict[str, Dict[str, Any]] = Field(
        description="Named connections (at least one required)"
    )
    pipelines: List[PipelineConfig] = Field(
        description="Pipeline definitions (at least one required)"
    )
    story: StoryConfig = Field(description="Story generation configuration (mandatory)")

    # === OPTIONAL (with sensible defaults) ===
    description: Optional[str] = Field(default=None, description="Project description")
    version: str = Field(default="1.0.0", description="Project version")
    owner: Optional[str] = Field(default=None, description="Project owner/contact")

    # Global settings (optional with defaults)
    retry: RetryConfig = Field(default_factory=RetryConfig)
    logging: LoggingConfig = Field(default_factory=LoggingConfig)

    @model_validator(mode="after")
    def validate_story_connection_exists(self):
        """Ensure story.connection is defined in connections."""
        if self.story.connection not in self.connections:
            available = ", ".join(self.connections.keys())
            raise ValueError(
                f"Story connection '{self.story.connection}' not found. "
                f"Available connections: {available}"
            )
        return self

# ‚úÖ Minimal valid project
project = ProjectConfig(
    project="sales_analytics",
    engine="pandas",
    connections={
        "local": {"type": "local", "base_path": "./data"}
    },
    story=StoryConfig(connection="local", path="stories/"),
    pipelines=[pipeline]
)

print(f"‚úÖ Project: {project.project}")
print(f"   Engine: {project.engine}")
print(f"   Pipelines: {[p.pipeline for p in project.pipelines]}")
print(f"   Connections: {list(project.connections.keys())}")
print(f"   Version: {project.version}")

# ‚ùå Invalid - story connection doesn't exist
try:
    bad_project = ProjectConfig(
        project="bad",
        connections={"local": {"type": "local"}},
        story=StoryConfig(connection="azure", path="stories/"),  # ‚Üê doesn't exist
        pipelines=[pipeline]
    )
except ValidationError as e:
    print(f"\n‚ùå Story connection validation:\n{e}")

### üí° Cross-Collection Validation

The `validate_story_connection_exists` model validator ensures referential integrity:
- Story references a connection name
- That connection must exist in `connections` dict
- Clear error if missing

This prevents runtime errors!

## 9Ô∏è‚É£ YAML Loading - The Complete Workflow

Now let's load real YAML configs.

In [None]:
import yaml
from pathlib import Path

# Sample YAML configuration
yaml_config = """
project: sales_analytics
description: Daily sales data processing pipeline
version: "2.1.0"
owner: data-team@company.com
engine: pandas

connections:
  local:
    type: local
    base_path: ./data
  
  delta:
    type: delta
    catalog: main
    schema: bronze

story:
  connection: local
  path: stories/
  max_sample_rows: 5

retry:
  enabled: true
  max_attempts: 3
  backoff: exponential

logging:
  level: INFO
  structured: false
  metadata:
    team: data-engineering
    environment: production

pipelines:
  - pipeline: bronze_ingestion
    layer: bronze
    description: Ingest raw sales data
    nodes:
      - name: read_raw_sales
        description: Read CSV sales files
        read:
          connection: local
          format: csv
          path: input/sales.csv
          options:
            header: true
            inferSchema: true
      
      - name: write_bronze
        depends_on: [read_raw_sales]
        write:
          connection: delta
          format: delta
          table: sales
          mode: append
        validation:
          not_empty: true
          no_nulls: [transaction_id, amount]

  - pipeline: silver_transformation
    layer: silver
    description: Clean and validate sales
    nodes:
      - name: clean_sales
        transform:
          steps:
            - SELECT * FROM bronze.sales WHERE amount > 0
            - SELECT DISTINCT * FROM data
        cache: true
      
      - name: aggregate_sales
        depends_on: [clean_sales]
        transform:
          steps:
            - sql: SELECT customer_id, SUM(amount) as total FROM clean_sales GROUP BY customer_id
        write:
          connection: delta
          format: delta
          table: customer_totals
          mode: overwrite
"""

# Load and validate
raw_config = yaml.safe_load(yaml_config)
config = ProjectConfig(**raw_config)

print("‚úÖ Successfully loaded and validated project config!\n")
print(f"Project: {config.project}")
print(f"Version: {config.version}")
print(f"Owner: {config.owner}")
print(f"Engine: {config.engine}")
print(f"\nConnections: {list(config.connections.keys())}")
print(f"\nPipelines:")
for p in config.pipelines:
    print(f"  - {p.pipeline} ({p.layer}): {len(p.nodes)} nodes")
    for n in p.nodes:
        ops = []
        if n.read: ops.append("read")
        if n.transform: ops.append("transform")
        if n.write: ops.append("write")
        print(f"    ‚Ä¢ {n.name}: {', '.join(ops)}")
        if n.depends_on:
            print(f"      depends_on: {n.depends_on}")

print(f"\nRetry: {config.retry}")
print(f"Logging: {config.logging}")
print(f"Story: connection={config.story.connection}, path={config.story.path}")

## üîü Error Messages - User Experience

Let's see what happens with various config errors.

In [None]:
# Error 1: Missing required field
bad_yaml_1 = """
project: test
# Missing: connections, pipelines, story
"""

try:
    config = ProjectConfig(**yaml.safe_load(bad_yaml_1))
except ValidationError as e:
    print("‚ùå Error 1: Missing required fields")
    print(e)
    print()

In [None]:
# Error 2: Invalid enum value
bad_yaml_2 = """
project: test
engine: dask  # Not a valid EngineType
connections:
  local: {type: local}
story:
  connection: local
  path: stories/
pipelines:
  - pipeline: test
    nodes:
      - name: node1
        read: {connection: local, format: csv, path: data.csv}
"""

try:
    config = ProjectConfig(**yaml.safe_load(bad_yaml_2))
except ValidationError as e:
    print("‚ùå Error 2: Invalid enum value")
    print(e)
    print()

In [None]:
# Error 3: Model validator failure
bad_yaml_3 = """
project: test
engine: pandas
connections:
  local: {type: local}
story:
  connection: azure  # Doesn't exist!
  path: stories/
pipelines:
  - pipeline: test
    nodes:
      - name: node1
        read: {connection: local, format: csv, path: data.csv}
"""

try:
    config = ProjectConfig(**yaml.safe_load(bad_yaml_3))
except ValidationError as e:
    print("‚ùå Error 3: Story connection doesn't exist")
    print(e)
    print()

In [None]:
# Error 4: Duplicate node names
bad_yaml_4 = """
project: test
engine: pandas
connections:
  local: {type: local}
story:
  connection: local
  path: stories/
pipelines:
  - pipeline: test
    nodes:
      - name: node1
        read: {connection: local, format: csv, path: a.csv}
      - name: node1  # Duplicate!
        read: {connection: local, format: csv, path: b.csv}
"""

try:
    config = ProjectConfig(**yaml.safe_load(bad_yaml_4))
except ValidationError as e:
    print("‚ùå Error 4: Duplicate node names")
    print(e)
    print()

In [None]:
# Error 5: Node with no operations
bad_yaml_5 = """
project: test
engine: pandas
connections:
  local: {type: local}
story:
  connection: local
  path: stories/
pipelines:
  - pipeline: test
    nodes:
      - name: empty_node
        # No read, transform, or write!
"""

try:
    config = ProjectConfig(**yaml.safe_load(bad_yaml_5))
except ValidationError as e:
    print("‚ùå Error 5: Node with no operations")
    print(e)
    print()

## üèóÔ∏è Build: Simplified Config System

Let's create a mini version from scratch.

In [None]:
from enum import Enum
from typing import Optional, List, Dict, Any
from pydantic import BaseModel, Field, field_validator, model_validator, ValidationError
import yaml

# === ENUMS ===
class DataFormat(str, Enum):
    CSV = "csv"
    PARQUET = "parquet"
    JSON = "json"

class TransformType(str, Enum):
    FILTER = "filter"
    AGGREGATE = "aggregate"
    JOIN = "join"

# === CONFIGS ===
class SourceConfig(BaseModel):
    path: str
    format: DataFormat
    columns: Optional[List[str]] = None

class TransformConfig(BaseModel):
    type: TransformType
    params: Dict[str, Any] = Field(default_factory=dict)

class SinkConfig(BaseModel):
    path: str
    format: DataFormat

class TaskConfig(BaseModel):
    name: str
    source: Optional[SourceConfig] = None
    transforms: List[TransformConfig] = Field(default_factory=list)
    sink: Optional[SinkConfig] = None
    
    @model_validator(mode="after")
    def check_has_operation(self):
        if not any([self.source, self.transforms, self.sink]):
            raise ValueError(f"Task '{self.name}' must have at least one operation")
        return self

class WorkflowConfig(BaseModel):
    name: str
    tasks: List[TaskConfig]
    
    @field_validator("tasks")
    @classmethod
    def check_unique_names(cls, tasks):
        names = [t.name for t in tasks]
        if len(names) != len(set(names)):
            raise ValueError(f"Duplicate task names found")
        return tasks

# === TEST ===
workflow_yaml = """
name: data_pipeline
tasks:
  - name: load_data
    source:
      path: input/data.csv
      format: csv
      columns: [id, name, value]
  
  - name: process_data
    transforms:
      - type: filter
        params:
          condition: value > 100
      - type: aggregate
        params:
          group_by: [name]
          agg: sum
  
  - name: save_results
    sink:
      path: output/results.parquet
      format: parquet
"""

workflow = WorkflowConfig(**yaml.safe_load(workflow_yaml))
print(f"‚úÖ Workflow: {workflow.name}")
print(f"   Tasks: {[t.name for t in workflow.tasks]}")
for task in workflow.tasks:
    print(f"\n   Task: {task.name}")
    if task.source:
        print(f"     Source: {task.source.path} ({task.source.format})")
    if task.transforms:
        print(f"     Transforms: {[t.type for t in task.transforms]}")
    if task.sink:
        print(f"     Sink: {task.sink.path} ({task.sink.format})")

## üìä Summary

### What We Learned

1. **Enums**: Type-safe constants prevent errors
2. **Nested Models**: Compose complex configs from simple pieces
3. **Field Validators**: Validate individual fields with constraints
4. **Model Validators**: Cross-field validation and business rules
5. **Discriminated Unions**: Different models based on type field
6. **YAML ‚Üí Pydantic**: Load and validate in one step
7. **Error Messages**: Clear, actionable validation errors

### Odibi's Config Architecture

```
ProjectConfig (313 lines)
‚îú‚îÄ‚îÄ Enums (4): EngineType, ConnectionType, WriteMode, LogLevel
‚îú‚îÄ‚îÄ Connections (5 types): Local, Azure, Delta, SQL Server + Union
‚îú‚îÄ‚îÄ Node Ops (4): ReadConfig, TransformConfig, WriteConfig, ValidationConfig
‚îú‚îÄ‚îÄ Hierarchy (3): NodeConfig ‚Üí PipelineConfig ‚Üí ProjectConfig
‚îî‚îÄ‚îÄ Global (3): RetryConfig, LoggingConfig, StoryConfig
```

### Key Patterns

| Pattern | Use Case | Example |
|---------|----------|----------|
| `str, Enum` | Type-safe constants | `EngineType.SPARK` |
| `Field(default_factory=dict)` | Mutable defaults | `options: Dict = Field(default_factory=dict)` |
| `@model_validator(mode="after")` | Cross-field validation | Check table OR path |
| `@field_validator("field")` | Single field validation | Unique node names |
| `Union[Type1, Type2]` | Discriminated unions | Different connection types |
| `Optional[Type]` | Nullable fields | `description: Optional[str] = None` |

### Best Practices

‚úÖ **DO:**
- Use enums for fixed sets of values
- Validate early with Pydantic
- Provide clear error messages
- Use `default_factory` for mutable defaults
- Document fields with `description`

‚ùå **DON'T:**
- Use magic strings without validation
- Use `= {}` or `= []` for defaults
- Write vague error messages
- Catch errors at runtime instead of config load
- Skip type hints

---

**Next Steps:**
1. Complete `exercises.ipynb`
2. Review `odibi_config_reference.md`
3. Move to `02_execution_context/` to see configs in action