# YAML Configuration for Data Pipelines

**Master the configuration language powering modern data engineering**

---

## üéØ The Problem

You're building a data pipeline that needs to:
- Connect to multiple data sources (local files, cloud storage, databases)
- Run in different environments (dev, staging, production)
- Be maintainable by non-developers (analysts, data scientists)
- Keep secrets separate from code

**Hard-coding is a nightmare:**
```python
# DON'T DO THIS ‚ùå
storage_account = "prod_account"  # How do you switch to dev?
password = "super_secret_123"      # Committed to git! Security breach!
database_host = "10.0.0.5"         # What if it changes?
```

**YAML is the solution:**
```yaml
# config.yaml - Readable, versionable, environment-aware
connections:
  database:
    host: ${DB_HOST}  # From environment variable
    port: 5432
    auth_mode: key_vault  # Secrets never in code
```

---

## ü¶â First Principles: Why YAML?

### 1. Declarative Over Imperative

YAML describes **what** you want, not **how** to do it:

```yaml
# YAML: "I want a database connection with these properties"
database:
  type: postgresql
  host: localhost
  port: 5432
```

vs.

```python
# Python: "Follow these steps to create a connection"
conn = DatabaseConnection()
conn.set_type('postgresql')
conn.set_host('localhost')
conn.set_port(5432)
conn.connect()
```

### 2. Human-Readable = Maintainable

Non-developers can read and modify YAML. Try that with pickled Python!

### 3. Language-Agnostic

Same config file works for Python, Go, Java, Rust - crucial for microservices.

---

## ‚ö° Part 1: YAML Syntax Crash Course

In [None]:
import yaml
from pprint import pprint

# Helper function to load and display YAML
def show_yaml(yaml_string):
    data = yaml.safe_load(yaml_string)
    pprint(data)
    return data

### 1.1 Scalars (Simple Values)

In [None]:
yaml_scalars = """
# Strings (quotes optional unless special chars)
project: My Data Pipeline
description: "Contains: colons and special chars!"

# Numbers
max_retries: 3
timeout: 30.5

# Booleans
enabled: true
debug: false

# Nulls
optional_field: null
also_null: ~
"""

show_yaml(yaml_scalars)

### 1.2 Lists

In [None]:
yaml_lists = """
# Block style (preferred for readability)
environments:
  - dev
  - staging
  - production

# Flow style (inline)
ports: [80, 443, 8080]

# Nested lists
matrix:
  - [1, 2, 3]
  - [4, 5, 6]
"""

show_yaml(yaml_lists)

### 1.3 Dictionaries (Nested)

In [None]:
yaml_dicts = """
# Block style - natural indentation
database:
  host: localhost
  port: 5432
  credentials:
    username: admin
    auth_mode: key_vault

# Flow style (inline)
cache: {enabled: true, ttl: 300}
"""

show_yaml(yaml_dicts)

### 1.4 Multiline Strings

In [None]:
yaml_multiline = """
# Literal block (|) - preserves newlines
sql_query: |
  SELECT 
    customer_id,
    SUM(amount) as total
  FROM sales
  WHERE date > '2024-01-01'
  GROUP BY customer_id

# Folded block (>) - folds into single line
description: >
  This is a very long description
  that will be folded into a single
  line with spaces between words.
"""

data = show_yaml(yaml_multiline)
print("\n--- SQL QUERY (literal |) ---")
print(data['sql_query'])
print("\n--- DESCRIPTION (folded >) ---")
print(data['description'])

### 1.5 Anchors & Aliases (DRY Principle)

In [None]:
yaml_anchors = """
# Define once with anchor (&)
default_retry: &retry_config
  max_attempts: 3
  backoff: exponential
  backoff_seconds: 2.0

# Reuse with alias (*)
pipeline_1:
  name: ETL Pipeline
  retry: *retry_config

pipeline_2:
  name: ML Pipeline
  retry: *retry_config  # Same config, no duplication!
"""

show_yaml(yaml_anchors)

### 1.6 Merge Keys (Override Defaults)

In [None]:
yaml_merge = """
defaults: &defaults
  timeout: 30
  retries: 3
  log_level: INFO

dev_config:
  <<: *defaults  # Merge defaults
  log_level: DEBUG  # Override specific value

prod_config:
  <<: *defaults
  timeout: 60  # Production needs longer timeout
  retries: 5
"""

show_yaml(yaml_merge)

---

## ‚ö° Part 2: PyYAML Deep Dive

### 2.1 Loading YAML (safely!)

In [None]:
import yaml

# ‚úÖ ALWAYS USE safe_load (prevents code execution)
with open('example_configs/basic.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("Loaded config:")
pprint(config)

# ‚ùå NEVER USE yaml.load() without Loader - security risk!
# with open('config.yaml') as f:
#     config = yaml.load(f)  # Can execute arbitrary Python!

**Why `safe_load`?**

```yaml
# Malicious YAML with unsafe load
!!python/object/apply:os.system
args: ['rm -rf /']
```

Using `yaml.load()` would **execute this code**! Always use `safe_load()`.

### 2.2 Writing YAML

In [None]:
# Python dict to YAML
config = {
    'project': 'Data Pipeline',
    'connections': {
        'database': {
            'host': 'localhost',
            'port': 5432
        }
    },
    'pipelines': ['bronze_to_silver', 'silver_to_gold']
}

# Convert to YAML string
yaml_string = yaml.safe_dump(config, default_flow_style=False, sort_keys=False)
print(yaml_string)

# Write to file
with open('example_configs/generated.yaml', 'w') as f:
    yaml.safe_dump(config, f, default_flow_style=False)

### 2.3 Type Coercion Gotchas

In [None]:
# The famous "Norway Problem"
yaml_gotcha = """
countries:
  - NO  # Norway ISO code
  - SE  # Sweden
  - DK  # Denmark
"""

data = yaml.safe_load(yaml_gotcha)
print("Countries:", data['countries'])
print("Type of NO:", type(data['countries'][0]))
# NO becomes False! (interpreted as boolean)

# Solution: Quote strings
yaml_fixed = """
countries:
  - "NO"  # Now it's a string
  - "SE"
  - "DK"
"""
data = yaml.safe_load(yaml_fixed)
print("\nFixed countries:", data['countries'])

**Other gotchas:**
- `12:30` ‚Üí Interpreted as 750 (seconds)
- `version: 1.0` ‚Üí Becomes float, not string
- Leading zeros: `012` ‚Üí Octal number (10 in decimal)

**Rule:** When in doubt, use quotes!

---

## ‚ö° Part 3: Validation with Pydantic

In [None]:
from pydantic import BaseModel, Field, validator
from typing import List, Dict, Optional, Literal
from pathlib import Path

# Define schema for database connection
class DatabaseConfig(BaseModel):
    host: str
    port: int = Field(gt=0, lt=65536)  # Valid port range
    database: str
    auth_mode: Literal['password', 'key_vault', 'managed_identity']
    
    @validator('host')
    def validate_host(cls, v):
        if v == 'localhost':
            return v
        # Could add more validation (IP, domain)
        return v

class RetryConfig(BaseModel):
    max_attempts: int = Field(ge=1, le=10)
    backoff_seconds: float = Field(gt=0)

class PipelineConfig(BaseModel):
    project: str
    database: DatabaseConfig
    retry: RetryConfig
    environments: List[str]

# Test with valid config
yaml_valid = """
project: My Pipeline
database:
  host: localhost
  port: 5432
  database: analytics
  auth_mode: key_vault
retry:
  max_attempts: 3
  backoff_seconds: 2.0
environments:
  - dev
  - prod
"""

config_dict = yaml.safe_load(yaml_valid)
config = PipelineConfig(**config_dict)
print("‚úÖ Valid config:")
print(config.model_dump_json(indent=2))

In [None]:
# Test with INVALID config
yaml_invalid = """
project: My Pipeline
database:
  host: localhost
  port: 99999  # Invalid port!
  database: analytics
  auth_mode: plain_text  # Not in allowed values!
retry:
  max_attempts: 100  # Too many!
  backoff_seconds: -1  # Negative!
environments: not_a_list  # Wrong type!
"""

try:
    config_dict = yaml.safe_load(yaml_invalid)
    config = PipelineConfig(**config_dict)
except Exception as e:
    print("‚ùå Validation failed:")
    print(e)

---

## ‚ö° Part 4: Environment Variables

In [None]:
import os
import re

def load_yaml_with_env(yaml_string: str) -> dict:
    """Load YAML and replace ${VAR} with environment variables."""
    
    # Replace ${VAR} with environment variable value
    def replace_env(match):
        var_name = match.group(1)
        return os.environ.get(var_name, match.group(0))  # Keep ${VAR} if not found
    
    expanded = re.sub(r'\$\{([^}]+)\}', replace_env, yaml_string)
    return yaml.safe_load(expanded)

# Set environment variables
os.environ['DB_HOST'] = 'production.database.com'
os.environ['DB_PASSWORD'] = 'super_secret_password'

yaml_with_env = """
database:
  host: ${DB_HOST}
  password: ${DB_PASSWORD}
  port: 5432
"""

config = load_yaml_with_env(yaml_with_env)
print("Config with env vars:")
pprint(config)

---

## üîç Part 5: Analyzing Odibi's YAML Structure

Let's examine real production YAML from Odibi!

In [None]:
# Load Odibi example config
odibi_config_path = r'c:\Users\hodibi\OneDrive - Ingredion\Desktop\Repos\Odibi\examples\example_delta_pipeline.yaml'

with open(odibi_config_path, 'r') as f:
    odibi_config = yaml.safe_load(f)

print("Top-level keys:")
print(list(odibi_config.keys()))

print("\nConnections:")
pprint(odibi_config['connections'])

print("\nFirst pipeline:")
pprint(odibi_config['pipelines'][0])

### Key Patterns in Odibi Configs:

1. **Connections as First-Class Citizens**
   - Named connections (local, bronze, silver)
   - Type-specific configs (local vs azure_adls)
   - Auth modes separated from credentials

2. **Pipeline Structure**
   - Pipelines contain nodes
   - Nodes have `depends_on` for DAG execution
   - Operations: read, transform, write

3. **Delta Lake Support**
   - Format: `delta`
   - Options: `versionAsOf` (time travel)
   - Partitioning with `partition_by`

See [odibi_config_patterns.md](odibi_config_patterns.md) for detailed analysis.

---

## üèóÔ∏è Build: Create Your Own Config System

In [None]:
from typing import List, Dict, Optional, Literal
from pydantic import BaseModel, Field
import yaml
from pathlib import Path

# Define schema for your own pipeline config
class ConnectionConfig(BaseModel):
    type: Literal['local', 'azure_adls', 'azure_sql']
    base_path: Optional[str] = None
    account: Optional[str] = None
    container: Optional[str] = None

class NodeConfig(BaseModel):
    name: str
    depends_on: List[str] = []
    operation: Literal['read', 'transform', 'write']

class PipelineConfig(BaseModel):
    pipeline: str
    nodes: List[NodeConfig]

class AppConfig(BaseModel):
    project: str
    connections: Dict[str, ConnectionConfig]
    pipelines: List[PipelineConfig]

# Load and validate
def load_config(path: Path) -> AppConfig:
    with open(path, 'r') as f:
        data = yaml.safe_load(f)
    return AppConfig(**data)

# Test it!
config = load_config(Path('example_configs/mini_pipeline.yaml'))
print(f"‚úÖ Loaded project: {config.project}")
print(f"‚úÖ Connections: {list(config.connections.keys())}")
print(f"‚úÖ Pipelines: {[p.pipeline for p in config.pipelines]}")

---

## ‚úÖ Best Practices

### 1. Security
```yaml
# ‚úÖ GOOD: Reference secrets
auth_mode: key_vault
secret_name: db-password

# ‚ùå BAD: Hardcode secrets
password: my_actual_password_123
```

### 2. Environment Separation
```
config/
  base.yaml       # Shared settings
  dev.yaml        # Dev overrides
  prod.yaml       # Prod overrides
```

### 3. Validation
```python
# Always validate on load
try:
    config = AppConfig(**yaml.safe_load(file))
except ValidationError as e:
    print(f"Invalid config: {e}")
    sys.exit(1)
```

### 4. Documentation
```yaml
# Use comments liberally!
retry:
  max_attempts: 3  # Number of retries before failure
  backoff: exponential  # Options: linear, exponential, constant
```

---

## üéØ Summary

You now know:

‚úÖ YAML syntax (scalars, lists, dicts, multiline, anchors)  
‚úÖ PyYAML (`safe_load`, type coercion gotchas)  
‚úÖ Pydantic validation (catch errors early)  
‚úÖ Environment variables (keep secrets separate)  
‚úÖ Odibi's config patterns (connections, pipelines, Delta)  

**Next:** Complete the exercises in [exercises.ipynb](exercises.ipynb)!

---

## üìö Resources

- [YAML Specification](https://yaml.org/spec/1.2/spec.html)
- [PyYAML Documentation](https://pyyaml.org/wiki/PyYAMLDocumentation)
- [Pydantic Settings](https://docs.pydantic.dev/latest/concepts/pydantic_settings/)
- [Odibi Config Docs](../../Odibi/docs/CONFIGURATION_EXPLAINED.md)