# Pipeline Orchestration: The Complete Symphony

The Pipeline is where **everything comes together** - Config, Graph, Context, Engine, and Node working in harmony.

## Part 1: Understanding Pipeline Architecture

### The Pipeline's Responsibilities

1. **Initialization**: Set up engine, context, graph, and story generator
2. **Orchestration**: Execute nodes in dependency order
3. **Tracking**: Monitor success, failures, and skipped nodes
4. **Recovery**: Handle failures gracefully
5. **Documentation**: Generate execution stories

In [None]:
# Pipeline initialization breakdown
from typing import Dict, Any, Optional
from dataclasses import dataclass, field

# Key components initialized by Pipeline:
components = {
    'config': 'PipelineConfig - defines what to run',
    'engine': 'PandasEngine/SparkEngine - how to process data',
    'context': 'ExecutionContext - shares data between nodes',
    'graph': 'DependencyGraph - determines execution order',
    'story_generator': 'StoryGenerator - creates documentation',
    'connections': 'Dict of connection objects - I/O handlers'
}

for name, purpose in components.items():
    print(f"{name:20} -> {purpose}")

In [None]:
# Real Pipeline initialization (from source)
'''
class Pipeline:
    def __init__(
        self,
        pipeline_config: PipelineConfig,
        engine: str = "pandas",
        connections: Optional[Dict[str, Any]] = None,
        generate_story: bool = True,
        story_config: Optional[Dict[str, Any]] = None,
    ):
        self.config = pipeline_config
        self.engine_type = engine
        self.connections = connections or {}
        self.generate_story = generate_story
        
        # Initialize story generator
        self.story_generator = StoryGenerator(...)
        
        # Initialize engine
        if engine == "pandas":
            self.engine = PandasEngine()
        
        # Initialize context
        self.context = create_context(engine)
        
        # Build dependency graph
        self.graph = DependencyGraph(pipeline_config.nodes)
'''
print("Pipeline initialization creates a complete execution environment")

## Part 2: PipelineResults - Tracking Execution

The `PipelineResults` dataclass captures **everything** about a pipeline run.

In [None]:
# PipelineResults structure
@dataclass
class PipelineResults:
    pipeline_name: str
    completed: list = field(default_factory=list)      # Successfully executed nodes
    failed: list = field(default_factory=list)         # Nodes that failed
    skipped: list = field(default_factory=list)        # Nodes skipped due to failures
    node_results: Dict = field(default_factory=dict)   # Detailed results per node
    duration: float = 0.0                              # Total execution time
    start_time: Optional[str] = None                   # ISO timestamp
    end_time: Optional[str] = None                     # ISO timestamp
    story_path: Optional[str] = None                   # Path to generated story

# Example results
example_results = PipelineResults(
    pipeline_name="bronze_to_silver",
    completed=["raw_customers", "clean_customers"],
    failed=["raw_orders"],
    skipped=["clean_orders", "customer_orders"],
    duration=12.5,
    start_time="2025-01-15T10:30:00",
    end_time="2025-01-15T10:30:12"
)

print("Pipeline Results:")
print(f"  ‚úÖ Completed: {len(example_results.completed)}")
print(f"  ‚ùå Failed: {len(example_results.failed)}")
print(f"  ‚è≠Ô∏è  Skipped: {len(example_results.skipped)}")
print(f"  ‚è±Ô∏è  Duration: {example_results.duration}s")

## Part 3: Pipeline Execution Flow

The `run()` method orchestrates the entire execution.

In [None]:
# Execution flow breakdown
execution_steps = [
    "1. Start timer and create PipelineResults",
    "2. Get execution order from graph.topological_sort()",
    "3. For each node in order:",
    "   a. Check if dependencies failed -> skip if yes",
    "   b. Create Node instance with config, context, engine",
    "   c. Execute node",
    "   d. Store result in PipelineResults",
    "   e. Mark as completed or failed",
    "4. Calculate total duration",
    "5. Generate story (if enabled)",
    "6. Return PipelineResults"
]

for step in execution_steps:
    print(step)

In [None]:
# Simplified run() logic
'''
def run(self, parallel: bool = False) -> PipelineResults:
    start_time = time.time()
    results = PipelineResults(pipeline_name=self.config.pipeline)
    
    # Get execution order from dependency graph
    execution_order = self.graph.topological_sort()
    
    # Execute nodes in order
    for node_name in execution_order:
        node_config = self.graph.nodes[node_name]
        
        # Skip if dependencies failed
        deps_failed = any(dep in results.failed for dep in node_config.depends_on)
        if deps_failed:
            results.skipped.append(node_name)
            continue
        
        # Execute node
        node = Node(
            config=node_config,
            context=self.context,
            engine=self.engine,
            connections=self.connections
        )
        
        node_result = node.execute()
        results.node_results[node_name] = node_result
        
        if node_result.success:
            results.completed.append(node_name)
        else:
            results.failed.append(node_name)
    
    results.duration = time.time() - start_time
    
    # Generate story
    if self.generate_story:
        story_path = self.story_generator.generate(...)
        results.story_path = story_path
    
    return results
'''
print("Pipeline.run() coordinates all components")

## Part 4: Failure Propagation

When a node fails, downstream dependencies are automatically skipped.

In [None]:
# Example: Failure propagation
class FailurePropagationDemo:
    def __init__(self):
        self.failed = []
        self.skipped = []
        self.completed = []
    
    def execute_node(self, node_name, depends_on, will_fail=False):
        # Check if dependencies failed
        deps_failed = any(dep in self.failed for dep in depends_on)
        
        if deps_failed:
            self.skipped.append(node_name)
            print(f"‚è≠Ô∏è  SKIPPED: {node_name} (dependency failed)")
            return
        
        if will_fail:
            self.failed.append(node_name)
            print(f"‚ùå FAILED: {node_name}")
        else:
            self.completed.append(node_name)
            print(f"‚úÖ COMPLETED: {node_name}")

# Simulate pipeline execution
demo = FailurePropagationDemo()

print("\nSimulating Pipeline Execution:\n")
demo.execute_node("raw_customers", [])
demo.execute_node("raw_orders", [], will_fail=True)  # This fails!
demo.execute_node("clean_customers", ["raw_customers"])
demo.execute_node("clean_orders", ["raw_orders"])  # Skipped due to failure
demo.execute_node("customer_orders", ["clean_customers", "clean_orders"])  # Skipped

print(f"\nFinal Status:")
print(f"  Completed: {demo.completed}")
print(f"  Failed: {demo.failed}")
print(f"  Skipped: {demo.skipped}")

## Part 5: Layer-Based Execution

The graph groups nodes into **layers** for potential parallel execution.

In [None]:
# Execution layers example
layers = [
    ["raw_customers", "raw_orders", "raw_products"],  # Layer 0: No dependencies
    ["clean_customers", "clean_orders"],               # Layer 1: Depend on Layer 0
    ["customer_orders"],                               # Layer 2: Depends on Layer 1
    ["customer_analytics"]                             # Layer 3: Depends on Layer 2
]

print("Execution Layers:\n")
for i, layer in enumerate(layers):
    print(f"Layer {i}: {layer}")
    print(f"  ‚Üí Can execute in parallel: {len(layer) > 1}\n")

print("Note: Current implementation is sequential,")
print("but layers enable future parallel execution!")

## Part 6: PipelineManager - Multi-Pipeline Orchestration

The `PipelineManager` manages multiple pipelines from a single YAML configuration.

In [None]:
# PipelineManager responsibilities
manager_features = {
    'Load from YAML': 'Parse entire project configuration',
    'Build Connections': 'Instantiate all connection objects',
    'Create Pipelines': 'Initialize all pipeline instances',
    'Run Selector': 'Run all, one, or multiple pipelines',
    'Story Config': 'Configure story generation globally',
    'Pipeline Access': 'Get specific pipeline instances'
}

print("PipelineManager Features:\n")
for feature, description in manager_features.items():
    print(f"{feature:20} -> {description}")

In [None]:
# PipelineManager usage patterns
'''
# Pattern 1: Run all pipelines
manager = PipelineManager.from_yaml("config.yaml")
results = manager.run()  # Dict[str, PipelineResults]

# Pattern 2: Run single pipeline
result = manager.run('bronze_to_silver')  # Returns single PipelineResults

# Pattern 3: Run multiple specific pipelines
results = manager.run(['bronze_to_silver', 'silver_to_gold'])

# Pattern 4: Get pipeline instance for inspection
pipeline = manager.get_pipeline('bronze_to_silver')
validation = pipeline.validate()
layers = pipeline.get_execution_layers()
'''
print("PipelineManager supports flexible execution patterns")

## Part 7: Connection Management

PipelineManager builds all connections from the configuration.

In [None]:
# Connection building logic
'''
@staticmethod
def _build_connections(conn_configs: Dict[str, Dict[str, Any]]) -> Dict[str, Any]:
    connections = {}
    
    for conn_name, conn_config in conn_configs.items():
        conn_type = conn_config.get("type", "local")
        
        if conn_type == "local":
            connections[conn_name] = LocalConnection(
                base_path=conn_config.get("base_path", "./data")
            )
        elif conn_type == "azure_adls":
            connections[conn_name] = AzureADLS(
                account=conn_config["account"],
                container=conn_config["container"],
                ...
            )
        else:
            raise ValueError(f"Unsupported connection type: {conn_type}")
    
    return connections
'''

# Example connection config
example_connections = {
    'local_data': {'type': 'local', 'base_path': './data'},
    'azure_storage': {
        'type': 'azure_adls',
        'account': 'mystorageaccount',
        'container': 'data',
        'auth_mode': 'key_vault'
    }
}

print("Connection types supported:")
for name, config in example_connections.items():
    print(f"  {name}: {config['type']}")

## Part 8: Story Generation Integration

After execution, the Pipeline automatically generates documentation.

In [None]:
# Story generation in Pipeline
'''
# After all nodes execute:
if self.generate_story:
    story_path = self.story_generator.generate(
        node_results=results.node_results,
        completed=results.completed,
        failed=results.failed,
        skipped=results.skipped,
        duration=results.duration,
        start_time=results.start_time,
        end_time=results.end_time,
        context=self.context,
    )
    results.story_path = story_path
'''

print("Story generation happens automatically after pipeline execution")
print("\nStory includes:")
story_contents = [
    "- Pipeline summary",
    "- Execution timeline",
    "- Node-by-node details",
    "- Data samples",
    "- Transformation SQL",
    "- Success/failure status"
]
for item in story_contents:
    print(item)

## Part 9: Pipeline Validation

Validate a pipeline **before** execution.

In [None]:
# Validation logic
'''
def validate(self) -> Dict[str, Any]:
    validation = {
        "valid": True,
        "errors": [],
        "warnings": [],
        "node_count": len(self.graph.nodes),
        "execution_order": [],
    }
    
    try:
        # Test topological sort (checks for cycles)
        execution_order = self.graph.topological_sort()
        validation["execution_order"] = execution_order
    except DependencyError as e:
        validation["valid"] = False
        validation["errors"].append(str(e))
    
    # Check for missing connections
    for node in self.config.nodes:
        if node.read and node.read.connection not in self.connections:
            validation["warnings"].append(
                f"Node '{node.name}': connection '{node.read.connection}' not configured"
            )
    
    return validation
'''

# Example validation result
validation_result = {
    "valid": True,
    "errors": [],
    "warnings": ["Node 'raw_orders': connection 'azure_storage' not configured"],
    "node_count": 5,
    "execution_order": ["raw_customers", "raw_orders", "clean_customers", "clean_orders", "customer_orders"]
}

print("Validation Result:")
print(f"  Valid: {validation_result['valid']}")
print(f"  Nodes: {validation_result['node_count']}")
print(f"  Warnings: {len(validation_result['warnings'])}")

## Part 10: Complete Execution Example

Putting it all together: from YAML to results.

In [None]:
# Complete workflow
'''
# Step 1: Load configuration
manager = PipelineManager.from_yaml("project_config.yaml")

# Step 2: List available pipelines
pipelines = manager.list_pipelines()
print(f"Available pipelines: {pipelines}")

# Step 3: Validate before running
pipeline = manager.get_pipeline('bronze_to_silver')
validation = pipeline.validate()
if not validation['valid']:
    print(f"Validation errors: {validation['errors']}")
    exit(1)

# Step 4: Inspect execution plan
layers = pipeline.get_execution_layers()
print(f"Execution will happen in {len(layers)} layers")

# Step 5: Execute pipeline
results = manager.run('bronze_to_silver')

# Step 6: Check results
if results.failed:
    print(f"‚ùå Pipeline failed. Failed nodes: {results.failed}")
else:
    print(f"‚úÖ Pipeline succeeded in {results.duration:.2f}s")
    print(f"üìñ Story: {results.story_path}")

# Step 7: Inspect individual node results
for node_name in results.completed:
    node_result = results.get_node_result(node_name)
    print(f"  {node_name}: {node_result.rows_affected} rows")
'''
print("Complete pipeline execution workflow")

## Summary: The Complete Picture

### Pipeline Responsibilities
1. ‚úÖ Initialize all components (engine, context, graph, story)
2. ‚úÖ Execute nodes in dependency order
3. ‚úÖ Handle failures gracefully
4. ‚úÖ Track detailed results
5. ‚úÖ Generate documentation

### PipelineManager Responsibilities
1. ‚úÖ Load YAML configuration
2. ‚úÖ Build connections
3. ‚úÖ Create multiple pipeline instances
4. ‚úÖ Orchestrate multi-pipeline execution
5. ‚úÖ Provide pipeline access and inspection

### Key Patterns
- **Dependency Injection**: All components passed to constructors
- **Single Responsibility**: Each class has one clear purpose
- **Fail Fast**: Validation before execution
- **Graceful Degradation**: Partial completion on failure
- **Comprehensive Tracking**: Detailed results and timing

### Integration Flow
```
YAML Config
    ‚Üì
PipelineManager.from_yaml()
    ‚Üì
Parse ProjectConfig
    ‚Üì
Build Connections
    ‚Üì
Create Pipeline instances
    ‚Üì
Pipeline.run()
    ‚Üì
Graph.topological_sort() ‚Üí execution order
    ‚Üì
For each node:
  - Check dependencies
  - Create Node instance
  - Execute with Engine
  - Store in Context
  - Track in Results
    ‚Üì
Generate Story
    ‚Üì
Return PipelineResults
```

**You now understand the complete Odibi execution lifecycle!**