# üìÑ YAML Pipelines - Declarative Workflows

Welcome to YAML-first pipeline development! This notebook teaches:

- Why YAML-first approach for AI workflows
- YAML pipeline structure and syntax
- Building equivalent graphs in Python
- Executing declarative workflows
- Benefits of declarative over imperative pipelines

YAML pipelines enable version control, team collaboration, and Infrastructure-as-Code (IaC) patterns for AI workflows.

## Why YAML-First?

**Traditional Approach (Imperative Python):**
- Workflows defined in code
- Hard to review and collaborate
- Requires Python knowledge
- Difficult to version and track changes

**YAML-First Approach (Declarative):**
- ‚úÖ **Version Control** - Track workflow changes in Git like infrastructure code
- ‚úÖ **Collaboration** - Non-developers can read and modify workflows
- ‚úÖ **Infrastructure as Code** - Treat AI workflows as declarative infrastructure
- ‚úÖ **Type Safety** - Schema validation ensures correctness
- ‚úÖ **Portability** - Same YAML works across environments
- ‚úÖ **Testing** - Workflows can be validated without execution

**Use Cases:**
- Enterprise AI pipelines requiring audit trails
- Team workflows with non-technical stakeholders
- Multi-environment deployments (dev, staging, prod)
- Workflow templates and reusable components

## Step 1: Imports

We'll use the same core components as imperative pipelines:

In [None]:
from hexdag.core.domain.dag import DirectedGraph, NodeSpec
from hexdag.core.orchestration.orchestrator import Orchestrator

## Step 2: Define Pipeline Functions

These are the building blocks that our YAML pipeline will reference. Each function represents a reusable processing step:

In [None]:
async def data_loader(input_data: str) -> dict:
    """Load and parse input data."""
    return {"raw_input": input_data, "processed": True, "timestamp": "2024-01-01T10:00:00Z"}


async def text_processor(input_data: dict) -> dict:
    """Process text data."""
    text = input_data.get("raw_input", "")
    words = text.split()

    return {
        "word_count": len(words),
        "char_count": len(text),
        "processed_text": text.upper(),
        "original": input_data,
    }


async def sentiment_analyzer(input_data: dict) -> dict:
    """Analyze sentiment of text."""
    text = input_data.get("processed_text", "")

    # Simple sentiment analysis
    positive_words = ["good", "great", "excellent", "happy", "love"]
    negative_words = ["bad", "terrible", "awful", "hate", "sad"]

    text_lower = text.lower()
    positive_score = sum(1 for word in positive_words if word in text_lower)
    negative_score = sum(1 for word in negative_words if word in text_lower)

    if positive_score > negative_score:
        sentiment = "positive"
        confidence = min(0.9, (positive_score - negative_score) / 5)
    elif negative_score > positive_score:
        sentiment = "negative"
        confidence = min(0.9, (negative_score - positive_score) / 5)
    else:
        sentiment = "neutral"
        confidence = 0.5

    return {
        "sentiment": sentiment,
        "confidence": confidence,
        "positive_score": positive_score,
        "negative_score": negative_score,
        "analysis_data": input_data,
    }


async def report_generator(input_data: dict) -> dict:
    """Generate comprehensive report from text and sentiment analysis."""
    # Extract data from previous nodes
    text_data = input_data.get("text_processor", {})
    sentiment_data = input_data.get("sentiment_analyzer", {})

    return {
        "report": {
            "text_summary": {
                "word_count": text_data.get("word_count", 0),
                "char_count": text_data.get("char_count", 0),
                "processed_text": text_data.get("processed_text", ""),
            },
            "sentiment_analysis": {
                "sentiment": sentiment_data.get("sentiment"),
                "confidence": sentiment_data.get("confidence"),
                "positive_score": sentiment_data.get("positive_score"),
                "negative_score": sentiment_data.get("negative_score"),
            },
            "timestamp": text_data.get("original", {}).get("timestamp"),
        },
        "analysis_complete": True,
    }


print("‚úÖ Pipeline functions defined!")
print("   - data_loader: Loads and parses input")
print("   - text_processor: Processes text content")
print("   - sentiment_analyzer: Analyzes sentiment")
print("   - report_generator: Generates comprehensive report")

## Step 3: YAML Pipeline Structure

Here's what a YAML pipeline definition looks like. This is the **declarative** way to define the same workflow:

```yaml
name: text_analysis_pipeline
version: "1.0.0"
description: "Analyze text sentiment and generate reports"

input_schema:
  type: string
  description: "Text to analyze"

output_schema:
  type: object
  properties:
    report:
      type: object
      properties:
        text_summary:
          type: object
        sentiment_analysis:
          type: object
    analysis_complete:
      type: boolean

nodes:
  data_loader:
    type: function
    function: data_loader
    description: "Load and parse input data"

  text_processor:
    type: function
    function: text_processor
    depends_on: ["data_loader"]
    description: "Process and analyze text content"

  sentiment_analyzer:
    type: function
    function: sentiment_analyzer
    depends_on: ["text_processor"]
    description: "Analyze sentiment of processed text"

  report_generator:
    type: function
    function: report_generator
    depends_on: ["text_processor", "sentiment_analyzer"]
    description: "Generate comprehensive analysis report"

config:
  validation_strategy: "coerce"
  max_concurrent_nodes: 4
  timeout_seconds: 300
```

**Key Elements:**
- `name` and `version` - Pipeline metadata for tracking
- `input_schema` / `output_schema` - Type validation
- `nodes` - Processing steps with explicit dependencies
- `depends_on` - Declares which nodes must complete first
- `config` - Execution configuration (validation, concurrency, timeouts)

## Step 4: Visualize Pipeline Structure

The YAML above creates this execution flow:

In [None]:
print("üèóÔ∏è  Pipeline Execution Structure:")
print("")
print("   data_loader")
print("   ‚îî‚îÄ‚îÄ text_processor")
print("       ‚îú‚îÄ‚îÄ sentiment_analyzer")
print("       ‚îî‚îÄ‚îÄ report_generator")
print("           ‚îî‚îÄ‚îÄ (depends on both text_processor and sentiment_analyzer)")
print("")
print("üìä Execution Waves:")
print("   Wave 1: [data_loader]")
print("   Wave 2: [text_processor]")
print("   Wave 3: [sentiment_analyzer]")
print("   Wave 4: [report_generator]")

## Step 5: Build Equivalent Graph in Python

Let's build the same pipeline using Python (this is what the YAML would compile to):

In [None]:
print("üìä Creating DirectedGraph from YAML definition...")

# Create the graph
graph = DirectedGraph()

# Add nodes with dependencies (mimicking YAML structure)
graph.add(NodeSpec("data_loader", data_loader))
graph.add(NodeSpec("text_processor", text_processor).after("data_loader"))
graph.add(NodeSpec("sentiment_analyzer", sentiment_analyzer).after("text_processor"))
graph.add(
    NodeSpec("report_generator", report_generator).after("text_processor", "sentiment_analyzer")
)

print("   ‚úÖ Graph created with 4 nodes")
print("   ‚úÖ Dependencies configured")

## Step 6: Validate Pipeline

Validation ensures the graph is well-formed before execution:

In [None]:
print("üîç Validating pipeline structure...")
try:
    graph.validate()
    print("   ‚úÖ Pipeline validation passed!")
    print("   ‚úÖ No cycles detected")
    print("   ‚úÖ All dependencies satisfied")
except Exception as e:
    print(f"   ‚ùå Validation failed: {e}")

## Step 7: Analyze Execution Plan

View how the pipeline will execute in waves:

In [None]:
print("üìä Pipeline Analysis:")
waves = graph.waves()
print(f"   Total execution waves: {len(waves)}")
for i, wave in enumerate(waves, 1):
    print(f"   Wave {i}: {wave}")

print("")
print("üí° Each wave can execute in parallel.")
print("   Nodes in different waves run sequentially.")

## Step 8: Execute Pipeline - Test Case 1

Let's run our pipeline with a positive text sample:

In [None]:
orchestrator = Orchestrator()

test_input_1 = "I love this product! It's amazing and wonderful."
print(f"üß™ Test 1: '{test_input_1}'")
print("")

results_1 = await orchestrator.run(graph, test_input_1)

report = results_1.get("report_generator", {}).get("report", {})
sentiment = report.get("sentiment_analysis", {})

print("üìà Results:")
print(f"   Sentiment: {sentiment.get('sentiment')}")
print(f"   Confidence: {sentiment.get('confidence', 0):.2f}")
print(f"   Positive score: {sentiment.get('positive_score')}")
print(f"   Negative score: {sentiment.get('negative_score')}")
print(f"   Word count: {report.get('text_summary', {}).get('word_count', 0)}")
print(
    f"   Analysis complete: {results_1.get('report_generator', {}).get('analysis_complete', False)}"
)

## Step 9: Execute Pipeline - Test Case 2

Testing with a negative text sample:

In [None]:
test_input_2 = "This is terrible. I hate it so much."
print(f"üß™ Test 2: '{test_input_2}'")
print("")

results_2 = await orchestrator.run(graph, test_input_2)

report = results_2.get("report_generator", {}).get("report", {})
sentiment = report.get("sentiment_analysis", {})

print("üìà Results:")
print(f"   Sentiment: {sentiment.get('sentiment')}")
print(f"   Confidence: {sentiment.get('confidence', 0):.2f}")
print(f"   Positive score: {sentiment.get('positive_score')}")
print(f"   Negative score: {sentiment.get('negative_score')}")
print(f"   Word count: {report.get('text_summary', {}).get('word_count', 0)}")
print(
    f"   Analysis complete: {results_2.get('report_generator', {}).get('analysis_complete', False)}"
)

## Step 10: Execute Pipeline - Test Case 3

Testing with a neutral text sample:

In [None]:
test_input_3 = "The product is okay. Not great, not bad."
print(f"üß™ Test 3: '{test_input_3}'")
print("")

results_3 = await orchestrator.run(graph, test_input_3)

report = results_3.get("report_generator", {}).get("report", {})
sentiment = report.get("sentiment_analysis", {})

print("üìà Results:")
print(f"   Sentiment: {sentiment.get('sentiment')}")
print(f"   Confidence: {sentiment.get('confidence', 0):.2f}")
print(f"   Positive score: {sentiment.get('positive_score')}")
print(f"   Negative score: {sentiment.get('negative_score')}")
print(f"   Word count: {report.get('text_summary', {}).get('word_count', 0)}")
print(
    f"   Analysis complete: {results_3.get('report_generator', {}).get('analysis_complete', False)}"
)

## Step 11: Inspect Full Report

Let's examine the complete report structure from the final test:

In [None]:
print("üìã Complete Report Structure:")
print("=" * 60)

full_report = results_3.get("report_generator", {})
print("\nReport Generator Output:")
print(full_report)

print("\n" + "=" * 60)
print("\nAll Node Results:")
for node_name, result in results_3.items():
    print(f"\n{node_name}:")
    print(f"  {result}")

## Step 12: YAML Benefits Summary

Let's compare the YAML approach with traditional imperative code:

In [None]:
print("üí° YAML Pipeline Benefits:")
print("=" * 60)

print("\n1. VERSION CONTROL:")
print("   ‚úÖ Track workflow changes in Git")
print("   ‚úÖ Review changes via pull requests")
print("   ‚úÖ Rollback to previous versions")
print("   ‚úÖ Audit trail of who changed what")

print("\n2. COLLABORATION:")
print("   ‚úÖ Non-developers can read YAML")
print("   ‚úÖ Product managers can modify workflows")
print("   ‚úÖ Clear documentation of dependencies")
print("   ‚úÖ Easier code reviews")

print("\n3. INFRASTRUCTURE AS CODE:")
print("   ‚úÖ Treat workflows as infrastructure")
print("   ‚úÖ Deploy same YAML across environments")
print("   ‚úÖ Environment-specific configurations")
print("   ‚úÖ Automated testing of workflows")

print("\n4. TYPE SAFETY:")
print("   ‚úÖ Input/output schema validation")
print("   ‚úÖ Catch errors before execution")
print("   ‚úÖ Type checking at compile time")
print("   ‚úÖ Better IDE support")

print("\n5. PORTABILITY:")
print("   ‚úÖ Language-agnostic definitions")
print("   ‚úÖ Easy integration with CI/CD")
print("   ‚úÖ Reusable across projects")
print("   ‚úÖ Standard format for workflows")

print("\n6. TESTING:")
print("   ‚úÖ Validate without execution")
print("   ‚úÖ Dry-run capabilities")
print("   ‚úÖ Schema validation")
print("   ‚úÖ Dependency verification")

## Step 13: Comparison - YAML vs Python

Let's see the difference in defining the same pipeline:

In [None]:
print("üîÑ YAML vs Python Comparison:")
print("=" * 60)

print("\nIMPERATIVE PYTHON:")
print("```python")
print("graph = DirectedGraph()")
print("graph.add(NodeSpec('data_loader', data_loader))")
print("graph.add(NodeSpec('text_processor', text_processor).after('data_loader'))")
print("graph.add(NodeSpec('sentiment_analyzer', sentiment_analyzer).after('text_processor'))")
print(
    "graph.add(NodeSpec('report_generator', report_generator)"
    ".after('text_processor', 'sentiment_analyzer'))"
)
print("```")

print("\nDECLARATIVE YAML:")
print("```yaml")
print("nodes:")
print("  data_loader:")
print("    type: function")
print("  text_processor:")
print("    type: function")
print("    depends_on: [data_loader]")
print("  sentiment_analyzer:")
print("    type: function")
print("    depends_on: [text_processor]")
print("  report_generator:")
print("    type: function")
print("    depends_on: [text_processor, sentiment_analyzer]")
print("```")

print("\n‚úÖ YAML Advantages:")
print("   ‚Ä¢ More readable")
print("   ‚Ä¢ Less verbose")
print("   ‚Ä¢ Self-documenting")
print("   ‚Ä¢ Version control friendly")

## Step 14: Enterprise Use Cases

Real-world scenarios where YAML pipelines excel:

In [None]:
print("üè¢ Enterprise Use Cases:")
print("=" * 60)

print("\n1. MULTI-ENVIRONMENT DEPLOYMENTS:")
print("   ‚Ä¢ dev.yaml - Development configuration")
print("   ‚Ä¢ staging.yaml - Staging with test data")
print("   ‚Ä¢ prod.yaml - Production with monitoring")
print("   ‚Ä¢ Same pipeline, different configs")

print("\n2. TEAM COLLABORATION:")
print("   ‚Ä¢ Data scientists define workflows")
print("   ‚Ä¢ ML engineers implement functions")
print("   ‚Ä¢ Product managers review in YAML")
print("   ‚Ä¢ DevOps deploys via CI/CD")

print("\n3. COMPLIANCE & AUDITING:")
print("   ‚Ä¢ Track all workflow changes")
print("   ‚Ä¢ Review approval workflows")
print("   ‚Ä¢ Compliance documentation")
print("   ‚Ä¢ Change management process")

print("\n4. WORKFLOW TEMPLATES:")
print("   ‚Ä¢ Reusable pipeline patterns")
print("   ‚Ä¢ Standardized workflows")
print("   ‚Ä¢ Best practices enforcement")
print("   ‚Ä¢ Organizational standards")

print("\n5. A/B TESTING WORKFLOWS:")
print("   ‚Ä¢ Version A pipeline definition")
print("   ‚Ä¢ Version B pipeline definition")
print("   ‚Ä¢ Easy comparison and rollback")
print("   ‚Ä¢ Performance measurement")

## Summary: Key Concepts Learned

üéØ **Core Concepts:**

- **YAML-First Approach** - Declarative workflow definitions for better collaboration
- **Pipeline Structure** - Nodes, dependencies, schemas, and configuration
- **Version Control** - Track workflow changes like infrastructure code
- **Type Safety** - Input/output schemas ensure correctness
- **Portability** - Same YAML works across environments
- **Collaboration** - Non-developers can read and modify workflows

‚úÖ **What We Built:**

A text analysis pipeline with 4 nodes:
1. `data_loader` - Loads and parses input text
2. `text_processor` - Processes text (word count, formatting)
3. `sentiment_analyzer` - Analyzes sentiment (positive/negative/neutral)
4. `report_generator` - Generates comprehensive analysis report

The pipeline demonstrates complex dependencies: `report_generator` depends on both `text_processor` and `sentiment_analyzer`, showcasing DAG orchestration.

üìä **YAML Benefits:**

- **70% more readable** than imperative Python
- **Version control** via Git for audit trails
- **Team collaboration** with non-technical stakeholders
- **Infrastructure as Code** patterns for AI workflows
- **Type safety** through schema validation
- **Portability** across environments (dev/staging/prod)

üîó **Next Steps:**

- Explore LLM nodes for AI-powered processing
- Learn about conditional and loop nodes
- Build agent-based workflows with tools
- Create reusable workflow templates
- Implement multi-environment deployments