# Story Generation Deep Dive

## Introduction
This lesson explores ODIBI's Story Generation system - the automatic documentation engine that transforms pipeline execution into rich, shareable documentation.

### What You'll Learn
1. StoryGenerator architecture and markdown generation
2. NodeExecutionMetadata - tracking detailed node metrics
3. PipelineStoryMetadata - aggregating pipeline-level data
4. Multi-format renderers (Markdown, HTML, JSON)
5. Theme system with customization and branding
6. Story content patterns (schemas, row counts, durations)
7. Testing and validating stories

## Part 1: StoryGenerator Architecture

### Basic Story Generation

In [None]:
from odibi.story.generator import StoryGenerator
from odibi.node import NodeResult
import pandas as pd
from datetime import datetime

# StoryGenerator creates markdown documentation from pipeline runs
class StoryGeneratorBasics:
    """Understanding StoryGenerator initialization."""
    
    @staticmethod
    def show_structure():
        print("""
StoryGenerator Configuration:

1. pipeline_name: str
   - Name of the pipeline being documented
   - Used in headers and filenames
   
2. max_sample_rows: int (default: 10)
   - Maximum rows to include in data samples
   - Balances detail vs. readability
   
3. output_path: str (default: "stories/")
   - Directory for generated story files
   - Auto-created if doesn't exist
        """)

StoryGeneratorBasics.show_structure()

In [None]:
# Example: Creating a story from pipeline execution
import tempfile
import os

# Create temp directory for stories
temp_dir = tempfile.mkdtemp()

# Initialize generator
generator = StoryGenerator(
    pipeline_name="data_pipeline",
    max_sample_rows=5,
    output_path=temp_dir
)

# Mock node results
node_results = {
    "extract": NodeResult(
        success=True,
        duration=1.23,
        result_schema=["id", "name", "value"],
        rows_processed=100,
        metadata={"steps": ["Connected to source", "Loaded data"]}
    ),
    "transform": NodeResult(
        success=True,
        duration=0.45,
        result_schema=["id", "name", "value", "category"],
        rows_processed=98,
        metadata={"steps": ["Filtered nulls", "Added category"]}
    )
}

# Generate story
story_path = generator.generate(
    node_results=node_results,
    completed=["extract", "transform"],
    failed=[],
    skipped=[],
    duration=1.68,
    start_time="2024-01-15T10:30:00",
    end_time="2024-01-15T10:30:01"
)

print(f"Story generated: {story_path}")
print(f"\nFirst 500 chars:")
with open(story_path, 'r') as f:
    print(f.read()[:500])

## Part 2: NodeExecutionMetadata - Tracking Metrics

### Node-Level Metadata Collection

In [None]:
from odibi.story.metadata import NodeExecutionMetadata

# NodeExecutionMetadata captures detailed execution information
node_meta = NodeExecutionMetadata(
    node_name="transform_sales",
    operation="filter_and_aggregate",
    status="success",
    duration=2.45,
    rows_in=1000,
    rows_out=850,
    schema_in=["date", "product", "amount", "region"],
    schema_out=["date", "product", "total_amount", "region", "category"],
    started_at="2024-01-15T10:30:00",
    completed_at="2024-01-15T10:30:02"
)

# Calculate metrics automatically
node_meta.calculate_row_change()
node_meta.calculate_schema_changes()

print("Node Execution Metadata:")
print(f"  Row Change: {node_meta.rows_change} ({node_meta.rows_change_pct:.1f}%)")
print(f"  Columns Added: {node_meta.columns_added}")
print(f"  Columns Removed: {node_meta.columns_removed}")
print(f"\nAs Dictionary:")
import json
print(json.dumps(node_meta.to_dict(), indent=2))

In [None]:
# Example: Tracking schema transformations
def demonstrate_schema_tracking():
    """Show how schema changes are automatically detected."""
    
    # Example 1: Column addition
    meta1 = NodeExecutionMetadata(
        node_name="add_features",
        operation="feature_engineering",
        status="success",
        duration=1.0,
        schema_in=["id", "value"],
        schema_out=["id", "value", "value_squared", "value_log"]
    )
    meta1.calculate_schema_changes()
    
    print("Example 1: Adding Columns")
    print(f"  Added: {meta1.columns_added}")
    print(f"  Removed: {meta1.columns_removed}")
    
    # Example 2: Column removal (filtering)
    meta2 = NodeExecutionMetadata(
        node_name="select_columns",
        operation="column_selection",
        status="success",
        duration=0.5,
        schema_in=["id", "name", "temp_col", "debug_flag", "value"],
        schema_out=["id", "name", "value"]
    )
    meta2.calculate_schema_changes()
    
    print("\nExample 2: Removing Columns")
    print(f"  Added: {meta2.columns_added}")
    print(f"  Removed: {meta2.columns_removed}")
    
    # Example 3: Both add and remove
    meta3 = NodeExecutionMetadata(
        node_name="reshape_data",
        operation="transformation",
        status="success",
        duration=1.5,
        schema_in=["date", "product_id", "qty", "price"],
        schema_out=["timestamp", "product_id", "revenue", "category"]
    )
    meta3.calculate_schema_changes()
    
    print("\nExample 3: Complex Schema Change")
    print(f"  Added: {meta3.columns_added}")
    print(f"  Removed: {meta3.columns_removed}")

demonstrate_schema_tracking()

## Part 3: PipelineStoryMetadata - Aggregation

### Pipeline-Level Metadata

In [None]:
from odibi.story.metadata import PipelineStoryMetadata, NodeExecutionMetadata

# PipelineStoryMetadata aggregates all node executions
pipeline_meta = PipelineStoryMetadata(
    pipeline_name="sales_etl",
    pipeline_layer="bronze_to_silver",
    project="analytics",
    plant="chicago",
    business_unit="operations",
    theme="corporate"
)

# Add successful node
pipeline_meta.add_node(NodeExecutionMetadata(
    node_name="extract",
    operation="read_csv",
    status="success",
    duration=1.2,
    rows_out=1000
))

# Add failed node
pipeline_meta.add_node(NodeExecutionMetadata(
    node_name="transform",
    operation="complex_join",
    status="failed",
    duration=0.8,
    error_message="Key column 'id' not found",
    error_type="KeyError"
))

# Add skipped node
pipeline_meta.add_node(NodeExecutionMetadata(
    node_name="load",
    operation="write_parquet",
    status="skipped",
    duration=0.0
))

# Complete the pipeline
pipeline_meta.completed_at = "2024-01-15T10:32:00"
pipeline_meta.duration = 2.0

# Calculate metrics
print(f"Pipeline: {pipeline_meta.pipeline_name}")
print(f"Success Rate: {pipeline_meta.get_success_rate():.1f}%")
print(f"Total Rows: {pipeline_meta.get_total_rows_processed():,}")
print(f"Status: {pipeline_meta.completed_nodes}C / {pipeline_meta.failed_nodes}F / {pipeline_meta.skipped_nodes}S")

## Part 4: Renderers - Multi-Format Output

### Markdown Renderer

In [None]:
from odibi.story.renderers import MarkdownStoryRenderer
from odibi.story.metadata import PipelineStoryMetadata, NodeExecutionMetadata

# Create sample pipeline metadata
metadata = PipelineStoryMetadata(
    pipeline_name="customer_analytics",
    pipeline_layer="silver",
    started_at="2024-01-15T10:00:00",
    completed_at="2024-01-15T10:05:00",
    duration=300.5,
    project="customer_360",
    plant="headquarters"
)

# Add nodes with calculated metrics
node1 = NodeExecutionMetadata(
    node_name="load_customers",
    operation="read_database",
    status="success",
    duration=120.5,
    rows_in=0,
    rows_out=50000,
    schema_out=["customer_id", "name", "email", "signup_date"]
)
node1.calculate_row_change()
metadata.add_node(node1)

node2 = NodeExecutionMetadata(
    node_name="enrich_data",
    operation="join_and_transform",
    status="success",
    duration=180.0,
    rows_in=50000,
    rows_out=48500,
    schema_in=["customer_id", "name", "email", "signup_date"],
    schema_out=["customer_id", "name", "email", "signup_date", "segment", "ltv"]
)
node2.calculate_row_change()
node2.calculate_schema_changes()
metadata.add_node(node2)

# Render as markdown
renderer = MarkdownStoryRenderer()
markdown_story = renderer.render(metadata)

print("Markdown Story (first 1000 chars):")
print(markdown_story[:1000])
print("\n... (truncated)")

### JSON Renderer

In [None]:
from odibi.story.renderers import JSONStoryRenderer

# JSON renderer for machine-readable output
json_renderer = JSONStoryRenderer()
json_story = json_renderer.render(metadata)

print("JSON Story:")
print(json_story[:800])
print("\n... (truncated)")

# Parse and analyze
import json
story_dict = json.loads(json_story)
print(f"\nParsed JSON structure:")
print(f"  Pipeline: {story_dict['pipeline_name']}")
print(f"  Nodes: {len(story_dict['nodes'])}")
print(f"  Success rate: {story_dict['success_rate']:.1f}%")

### HTML Renderer (if Jinja2 available)

In [None]:
# HTML renderer requires jinja2 and template file
try:
    from odibi.story.renderers import HTMLStoryRenderer
    
    # Note: HTML renderer requires template file
    # In practice, you'd use: renderer = HTMLStoryRenderer()
    print("HTML renderer available")
    print("\nHTML rendering creates:")
    print("  - Professional, responsive design")
    print("  - Interactive collapsible sections")
    print("  - Color-coded status indicators")
    print("  - Summary dashboards")
    print("  - Printable reports")
    
except ImportError:
    print("HTML rendering requires: pip install jinja2")
except Exception as e:
    print(f"HTML renderer note: {e}")
    print("(Template file required for actual rendering)")

## Part 5: Theme System

### Built-in Themes

In [None]:
from odibi.story.themes import (
    StoryTheme, 
    DEFAULT_THEME, 
    CORPORATE_THEME, 
    DARK_THEME,
    MINIMAL_THEME,
    list_themes,
    get_theme
)

# List all built-in themes
themes = list_themes()
print("Built-in Themes:")
for name, theme in themes.items():
    print(f"\n{name.upper()}:")
    print(f"  Primary: {theme.primary_color}")
    print(f"  Success: {theme.success_color}")
    print(f"  Error: {theme.error_color}")
    print(f"  Font: {theme.font_family[:50]}...")

In [None]:
# Creating custom theme
custom_theme = StoryTheme(
    name="ingredion_brand",
    primary_color="#006837",  # Ingredion green
    success_color="#2e7d32",
    error_color="#c62828",
    warning_color="#f57c00",
    bg_color="#ffffff",
    text_color="#212121",
    font_family="'Open Sans', Arial, sans-serif",
    heading_font="'Roboto', sans-serif",
    company_name="Ingredion Incorporated",
    footer_text="Confidential - Internal Use Only",
    max_width="1400px"
)

# Generate CSS
css = custom_theme.to_css_string()
print("Custom Theme CSS:")
print(css)

In [None]:
# Theme with custom CSS
theme_with_custom_css = StoryTheme(
    name="custom_styled",
    primary_color="#0066cc",
    custom_css="""
    .node-header {
        border-left: 4px solid var(--primary-color);
        padding-left: 12px;
    }
    
    .summary {
        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
        color: white;
        padding: 20px;
        border-radius: 8px;
    }
    """
)

print("Theme with custom CSS:")
print(theme_with_custom_css.to_css_string())

## Part 6: Story Content Patterns

### Tracking Row Counts and Data Flow

In [None]:
# Example: Building a complete data lineage story
def build_data_lineage_story():
    """Demonstrate tracking data through pipeline."""
    
    metadata = PipelineStoryMetadata(
        pipeline_name="data_quality_pipeline",
        pipeline_layer="data_quality",
        started_at="2024-01-15T09:00:00"
    )
    
    # Step 1: Extract
    extract = NodeExecutionMetadata(
        node_name="extract_raw_data",
        operation="database_read",
        status="success",
        duration=5.2,
        rows_in=0,
        rows_out=100000,
        schema_out=["id", "timestamp", "value", "sensor_id", "quality_flag"]
    )
    extract.calculate_row_change()
    metadata.add_node(extract)
    
    # Step 2: Remove duplicates
    dedupe = NodeExecutionMetadata(
        node_name="remove_duplicates",
        operation="distinct",
        status="success",
        duration=2.1,
        rows_in=100000,
        rows_out=95000,
        schema_in=["id", "timestamp", "value", "sensor_id", "quality_flag"],
        schema_out=["id", "timestamp", "value", "sensor_id", "quality_flag"]
    )
    dedupe.calculate_row_change()
    dedupe.calculate_schema_changes()
    metadata.add_node(dedupe)
    
    # Step 3: Filter bad quality
    filter_node = NodeExecutionMetadata(
        node_name="filter_quality",
        operation="filter",
        status="success",
        duration=1.5,
        rows_in=95000,
        rows_out=85000,
        schema_in=["id", "timestamp", "value", "sensor_id", "quality_flag"],
        schema_out=["id", "timestamp", "value", "sensor_id"]
    )
    filter_node.calculate_row_change()
    filter_node.calculate_schema_changes()
    metadata.add_node(filter_node)
    
    # Step 4: Aggregate
    aggregate = NodeExecutionMetadata(
        node_name="hourly_aggregation",
        operation="group_by",
        status="success",
        duration=3.8,
        rows_in=85000,
        rows_out=2040,
        schema_in=["id", "timestamp", "value", "sensor_id"],
        schema_out=["hour", "sensor_id", "avg_value", "min_value", "max_value", "count"]
    )
    aggregate.calculate_row_change()
    aggregate.calculate_schema_changes()
    metadata.add_node(aggregate)
    
    metadata.completed_at = "2024-01-15T09:00:13"
    metadata.duration = 12.6
    
    # Render story
    renderer = MarkdownStoryRenderer()
    story = renderer.render(metadata)
    
    return story

story = build_data_lineage_story()
print(story[:1500])
print("\n... (story continues)")

### Duration Tracking and Performance Metrics

In [None]:
# Example: Analyzing performance from story metadata
def analyze_performance(metadata: PipelineStoryMetadata):
    """Extract performance insights from story."""
    
    print(f"Pipeline Performance Analysis: {metadata.pipeline_name}")
    print(f"="*60)
    
    # Overall metrics
    print(f"\nOverall:")
    print(f"  Total Duration: {metadata.duration:.2f}s")
    print(f"  Total Nodes: {metadata.total_nodes}")
    print(f"  Avg Time/Node: {metadata.duration/metadata.total_nodes:.2f}s")
    
    # Node-level analysis
    print(f"\nNode Performance:")
    sorted_nodes = sorted(metadata.nodes, key=lambda n: n.duration, reverse=True)
    
    for i, node in enumerate(sorted_nodes[:5], 1):
        pct = (node.duration / metadata.duration * 100) if metadata.duration > 0 else 0
        print(f"  {i}. {node.node_name}: {node.duration:.2f}s ({pct:.1f}%)")
        
        if node.rows_in and node.rows_out:
            throughput = node.rows_out / node.duration if node.duration > 0 else 0
            print(f"     Throughput: {throughput:,.0f} rows/sec")
    
    # Data volume analysis
    print(f"\nData Volume:")
    total_rows = metadata.get_total_rows_processed()
    print(f"  Total Rows Processed: {total_rows:,}")
    
    overall_throughput = total_rows / metadata.duration if metadata.duration > 0 else 0
    print(f"  Overall Throughput: {overall_throughput:,.0f} rows/sec")

# Test with our previous example
test_metadata = PipelineStoryMetadata(
    pipeline_name="test_pipeline",
    duration=10.5
)

for i in range(3):
    node = NodeExecutionMetadata(
        node_name=f"node_{i}",
        operation="transform",
        status="success",
        duration=3.5 - i,
        rows_in=10000,
        rows_out=9000 + i*100
    )
    test_metadata.add_node(node)

analyze_performance(test_metadata)

## Part 7: Testing Stories

### Validating Story Content

In [None]:
# Testing story generation
def test_story_generation():
    """Validate story content and structure."""
    
    # Create test metadata
    metadata = PipelineStoryMetadata(
        pipeline_name="test_pipeline",
        started_at="2024-01-15T10:00:00",
        completed_at="2024-01-15T10:05:00",
        duration=300.0
    )
    
    # Add test nodes
    metadata.add_node(NodeExecutionMetadata(
        node_name="node_1",
        operation="read",
        status="success",
        duration=100.0,
        rows_out=1000
    ))
    
    metadata.add_node(NodeExecutionMetadata(
        node_name="node_2",
        operation="transform",
        status="failed",
        duration=50.0,
        error_message="Column not found",
        error_type="KeyError"
    ))
    
    # Test Markdown rendering
    md_renderer = MarkdownStoryRenderer()
    md_story = md_renderer.render(metadata)
    
    # Validate content
    tests = [
        ("# üìä Pipeline Run Story" in md_story, "Has header"),
        ("test_pipeline" in md_story, "Contains pipeline name"),
        ("‚úÖ node_1" in md_story, "Shows successful node"),
        ("‚ùå node_2" in md_story, "Shows failed node"),
        ("KeyError" in md_story, "Includes error type"),
        ("Column not found" in md_story, "Includes error message"),
        ("Duration:" in md_story, "Shows duration"),
        ("Success Rate" in md_story, "Shows success rate")
    ]
    
    print("Story Content Validation:")
    print("="*50)
    all_passed = True
    for passed, description in tests:
        status = "‚úÖ PASS" if passed else "‚ùå FAIL"
        print(f"{status}: {description}")
        if not passed:
            all_passed = False
    
    print("\n" + "="*50)
    print(f"Overall: {'‚úÖ ALL TESTS PASSED' if all_passed else '‚ùå SOME TESTS FAILED'}")
    
    return all_passed

test_story_generation()

In [None]:
# Testing JSON serialization
def test_json_serialization():
    """Validate JSON story can be serialized and deserialized."""
    import json
    
    # Create metadata
    metadata = PipelineStoryMetadata(
        pipeline_name="json_test",
        duration=100.5
    )
    
    node = NodeExecutionMetadata(
        node_name="test_node",
        operation="test_op",
        status="success",
        duration=50.0,
        rows_in=100,
        rows_out=90
    )
    node.calculate_row_change()
    metadata.add_node(node)
    
    # Render as JSON
    renderer = JSONStoryRenderer()
    json_str = renderer.render(metadata)
    
    # Parse back
    parsed = json.loads(json_str)
    
    # Validate
    tests = [
        (parsed["pipeline_name"] == "json_test", "Pipeline name preserved"),
        (parsed["duration"] == 100.5, "Duration preserved"),
        (len(parsed["nodes"]) == 1, "Node count correct"),
        (parsed["nodes"][0]["node_name"] == "test_node", "Node name preserved"),
        (parsed["nodes"][0]["rows_change"] == -10, "Row change calculated"),
        (parsed["success_rate"] == 100.0, "Success rate calculated")
    ]
    
    print("JSON Serialization Tests:")
    print("="*50)
    for passed, description in tests:
        status = "‚úÖ PASS" if passed else "‚ùå FAIL"
        print(f"{status}: {description}")
    
    return all(t[0] for t in tests)

test_json_serialization()

## Summary

### Key Concepts Covered

1. **StoryGenerator**: Creates markdown documentation from pipeline execution
   - Configurable sample sizes and output paths
   - Automatic DataFrame-to-markdown conversion
   - Rich execution context inclusion

2. **NodeExecutionMetadata**: Tracks detailed node-level metrics
   - Row counts (in/out/change/percentage)
   - Schema evolution (added/removed columns)
   - Timing and performance data
   - Error tracking with full context

3. **PipelineStoryMetadata**: Aggregates pipeline-level information
   - Overall status and success rates
   - Total rows processed across pipeline
   - Project/plant/business unit context
   - Theme and rendering preferences

4. **Multi-Format Renderers**:
   - **Markdown**: Human-readable, GitHub-flavored
   - **HTML**: Professional, interactive reports (requires Jinja2)
   - **JSON**: Machine-readable for APIs and storage

5. **Theme System**: Customizable branding and styling
   - Built-in themes (default, corporate, dark, minimal)
   - Custom color schemes and typography
   - Company branding (logo, name, footer)
   - CSS customization support

### Automatic Documentation Pattern

Story generation enables:
- **Self-documenting pipelines** - execution creates its own documentation
- **Data lineage tracking** - see exactly how data transforms
- **Performance monitoring** - identify slow nodes and bottlenecks
- **Error debugging** - rich context for troubleshooting
- **Compliance reporting** - auditable execution records
- **Knowledge sharing** - communicate pipeline behavior to stakeholders