# 05 - Build a New Pipeline from Scratch

## üß≠ Goal

Learn how to create your own ODIBI pipeline from scratch.

This notebook will:
- Guide you through pipeline design
- Teach node structure and dependencies
- Show transform patterns
- Build and run a complete custom pipeline

**Estimated time:** 5 minutes

## üîß Setup

In [None]:
# ‚úÖ Environment Setup
import sys
import os
from pathlib import Path
import pandas as pd
import yaml

# Navigate to project root
project_root = Path.cwd().parent if Path.cwd().name == 'walkthroughs' else Path.cwd()
os.chdir(project_root)

from odibi.pipeline import Pipeline
from odibi.config import PipelineConfig, ProjectConfig

print(f"‚úÖ Environment ready")
print(f"üìÅ Working directory: {Path.cwd()}")

## üéØ Use Case: Product Analytics Pipeline

We'll build a pipeline to analyze product performance:
- **Input:** Products CSV and Orders CSV
- **Transform:** Join, filter, aggregate
- **Output:** Top-selling products report

## üìä Step 1: Create Sample Data

In [None]:
# Create data directory
Path("data/workshop/bronze").mkdir(parents=True, exist_ok=True)

# Products data
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headset'],
    'category': ['Computer', 'Accessory', 'Accessory', 'Computer', 'Accessory'],
    'price': [1200.00, 25.00, 75.00, 350.00, 120.00]
})

products.to_csv('data/workshop/bronze/products.csv', index=False)

# Orders data
orders = pd.DataFrame({
    'order_id': ['O001', 'O002', 'O003', 'O004', 'O005', 'O006', 'O007', 'O008'],
    'product_id': ['P001', 'P002', 'P002', 'P003', 'P001', 'P005', 'P002', 'P004'],
    'quantity': [2, 5, 3, 1, 1, 2, 10, 1],
    'order_date': ['2024-01-15', '2024-01-16', '2024-01-18', '2024-01-20', 
                   '2024-01-22', '2024-01-25', '2024-01-28', '2024-02-01']
})

orders.to_csv('data/workshop/bronze/orders.csv', index=False)

print("‚úÖ Sample data created\n")
print("Products:")
display(products)
print("\nOrders:")
display(orders)

## üèóÔ∏è Step 2: Design Pipeline YAML

Let's build the pipeline configuration step by step.

In [None]:
# Define pipeline configuration
pipeline_yaml = """
project: Product Analytics Workshop
engine: pandas

connections:
  local:
    type: local
    base_path: ./data/workshop

pipelines:
  - name: product_analytics
    nodes:
      # Node 1: Load products
      - name: load_products
        read:
          connection: local
          path: bronze/products.csv
          format: csv
        cache: true
      
      # Node 2: Load orders
      - name: load_orders
        read:
          connection: local
          path: bronze/orders.csv
          format: csv
        cache: true
      
      # Node 3: Join products with orders
      - name: enrich_orders
        depends_on: [load_products, load_orders]
        transform:
          steps:
            - |
              SELECT 
                o.order_id,
                o.product_id,
                p.product_name,
                p.category,
                p.price,
                o.quantity,
                p.price * o.quantity as revenue,
                o.order_date
              FROM load_orders o
              LEFT JOIN load_products p ON o.product_id = p.product_id
      
      # Node 4: Calculate product metrics
      - name: product_metrics
        depends_on: [enrich_orders]
        transform:
          steps:
            - |
              SELECT 
                product_id,
                product_name,
                category,
                SUM(quantity) as total_units_sold,
                SUM(revenue) as total_revenue,
                COUNT(DISTINCT order_id) as order_count,
                AVG(quantity) as avg_quantity_per_order
              FROM enrich_orders
              GROUP BY product_id, product_name, category
              ORDER BY total_revenue DESC
      
      # Node 5: Save results
      - name: save_report
        depends_on: [product_metrics]
        write:
          connection: local
          path: silver/product_performance.parquet
          format: parquet
          mode: overwrite
"""

print("üìã Pipeline Configuration Created")
print("\nPipeline Structure:")
print("  1. load_products (read CSV)")
print("  2. load_orders (read CSV)")
print("  3. enrich_orders (join products + orders)")
print("  4. product_metrics (aggregate by product)")
print("  5. save_report (write Parquet)")

## ‚ñ∂Ô∏è Step 3: Run the Pipeline

In [None]:
# Parse configuration
config = yaml.safe_load(pipeline_yaml)

# Create pipeline
pipeline_config = PipelineConfig(**config['pipelines'][0])
project_config = ProjectConfig(**{k: v for k, v in config.items() if k != 'pipelines'})

pipeline = Pipeline.from_config(pipeline_config, project_config)

print("üîÑ Running product analytics pipeline...\n")
result = pipeline.run()

print(f"\n‚úÖ Pipeline Status: {result.status}")
for node_name, node_result in result.node_results.items():
    status_icon = "‚úÖ" if node_result.status == "success" else "‚ùå"
    print(f"   {status_icon} {node_name}: {node_result.status}")

## üîç Step 4: Inspect Results

In [None]:
# Load and display the product performance report
report = pd.read_parquet('data/workshop/silver/product_performance.parquet')

print("üìä Product Performance Report:\n")
display(report)

print(f"\nüí° Insights:")
print(f"   ‚Ä¢ Top product: {report.iloc[0]['product_name']} (${report.iloc[0]['total_revenue']:.2f})")
print(f"   ‚Ä¢ Total products analyzed: {len(report)}")
print(f"   ‚Ä¢ Total revenue: ${report['total_revenue'].sum():.2f}")

## üß© Step 5: Understand the Patterns

Let's break down the key patterns used:

In [None]:
print("""
üéØ Key Pipeline Patterns:

1. **Read Nodes** (no dependencies)
   - Load data from external sources
   - Set `cache: true` for reuse

2. **Transform Nodes** (depend on read nodes)
   - Use SQL for joins, filters, aggregations
   - Reference upstream nodes by name
   - Can have multiple dependencies

3. **Write Nodes** (depend on transforms)
   - Save results to files
   - Specify format (csv, parquet, etc.)
   - Set mode (overwrite, append)

4. **Dependencies**
   - Use `depends_on: [node1, node2]`
   - Creates execution order
   - Enables parallel execution where possible

5. **SQL Transforms**
   - Full DuckDB SQL support
   - Reference nodes as tables
   - Use JOINs, GROUP BY, window functions
""")

## ü™û Reflect

**What we learned:**
- How to design a pipeline from scratch
- Node types: read, transform, write
- Dependency management with `depends_on`
- SQL transforms for data manipulation
- Running and inspecting pipeline results

**Your Turn:**
Try modifying the pipeline to:
- Add a filter for high-revenue products only
- Calculate category-level metrics
- Add a date range filter
- Write results to CSV instead of Parquet

**Next Steps:**
1. Explore more examples in `examples/getting_started/`
2. Read [CONTRIBUTING.md](../CONTRIBUTING.md) to contribute
3. Check [PHASES.md](../PHASES.md) for upcoming features
4. Join the community at https://github.com/henryodibi11/Odibi

**Congratulations!** üéâ You've completed all ODIBI walkthroughs!

## ‚úÖ Self-Check

In [None]:
# ‚úÖ Self-Check
try:
    import sys, os
    print("Running self-check...")
    
    # Verify sample data was created
    assert os.path.exists("data/workshop/bronze/products.csv"), "Missing products.csv"
    assert os.path.exists("data/workshop/bronze/orders.csv"), "Missing orders.csv"
    
    # Verify pipeline ran successfully
    assert os.path.exists("data/workshop/silver/product_performance.parquet"), "Pipeline did not create output"
    
    # Verify output quality
    import pandas as pd
    report = pd.read_parquet("data/workshop/silver/product_performance.parquet")
    
    assert len(report) > 0, "Report is empty"
    assert 'total_revenue' in report.columns, "Missing expected column"
    assert 'product_name' in report.columns, "Missing expected column"
    
    # Verify data integrity
    assert report['total_revenue'].sum() > 0, "Revenue calculation failed"
    
    print(f"‚úÖ Pipeline created {len(report)} product metrics")
    print(f"‚úÖ Total revenue calculated: ${report['total_revenue'].sum():.2f}")
    
    print("üéâ Walkthrough 05 verified successfully")
    print("\nüéì Congratulations! You've completed all ODIBI walkthroughs!")
except Exception as e:
    print(f"‚ùå Walkthrough failed self-check: {e}")
    raise