# ODIBI Getting Started - Complete Walkthrough

**Welcome to ODIBI!** This notebook will teach you everything you need to know to build data pipelines.

## What You'll Learn:

1. ‚úÖ Basic pipeline (read ‚Üí write)
2. ‚úÖ Transform functions (custom logic)
3. ‚úÖ SQL transforms
4. ‚úÖ Multi-source pipelines (joins)
5. ‚úÖ Debugging techniques
6. ‚úÖ Error handling

## Prerequisites:

- ODIBI installed: `pip install -e d:/odibi`
- This notebook in `examples/getting_started/`
- Sample data in `data/` folder

Let's get started! üöÄ

---

## Setup

In [1]:
# Setup: Add project root to path and change to examples directory
import sys
import os
from pathlib import Path

# Change to examples directory
os.chdir(Path.cwd() / 'examples' / 'getting_started' if 'getting_started' not in str(Path.cwd()) else Path.cwd())

# Add project root to path
project_root = Path.cwd().parent.parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"üìÅ Working directory: {Path.cwd()}")
print(f"üì¶ Project root: {project_root}")

üìÅ Working directory: c:\Users\hodibi\OneDrive - Ingredion\Desktop\Repos\Odibi\examples\getting_started
üì¶ Project root: c:\Users\hodibi\OneDrive - Ingredion\Desktop\Repos\Odibi


In [2]:
# Imports
import pandas as pd
import yaml

from odibi.pipeline import Pipeline
from odibi.config import PipelineConfig, ProjectConfig
from odibi.connections import LocalConnection
from odibi.registry import FunctionRegistry

# Import our transform functions
import transforms  # ‚Üê This registers all @transform functions

print("‚úÖ All imports successful!")
print(f"üìù Registered transforms: {FunctionRegistry.list_functions()}")

‚úÖ All imports successful!
üìù Registered transforms: ['calculate_revenue', 'filter_by_category', 'enrich_with_customer_data', 'aggregate_by_product']


  class BaseConnectionConfig(BaseModel):
  class NodeConfig(BaseModel):


---

## Step 1: Explore the Sample Data

Let's see what data we're working with.

In [3]:
# Load sample sales data
sales_df = pd.read_csv('data/sales.csv')

print("üìä Sales Data:")
print(f"  Rows: {len(sales_df)}")
print(f"  Columns: {list(sales_df.columns)}")
print("\nSample:")
sales_df.head()

üìä Sales Data:
  Rows: 10
  Columns: ['id', 'date', 'product', 'category', 'quantity', 'price', 'customer_id']

Sample:


Unnamed: 0,id,date,product,category,quantity,price,customer_id
0,1,2024-01-01,Widget A,Electronics,5,29.99,101
1,2,2024-01-01,Widget B,Electronics,3,49.99,102
2,3,2024-01-02,Gadget X,Home,2,15.5,103
3,4,2024-01-02,Widget A,Electronics,1,29.99,101
4,5,2024-01-03,Tool Y,Tools,4,12.0,104


In [4]:
# Load customer data
customers_df = pd.read_csv('data/customers.csv')

print("üë• Customer Data:")
print(f"  Rows: {len(customers_df)}")
print(f"  Columns: {list(customers_df.columns)}")
print("\nSample:")
customers_df.head()

üë• Customer Data:
  Rows: 7
  Columns: ['customer_id', 'name', 'region', 'tier']

Sample:


Unnamed: 0,customer_id,name,region,tier
0,101,Alice Smith,North,Gold
1,102,Bob Johnson,South,Silver
2,103,Carol White,East,Gold
3,104,David Brown,West,Bronze
4,105,Eve Davis,North,Silver


---

## Step 2: Simple Pipeline (Read ‚Üí Write)

**Goal:** Load CSV, save as Parquet

**Pipeline:** `pipelines/simple.yaml`

In [5]:
# Load the simple pipeline config
with open('pipelines/simple.yaml') as f:
    pipeline_yaml = yaml.safe_load(f)

print("üìã Pipeline Config:")
print(yaml.dump(pipeline_yaml, default_flow_style=False))

üìã Pipeline Config:
description: Load CSV and save as Parquet
nodes:
- description: Load sales data from CSV
  name: load_sales
  read:
    connection: local_data
    format: csv
    path: sales.csv
- depends_on:
  - load_sales
  description: Save sales data as Parquet
  name: save_parquet
  write:
    connection: local_output
    format: parquet
    mode: overwrite
    path: sales.parquet
pipeline: simple_etl



In [6]:
# Create pipeline config from YAML
pipeline_config = PipelineConfig(**pipeline_yaml)

print(f"‚úÖ Pipeline: {pipeline_config.pipeline}")
print(f"üì¶ Nodes: {[node.name for node in pipeline_config.nodes]}")

‚úÖ Pipeline: simple_etl
üì¶ Nodes: ['load_sales', 'save_parquet']


In [8]:
# Setup connections
connections = {
    "local_data": LocalConnection(base_path="./data"),
    "local_output": LocalConnection(base_path="./output")
}

print("üîå Connections configured:")
for name, conn in connections.items():
    print(f"  - {name}: {conn.base_path}")

üîå Connections configured:
  - local_data: data
  - local_output: output


In [9]:
# Create and run pipeline
pipeline = Pipeline(pipeline_config, connections=connections)

# Validate first
validation = pipeline.validate()
print("üîç Validation:")
print(f"  Valid: {validation['valid']}")
print(f"  Execution order: {validation['execution_order']}")

# Run pipeline
print("\n‚ñ∂Ô∏è Running pipeline...\n")
results = pipeline.run()

print("\n‚úÖ Pipeline Results:")
print(f"  Completed: {results.completed}")
print(f"  Failed: {results.failed}")
print(f"  Duration: {results.duration:.4f}s")

üîç Validation:
  Valid: True
  Execution order: ['load_sales', 'save_parquet']

‚ñ∂Ô∏è Running pipeline...


‚úÖ Pipeline Results:
  Completed: ['load_sales', 'save_parquet']
  Failed: []
  Duration: 0.3377s


In [12]:
# Verify output file was created
import os

output_file = Path('output/sales.parquet')
if output_file.exists():
    print(f"‚úÖ Output file created: {output_file}")
    print(f"   Size: {output_file.stat().st_size} bytes")
    
    # Load and verify
    saved_df = pd.read_parquet(output_file)
    print(f"   Rows: {len(saved_df)}")
    print("\nüìä First few rows:")
    display(saved_df.head())
else:
    print("‚ùå Output file not found!")

‚úÖ Output file created: output\sales.parquet
   Size: 4601 bytes
   Rows: 10

üìä First few rows:


Unnamed: 0,id,date,product,category,quantity,price,customer_id
0,1,2024-01-01,Widget A,Electronics,5,29.99,101
1,2,2024-01-01,Widget B,Electronics,3,49.99,102
2,3,2024-01-02,Gadget X,Home,2,15.5,103
3,4,2024-01-02,Widget A,Electronics,1,29.99,101
4,5,2024-01-03,Tool Y,Tools,4,12.0,104


**üéâ Success!** You just ran your first ODIBI pipeline!

**What happened:**
1. Loaded `pipelines/simple.yaml`
2. Validated with Pydantic
3. Created Pipeline with connections
4. Executed: load_sales ‚Üí save_parquet
5. Saved output to `output/sales.parquet`

---

## Step 3: Transform Pipeline (With Custom Functions)

**Goal:** Load data, calculate revenue, filter by category, save

**Pipeline:** `pipelines/transform.yaml`

**Uses:** Custom `@transform` functions from `transforms.py`

In [13]:
# Load transform pipeline
with open('pipelines/transform.yaml') as f:
    pipeline_yaml = yaml.safe_load(f)

print("üìã Transform Pipeline:")
print(f"  Pipeline: {pipeline_yaml['pipeline']}")
print(f"  Nodes: {[node['name'] for node in pipeline_yaml['nodes']]}")

üìã Transform Pipeline:
  Pipeline: transform_etl
  Nodes: ['load_sales', 'add_revenue', 'electronics_only', 'save_electronics']


In [14]:
# See the transform functions being used
for node in pipeline_yaml['nodes']:
    if 'transform' in node:
        print(f"\nüîß {node['name']}:")
        for step in node['transform']['steps']:
            if isinstance(step, dict) and 'function' in step:
                print(f"  Function: {step['function']}")
                print(f"  Params: {step['params']}")


üîß add_revenue:
  Function: calculate_revenue
  Params: {'source': 'load_sales'}

üîß electronics_only:
  Function: filter_by_category
  Params: {'source': 'add_revenue', 'category': 'Electronics'}


In [15]:
# Create and run pipeline
pipeline_config = PipelineConfig(**pipeline_yaml)
pipeline = Pipeline(pipeline_config, connections=connections)

print("‚ñ∂Ô∏è Running transform pipeline...\n")
results = pipeline.run()

print("\n‚úÖ Results:")
print(f"  Completed: {results.completed}")
print(f"  Failed: {results.failed}")
print(f"  Duration: {results.duration:.4f}s")

‚ñ∂Ô∏è Running transform pipeline...


‚úÖ Results:
  Completed: ['load_sales', 'add_revenue', 'electronics_only', 'save_electronics']
  Failed: []
  Duration: 0.0461s


In [16]:
# Check the output
electronics_df = pd.read_csv('output/electronics_sales.csv')

print("üìä Electronics Sales Output:")
print(f"  Rows: {len(electronics_df)}")
print(f"  Columns: {list(electronics_df.columns)}")
print("\nüí∞ Revenue calculated and filtered:")
electronics_df

üìä Electronics Sales Output:
  Rows: 5
  Columns: ['id', 'date', 'product', 'category', 'quantity', 'price', 'customer_id', 'revenue']

üí∞ Revenue calculated and filtered:


Unnamed: 0,id,date,product,category,quantity,price,customer_id,revenue
0,1,2024-01-01,Widget A,Electronics,5,29.99,101,149.95
1,2,2024-01-01,Widget B,Electronics,3,49.99,102,149.97
2,4,2024-01-02,Widget A,Electronics,1,29.99,101,29.99
3,7,2024-01-04,Widget B,Electronics,2,49.99,102,99.98
4,9,2024-01-05,Widget A,Electronics,3,29.99,107,89.97


**What happened:**
1. Loaded sales data
2. **Added revenue column** using `calculate_revenue()` function
3. **Filtered for Electronics** using `filter_by_category()` function
4. Saved results

**Key insight:** Transform functions are just Python functions with `@transform` decorator!

---

## Step 4: Advanced Pipeline (SQL, Joins, Parallel)

**Goal:** Join sales with customers, aggregate by product

**Features:**
- Multiple data sources
- SQL transforms
- Function transforms
- Parallel writes

In [17]:
# Load advanced pipeline
with open('pipelines/advanced.yaml') as f:
    pipeline_yaml = yaml.safe_load(f)

pipeline_config = PipelineConfig(**pipeline_yaml)

print(f"üìã Pipeline: {pipeline_config.pipeline}")
print(f"üì¶ Nodes ({len(pipeline_config.nodes)}):")
for node in pipeline_config.nodes:
    deps = f" ‚Üê {node.depends_on}" if node.depends_on else ""
    print(f"  - {node.name}{deps}")

üìã Pipeline: advanced_etl
üì¶ Nodes (7):
  - load_sales
  - load_customers
  - sales_with_revenue ‚Üê ['load_sales']
  - enriched_sales ‚Üê ['sales_with_revenue', 'load_customers']
  - product_summary ‚Üê ['enriched_sales']
  - save_enriched ‚Üê ['enriched_sales']
  - save_summary ‚Üê ['product_summary']


In [None]:
# Visualize the dependency graph
pipeline = Pipeline(pipeline_config, connections=connections)

print("üîÄ Dependency Graph:\n")
print(pipeline.visualize())

In [None]:
# See execution layers (what can run in parallel)
layers = pipeline.get_execution_layers()

print("‚ö° Execution Layers (nodes in same layer can run in parallel):\n")
for i, layer in enumerate(layers, 1):
    print(f"  Layer {i}: {layer}")

In [None]:
# Run the pipeline
print("‚ñ∂Ô∏è Running advanced pipeline...\n")
results = pipeline.run()

print("\n‚úÖ Results:")
print(f"  Completed: {results.completed}")
print(f"  Failed: {results.failed}")
print(f"  Duration: {results.duration:.4f}s")

print("\nüìù Node Details:")
for node_name in results.completed:
    node_result = results.get_node_result(node_name)
    print(f"  {node_name}: {node_result.duration:.4f}s")

In [None]:
# Check enriched output
enriched_df = pd.read_csv('output/enriched_sales.csv')

print("üìä Enriched Sales (with customer data):")
print(f"  Rows: {len(enriched_df)}")
print(f"  Columns: {list(enriched_df.columns)}")
print("\nSample (notice customer name, region, tier):")
enriched_df.head()

In [None]:
# Check product summary
summary_df = pd.read_csv('output/product_summary.csv')

print("üìà Product Summary (aggregated):")
summary_df

**üéâ Advanced pipeline complete!**

**What happened:**
1. Loaded **2 data sources** (sales + customers)
2. Used **SQL** to calculate revenue
3. **Joined** sales with customer data
4. **Aggregated** by product
5. Saved **2 outputs** (could run in parallel!)

**Notice:**
- `save_enriched` and `save_summary` are in Layer 4 (same layer)
- They could run in parallel (feature not implemented yet)
- Dependencies automatically handled!

---

## Step 5: Debugging - Run Single Nodes

**Problem:** You want to test one transform without running the whole pipeline.

**Solution:** `pipeline.run_node()` with mock data

In [None]:
# Test the aggregate_by_product function in isolation

# Create mock data
mock_sales = pd.DataFrame({
    'product': ['Widget A', 'Widget A', 'Widget B', 'Widget B'],
    'quantity': [5, 3, 2, 4],
    'revenue': [100, 60, 80, 160]
})

print("üß™ Testing single node with mock data:\n")
print("Input:")
print(mock_sales)

# Run just the aggregate node
result = pipeline.run_node(
    "product_summary",
    mock_data={"enriched_sales": mock_sales}
)

print(f"\n‚úÖ Node executed: {result.success}")
print(f"‚è± Duration: {result.duration:.4f}s")

print("\nOutput:")
pipeline.context.get("product_summary")

**Great for debugging!**
- Test transforms without loading real data
- Iterate quickly
- Debug failures in isolation

---

## Step 6: Error Handling

**What happens when things go wrong?**

In [None]:
# Create a pipeline with an error
from odibi.config import NodeConfig, ReadConfig, TransformConfig

# Clear registry and add a failing function
FunctionRegistry._functions.clear()
FunctionRegistry._signatures.clear()

from odibi import transform

@transform
def failing_transform(context, source: str):
    """This will fail on purpose."""
    df = context.get(source)
    raise ValueError("Intentional failure for demo!")

# Create pipeline with error
error_pipeline_config = PipelineConfig(
    pipeline="error_demo",
    nodes=[
        NodeConfig(
            name="load",
            read=ReadConfig(connection="local_data", format="csv", path="sales.csv")
        ),
        NodeConfig(
            name="fail_node",
            depends_on=["load"],
            transform=TransformConfig(
                steps=[{"function": "failing_transform", "params": {"source": "load"}}]
            )
        ),
        NodeConfig(
            name="dependent",
            depends_on=["fail_node"],
            transform=TransformConfig(
                steps=["SELECT * FROM fail_node"]
            )
        ),
        NodeConfig(
            name="independent",
            read=ReadConfig(connection="local_data", format="csv", path="customers.csv")
        )
    ]
)

error_pipeline = Pipeline(error_pipeline_config, connections=connections)

print("‚ñ∂Ô∏è Running pipeline with intentional error...\n")
results = error_pipeline.run()

print("\nüìä Results:")
print(f"  ‚úÖ Completed: {results.completed}")
print(f"  ‚ùå Failed: {results.failed}")
print(f"  ‚è≠ Skipped: {results.skipped}")

In [None]:
# Get error details
if results.failed:
    failed_node = results.failed[0]
    node_result = results.get_node_result(failed_node)
    
    print(f"‚ùå Node '{failed_node}' failed:\n")
    print(f"Error: {node_result.error}")

**Error Handling:**
- ‚úÖ `load` completed (ran first)
- ‚ùå `fail_node` failed (intentional error)
- ‚è≠ `dependent` skipped (dependency failed)
- ‚úÖ `independent` completed (no dependency on failed node)

**Philosophy:** 
- Don't fail fast
- Skip dependents of failed nodes
- Continue with independent nodes
- Collect all errors

---

## Step 7: Inspect the Context

**The Context** is how data passes between nodes.

In [None]:
# Re-import transforms to register them again
import importlib
importlib.reload(transforms)

# Run the transform pipeline again
with open('pipelines/transform.yaml') as f:
    pipeline_yaml = yaml.safe_load(f)

pipeline_config = PipelineConfig(**pipeline_yaml)
pipeline = Pipeline(pipeline_config, connections=connections)
results = pipeline.run()

print("üì¶ After pipeline execution, context contains:\n")
print(f"  Registered DataFrames: {pipeline.context.list_names()}")

In [None]:
# Access intermediate results from context
revenue_df = pipeline.context.get("add_revenue")

print("üí∞ Intermediate result 'add_revenue' (before filtering):")
print(f"  Rows: {len(revenue_df)}")
print(f"  Has revenue column: {'revenue' in revenue_df.columns}")
revenue_df.head()

In [None]:
# Compare before and after filtering
electronics_df = pipeline.context.get("electronics_only")

print("üìä Before filtering (add_revenue):")
print(f"  Total rows: {len(revenue_df)}")
print(f"  Categories: {revenue_df['category'].unique().tolist()}")

print("\nüìä After filtering (electronics_only):")
print(f"  Total rows: {len(electronics_df)}")
print(f"  Categories: {electronics_df['category'].unique().tolist()}")
print(f"  Filtered out: {len(revenue_df) - len(electronics_df)} rows")

**Context is powerful!**
- Every node registers its result
- Downstream nodes access by name
- Great for debugging (inspect intermediate results)
- Same API for Spark and Pandas

---

## Step 8: Simplified Orchestration with PipelineManager

**Problem:** Manually loading YAML, creating configs, and setting up connections is verbose.

**Solution:** `PipelineManager` handles everything from a single project configuration file.

In [6]:
from odibi.pipeline import PipelineManager

# Initialize manager from project config (pipelines/manager_demo.yaml)
# This automatically:
# 1. Loads the project config
# 2. Sets up connections
# 3. Creates all pipelines
manager = PipelineManager.from_yaml("pipelines/manager_demo.yaml")

print(f"‚úÖ Loaded pipelines: {manager.list_pipelines()}")

# Run the pipeline managed by the manager
print("\n‚ñ∂Ô∏è Running managed pipeline...\n")
results = manager.run("simple_etl_managed")

print(f"\n‚úÖ Success! Output saved to {results.story_path}")

‚úÖ Loaded pipelines: ['simple_etl_managed']

‚ñ∂Ô∏è Running managed pipeline...


Running pipeline: simple_etl_managed


‚úÖ SUCCESS - simple_etl_managed
  Completed: 2 nodes
  Failed: 0 nodes
  Duration: 239.39s
  Story: c:\Users\hodibi\OneDrive - Ingredion\Desktop\Repos\Odibi\examples\getting_started\output\stories\simple_etl_managed_20251119_122921.md

‚úÖ Success! Output saved to c:\Users\hodibi\OneDrive - Ingredion\Desktop\Repos\Odibi\examples\getting_started\output\stories\simple_etl_managed_20251119_122921.md


**Why use PipelineManager?**
- Single source of truth (project config)
- Automatic connection setup
- Manages multiple pipelines
- Consistent execution environment

---

## Summary

**üéâ Congratulations!** You've learned:

### ‚úÖ Core Concepts:
1. **YAML Configs** - Declarative pipeline definitions
2. **Nodes** - Read ‚Üí Transform ‚Üí Write units
3. **Dependencies** - Explicit `depends_on` declarations
4. **Context** - Data passing between nodes
5. **Connections** - Abstract data sources/destinations

### ‚úÖ Features Used:
- ‚úÖ CSV/Parquet read/write
- ‚úÖ Transform functions (`@transform` decorator)
- ‚úÖ SQL transforms (DuckDB)
- ‚úÖ Multi-source joins
- ‚úÖ Aggregation
- ‚úÖ Dependency graph visualization
- ‚úÖ Single node debugging
- ‚úÖ Error handling

### üöÄ Next Steps:

**Try building your own pipeline:**
1. Create your own YAML config
2. Write custom `@transform` functions
3. Process your own data

**Explore more:**
- Read the framework docs: `docs/ODIBI_FRAMEWORK_PLAN.md`
- Review test examples: `test_exploration_phase2.ipynb`
- Check improvements list: `docs/IMPROVEMENTS.md`

**Happy data engineering!** üéâ