# 01 - Local Pipeline with Pandas

## üß≠ Goal

Run a complete data pipeline using ODIBI's Pandas engine.

This notebook will:
- Create sample sales data
- Run the `example_local.yaml` pipeline
- Transform Bronze ‚Üí Silver ‚Üí Gold layers
- Inspect output files

**Estimated time:** 2 minutes

## üîß Setup

In [1]:
# ‚úÖ Environment Setup
import sys
import os
from pathlib import Path
import pandas as pd
import yaml

# Navigate to project root
project_root = Path.cwd().parent if Path.cwd().name == 'walkthroughs' else Path.cwd()
os.chdir(project_root)

# Import ODIBI
from odibi.pipeline import Pipeline
from odibi.config import PipelineConfig, ProjectConfig
from odibi.connections import LocalConnection

print(f"‚úÖ Environment ready")
print(f"üìÅ Working directory: {Path.cwd()}")

‚úÖ Environment ready
üìÅ Working directory: c:\Users\hodibi\OneDrive - Ingredion\Desktop\Repos\Odibi


## üìä Create Sample Data

Let's create some sample sales data for our pipeline.

In [2]:
# Create data directories
Path("data/bronze").mkdir(parents=True, exist_ok=True)

# Create sample sales CSV
sales_data = pd.DataFrame({
    'transaction_id': ['T001', 'T002', 'T003', 'T004', 'T005', 'T006'],
    'customer_id': ['C001', 'C001', 'C002', 'C002', 'C003', 'C001'],
    'product_id': ['P001', 'P002', 'P001', 'P003', 'P002', 'P001'],
    'amount': [50.00, 75.50, 120.00, 45.00, 200.00, 30.00],
    'transaction_date': ['2024-01-15', '2024-01-20', '2024-01-22', '2024-01-25', '2024-02-01', '2024-02-05']
})

sales_data.to_csv('data/bronze/sales.csv', index=False)

print("‚úÖ Sample data created")
print("\nSample data preview:")
display(sales_data)

‚úÖ Sample data created

Sample data preview:


Unnamed: 0,transaction_id,customer_id,product_id,amount,transaction_date
0,T001,C001,P001,50.0,2024-01-15
1,T002,C001,P002,75.5,2024-01-20
2,T003,C002,P001,120.0,2024-01-22
3,T004,C002,P003,45.0,2024-01-25
4,T005,C003,P002,200.0,2024-02-01
5,T006,C001,P001,30.0,2024-02-05


## ‚ñ∂Ô∏è Run Pipeline

Now let's run the Bronze ‚Üí Silver ‚Üí Gold pipeline using `example_local.yaml`.

In [3]:
import pandas as pd
print(pd.read_csv)

<function read_csv at 0x000002239AF2BA30>


In [25]:
print(config)

{'project': 'Local Pandas Example', 'engine': 'pandas', 'connections': {'data': {'type': 'local', 'base_path': './data'}, 'outputs': {'type': 'local', 'base_path': './outputs'}}, 'story': {'connection': 'outputs', 'path': 'stories/', 'max_sample_rows': 10}, 'pipelines': [{'pipeline': 'bronze_to_silver', 'layer': 'transformation', 'nodes': [{'name': 'load_raw_sales', 'read': {'connection': 'data', 'path': 'bronze/sales.csv', 'format': 'csv', 'options': {'header': 0, 'dtype': {'transaction_id': 'str', 'amount': 'float'}}}, 'cache': True}, {'name': 'clean_sales', 'depends_on': ['load_raw_sales'], 'transform': {'steps': ['SELECT\n  transaction_id,\n  customer_id,\n  product_id,\n  amount,\n  transaction_date\nFROM load_raw_sales\nWHERE amount > 0  -- Remove invalid transactions\n  AND transaction_date IS NOT NULL\n']}}, {'name': 'save_silver', 'depends_on': ['clean_sales'], 'write': {'connection': 'data', 'path': 'silver/sales.parquet', 'format': 'parquet', 'mode': 'overwrite'}}]}, {'pipel

In [24]:
# Load pipeline configuration
with open('examples/example_local.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("üìã Pipeline configuration loaded (v1.1)")
print(f"   Project: {config['project']}")
print(f"   Engine: {config['engine']}")
print(f"   Connections: {list(config['connections'].keys())}")
print(f"   Pipelines: {len(config['pipelines'])}")

üìã Pipeline configuration loaded (v1.1)
   Project: Local Pandas Example
   Engine: pandas
   Connections: ['data', 'outputs']
   Pipelines: 2


### üí° Concept: v1.1 Single Source of Truth

**Key changes in ODIBI v1.1:**

- **ProjectConfig is the single source of truth**: No raw dict parsing or field slicing needed
  - All configuration fields (connections, story, retry, logging, pipelines) validated at load time
  - Simply pass `ProjectConfig(**config)` - no manual filtering required

- **Story configuration is mandatory**: Every pipeline generates an execution story
  - `story.connection` specifies where stories are saved
  - Stories provide observability into pipeline execution

- **Connections use pattern**: Multiple connections supported (e.g., `data` for inputs, `outputs` for stories)
  - Each node specifies which connection to use via `connection` field
  - Connection objects still required at runtime for actual I/O operations

**In notebooks:** Create `ProjectConfig(**config)` directly - no dict manipulation needed!

In [34]:
for pipeline in (config['pipelines']):
    print(pipeline)

{'pipeline': 'bronze_to_silver', 'layer': 'transformation', 'nodes': [{'name': 'load_raw_sales', 'read': {'connection': 'data', 'path': 'bronze/sales.csv', 'format': 'csv', 'options': {'header': 0, 'dtype': {'transaction_id': 'str', 'amount': 'float'}}}, 'cache': True}, {'name': 'clean_sales', 'depends_on': ['load_raw_sales'], 'transform': {'steps': ['SELECT\n  transaction_id,\n  customer_id,\n  product_id,\n  amount,\n  transaction_date\nFROM load_raw_sales\nWHERE amount > 0  -- Remove invalid transactions\n  AND transaction_date IS NOT NULL\n']}}, {'name': 'save_silver', 'depends_on': ['clean_sales'], 'write': {'connection': 'data', 'path': 'silver/sales.parquet', 'format': 'parquet', 'mode': 'overwrite'}}]}
{'pipeline': 'silver_to_gold', 'layer': 'aggregation', 'nodes': [{'name': 'load_silver_sales', 'read': {'connection': 'data', 'path': 'silver/sales.parquet', 'format': 'parquet'}, 'cache': True}, {'name': 'customer_summary', 'depends_on': ['load_silver_sales'], 'transform': {'ste

In [26]:
# Run Bronze ‚Üí Silver pipeline
print("\nüîÑ Running Bronze ‚Üí Silver pipeline...\n")

# v1.1: ProjectConfig is single source of truth - no dict slicing needed
project_config = ProjectConfig(**config)
pipeline_config = PipelineConfig(**config['pipelines'][0])

# Create runtime connection instances
# v1.1 uses multiple connections: 'data' for inputs, 'outputs' for stories
connections = {
    'data': LocalConnection(base_path='./data'),
    'outputs': LocalConnection(base_path='./outputs')
}

# Create pipeline
pipeline = Pipeline(
    pipeline_config=pipeline_config,
    engine=project_config.engine,
    connections=connections
)
results = pipeline.run()

# Check results
print(f"\n‚úÖ Pipeline completed")
print(f"   Completed nodes: {len(results.completed)}")
print(f"   Failed nodes: {len(results.failed)}")
print(f"   Nodes: {results.completed}")

# Debug tip: If pipeline fails, inspect failures
if results.failed:
    print(f"\n‚ö†Ô∏è Failed nodes detected:")
    for node_name in results.failed:
        node_result = results.get_node_result(node_name)
        if node_result and node_result.error:
            print(f"   {node_name}: {node_result.error}")


üîÑ Running Bronze ‚Üí Silver pipeline...


‚úÖ Pipeline completed
   Completed nodes: 3
   Failed nodes: 0
   Nodes: ['load_raw_sales', 'clean_sales', 'save_silver']


In [35]:
# Run Silver ‚Üí Gold pipeline
print("\nüîÑ Running Silver ‚Üí Gold pipeline...\n")

pipeline_config = PipelineConfig(**config['pipelines'][1])
pipeline = Pipeline(
    pipeline_config=pipeline_config,
    engine=project_config.engine,
    connections=connections  # Reuse connection objects from above
)
results = pipeline.run()

print(f"\n‚úÖ Pipeline completed")
print(f"   Completed nodes: {len(results.completed)}")
print(f"   Failed nodes: {len(results.failed)}")
print(f"   Nodes: {results.completed}")


üîÑ Running Silver ‚Üí Gold pipeline...


‚úÖ Pipeline completed
   Completed nodes: 3
   Failed nodes: 0
   Nodes: ['load_silver_sales', 'customer_summary', 'save_gold']


### üí° How SQL Works with the Pandas Engine

**You might wonder:** How can we use SQL with `engine='pandas'`?

**Answer:** ODIBI uses [DuckDB](https://duckdb.org/) to run SQL queries over in-memory Pandas DataFrames:
- Each node's output is registered as a SQL view using the **node name**
- In the pipeline YAML, you can reference upstream nodes directly in SQL (e.g., `FROM load_raw_sales`)
- DuckDB translates SQL to DataFrame operations automatically

**Example from `example_local.yaml`:**
```sql
SELECT transaction_id, customer_id, amount
FROM load_raw_sales  -- ‚Üê This is the upstream node name!
WHERE amount > 0
```

This is why node naming is important - they become your SQL table names!

## üîç Inspect Outputs

Let's examine the data at each layer.

In [20]:
# Check Bronze layer (original CSV)
bronze_data = pd.read_csv('data/bronze/sales.csv')
print("üìÅ Bronze Layer (Raw Data):")
print(f"   Rows: {len(bronze_data)}")
display(bronze_data)

üìÅ Bronze Layer (Raw Data):
   Rows: 6


Unnamed: 0,transaction_id,customer_id,product_id,amount,transaction_date
0,T001,C001,P001,50.0,2024-01-15
1,T002,C001,P002,75.5,2024-01-20
2,T003,C002,P001,120.0,2024-01-22
3,T004,C002,P003,45.0,2024-01-25
4,T005,C003,P002,200.0,2024-02-01
5,T006,C001,P001,30.0,2024-02-05


In [21]:
# Check Silver layer (cleaned Parquet)
silver_data = pd.read_parquet('data/silver/sales.parquet')
print("\nüìÅ Silver Layer (Cleaned Data):")
print(f"   Rows: {len(silver_data)}")
print(f"   Columns: {list(silver_data.columns)}")
display(silver_data)


üìÅ Silver Layer (Cleaned Data):
   Rows: 6
   Columns: ['transaction_id', 'customer_id', 'product_id', 'amount', 'transaction_date']


Unnamed: 0,transaction_id,customer_id,product_id,amount,transaction_date
0,T001,C001,P001,50.0,2024-01-15
1,T002,C001,P002,75.5,2024-01-20
2,T003,C002,P001,120.0,2024-01-22
3,T004,C002,P003,45.0,2024-01-25
4,T005,C003,P002,200.0,2024-02-01
5,T006,C001,P001,30.0,2024-02-05


In [22]:
# Check Gold layer (aggregated analytics)
gold_data = pd.read_parquet('data/gold/customer_summary.parquet')
print("\nüìÅ Gold Layer (Customer Analytics):")
print(f"   Rows: {len(gold_data)}")
print(f"   Columns: {list(gold_data.columns)}")
display(gold_data)


üìÅ Gold Layer (Customer Analytics):
   Rows: 4
   Columns: ['customer_id', 'transaction_count', 'total_spent', 'avg_transaction', 'last_purchase_date']


Unnamed: 0,customer_id,transaction_count,total_spent,avg_transaction,last_purchase_date
0,C002,2,145.0,72.5,2024-01-25
1,C001,3,155.5,51.833333,2024-02-01
2,C003,1,150.0,150.0,2024-01-28
3,C004,1,200.0,200.0,2024-02-05


## üîß Troubleshooting

**Common issues and solutions (v1.1):**

| Error | Cause | Solution |
|-------|-------|----------|
| `ValidationError: story is required` | Missing mandatory story field in YAML | Add `story:` section with `connection`, `path`, `enabled` fields |
| `ValidationError: story.connection is required` | Story section missing connection field | Add `connection: outputs` to story section |
| `KeyError: 'data'` | Connection name mismatch | Ensure nodes use `connection: data` and connections dict has `'data': LocalConnection(...)` |
| `FileNotFoundError: data/silver/sales.parquet` | Wrong working directory or pipeline failed | Re-run Setup cell; check `results.failed` for errors |
| `ImportError: Missing optional dependency 'pyarrow'` | Parquet library not installed | Run: `pip install pyarrow` |
| `KeyError: 'load_raw_sales'` in SQL | Node name mismatch in dependencies or SQL | Ensure SQL table names match upstream node names exactly |
| Pipeline runs but no output files | Pipeline node failed silently | Check `results.failed` and inspect node errors (see debug code above) |

**Debug checklist:**
1. ‚úÖ Re-run Setup cell to ensure correct working directory
2. ‚úÖ Check `results.failed` for any failed nodes
3. ‚úÖ Verify YAML has required fields: `story`, `connections`, `pipelines`
4. ‚úÖ Ensure connection objects match YAML connection names (e.g., `'data'`, `'outputs'`)
5. ‚úÖ Ensure bronze data exists: `data/bronze/sales.csv`
6. ‚úÖ Install dependencies: `pip install pyarrow pyyaml pandas`

## ü™û Reflect

**What we learned:**
- Created sample data programmatically
- Ran a multi-layer pipeline (Bronze ‚Üí Silver ‚Üí Gold)
- Transformed CSV to Parquet format
- Applied SQL-based filtering and aggregation
- Inspected outputs at each layer

**Key concepts:**
- **Bronze:** Raw data, minimal processing
- **Silver:** Cleaned, validated, ready for analysis
- **Gold:** Business-level aggregates and metrics

**Next step:**  
Go to **`02_cli_and_testing.ipynb`** to learn about CLI tools and testing (Phase 2 preview).

## ‚úÖ Self-Check

In [23]:
# ‚úÖ Self-Check
try:
    import sys, os
    print("Running self-check...")
    
    # Verify example config exists
    assert os.path.exists("examples/example_local.yaml"), "Missing example_local.yaml"
    
    # Verify data layers were created
    assert os.path.exists("data/bronze/sales.csv"), "Missing Bronze layer"
    assert os.path.exists("data/silver/sales.parquet"), "Missing Silver layer"
    assert os.path.exists("data/gold/customer_summary.parquet"), "Missing Gold layer"
    
    # Verify data integrity
    import pandas as pd
    gold = pd.read_parquet("data/gold/customer_summary.parquet")
    assert len(gold) > 0, "Gold layer has no data"
    assert 'total_spent' in gold.columns, "Missing expected column in Gold layer"
    
    print("‚úÖ Data pipeline ran successfully")
    print(f"   Bronze: {len(pd.read_csv('data/bronze/sales.csv'))} rows")
    print(f"   Silver: {len(pd.read_parquet('data/silver/sales.parquet'))} rows")
    print(f"   Gold: {len(gold)} customers")
    
    print("üéâ Walkthrough 01 verified successfully")
except Exception as e:
    print(f"‚ùå Walkthrough failed self-check: {e}")
    raise

Running self-check...
‚úÖ Data pipeline ran successfully
   Bronze: 6 rows
   Silver: 6 rows
   Gold: 4 customers
üéâ Walkthrough 01 verified successfully
