# 01 - Local Pipeline with Pandas

## üß≠ Goal

Run a complete data pipeline using ODIBI's Pandas engine.

This notebook will:
- Create sample sales data
- Run the `example_local.yaml` pipeline
- Transform Bronze ‚Üí Silver ‚Üí Gold layers
- Inspect output files

**Estimated time:** 2 minutes

## üîß Setup

In [1]:
# ‚úÖ Environment Setup
import sys
import os
from pathlib import Path
import pandas as pd
import yaml

# Navigate to project root
project_root = Path.cwd().parent if Path.cwd().name == 'walkthroughs' else Path.cwd()
os.chdir(project_root)

# Import ODIBI
from odibi.pipeline import Pipeline
from odibi.config import PipelineConfig, ProjectConfig
from odibi.connections import LocalConnection

print(f"‚úÖ Environment ready")
print(f"üìÅ Working directory: {Path.cwd()}")

‚úÖ Environment ready
üìÅ Working directory: d:\odibi


## üìä Create Sample Data

Let's create some sample sales data for our pipeline.

In [2]:
# Create data directories
Path("data/bronze").mkdir(parents=True, exist_ok=True)

# Create sample sales CSV
sales_data = pd.DataFrame({
    'transaction_id': ['T001', 'T002', 'T003', 'T004', 'T005', 'T006'],
    'customer_id': ['C001', 'C001', 'C002', 'C002', 'C003', 'C001'],
    'product_id': ['P001', 'P002', 'P001', 'P003', 'P002', 'P001'],
    'amount': [50.00, 75.50, 120.00, 45.00, 200.00, 30.00],
    'transaction_date': ['2024-01-15', '2024-01-20', '2024-01-22', '2024-01-25', '2024-02-01', '2024-02-05']
})

sales_data.to_csv('data/bronze/sales.csv', index=False)

print("‚úÖ Sample data created")
print("\nSample data preview:")
display(sales_data)

‚úÖ Sample data created

Sample data preview:


Unnamed: 0,transaction_id,customer_id,product_id,amount,transaction_date
0,T001,C001,P001,50.0,2024-01-15
1,T002,C001,P002,75.5,2024-01-20
2,T003,C002,P001,120.0,2024-01-22
3,T004,C002,P003,45.0,2024-01-25
4,T005,C003,P002,200.0,2024-02-01
5,T006,C001,P001,30.0,2024-02-05


## ‚ñ∂Ô∏è Run Pipeline

Now let's run the Bronze ‚Üí Silver ‚Üí Gold pipeline using `example_local.yaml`.

In [5]:
import pandas as pd
print(pd.read_csv)

<function read_csv at 0x000001B095F05080>


In [3]:
# Load pipeline configuration
with open('examples/example_local.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("üìã Pipeline configuration loaded")
print(f"   Project: {config['project']}")
print(f"   Engine: {config['engine']}")
print(f"   Pipelines: {len(config['pipelines'])}")

üìã Pipeline configuration loaded
   Project: Local Pandas Example
   Engine: pandas
   Pipelines: 2


### üí° Concept: Configuration vs Runtime Objects

**Key distinction in ODIBI:**

- **Configuration (YAML/dicts)**: Declarative definitions of *what* should happen
  - Defines engine type, node names, file paths, connection types
  - Used for validation, portability, and version control

- **Runtime Objects**: Executable instances that *do* the work
  - `LocalConnection` objects perform actual file I/O operations
  - `Pipeline` orchestrates execution and calls methods on connections

**Why not use `project_config.connections` directly?**
- It contains configuration dicts, not executable objects
- ODIBI needs connection instances with methods like `get_path()`
- This separation enables: (1) validation without I/O, (2) easy testing/mocking, (3) secure credential injection at runtime

**In notebooks:** Always instantiate connection objects before passing to Pipeline.

In [None]:
# Run Bronze ‚Üí Silver pipeline
print("\nüîÑ Running Bronze ‚Üí Silver pipeline...\n")

pipeline_config = PipelineConfig(**config['pipelines'][0])
project_config = ProjectConfig(**{k: v for k, v in config.items() if k != 'pipelines'})

# Create runtime connection instances from config
# ODIBI requires objects (e.g., LocalConnection) at runtime to perform reads/writes
# The config dict just tells us WHAT to create, not HOW to execute I/O
connections = {
    'local': LocalConnection(base_path='./data')
}

# Create pipeline
pipeline = Pipeline(
    pipeline_config=pipeline_config,
    engine=project_config.engine,
    connections=connections
)
results = pipeline.run()

# Check results
print(f"\n‚úÖ Pipeline completed")
print(f"   Completed nodes: {len(results.completed)}")
print(f"   Failed nodes: {len(results.failed)}")
print(f"   Nodes: {results.completed}")

# Debug tip: If pipeline fails, inspect failures
if results.failed:
    print(f"\n‚ö†Ô∏è Failed nodes detected:")
    for node_name in results.failed:
        node_result = results.get_node_result(node_name)
        if node_result and node_result.error:
            print(f"   {node_name}: {node_result.error}")

In [None]:
# Run Silver ‚Üí Gold pipeline
print("\nüîÑ Running Silver ‚Üí Gold pipeline...\n")

pipeline_config = PipelineConfig(**config['pipelines'][1])
pipeline = Pipeline(
    pipeline_config=pipeline_config,
    engine=project_config.engine,
    connections=connections  # Reuse connection objects from above
)
results = pipeline.run()

print(f"\n‚úÖ Pipeline completed")
print(f"   Completed nodes: {len(results.completed)}")
print(f"   Failed nodes: {len(results.failed)}")
print(f"   Nodes: {results.completed}")

### üí° How SQL Works with the Pandas Engine

**You might wonder:** How can we use SQL with `engine='pandas'`?

**Answer:** ODIBI uses [DuckDB](https://duckdb.org/) to run SQL queries over in-memory Pandas DataFrames:
- Each node's output is registered as a SQL view using the **node name**
- In the pipeline YAML, you can reference upstream nodes directly in SQL (e.g., `FROM load_raw_sales`)
- DuckDB translates SQL to DataFrame operations automatically

**Example from `example_local.yaml`:**
```sql
SELECT transaction_id, customer_id, amount
FROM load_raw_sales  -- ‚Üê This is the upstream node name!
WHERE amount > 0
```

This is why node naming is important - they become your SQL table names!

## üîç Inspect Outputs

Let's examine the data at each layer.

In [None]:
# Check Bronze layer (original CSV)
bronze_data = pd.read_csv('data/bronze/sales.csv')
print("üìÅ Bronze Layer (Raw Data):")
print(f"   Rows: {len(bronze_data)}")
display(bronze_data)

In [None]:
# Check Silver layer (cleaned Parquet)
silver_data = pd.read_parquet('data/silver/sales.parquet')
print("\nüìÅ Silver Layer (Cleaned Data):")
print(f"   Rows: {len(silver_data)}")
print(f"   Columns: {list(silver_data.columns)}")
display(silver_data)

In [None]:
# Check Gold layer (aggregated analytics)
gold_data = pd.read_parquet('data/gold/customer_summary.parquet')
print("\nüìÅ Gold Layer (Customer Analytics):")
print(f"   Rows: {len(gold_data)}")
print(f"   Columns: {list(gold_data.columns)}")
display(gold_data)

## üîß Troubleshooting

**Common issues and solutions:**

| Error | Cause | Solution |
|-------|-------|----------|
| `TypeError: expected Connection, got dict` | Passing `project_config.connections` (raw dicts) to Pipeline | Create `LocalConnection` objects (see Config vs Runtime section above) |
| `FileNotFoundError: data/silver/sales.parquet` | Wrong working directory or pipeline failed | Re-run Setup cell to set working directory; check `results.failed` for errors |
| `ImportError: Missing optional dependency 'pyarrow'` | Parquet library not installed | Run: `pip install pyarrow` |
| `KeyError: 'load_raw_sales'` in SQL | Node name mismatch in dependencies or SQL | Ensure SQL table names match upstream node names exactly |
| `AttributeError: 'dict' object has no attribute 'get_path'` | Connection objects not instantiated | See Config vs Runtime section - use `LocalConnection()` not raw dicts |
| Pipeline runs but no output files | Pipeline node failed silently | Check `results.failed` and inspect node errors (see debug code above) |

**Debug checklist:**
1. ‚úÖ Re-run Setup cell to ensure correct working directory
2. ‚úÖ Check `results.failed` for any failed nodes
3. ‚úÖ Verify `connections` uses `LocalConnection()` objects, not `project_config.connections`
4. ‚úÖ Ensure bronze data exists: `data/bronze/sales.csv`
5. ‚úÖ Install dependencies: `pip install pyarrow pyyaml pandas`

## ü™û Reflect

**What we learned:**
- Created sample data programmatically
- Ran a multi-layer pipeline (Bronze ‚Üí Silver ‚Üí Gold)
- Transformed CSV to Parquet format
- Applied SQL-based filtering and aggregation
- Inspected outputs at each layer

**Key concepts:**
- **Bronze:** Raw data, minimal processing
- **Silver:** Cleaned, validated, ready for analysis
- **Gold:** Business-level aggregates and metrics

**Next step:**  
Go to **`02_cli_and_testing.ipynb`** to learn about CLI tools and testing (Phase 2 preview).

## ‚úÖ Self-Check

In [None]:
# ‚úÖ Self-Check
try:
    import sys, os
    print("Running self-check...")
    
    # Verify example config exists
    assert os.path.exists("examples/example_local.yaml"), "Missing example_local.yaml"
    
    # Verify data layers were created
    assert os.path.exists("data/bronze/sales.csv"), "Missing Bronze layer"
    assert os.path.exists("data/silver/sales.parquet"), "Missing Silver layer"
    assert os.path.exists("data/gold/customer_summary.parquet"), "Missing Gold layer"
    
    # Verify data integrity
    import pandas as pd
    gold = pd.read_parquet("data/gold/customer_summary.parquet")
    assert len(gold) > 0, "Gold layer has no data"
    assert 'total_spent' in gold.columns, "Missing expected column in Gold layer"
    
    print("‚úÖ Data pipeline ran successfully")
    print(f"   Bronze: {len(pd.read_csv('data/bronze/sales.csv'))} rows")
    print(f"   Silver: {len(pd.read_parquet('data/silver/sales.parquet'))} rows")
    print(f"   Gold: {len(gold)} customers")
    
    print("üéâ Walkthrough 01 verified successfully")
except Exception as e:
    print(f"‚ùå Walkthrough failed self-check: {e}")
    raise