# 01 - Local Pipeline with Pandas

## üß≠ Goal

Run a complete data pipeline using ODIBI's Pandas engine.

This notebook will:
- Create sample sales data
- Run the `example_local.yaml` pipeline
- Transform Bronze ‚Üí Silver ‚Üí Gold layers
- Inspect output files

**Estimated time:** 2 minutes

## üîß Setup

In [None]:
# ‚úÖ Environment Setup
import sys
import os
from pathlib import Path
import pandas as pd
import yaml

# Navigate to project root
project_root = Path.cwd().parent if Path.cwd().name == 'walkthroughs' else Path.cwd()
os.chdir(project_root)

# Import ODIBI
from odibi.pipeline import Pipeline
from odibi.config import PipelineConfig, ProjectConfig
from odibi.connections import LocalConnection

print(f"‚úÖ Environment ready")
print(f"üìÅ Working directory: {Path.cwd()}")

## üìä Create Sample Data

Let's create some sample sales data for our pipeline.

In [None]:
# Create data directories
Path("data/bronze").mkdir(parents=True, exist_ok=True)

# Create sample sales CSV
sales_data = pd.DataFrame({
    'transaction_id': ['T001', 'T002', 'T003', 'T004', 'T005', 'T006'],
    'customer_id': ['C001', 'C001', 'C002', 'C002', 'C003', 'C001'],
    'product_id': ['P001', 'P002', 'P001', 'P003', 'P002', 'P001'],
    'amount': [50.00, 75.50, 120.00, 45.00, 200.00, 30.00],
    'transaction_date': ['2024-01-15', '2024-01-20', '2024-01-22', '2024-01-25', '2024-02-01', '2024-02-05']
})

sales_data.to_csv('data/bronze/sales.csv', index=False)

print("‚úÖ Sample data created")
print("\nSample data preview:")
display(sales_data)

## ‚ñ∂Ô∏è Run Pipeline

Now let's run the Bronze ‚Üí Silver ‚Üí Gold pipeline using `example_local.yaml`.

In [None]:
# Load pipeline configuration
with open('examples/example_local.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("üìã Pipeline configuration loaded")
print(f"   Project: {config['project']}")
print(f"   Engine: {config['engine']}")
print(f"   Pipelines: {len(config['pipelines'])}")

In [None]:
# Run Bronze ‚Üí Silver pipeline
print("\nüîÑ Running Bronze ‚Üí Silver pipeline...\n")

pipeline_config = PipelineConfig(**config['pipelines'][0])
project_config = ProjectConfig(**{k: v for k, v in config.items() if k != 'pipelines'})

# Create connection objects (NOT raw dicts from config)
connections = {
    'local': LocalConnection(base_path='./data')
}

# Create pipeline
pipeline = Pipeline(
    pipeline_config=pipeline_config,
    engine=project_config.engine,
    connections=connections
)
results = pipeline.run()

# Check results
print(f"\n‚úÖ Pipeline completed")
print(f"   Completed nodes: {len(results.completed)}")
print(f"   Failed nodes: {len(results.failed)}")
print(f"   Nodes: {results.completed}")

In [None]:
# Run Silver ‚Üí Gold pipeline
print("\nüîÑ Running Silver ‚Üí Gold pipeline...\n")

pipeline_config = PipelineConfig(**config['pipelines'][1])
pipeline = Pipeline(
    pipeline_config=pipeline_config,
    engine=project_config.engine,
    connections=connections  # Reuse connection objects from above
)
results = pipeline.run()

print(f"\n‚úÖ Pipeline completed")
print(f"   Completed nodes: {len(results.completed)}")
print(f"   Failed nodes: {len(results.failed)}")
print(f"   Nodes: {results.completed}")

## üîç Inspect Outputs

Let's examine the data at each layer.

In [None]:
# Check Bronze layer (original CSV)
bronze_data = pd.read_csv('data/bronze/sales.csv')
print("üìÅ Bronze Layer (Raw Data):")
print(f"   Rows: {len(bronze_data)}")
display(bronze_data)

In [None]:
# Check Silver layer (cleaned Parquet)
silver_data = pd.read_parquet('data/silver/sales.parquet')
print("\nüìÅ Silver Layer (Cleaned Data):")
print(f"   Rows: {len(silver_data)}")
print(f"   Columns: {list(silver_data.columns)}")
display(silver_data)

In [None]:
# Check Gold layer (aggregated analytics)
gold_data = pd.read_parquet('data/gold/customer_summary.parquet')
print("\nüìÅ Gold Layer (Customer Analytics):")
print(f"   Rows: {len(gold_data)}")
print(f"   Columns: {list(gold_data.columns)}")
display(gold_data)

## ü™û Reflect

**What we learned:**
- Created sample data programmatically
- Ran a multi-layer pipeline (Bronze ‚Üí Silver ‚Üí Gold)
- Transformed CSV to Parquet format
- Applied SQL-based filtering and aggregation
- Inspected outputs at each layer

**Key concepts:**
- **Bronze:** Raw data, minimal processing
- **Silver:** Cleaned, validated, ready for analysis
- **Gold:** Business-level aggregates and metrics

**Next step:**  
Go to **`02_cli_and_testing.ipynb`** to learn about CLI tools and testing (Phase 2 preview).

## ‚úÖ Self-Check

In [None]:
# ‚úÖ Self-Check
try:
    import sys, os
    print("Running self-check...")
    
    # Verify example config exists
    assert os.path.exists("examples/example_local.yaml"), "Missing example_local.yaml"
    
    # Verify data layers were created
    assert os.path.exists("data/bronze/sales.csv"), "Missing Bronze layer"
    assert os.path.exists("data/silver/sales.parquet"), "Missing Silver layer"
    assert os.path.exists("data/gold/customer_summary.parquet"), "Missing Gold layer"
    
    # Verify data integrity
    import pandas as pd
    gold = pd.read_parquet("data/gold/customer_summary.parquet")
    assert len(gold) > 0, "Gold layer has no data"
    assert 'total_spent' in gold.columns, "Missing expected column in Gold layer"
    
    print("‚úÖ Data pipeline ran successfully")
    print(f"   Bronze: {len(pd.read_csv('data/bronze/sales.csv'))} rows")
    print(f"   Silver: {len(pd.read_parquet('data/silver/sales.parquet'))} rows")
    print(f"   Gold: {len(gold)} customers")
    
    print("üéâ Walkthrough 01 verified successfully")
except Exception as e:
    print(f"‚ùå Walkthrough failed self-check: {e}")
    raise