# 🚀 odibi_de_v2 Framework Evolution Tutorial
## From v1.x (Industry-Specific) to v2.0 (Universal)

**Version:** 2.0  
**Last Updated:** October 29, 2025  
**Target Audience:** Data Engineers, Analytics Engineers, ML Engineers

---

## 📑 Table of Contents

1. [Introduction: What Changed and Why](#1-introduction)
2. [Architecture Overview](#2-architecture)
3. [Step-by-Step Walkthrough](#3-walkthrough)
4. [Examples by Industry](#4-examples)
5. [Troubleshooting Guide](#5-troubleshooting)
6. [Quick Reference](#6-reference)

---

## 1. Introduction: What Changed and Why {#1-introduction}

### 1.1 The Evolution Story

**v1.x - The Ingredion Era:**
- Built specifically for manufacturing use cases
- Hardcoded entity columns: `plant`, `asset`
- Single input/output per transformation
- Manual orchestrator instantiation
- Great for Ingredion, limiting for other domains

**v2.0 - Universal Framework:**
- Works for ANY industry: manufacturing, retail, finance, healthcare, ML
- Generic entity system: `entity_1`, `entity_2`, `entity_3`
- Multiple inputs/constants/outputs (JSON-based)
- One-command execution: `run_project()`
- Project scaffolding: `initialize_project()`
- Full backward compatibility

### 1.2 Why the Refactoring?

**Pain Points in v1.x:**
```python
# ❌ Old way - Too specific
orchestrator = IngredionProjectOrchestrator(
    project="Energy Efficiency",
    env="qat",
    # Hardcoded for manufacturing only
)
result = orchestrator.run(
    repo_path="/path/to/repo",
    layer_order=["Silver_1", "Gold_1"],
    cache_plan={...},
    max_workers=4
)
```

**v2.0 Solution:**
```python
# ✅ New way - Universal and simple
from odibi_de_v2 import run_project

result = run_project(project="Energy Efficiency", env="qat")
```

### 1.3 Key Benefits

| Feature | v1.x | v2.0 | Benefit |
|---------|------|------|----------|
| **Domain Support** | Manufacturing only | Any domain | Reusability |
| **Entity Model** | `plant`, `asset` | `entity_1/2/3` | Flexibility |
| **Inputs** | Single table | Multiple (JSON) | Complex workflows |
| **Constants** | None | JSON object | Parameterization |
| **Outputs** | Single table | Multiple (JSON) | Multi-target pipelines |
| **Execution** | Manual class | `run_project()` | Simplicity |
| **Setup** | Manual | `initialize_project()` | Speed |
| **Lines of Code** | ~50 lines | 1 line | Productivity |

### 1.4 Backward Compatibility

✅ All v1.x projects continue to work  
✅ Legacy view `TransformationConfig_Legacy` provided  
✅ Gradual migration path  
✅ No breaking changes to transformation functions  

---

## 2. Architecture Overview {#2-architecture}

### 2.1 System Architecture

```mermaid
graph TB
    subgraph "Configuration Layer"
        A[TransformationRegistry SQL Table]
        B[Project Manifest JSON]
        C[IngestionSourceConfig]
    end
    
    subgraph "Orchestration Layer"
        D[run_project API]
        E[TransformationRunner]
        F[Layer Executor]
    end
    
    subgraph "Execution Layer"
        G[Bronze Ingestion]
        H[Silver Transformations]
        I[Gold Aggregations]
    end
    
    subgraph "Storage Layer"
        J[Delta Lake]
        K[Azure Blob Storage]
        L[SQL Database]
    end
    
    A --> D
    B --> D
    C --> G
    D --> E
    E --> F
    F --> G
    F --> H
    F --> I
    G --> J
    H --> J
    I --> J
    J --> K
    A --> L
```

### 2.2 TransformationRegistry Schema

**The Heart of Configuration**

```sql
CREATE TABLE TransformationRegistry (
    -- Identity
    transformation_id VARCHAR(100) PRIMARY KEY,     -- Unique ID
    transformation_group_id VARCHAR(100),           -- Logical grouping
    
    -- Scoping
    project VARCHAR(100) NOT NULL,                  -- Project name
    environment VARCHAR(20) NOT NULL,               -- qat, prod, dev
    
    -- Execution Control
    layer VARCHAR(50) NOT NULL,                     -- Bronze, Silver_1, Gold_1
    step INT NOT NULL DEFAULT 1,                    -- Execution order within layer
    enabled BIT NOT NULL DEFAULT 1,                 -- On/off switch
    
    -- Generic Entity Hierarchy (Domain-Agnostic)
    entity_1 VARCHAR(100),                          -- Top-level entity
    entity_2 VARCHAR(100),                          -- Mid-level entity
    entity_3 VARCHAR(100),                          -- Detail-level entity
    
    -- Transformation Logic
    module VARCHAR(255) NOT NULL,                   -- Python module path
    function VARCHAR(255) NOT NULL,                 -- Function name
    
    -- Flexible I/O (JSON)
    inputs NVARCHAR(MAX),                           -- ["table1", "table2"]
    constants NVARCHAR(MAX),                        -- {"threshold": 100}
    outputs NVARCHAR(MAX),                          -- [{"table": "...", "mode": "..."}]
    
    -- Metadata
    description NVARCHAR(500),
    created_at DATETIME,
    updated_at DATETIME
);
```

**Key Concepts:**

1. **Generic Entities**: Adapt to any domain
   - Manufacturing: `entity_1=plant`, `entity_2=asset`, `entity_3=equipment`
   - Retail: `entity_1=region`, `entity_2=store`, `entity_3=department`
   - Finance: `entity_1=business_unit`, `entity_2=product`, `entity_3=account_type`

2. **JSON Fields**: Maximum flexibility
   - `inputs`: Array of table names or query objects
   - `constants`: Parameters passed to transformation function
   - `outputs`: Array of output configurations

3. **Layer-Based Execution**: Medallion architecture
   - Layers run in sequence: Bronze → Silver → Gold
   - Within layer, transformations run by `step` order
   - Parallel execution within same step

### 2.3 Manifest System

**Project-Level Configuration**

```json
{
  "project_name": "Energy Efficiency",
  "project_type": "manufacturing",
  "layer_order": ["Bronze", "Silver_1", "Silver_2", "Gold_1", "Gold_2"],
  "entity_labels": {
    "entity_1": "plant",
    "entity_2": "asset",
    "entity_3": "equipment"
  },
  "cache_plan": {
    "Gold_1": ["combined_dryers", "combined_boilers"]
  },
  "environments": ["qat", "prod"],
  "default_env": "qat"
}
```

**Manifest Benefits:**
- Centralized project configuration
- Version-controlled (Git)
- Self-documenting
- Environment-specific overrides

### 2.4 Component Interaction Flow

```mermaid
sequenceDiagram
    participant User
    participant API as run_project()
    participant Manifest
    participant Registry as TransformationRegistry
    participant Runner as TransformationRunner
    participant Spark
    participant Delta as Delta Lake
    
    User->>API: run_project("Energy Efficiency", "qat")
    API->>Manifest: Load manifest.json
    Manifest-->>API: layer_order, entity_labels, cache_plan
    
    loop For each layer in layer_order
        API->>Registry: SELECT * WHERE project='...' AND layer='...'
        Registry-->>API: transformation configs
        API->>Runner: Execute transformations
        
        loop For each transformation
            Runner->>Spark: Import module.function
            Runner->>Spark: Execute with inputs/constants
            Spark->>Delta: Write outputs
            Delta-->>Runner: Success
        end
        
        Runner-->>API: Layer complete
    end
    
    API-->>User: {status: 'SUCCESS', layers_run: [...]}
```

---

## 3. Step-by-Step Walkthrough {#3-walkthrough}

### Cell 1: Import the New API

In [None]:
# Import the two main functions you'll use
from odibi_de_v2 import run_project, initialize_project

# Optional: Import utilities for health checks and helpers
from odibi_de_v2.utils import quick_health_check, print_project_summary
from odibi_de_v2.config import TransformationRegistryUI

print("✅ odibi_de_v2 v2.0 imported successfully!")
print("\nMain functions available:")
print("  - initialize_project(name, type): Create new project")
print("  - run_project(project, env): Run pipeline")
print("\nHelper utilities:")
print("  - quick_health_check(): Validate configuration")
print("  - print_project_summary(): View project status")
print("  - TransformationRegistryUI(): Interactive config editor")

### Cell 2: Understanding TransformationRegistry Table

**The TransformationRegistry is where all transformation logic is configured.**

Let's explore the schema and see example records:

In [None]:
# Connect to your SQL provider (replace with your actual connection)
from your_sql_provider import get_sql_provider

sql_provider = get_sql_provider()

# View the schema
print("📋 TransformationRegistry Schema:")
print("""
Key Fields:
  • transformation_id: Unique identifier (e.g., 'energy-argo-boilers-silver')
  • project: Project name (e.g., 'Energy Efficiency')
  • environment: qat, prod, dev
  • layer: Bronze, Silver_1, Silver_2, Gold_1, etc.
  • entity_1/2/3: Generic hierarchy (adapt to your domain)
  • module: Python module path (e.g., 'silver.functions')
  • function: Function name (e.g., 'process_argo_boilers')
  • inputs: JSON array ["table1", "table2"]
  • constants: JSON object {"threshold": 100}
  • outputs: JSON array [{"table": "...", "mode": "overwrite"}]
""")

# Query some example records
query = """
SELECT TOP 3
    transformation_id,
    project,
    layer,
    entity_1,
    entity_2,
    function,
    inputs,
    outputs
FROM TransformationRegistry
WHERE enabled = 1
ORDER BY layer, step
"""

# Execute query
# df = sql_provider.execute_query(query)
# display(df)

print("\n💡 Tip: Use TransformationRegistryUI() to edit configs interactively!")

### Cell 3: Understanding Project Manifests

**Manifests define project-level configuration in JSON format.**

In [None]:
import json

# Example manifest structure
example_manifest = {
    "project_name": "Energy Efficiency",
    "project_type": "manufacturing",
    "description": "Manufacturing energy analytics pipeline",
    "version": "1.0.0",
    
    # Layer execution order
    "layer_order": ["Bronze", "Silver_1", "Silver_2", "Gold_1", "Gold_2"],
    
    # Entity labels (domain-specific naming)
    "entity_labels": {
        "entity_1": "plant",
        "entity_2": "asset",
        "entity_3": "equipment"
    },
    
    # Caching strategy
    "cache_plan": {
        "Gold_1": ["combined_dryers", "combined_boilers"]
    },
    
    # Environments
    "environments": ["qat", "prod"],
    "default_env": "qat"
}

print("📄 Example Project Manifest:")
print(json.dumps(example_manifest, indent=2))

print("\n🔑 Key Manifest Components:")
print("  1. layer_order: Defines execution sequence")
print("  2. entity_labels: Maps generic entities to domain terms")
print("  3. cache_plan: Specifies which tables to cache per layer")
print("  4. environments: Supported environments (qat, prod, dev)")

print("\n💡 Manifests are auto-generated by initialize_project()!")

### Cell 4: How to Create New Projects with initialize_project()

**The fastest way to start a new data pipeline.**

In [None]:
from odibi_de_v2 import initialize_project
from odibi_de_v2.project import ProjectType

# Example 1: Manufacturing Project
result_manufacturing = initialize_project(
    project_name="Plant Reliability",
    project_type=ProjectType.MANUFACTURING,
    description="Predictive maintenance and reliability analytics"
)

print("✅ Manufacturing project created!")
print(f"   Path: {result_manufacturing['project_path']}")
print(f"   Manifest: {result_manufacturing['manifest_path']}")

# Example 2: Analytics Project
result_analytics = initialize_project(
    project_name="Customer Churn",
    project_type=ProjectType.ANALYTICS,
    description="Customer churn prediction pipeline"
)

print("\n✅ Analytics project created!")
print(f"   Path: {result_analytics['project_path']}")

# Example 3: ML Pipeline
result_ml = initialize_project(
    project_name="Sales Forecasting",
    project_type=ProjectType.ML_PIPELINE,
    description="Sales forecasting with ML models"
)

print("\n✅ ML Pipeline project created!")
print(f"   Path: {result_ml['project_path']}")

print("""
\n📁 Project Structure Created:
project_name/
├── manifest.json              # Project configuration
├── transformations/           # Transformation modules
│   ├── bronze/
│   ├── silver/
│   └── gold/
├── sql/                       # SQL scripts
│   └── ddl/
├── notebooks/                 # Databricks notebooks
├── config/                    # Config files
├── tests/                     # Unit tests
└── README.md                  # Auto-generated documentation
""")

print("\n💡 Next step: Add transformations to TransformationRegistry table!")

### Cell 5: How to Run Projects with run_project()

**One command to rule them all.**

In [None]:
from odibi_de_v2 import run_project

# Example 1: Run entire pipeline (all layers)
print("🚀 Example 1: Full Pipeline Execution")
result = run_project(
    project="Energy Efficiency",
    env="qat"
)
print(f"Status: {result['status']}")
print(f"Layers run: {result['layers_run']}")
print(f"Duration: {result['duration_seconds']:.2f}s")

# Example 2: Run specific layers only
print("\n🎯 Example 2: Target Specific Layers")
result = run_project(
    project="Energy Efficiency",
    env="qat",
    target_layers=["Silver_1", "Gold_1"]  # Only these layers
)
print(f"Layers run: {result['layers_run']}")

# Example 3: With logging and caching
print("\n📊 Example 3: Advanced Options")
result = run_project(
    project="Energy Efficiency",
    env="qat",
    log_level="INFO",
    save_logs=True,
    cache_plan={"Gold_1": ["combined_dryers"]}  # Cache specific tables
)

# Example 4: Custom authentication
print("\n🔐 Example 4: Custom Authentication")

def custom_auth_provider(env, repo_path, logger_metadata):
    """Custom authentication logic."""
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.getOrCreate()
    sql_provider = get_sql_provider()  # Your SQL provider
    return {"spark": spark, "sql_provider": sql_provider}

result = run_project(
    project="Energy Efficiency",
    env="qat",
    auth_provider=custom_auth_provider
)

# Example 5: Dry run (validation only)
print("\n✅ Example 5: Validation Mode")
result = run_project(
    project="Energy Efficiency",
    env="qat",
    dry_run=True  # Check config without executing
)
print(f"Validation status: {result['status']}")
print(f"Transformations found: {result['transformation_count']}")

### Cell 6: Working with the Config UI

**Interactive configuration editor for TransformationRegistry.**

In [None]:
from odibi_de_v2.config import TransformationRegistryUI, TransformationRegistryBrowser

# Create new transformation config interactively
print("🎨 Interactive Config Editor")
ui = TransformationRegistryUI(
    project="Energy Efficiency",
    env="qat"
)

# This will render an interactive form
# ui.render()

print("""
Features:
  ✅ Visual field editor
  ✅ JSON validation (inputs, constants, outputs)
  ✅ Auto-generate SQL INSERT statement
  ✅ Copy to clipboard
  ✅ Pre-filled templates
""")

# Browse existing configurations
print("\n📚 Browse Existing Configs")
browser = TransformationRegistryBrowser(
    sql_provider=sql_provider,
    project="Energy Efficiency",
    env="qat"
)

# This will render a filterable table
# browser.render()

print("""
Features:
  ✅ Filter by layer, entity, enabled status
  ✅ Search transformations
  ✅ Quick edit/delete
  ✅ Export to CSV/JSON
""")

### Cell 7: Migration from Old to New

**Step-by-step migration guide for v1.x projects.**

In [None]:
# MIGRATION WORKFLOW
print("📋 Migration Checklist:\n")

# Step 1: Create TransformationRegistry table
print("✅ Step 1: Create TransformationRegistry Table")
ddl_script = """
-- Run this SQL script
CREATE TABLE IF NOT EXISTS TransformationRegistry (
    transformation_id VARCHAR(100) PRIMARY KEY,
    project VARCHAR(100) NOT NULL,
    environment VARCHAR(20) NOT NULL,
    layer VARCHAR(50) NOT NULL,
    entity_1 VARCHAR(100),
    entity_2 VARCHAR(100),
    module VARCHAR(255) NOT NULL,
    function VARCHAR(255) NOT NULL,
    inputs NVARCHAR(MAX),
    constants NVARCHAR(MAX),
    outputs NVARCHAR(MAX),
    -- ... (see sql/ddl/01_transformation_registry.sql)
);
"""
print(ddl_script)

# Step 2: Migrate data from old TransformationConfig
print("\n✅ Step 2: Migrate Existing Data")
migration_sql = """
INSERT INTO TransformationRegistry (
    transformation_id,
    project,
    environment,
    layer,
    entity_1,
    entity_2,
    module,
    function,
    inputs,
    outputs
)
SELECT
    'energy-' + LOWER(REPLACE(plant + '-' + asset, ' ', '-')) + '-' + layer,
    project,
    env AS environment,
    layer,
    plant AS entity_1,
    asset AS entity_2,
    module,
    function,
    JSON_QUERY('[' + QUOTENAME(input_table, '"') + ']'),
    JSON_QUERY('[{"table":"' + target_table + '","mode":"overwrite"}]')
FROM TransformationConfig
WHERE project = 'energy efficiency';
"""
print(migration_sql)

# Step 3: Create manifest
print("\n✅ Step 3: Create Project Manifest")
print("# Use initialize_project() or create manually")
print("result = initialize_project('Energy Efficiency', 'manufacturing')")

# Step 4: Update code
print("\n✅ Step 4: Update Orchestration Code")
print("""# Before (v1.x):
orchestrator = IngredionProjectOrchestrator(...)
orchestrator.run(...)

# After (v2.0):
run_project(project='Energy Efficiency', env='qat')
""")

# Step 5: Test
print("\n✅ Step 5: Test")
print("run_project(project='Energy Efficiency', env='qat', target_layers=['Bronze'])")

print("\n🎉 Migration complete! See MIGRATION_GUIDE.md for details.")

### Cell 8: Advanced Patterns

**Custom auth, multi-environment, caching strategies.**

In [None]:
# ADVANCED PATTERN 1: Multi-Environment Deployment
print("🌍 Pattern 1: Multi-Environment Deployment\n")

environments = ["dev", "qat", "prod"]

for env in environments:
    print(f"Deploying to {env}...")
    result = run_project(
        project="Customer Churn",
        env=env,
        target_layers=["Silver_1"],  # Deploy incrementally
        dry_run=(env == "prod")  # Validate prod before running
    )
    print(f"  {env}: {result['status']}\n")

# ADVANCED PATTERN 2: Smart Caching
print("💾 Pattern 2: Layer-Specific Caching\n")

cache_strategy = {
    "Silver_1": [],  # No caching for Silver
    "Gold_1": ["large_aggregation_table"],  # Cache expensive tables
    "Gold_2": ["kpi_summary", "reporting_table"]
}

result = run_project(
    project="Energy Efficiency",
    env="qat",
    cache_plan=cache_strategy
)

# ADVANCED PATTERN 3: Custom Auth with Secrets
print("\n🔐 Pattern 3: Custom Authentication with Azure Key Vault\n")

def azure_keyvault_auth(env, repo_path, logger_metadata):
    """Fetch credentials from Azure Key Vault."""
    from azure.keyvault.secrets import SecretClient
    from azure.identity import DefaultAzureCredential
    from pyspark.sql import SparkSession
    
    # Get secrets
    credential = DefaultAzureCredential()
    client = SecretClient(
        vault_url="https://my-vault.vault.azure.net/",
        credential=credential
    )
    
    db_password = client.get_secret(f"db-password-{env}").value
    
    # Create Spark session with secrets
    spark = SparkSession.builder \
        .config("spark.sql.password", db_password) \
        .getOrCreate()
    
    sql_provider = get_sql_provider(password=db_password)
    
    return {"spark": spark, "sql_provider": sql_provider}

# Use custom auth
# result = run_project(
#     project="Secure Project",
#     env="prod",
#     auth_provider=azure_keyvault_auth
# )

# ADVANCED PATTERN 4: Parameterized Transformations
print("\n⚙️ Pattern 4: Dynamic Constants per Environment\n")

# In TransformationRegistry, use env-specific constants:
constants_example = {
    "qat": {"threshold": 100, "sample_rate": 0.1},
    "prod": {"threshold": 1000, "sample_rate": 1.0}
}

print(f"QAT constants: {constants_example['qat']}")
print(f"PROD constants: {constants_example['prod']}")

# ADVANCED PATTERN 5: Incremental Layers
print("\n📈 Pattern 5: Incremental Layer Execution\n")

layers_to_run = ["Bronze", "Silver_1"]

# First run: Bronze + Silver_1
result = run_project(
    project="Energy Efficiency",
    env="qat",
    target_layers=layers_to_run
)

if result['status'] == 'SUCCESS':
    # Second run: Add Gold_1
    layers_to_run.append("Gold_1")
    result = run_project(
        project="Energy Efficiency",
        env="qat",
        target_layers=["Gold_1"]  # Only new layer
    )

print("\n💡 See docs/advanced_patterns.md for more examples!")

---

## 4. Examples by Industry {#4-examples}

### 4.1 Manufacturing: Energy Efficiency

**Use Case:** Track energy consumption across plants and assets.

In [None]:
# Manufacturing Example: Energy Efficiency
print("🏭 Manufacturing Example: Energy Efficiency\n")

# Entity mapping for manufacturing
entity_mapping = {
    "entity_1": "plant",      # Argo, Cedar Rapids, Winston Salem
    "entity_2": "asset",      # Boilers, Dryers, Compressors
    "entity_3": "equipment"   # Individual equipment IDs
}

print("Entity Hierarchy:")
print(f"  {entity_mapping['entity_1']} → {entity_mapping['entity_2']} → {entity_mapping['entity_3']}")
print(f"  Example: Argo → Boiler 6,7,8 → Boiler #7\n")

# Sample TransformationRegistry record
sample_config = {
    "transformation_id": "energy-argo-boilers-silver",
    "project": "energy efficiency",
    "environment": "qat",
    "layer": "Silver_1",
    "entity_1": "Argo",
    "entity_2": "Boiler 6,7,8,10",
    "module": "silver.functions",
    "function": "process_argo_boilers",
    "inputs": '["qat_energy_efficiency.argo_boilers_bronze"]',
    "outputs": '[{"table": "qat_energy_efficiency.argo_boilers_silver", "mode": "overwrite"}]'
}

print("Sample Configuration:")
for key, value in sample_config.items():
    print(f"  {key}: {value}")

# Run the pipeline
print("\n🚀 Running Energy Efficiency pipeline...")
result = run_project(
    project="Energy Efficiency",
    env="qat",
    target_layers=["Bronze", "Silver_1", "Gold_1"]
)

print(f"\nResult: {result['status']}")
print(f"Layers processed: {result['layers_run']}")
print(f"Duration: {result['duration_seconds']:.2f}s")

### 4.2 Analytics: Customer Churn

**Use Case:** Predict customer churn using behavioral features.

In [None]:
# Analytics Example: Customer Churn
print("📊 Analytics Example: Customer Churn\n")

# Entity mapping for analytics
entity_mapping = {
    "entity_1": "region",      # North America, Europe, Asia
    "entity_2": "segment",     # Retail, Enterprise, SMB
    "entity_3": "channel"      # Online, In-store, Partner
}

print("Entity Hierarchy:")
print(f"  {entity_mapping['entity_1']} → {entity_mapping['entity_2']} → {entity_mapping['entity_3']}")
print(f"  Example: North America → Retail → Online\n")

# Sample TransformationRegistry record
sample_config = {
    "transformation_id": "churn-features-silver",
    "project": "customer churn",
    "environment": "qat",
    "layer": "Silver_1",
    "entity_1": "North America",
    "entity_2": "Retail",
    "entity_3": "Online",
    "module": "churn.features",
    "function": "calculate_customer_features",
    "inputs": '["qat_churn.customer_transactions_bronze", "qat_churn.customer_profile_bronze"]',
    "constants": '{"lookback_days": 90, "min_transactions": 5}',
    "outputs": '[{"table": "qat_churn.customer_features_silver", "mode": "overwrite"}]'
}

print("Sample Configuration:")
for key, value in sample_config.items():
    print(f"  {key}: {value}")

# Sample transformation function
print("\n📝 Sample Transformation Function:")
sample_function = '''
def calculate_customer_features(spark=None, inputs=None, constants=None, outputs=None, **kwargs):
    """Calculate customer behavior features."""
    from pyspark.sql import functions as F
    
    # Read inputs
    transactions = spark.table(inputs[0])
    profiles = spark.table(inputs[1])
    
    # Get constants
    lookback_days = constants.get('lookback_days', 90)
    min_transactions = constants.get('min_transactions', 5)
    
    # Calculate features
    features = transactions.join(profiles, "customer_id") \
        .groupBy("customer_id") \
        .agg(
            F.count("transaction_id").alias("transaction_count"),
            F.sum("amount").alias("total_spend"),
            F.avg("amount").alias("avg_transaction")
        ) \
        .filter(F.col("transaction_count") >= min_transactions)
    
    # Save output
    features.write.mode(outputs[0]['mode']).saveAsTable(outputs[0]['table'])
    
    return features
'''
print(sample_function)

# Run the pipeline
print("\n🚀 Running Customer Churn pipeline...")
# result = run_project(project="Customer Churn", env="qat")
print("(Uncomment to run)")

### 4.3 Custom Project: Supply Chain Optimization

**Use Case:** Optimize inventory across warehouses and distribution centers.

In [None]:
# Custom Example: Supply Chain Optimization
print("📦 Custom Example: Supply Chain Optimization\n")

# Step 1: Initialize project
result = initialize_project(
    project_name="Supply Chain Optimization",
    project_type="custom",
    description="Inventory optimization across distribution network"
)

print(f"✅ Project created at: {result['project_path']}\n")

# Step 2: Define custom entity hierarchy
entity_mapping = {
    "entity_1": "region",          # West, East, Central
    "entity_2": "warehouse",       # Warehouse ID
    "entity_3": "product_category" # Electronics, Apparel, etc.
}

print("Custom Entity Hierarchy:")
print(f"  {entity_mapping['entity_1']} → {entity_mapping['entity_2']} → {entity_mapping['entity_3']}")
print(f"  Example: West → WH-001 → Electronics\n")

# Step 3: Update manifest with custom entities
manifest_updates = {
    "entity_labels": entity_mapping,
    "layer_order": ["Bronze", "Silver_Inventory", "Silver_Demand", "Gold_Optimization"],
    "cache_plan": {
        "Gold_Optimization": ["optimal_stock_levels"]
    }
}

print("Manifest Configuration:")
import json
print(json.dumps(manifest_updates, indent=2))

# Step 4: Add transformations to TransformationRegistry
print("\n\nSample SQL to Add Transformations:")
sample_sql = """
INSERT INTO TransformationRegistry (
    transformation_id, project, environment, layer,
    entity_1, entity_2, entity_3,
    module, function, inputs, outputs
)
VALUES (
    'supply-west-wh001-inventory',
    'supply chain optimization',
    'qat',
    'Silver_Inventory',
    'West',
    'WH-001',
    'Electronics',
    'supply_chain.transformations.inventory',
    'calculate_current_stock',
    '["qat_supply.inventory_bronze", "qat_supply.shipments_bronze"]',
    '[{"table": "qat_supply.current_stock_silver", "mode": "overwrite"}]'
);
"""
print(sample_sql)

# Step 5: Run the pipeline
print("\n🚀 Running Supply Chain pipeline...")
# result = run_project(project="Supply Chain Optimization", env="qat")
print("(Configure transformations first, then uncomment to run)")

---

## 5. Troubleshooting Guide {#5-troubleshooting}

### 5.1 Common Issues and Solutions

In [None]:
# TROUBLESHOOTING GUIDE

print("🔧 Common Issues and Solutions\n")
print("=" * 60)

# Issue 1: Manifest not found
print("\n❌ Issue 1: 'No manifest found for project'")
print("\n✅ Solution:")
print("""
# Option A: Create manifest
initialize_project("Project Name", "project_type")

# Option B: Specify path explicitly
run_project(
    project="Project Name",
    manifest_path="/path/to/manifest.json"
)
""")

# Issue 2: TransformationRegistry table not found
print("\n" + "=" * 60)
print("\n❌ Issue 2: 'TransformationRegistry table does not exist'")
print("\n✅ Solution:")
print("""
# Run the DDL script:
# See: odibi_de_v2/sql/ddl/01_transformation_registry.sql

CREATE TABLE IF NOT EXISTS TransformationRegistry (
    transformation_id VARCHAR(100) PRIMARY KEY,
    project VARCHAR(100) NOT NULL,
    -- ... (see DDL file for complete schema)
);
""")

# Issue 3: JSON parsing error
print("\n" + "=" * 60)
print("\n❌ Issue 3: 'Invalid JSON in inputs/constants/outputs'")
print("\n✅ Solution:")
print("""
# ❌ WRONG:
inputs: "table1"
constants: "threshold=100"

# ✅ CORRECT:
inputs: '["table1"]'                      # JSON array
constants: '{"threshold": 100}'           # JSON object
outputs: '[{"table": "...", "mode": "overwrite"}]'  # JSON array of objects

# Validate JSON online: https://jsonlint.com/
""")

# Issue 4: Module not found
print("\n" + "=" * 60)
print("\n❌ Issue 4: 'ModuleNotFoundError: No module named ...'")
print("\n✅ Solution:")
print("""
# Check module path in TransformationRegistry
# Ensure it matches your Python import structure

# Example:
# File: energy_efficiency/transformations/silver/functions.py
# Module in DB: 'energy_efficiency.transformations.silver.functions'

# Verify module exists:
import importlib
try:
    mod = importlib.import_module('energy_efficiency.transformations.silver.functions')
    print("✅ Module found")
except ModuleNotFoundError as e:
    print(f"❌ Module not found: {e}")
""")

# Issue 5: Function not found
print("\n" + "=" * 60)
print("\n❌ Issue 5: 'Function ... does not exist in module'")
print("\n✅ Solution:")
print("""
# Verify function exists in module:
import importlib
mod = importlib.import_module('silver.functions')
if hasattr(mod, 'process_argo_boilers'):
    print("✅ Function found")
else:
    print("❌ Function not found")
    print(f"Available functions: {dir(mod)}")
""")

# Issue 6: Permission denied
print("\n" + "=" * 60)
print("\n❌ Issue 6: 'Permission denied' or 'Authentication failed'")
print("\n✅ Solution:")
print("""
# Use custom auth provider:
def custom_auth(env, repo_path, logger_metadata):
    # Your authentication logic
    spark = get_authenticated_spark()
    sql_provider = get_authenticated_sql()
    return {"spark": spark, "sql_provider": sql_provider}

run_project(
    project="Project Name",
    env="qat",
    auth_provider=custom_auth
)
""")

print("\n" + "=" * 60)

### 5.2 Debugging Pipeline Failures

In [None]:
# DEBUGGING PIPELINE FAILURES

print("🐛 Debugging Pipeline Failures\n")

# 1. Use health check utility
print("1️⃣ Run Health Check")
print("""
from odibi_de_v2.utils import quick_health_check

quick_health_check(
    sql_provider=sql_provider,
    spark=spark,
    project="Energy Efficiency",
    env="qat"
)
""")

# 2. Get recent failures
print("\n2️⃣ Check Recent Failures")
print("""
from odibi_de_v2.utils import get_failed_transformations

failures = get_failed_transformations(
    sql_provider=sql_provider,
    project="Energy Efficiency",
    env="qat",
    hours_back=24
)

for failure in failures:
    print(f"Failed: {failure['transformation_id']}")
    print(f"  Error: {failure['error_message']}")
""")

# 3. Validate configuration
print("\n3️⃣ Validate Configuration")
print("""
from odibi_de_v2.config import validate_transformation_registry

# Get configs from DB
configs = sql_provider.execute_query(
    "SELECT * FROM TransformationRegistry WHERE project='Energy Efficiency'"
)

# Validate
result = validate_transformation_registry(configs)
print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Errors: {len(result.errors)}")
""")

# 4. Run with verbose logging
print("\n4️⃣ Enable Verbose Logging")
print("""
result = run_project(
    project="Energy Efficiency",
    env="qat",
    log_level="DEBUG",  # More detailed logs
    save_logs=True
)
""")

# 5. Test individual transformations
print("\n5️⃣ Test Individual Transformations")
print("""
# Import and test directly
from silver.functions import process_argo_boilers

result = process_argo_boilers(
    spark=spark,
    inputs=['qat_energy_efficiency.argo_boilers_bronze'],
    constants={},
    outputs=[{"table": "test_output", "mode": "overwrite"}]
)
""")

# 6. Check execution logs
print("\n6️⃣ Query Execution Logs")
print("""
# Check TransformationRunLog table
logs = sql_provider.execute_query(
    '''
    SELECT TOP 10
        transformation_id,
        status,
        error_message,
        start_time,
        end_time
    FROM TransformationRunLog
    WHERE project = 'Energy Efficiency'
    ORDER BY start_time DESC
    '''
)
display(logs)
""")

print("\n💡 See docs/troubleshooting.md for more debugging tips!")

### 5.3 Performance Tips

In [None]:
# PERFORMANCE OPTIMIZATION TIPS

print("⚡ Performance Optimization Tips\n")
print("=" * 60)

# Tip 1: Use caching for expensive operations
print("\n1️⃣ Smart Caching")
print("""
# Cache large intermediate tables
cache_plan = {
    "Silver_2": ["large_joined_table"],
    "Gold_1": ["expensive_aggregation"]
}

run_project(
    project="Energy Efficiency",
    env="qat",
    cache_plan=cache_plan
)

# Effect: Reduce recomputation, faster Gold layer execution
""")

# Tip 2: Parallel execution
print("\n2️⃣ Parallel Execution")
print("""
# Transformations with same 'step' run in parallel
# Ensure independent transformations have same step number

# Example TransformationRegistry:
# transformation_id              layer      step
# 'energy-argo-boilers'          Silver_1   1     ← Runs in parallel
# 'energy-cedar-boilers'         Silver_1   1     ← Runs in parallel
# 'energy-combined-boilers'      Gold_1     1     ← Runs after Silver_1
""")

# Tip 3: Layer-specific optimization
print("\n3️⃣ Layer-Specific Workers")
print("""
# In manifest.json:
{
  "layers": {
    "Silver_1": {
      "max_workers": 8  // More parallel workers
    },
    "Gold_1": {
      "max_workers": 2  // Fewer workers for resource-intensive tasks
    }
  }
}
""")

# Tip 4: Incremental processing
print("\n4️⃣ Incremental Processing")
print("""
# Use Delta Lake's MERGE for incremental updates
# Instead of full table rewrites

def incremental_transform(spark, inputs, outputs, **kwargs):
    new_data = spark.table(inputs[0])
    
    spark.sql(f'''
        MERGE INTO {outputs[0]['table']} target
        USING new_data source
        ON target.id = source.id
        WHEN MATCHED THEN UPDATE SET *
        WHEN NOT MATCHED THEN INSERT *
    ''')
""")

# Tip 5: Partition pruning
print("\n5️⃣ Partition Pruning")
print("""
# Partition output tables by common filter columns
{
  "table": "qat_energy.boilers_silver",
  "mode": "overwrite",
  "partition_by": ["date", "plant"]  // Add partitioning
}

# Effect: Faster queries with WHERE date = '2024-01-01'
""")

# Tip 6: Z-ordering for Delta tables
print("\n6️⃣ Delta Lake Z-Ordering")
print("""
# Optimize Delta tables for common queries
spark.sql('''
    OPTIMIZE qat_energy.boilers_silver
    ZORDER BY (plant, asset, date)
''')

# Effect: 10-100x faster queries on ordered columns
""")

# Tip 7: Monitor execution metrics
print("\n7️⃣ Monitor Execution Metrics")
print("""
from odibi_de_v2.utils import print_project_summary

# View execution stats
print_project_summary(
    sql_provider=sql_provider,
    spark=spark,
    project="Energy Efficiency",
    env="qat"
)

# Identify slow transformations and optimize
""")

print("\n" + "=" * 60)
print("\n💡 Typical performance improvements: 2-10x faster!")

---

## 6. Quick Reference {#6-reference}

### 6.1 Command Reference

In [None]:
# QUICK COMMAND REFERENCE

print("""
📖 QUICK COMMAND REFERENCE
==========================

PROJECT MANAGEMENT
------------------
# Create new project
initialize_project("Project Name", "manufacturing")

# Run full pipeline
run_project(project="Project Name", env="qat")

# Run specific layers
run_project(project="Project Name", env="qat", target_layers=["Silver_1"])

# Dry run (validation only)
run_project(project="Project Name", env="qat", dry_run=True)


HEALTH & DIAGNOSTICS
--------------------
from odibi_de_v2.utils import quick_health_check, print_project_summary

# Quick health check
quick_health_check(sql_provider, spark, "Project Name", "qat")

# Project summary
print_project_summary(sql_provider, spark, "Project Name", "qat")

# List all projects
from odibi_de_v2.utils import list_projects
projects = list_projects(sql_provider, env="qat")


CONFIGURATION
-------------
from odibi_de_v2.config import TransformationRegistryUI

# Interactive config editor
ui = TransformationRegistryUI(project="Project Name", env="qat")
ui.render()

# Validate configurations
from odibi_de_v2.config import validate_transformation_registry
result = validate_transformation_registry(configs)


DEBUGGING
---------
from odibi_de_v2.utils import get_failed_transformations

# Get recent failures
failures = get_failed_transformations(
    sql_provider, "Project Name", "qat", hours_back=24
)

# Retry failed transformations
from odibi_de_v2.utils import retry_failed_transformations
retry_failed_transformations(sql_provider, spark, "Project Name", "qat")


IMPORTS
-------
from odibi_de_v2 import run_project, initialize_project
from odibi_de_v2.utils import (
    quick_health_check,
    print_project_summary,
    list_projects,
    get_failed_transformations,
    retry_failed_transformations
)
from odibi_de_v2.config import (
    TransformationRegistryUI,
    validate_transformation_registry
)
""")

### 6.2 SQL DDL Reference

In [None]:
# SQL DDL REFERENCE

print("""
📋 SQL DDL REFERENCE
====================

TRANSFORMATIONREGISTRY TABLE
----------------------------
""")

ddl_transformation_registry = '''
CREATE TABLE IF NOT EXISTS TransformationRegistry (
    -- Primary identification
    transformation_id VARCHAR(100) PRIMARY KEY,
    transformation_group_id VARCHAR(100),
    
    -- Project and environment scoping
    project VARCHAR(100) NOT NULL,
    environment VARCHAR(20) NOT NULL DEFAULT 'qat',
    
    -- Layer and execution control
    layer VARCHAR(50) NOT NULL,
    step INT NOT NULL DEFAULT 1,
    enabled BIT NOT NULL DEFAULT 1,
    
    -- Generic entity hierarchy
    entity_1 VARCHAR(100),
    entity_2 VARCHAR(100),
    entity_3 VARCHAR(100),
    
    -- Transformation logic
    module VARCHAR(255) NOT NULL,
    function VARCHAR(255) NOT NULL,
    
    -- Flexible I/O (JSON)
    inputs NVARCHAR(MAX),
    constants NVARCHAR(MAX),
    outputs NVARCHAR(MAX),
    
    -- Metadata
    description NVARCHAR(500),
    created_at DATETIME DEFAULT GETDATE(),
    updated_at DATETIME DEFAULT GETDATE(),
    created_by VARCHAR(100),
    updated_by VARCHAR(100),
    
    -- Indexes
    INDEX idx_project_env_layer (project, environment, layer),
    INDEX idx_enabled (enabled),
    INDEX idx_transformation_group (transformation_group_id)
);
'''

print(ddl_transformation_registry)

print("""

SAMPLE INSERT STATEMENT
-----------------------
""")

sample_insert = '''
INSERT INTO TransformationRegistry (
    transformation_id,
    project,
    environment,
    layer,
    entity_1,
    entity_2,
    module,
    function,
    inputs,
    constants,
    outputs,
    description
)
VALUES (
    'my-transformation-id',
    'my project',
    'qat',
    'Silver_1',
    'EntityA',
    'EntityB',
    'my_project.transformations.silver',
    'my_function',
    '["input_table1", "input_table2"]',
    '{"threshold": 100, "window_days": 30}',
    '[{"table": "output_table", "mode": "overwrite"}]',
    'My transformation description'
);
'''

print(sample_insert)

### 6.3 Manifest Schema Reference

In [None]:
# MANIFEST SCHEMA REFERENCE

print("""
📄 MANIFEST SCHEMA REFERENCE
============================

Complete manifest.json structure:
""")

manifest_schema = {
    "project_name": "str - Project name",
    "project_type": "str - manufacturing|analytics|ml_pipeline|custom",
    "description": "str - Project description",
    "version": "str - Semantic version (e.g., 1.0.0)",
    
    "layer_order": [
        "Array of layer names in execution order",
        "Example: ['Bronze', 'Silver_1', 'Silver_2', 'Gold_1']"
    ],
    
    "layers": {
        "LayerName": {
            "name": "str - Layer name",
            "description": "str - Layer description",
            "depends_on": ["Array of prerequisite layers"],
            "cache_tables": ["Array of table names to cache"],
            "max_workers": "int|null - Max parallel workers"
        }
    },
    
    "environments": [
        "Array of supported environments",
        "Example: ['dev', 'qat', 'prod']"
    ],
    "default_env": "str - Default environment",
    
    "entity_labels": {
        "entity_1": "str - Label for entity_1 (e.g., 'plant')",
        "entity_2": "str - Label for entity_2 (e.g., 'asset')",
        "entity_3": "str - Label for entity_3 (e.g., 'equipment')"
    },
    
    "transformation_modules": [
        "Array of Python module paths",
        "Example: ['energy_efficiency.transformations']"
    ],
    
    "cache_plan": {
        "LayerName": [
            "Array of table names to cache for this layer"
        ]
    },
    
    "owner": "str|null - Project owner",
    "tags": ["Array of tags for categorization"],
    "metadata": {"object - Custom metadata key-value pairs"}
}

import json
print(json.dumps(manifest_schema, indent=2))

print("""

💡 Use initialize_project() to generate a valid manifest!
""")

---

## 🎓 Summary

**Congratulations!** You've completed the odibi_de_v2 Framework Evolution Tutorial.

### What You've Learned:

1. ✅ **Evolution Story**: Why v1.x → v2.0 and key benefits
2. ✅ **Architecture**: TransformationRegistry, Manifests, Component interaction
3. ✅ **Core APIs**: `initialize_project()` and `run_project()`
4. ✅ **Configuration**: TransformationRegistry UI and validation
5. ✅ **Migration**: Step-by-step v1.x → v2.0 migration
6. ✅ **Advanced Patterns**: Multi-env, caching, custom auth
7. ✅ **Industry Examples**: Manufacturing, Analytics, Custom domains
8. ✅ **Troubleshooting**: Common issues, debugging, performance tips
9. ✅ **Reference**: Commands, SQL DDL, Manifest schema

### Next Steps:

1. **Try it yourself**: Create a new project with `initialize_project()`
2. **Migrate existing projects**: Follow the migration guide in Cell 7
3. **Read the docs**: See `MIGRATION_GUIDE.md`, `QOL_UTILITIES_SUMMARY.md`, `QUICK_REFERENCE.md`
4. **Join the community**: Share your experience and help others

### Resources:

- 📖 **Full Documentation**: `/d:/projects/odibi_de_v2/docs/`
- 🔀 **Migration Guide**: `MIGRATION_GUIDE.md`
- ⚡ **Quick Reference**: `QUICK_REFERENCE.md`
- 🛠️ **Quality of Life Utilities**: `QOL_UTILITIES_SUMMARY.md`
- 💻 **GitHub**: [Your Repository URL]

---

**Happy Engineering! 🚀**

*odibi_de_v2 v2.0 - Universal Data Engineering Framework*