# 03 - Spark Engine Preview (Phase 1 Scaffold)

## üß≠ Goal

Understand ODIBI's Spark engine architecture and configuration.

This notebook will:
- Explore the Spark engine scaffold (Phase 1)
- Show Spark pipeline YAML structure
- Explain Azure integration points
- Preview what's coming in Phase 3

**Note:** Spark execution is coming in Phase 3 (Q2 2026). This shows the architecture.

**Estimated time:** 3 minutes

## üîß Setup

In [None]:
# ‚úÖ Environment Setup
import sys
import os
from pathlib import Path
import yaml

# Navigate to project root
project_root = Path.cwd().parent if Path.cwd().name == 'walkthroughs' else Path.cwd()
os.chdir(project_root)

print(f"‚úÖ Environment ready")
print(f"üìÅ Working directory: {Path.cwd()}")

## üèóÔ∏è Spark Engine Architecture

ODIBI's Spark engine follows the same interface as Pandas:

In [None]:
# Inspect Spark engine code
with open('odibi/engine/spark_engine.py', 'r') as f:
    content = f.read()

print("üì¶ Spark Engine Class Structure:\n")

# Show class definition and key methods
lines = content.split('\n')
for i, line in enumerate(lines[:50], 1):
    if 'class SparkEngine' in line or 'def ' in line:
        print(f"{i:3}: {line}")

In [None]:
# Try importing SparkEngine (will show helpful error if pyspark not installed)
try:
    from odibi.engine.spark_engine import SparkEngine
    print("‚úÖ SparkEngine imported successfully")
    print(f"   Engine name: {SparkEngine.name}")
except ImportError as e:
    print(f"‚ö†Ô∏è Expected: {e}")
    print("\nüí° This is normal! Spark engine requires:")
    print("   pip install 'odibi[spark]'")

## üìã Spark Pipeline Configuration

Let's examine the `example_spark.yaml` configuration:

In [None]:
# Load Spark example configuration
with open('examples/example_spark.yaml', 'r') as f:
    spark_config = yaml.safe_load(f)

print("üîß Spark Pipeline Configuration:\n")
print(f"Project: {spark_config['project']}")
print(f"Engine: {spark_config['engine']}")
print(f"\nConnections:")
for conn_name, conn_config in spark_config['connections'].items():
    print(f"  ‚Ä¢ {conn_name}: {conn_config['type']}")

In [None]:
# Show Azure ADLS connection structure
print("\n‚òÅÔ∏è Azure ADLS Connection:")
print(yaml.dump(spark_config['connections']['adls_bronze'], default_flow_style=False))

In [None]:
# Show pipeline structure
print("\nüìä Pipeline Structure:")
for pipeline in spark_config['pipelines']:
    print(f"\nPipeline: {pipeline['name']}")
    print(f"  Nodes: {len(pipeline['nodes'])}")
    for node in pipeline['nodes']:
        print(f"    ‚Ä¢ {node['name']}")

## üîó Azure Connection Resolution

The Azure ADLS connection can build URIs even in Phase 1:

In [None]:
# Test ADLS URI generation (Phase 1 feature that works now)
from odibi.connections.azure_adls import AzureADLS

# Create connection
adls = AzureADLS(
    account="mystorageaccount",
    container="datalake",
    path_prefix="bronze/",
    auth_mode="managed_identity"
)

# Build URI
uri = adls.uri("sensors/2024/01/data.parquet")

print("‚úÖ ADLS URI Construction (Phase 1 Working):")
print(f"\n   Input path: sensors/2024/01/data.parquet")
print(f"   Output URI: {uri}")
print(f"\n   This URI will be used by Spark in Phase 3!")

## ü™û Reflect

**What we learned:**
- Spark engine architecture (scaffolded in Phase 1)
- Spark pipeline YAML structure
- Azure ADLS connection configuration
- URI construction works today (Phase 1)
- Spark execution coming in Phase 3

**Phase 1 (Current) Features:**
- ‚úÖ SparkEngine class defined
- ‚úÖ Azure connections (ADLS, SQL, DBFS)
- ‚úÖ Path/URI resolution
- ‚úÖ Import guards with helpful errors

**Phase 3 (Q2 2026) Features:**
- ‚è≥ `SparkEngine.read()` - Load Parquet/CSV from ADLS
- ‚è≥ `SparkEngine.write()` - Save to ADLS/Delta
- ‚è≥ `SparkEngine.execute_sql()` - Run Spark SQL transforms
- ‚è≥ Integration tests with local Spark session

**Next step:**  
Go to **`04_ci_cd_and_precommit.ipynb`** to learn about code quality automation.

## ‚úÖ Self-Check

In [None]:
# ‚úÖ Self-Check
try:
    import sys, os
    print("Running self-check...")
    
    # Verify Spark scaffolding exists
    assert os.path.exists("odibi/engine/spark_engine.py"), "Missing Spark engine"
    assert os.path.exists("odibi/connections/azure_adls.py"), "Missing ADLS connection"
    assert os.path.exists("examples/example_spark.yaml"), "Missing Spark example"
    
    # Verify Azure connections work
    from odibi.connections.azure_adls import AzureADLS
    conn = AzureADLS(account="test", container="data")
    uri = conn.uri("file.csv")
    assert uri.startswith("abfss://"), "URI generation failed"
    print(f"‚úÖ Azure ADLS connection works")
    
    # Verify docs exist
    assert os.path.exists("docs/setup_databricks.md"), "Missing Databricks setup guide"
    assert os.path.exists("docs/setup_azure.md"), "Missing Azure setup guide"
    
    print("üéâ Walkthrough 03 verified successfully")
except Exception as e:
    print(f"‚ùå Walkthrough failed self-check: {e}")
    raise