# Deep Dive 03: Odibi Engine Abstraction

## üéØ The Problem

Data processing needs different execution engines:
- **Small data** ‚Üí Pandas (fast, in-memory)
- **Big data** ‚Üí Spark (distributed, scalable)
- **Analytics** ‚Üí DuckDB (optimized SQL)

**Without abstraction:**
```python
# ‚ùå Tightly coupled to Pandas
df = pd.read_csv('data.csv')
result = df.groupby('category').sum()
result.to_parquet('output.parquet')

# To switch to Spark, rewrite EVERYTHING:
df = spark.read.csv('data.csv')
result = df.groupBy('category').sum()
result.write.parquet('output.parquet')
```

**With Engine abstraction:**
```python
# ‚úÖ Engine-agnostic pipeline code
df = engine.read(conn, format='csv', path='data.csv')
# ... transformations ...
engine.write(df, conn, format='parquet', path='output.parquet')

# Switch engines with config only - ZERO code changes!
```

## ü¶â First Principles

### 1. Abstraction
Hide implementation details behind a stable interface.

### 2. Polymorphism
Treat different engines uniformly through a common base class.

### 3. Dependency Inversion
Depend on abstractions (Engine ABC), not concrete implementations (PandasEngine).

### 4. Open/Closed Principle
Open for extension (add new engines), closed for modification (ABC doesn't change).

## üìñ Part 1: The Engine ABC

Let's examine the abstract base class that defines what an engine must do:

In [None]:
from pathlib import Path

# Read the Engine ABC
base_engine_path = Path(r"c:/Users/hodibi/OneDrive - Ingredion/Desktop/Repos/Odibi/odibi/engine/base.py")

with open(base_engine_path) as f:
    engine_abc_code = f.read()

print(f"Engine ABC: {len(engine_abc_code.splitlines())} lines\n")
print(engine_abc_code)

## üîç Analysis: The 9 Abstract Methods

Every engine MUST implement these methods:

### Data I/O (3 methods)
1. **`read()`** - Load data from connection
2. **`write()`** - Save data to connection  
3. **`execute_sql()`** - Run SQL queries

### Operations (1 method)
4. **`execute_operation()`** - Built-in operations (pivot, etc.)

### Introspection (5 methods)
5. **`get_schema()`** - Get column names/types
6. **`get_shape()`** - Get (rows, columns)
7. **`count_rows()`** - Count rows
8. **`count_nulls()`** - Count nulls per column
9. **`validate_schema()`** - Validate DataFrame structure

**Key Insight:** Notice the method signatures use `Any` for DataFrame types. This allows:
- PandasEngine to return `pd.DataFrame`
- SparkEngine to return `pyspark.sql.DataFrame`
- DuckDBEngine to return `duckdb.DuckDBPyRelation`

## üìñ Part 2: PandasEngine Implementation

Let's see how Pandas implements the Engine contract:

In [None]:
pandas_engine_path = Path(r"c:/Users/hodibi/OneDrive - Ingredion/Desktop/Repos/Odibi/odibi/engine/pandas_engine.py")

with open(pandas_engine_path) as f:
    pandas_code = f.read()

print(f"PandasEngine: {len(pandas_code.splitlines())} lines\n")

# Show key parts
lines = pandas_code.splitlines()
print("Class definition:")
print('\n'.join(lines[12:19]))  # Class and __init__
print("\n" + "="*60)
print("\nread() method signature:")
print('\n'.join(lines[47:66]))  # read() method

### üî¨ Deep Dive: PandasEngine.read()

Let's understand the read implementation:

In [None]:
# Extract the read() method
lines = pandas_code.splitlines()

# Find read method (lines 47-134)
read_method = '\n'.join(lines[46:135])
print(read_method)

### üß™ Analysis: Read Method Pattern

Notice the pattern:

```python
if format == "csv":
    return pd.read_csv(full_path, **merged_options)
elif format == "parquet":
    return pd.read_parquet(full_path, **merged_options)
elif format == "delta":
    # Special handling for Delta Lake
    dt = DeltaTable(full_path, storage_options=storage_opts)
    return dt.to_pandas()
```

**Key Points:**
1. **Format dispatch** - Different code paths per format
2. **Storage options** - Merged from connection + user options
3. **Cloud support** - Works with ADLS, S3 via `fsspec`
4. **Delta Lake** - Uses `deltalake` library, converts to Pandas
5. **Error handling** - Clear ImportError messages

### üî¨ Deep Dive: PandasEngine.execute_sql()

How does Pandas run SQL queries?

In [None]:
# Extract execute_sql method (lines 254-299)
sql_method = '\n'.join(lines[253:300])
print(sql_method)

### üß™ Analysis: SQL Execution Strategy

**Clever Two-Tier Fallback:**

1. **Prefer DuckDB** (fast, feature-rich SQL engine)
   ```python
   conn = duckdb.connect(":memory:")
   for name in context.list_names():
       conn.register(name, context.get(name))  # Zero-copy!
   return conn.execute(sql).df()
   ```

2. **Fallback to pandasql** (pure Python, slower)
   ```python
   from pandasql import sqldf
   locals_dict = {name: context.get(name) for name in context.list_names()}
   return sqldf(sql, locals_dict)
   ```

**Why DuckDB?**
- 10-100x faster than pandasql
- Full SQL support (window functions, CTEs, etc.)
- Zero-copy integration with Pandas
- Used by many modern tools (Ibis, Hamilton, etc.)

## üìñ Part 3: SparkEngine Implementation

Now let's see how Spark implements the same interface:

In [None]:
spark_engine_path = Path(r"c:/Users/hodibi/OneDrive - Ingredion/Desktop/Repos/Odibi/odibi/engine/spark_engine.py")

with open(spark_engine_path) as f:
    spark_code = f.read()

print(f"SparkEngine: {len(spark_code.splitlines())} lines\n")

lines = spark_code.splitlines()
print("Class definition:")
print('\n'.join(lines[9:58]))  # Class and __init__

### üî¨ Deep Dive: SparkEngine.__init__()

Notice the differences from PandasEngine:

```python
def __init__(self, connections=None, spark_session=None, config=None):
    # Import guard
    try:
        from pyspark.sql import SparkSession
    except ImportError:
        raise ImportError("Spark support requires 'pip install odibi[spark]'")
    
    # Configure Delta Lake
    from delta import configure_spark_with_delta_pip
    builder = SparkSession.builder.appName("odibi")
    self.spark = spark_session or configure_spark_with_delta_pip(builder).getOrCreate()
    
    # Configure all ADLS connections upfront
    self._configure_all_connections()
```

**Key Differences:**
1. **SparkSession** - Heavy object created once, reused
2. **Delta integration** - Configured at session level
3. **Connection config** - All credentials set upfront (Pandas does per-operation)

### üî¨ Deep Dive: SparkEngine.read()

Compare to PandasEngine:

In [None]:
# Extract read method (lines 81-118)
spark_read = '\n'.join(lines[80:119])
print(spark_read)

### üß™ Analysis: Spark Read Pattern

**Much simpler than Pandas!**

```python
reader = self.spark.read.format(format)
for key, value in options.items():
    reader = reader.option(key, value)
return reader.load(full_path)
```

**Why simpler?**
- Spark has a **unified DataFrameReader API**
- All formats (CSV, Parquet, Delta, JSON) use same pattern
- No format-specific branches needed
- Options are format-agnostic

**Trade-off:**
- ‚úÖ More generic, extensible
- ‚ùå Less control over format-specific features
- ‚ùå Requires understanding Spark options

### üî¨ Deep Dive: SparkEngine.execute_sql()

SQL in Spark is native:

In [None]:
# Extract execute_sql method (lines 180-194)
spark_sql = '\n'.join(lines[179:195])
print(spark_sql)

### üß™ Analysis: Spark SQL

```python
# Register DataFrames as temp views
for table_name, df in context.items():
    df.createOrReplaceTempView(table_name)

# Execute SQL
return self.spark.sql(sql)
```

**Much simpler than Pandas!**
- No external SQL engine needed
- Spark SQL is a first-class feature
- Uses Catalyst optimizer
- Can leverage distributed joins

**Context Difference:**
- Pandas: `context.list_names()` + `context.get(name)` (Context API)
- Spark: `context.items()` (simple dict)
- This shows engines can have different context expectations

## üé® Part 4: The Power of Abstraction

Let's visualize how the same pipeline code works with different engines:

In [None]:
# Pseudo-code showing engine-agnostic pipeline

def run_pipeline(engine, connection):
    """
    This SAME code works with PandasEngine, SparkEngine, or DuckDBEngine!
    """
    # Read data
    df = engine.read(
        connection=connection,
        format='csv',
        path='sales.csv'
    )
    
    # Check schema
    schema = engine.get_schema(df)
    print(f"Columns: {schema}")
    
    # Transform with SQL
    result = engine.execute_sql(
        "SELECT category, SUM(amount) as total FROM df GROUP BY category",
        context={'df': df}
    )
    
    # Write output
    engine.write(
        df=result,
        connection=connection,
        format='parquet',
        path='output.parquet',
        mode='overwrite'
    )

# Same code, different engines:
# run_pipeline(PandasEngine(), local_conn)  # In-memory
# run_pipeline(SparkEngine(), adls_conn)    # Distributed
# run_pipeline(DuckDBEngine(), local_conn)  # Analytical

## üìä Part 5: Side-by-Side Comparison

Let's create a comparison table:

In [None]:
import pandas as pd

comparison_data = {
    'Aspect': [
        'Execution Model',
        'Memory Model',
        'Parallelism',
        'SQL Engine',
        'Best For',
        'Cloud Support',
        'Delta Lake',
    ],
    'PandasEngine': [
        'Eager, in-memory',
        'All data in RAM',
        'Single-threaded',
        'DuckDB or pandasql',
        'Small-medium (<10GB)',
        'Via fsspec',
        'Via deltalake lib',
    ],
    'SparkEngine': [
        'Lazy, distributed',
        'Spill to disk',
        'Multi-node cluster',
        'Spark SQL (Catalyst)',
        'Large data (>100GB)',
        'Native (s3a, abfss)',
        'Native (delta-spark)',
    ],
    'DuckDBEngine (Target)': [
        'Eager, in-process',
        'Memory-mapped',
        'Multi-threaded',
        'Native DuckDB SQL',
        'Analytics (1-100GB)',
        'Via fsspec',
        'read_delta() function',
    ]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df

## üèóÔ∏è Part 6: Design Patterns in Engine Architecture

### Pattern 1: Strategy Pattern
The Engine ABC is a **Strategy** - different algorithms (engines) with the same interface.

### Pattern 2: Template Method (Implicit)
The ABC defines the "what", implementations define the "how".

### Pattern 3: Factory (in Odibi core)
```python
def get_engine(engine_type: str) -> Engine:
    if engine_type == "pandas":
        return PandasEngine()
    elif engine_type == "spark":
        return SparkEngine()
```

### Pattern 4: Adapter (for Delta Lake)
```python
# PandasEngine adapts deltalake library to Engine interface
dt = DeltaTable(full_path)
return dt.to_pandas()  # Adapt to pd.DataFrame
```

## üß™ Part 7: Testing with Mock Engines

The abstraction enables easy testing:

In [None]:
from typing import Any, Dict, List, Optional
from abc import ABC

class MockEngine:
    """Simple mock engine for testing pipeline logic."""
    
    def __init__(self):
        self.reads = []
        self.writes = []
        self.sql_queries = []
    
    def read(self, connection, format, table=None, path=None, options=None):
        self.reads.append({
            'connection': connection,
            'format': format,
            'path': path or table
        })
        # Return mock data
        return pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
    
    def write(self, df, connection, format, table=None, path=None, mode='overwrite', options=None):
        self.writes.append({
            'connection': connection,
            'format': format,
            'path': path or table,
            'mode': mode,
            'rows': len(df)
        })
    
    def execute_sql(self, sql, context):
        self.sql_queries.append(sql)
        return pd.DataFrame({'result': ['mocked']})
    
    def get_schema(self, df):
        return df.columns.tolist()
    
    def get_shape(self, df):
        return df.shape

# Test a pipeline
mock = MockEngine()
# run_pipeline(mock, connection)

# Assert expectations
print(f"Reads: {mock.reads}")
print(f"Writes: {mock.writes}")
print(f"SQL: {mock.sql_queries}")

## üí° Key Insights

### 1. ABC Stability is Critical
The Engine ABC hasn't changed in 2+ years because:
- Covers all essential operations
- Methods are small, focused (SRP)
- Interface is minimal but sufficient

### 2. Different Engines, Different Trade-offs
- **Pandas**: Simple, fast for small data, rich ecosystem
- **Spark**: Complex, essential for big data, JVM overhead
- **DuckDB**: Sweet spot for analytical workloads

### 3. Abstraction Enables Evolution
Odibi can:
- Add new engines without changing pipeline code
- Optimize engines independently
- Test with mocks
- Let users choose based on needs

### 4. Storage Options Pattern
Two approaches:
- **Pandas**: Merge per-operation (flexible)
- **Spark**: Configure once (efficient)

Both work because they're hidden behind the abstraction!

### 5. SQL is Not One-Size-Fits-All
- Pandas needs external SQL engine (DuckDB)
- Spark has native SQL
- The abstraction handles both transparently

## üéØ Summary

The Engine abstraction is Odibi's architectural foundation:

1. **Engine ABC** - 9 methods defining the contract
2. **PandasEngine** - In-memory, flexible, feature-rich
3. **SparkEngine** - Distributed, scalable, production-grade
4. **Abstraction Benefits** - Swappable, testable, extensible

**Design Principles Applied:**
- ‚úÖ Abstraction (ABC hides details)
- ‚úÖ Polymorphism (treat engines uniformly)
- ‚úÖ Dependency Inversion (depend on Engine, not PandasEngine)
- ‚úÖ Open/Closed (add engines without modifying ABC)

**Next Steps:**
- Complete `exercises.ipynb` to build DuckDBEngine
- Study `engine_comparison.md` for detailed reference
- Proceed to `04_context_api/` to see how engines get data

## üîó Additional Resources

### Odibi Source Code
- [base.py](file:///c:/Users/hodibi/OneDrive%20-%20Ingredion/Desktop/Repos/Odibi/odibi/engine/base.py)
- [pandas_engine.py](file:///c:/Users/hodibi/OneDrive%20-%20Ingredion/Desktop/Repos/Odibi/odibi/engine/pandas_engine.py)
- [spark_engine.py](file:///c:/Users/hodibi/OneDrive%20-%20Ingredion/Desktop/Repos/Odibi/odibi/engine/spark_engine.py)

### Related Lessons
- `foundations/06_abc` - Abstract Base Classes fundamentals
- `odibi_deep_dive/01_config_system` - EngineType enum
- `odibi_deep_dive/02_connection_layer` - How connections work
- `odibi_deep_dive/04_context_api` - How engines receive data

### External References
- [DuckDB Python API](https://duckdb.org/docs/api/python)
- [PySpark SQL Module](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html)
- [Strategy Pattern](https://refactoring.guru/design-patterns/strategy)