# Tutorial 2: Function Registry and Decorators

Learn how to register reusable functions for both Spark and Pandas engines using the ODIBI function framework.

**Learning Objectives:**
- Use `@odibi_function` decorator for function registration
- Create engine-specific variants (Spark vs Pandas)
- Resolve functions from the global registry
- Understand the function resolution order
- Use shorthand decorators (`@spark_function`, `@pandas_function`, `@universal_function`)

**Prerequisites:**
- odibi_de_v2 installed
- pandas installed (Spark optional for full examples)

## Part 1: Basic Function Registration

In [None]:
from odibi_de_v2.odibi_functions import odibi_function, REGISTRY
import pandas as pd

# Register a universal function (works with any engine)
@odibi_function(
    engine="any",
    module="cleaning",
    description="Remove duplicate rows from DataFrame",
    version="1.0"
)
def remove_duplicates(df):
    """Remove duplicate rows - works for both Spark and Pandas."""
    return df.drop_duplicates()

print("✓ Registered 'remove_duplicates' function")

# Verify registration
all_functions = REGISTRY.get_all()
print(f"\nRegistered functions: {all_functions}")

In [None]:
# Resolve and use the function
fn = REGISTRY.resolve(
    module="cleaning",
    func="remove_duplicates",
    engine="pandas"
)

# Test with sample data
test_df = pd.DataFrame({
    'id': [1, 2, 2, 3, 4, 4],
    'value': ['a', 'b', 'b', 'c', 'd', 'd']
})

print("Original DataFrame:")
print(test_df)

result = fn(test_df)
print("\nAfter remove_duplicates:")
print(result)
print(f"\nRows: {len(test_df)} → {len(result)}")

## Part 2: Engine-Specific Implementations

Create different implementations for Spark and Pandas.

In [None]:
# Pandas-specific implementation
@odibi_function(
    engine="pandas",
    module="aggregation",
    description="Calculate daily statistics using Pandas"
)
def calculate_daily_stats_pandas(df):
    """Aggregate data by date using pandas groupby."""
    return df.groupby('date').agg({
        'value': ['count', 'sum', 'mean'],
        'id': 'nunique'
    }).reset_index()

# Spark-specific implementation (same logical operation)
@odibi_function(
    engine="spark",
    module="aggregation",
    description="Calculate daily statistics using Spark SQL"
)
def calculate_daily_stats_spark(df):
    """Aggregate data by date using Spark SQL functions."""
    from pyspark.sql import functions as F
    
    return df.groupBy('date').agg(
        F.count('value').alias('value_count'),
        F.sum('value').alias('value_sum'),
        F.mean('value').alias('value_mean'),
        F.countDistinct('id').alias('id_nunique')
    )

print("✓ Registered engine-specific implementations")
print("  - calculate_daily_stats_pandas (engine='pandas')")
print("  - calculate_daily_stats_spark (engine='spark')")

## Part 3: Function Resolution

**Resolution Order:**
1. Exact match: `(function_name, engine)`
2. Universal fallback: `(function_name, "any")`
3. None if not found

In [None]:
# Test resolution for Pandas engine
pandas_fn = REGISTRY.resolve(
    module="aggregation",
    func="calculate_daily_stats_pandas",
    engine="pandas"
)

print(f"Resolved for pandas: {pandas_fn.__name__}")

# Test with sample data
sample_df = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03'],
    'id': [1, 2, 1, 3, 2],
    'value': [10, 20, 15, 25, 30]
})

print("\nSample Data:")
print(sample_df)

stats = pandas_fn(sample_df)
print("\nDaily Statistics:")
print(stats)

In [None]:
# Demonstrate fallback resolution
@odibi_function(
    engine="any",
    module="validation",
    description="Count null values"
)
def count_nulls(df):
    """Universal null counter - works with both engines."""
    return df.isnull().sum()

# Resolve for pandas (no pandas-specific version, falls back to 'any')
resolved_fn = REGISTRY.resolve(
    module="validation",
    func="count_nulls",
    engine="pandas"
)

print(f"✓ Resolved 'count_nulls' for pandas engine: {resolved_fn.__name__}")

# Test
test_df_with_nulls = pd.DataFrame({
    'a': [1, 2, None, 4],
    'b': [None, 'x', 'y', None],
    'c': [1.0, 2.0, 3.0, 4.0]
})

null_counts = resolved_fn(test_df_with_nulls)
print("\nNull Counts:")
print(null_counts)

## Part 4: Shorthand Decorators

Use convenient aliases for common patterns.

In [None]:
from odibi_de_v2.odibi_functions import (
    spark_function,
    pandas_function,
    universal_function
)

# Equivalent to @odibi_function(engine="pandas", ...)
@pandas_function(
    module="enrichment",
    description="Add revenue column",
    author="data_team"
)
def add_revenue_column(df):
    """Calculate revenue from quantity and price."""
    df = df.copy()
    df['revenue'] = df['quantity'] * df['price']
    return df

# Equivalent to @odibi_function(engine="any", ...)
@universal_function(
    module="enrichment",
    description="Add timestamp column"
)
def add_timestamp(df):
    """Add current timestamp to DataFrame."""
    from datetime import datetime
    df = df.copy()
    df['processed_at'] = datetime.now()
    return df

print("✓ Registered functions using shorthand decorators:")
print("  - add_revenue_column (@pandas_function)")
print("  - add_timestamp (@universal_function)")

In [None]:
# Test the enrichment functions
sales_df = pd.DataFrame({
    'product': ['Widget A', 'Widget B', 'Widget C'],
    'quantity': [10, 5, 8],
    'price': [25.50, 40.00, 15.75]
})

print("Original Data:")
print(sales_df)

# Apply functions directly (they're still normal Python functions)
enriched = add_revenue_column(sales_df)
enriched = add_timestamp(enriched)

print("\nEnriched Data:")
print(enriched)

## Part 5: Advanced - Function Metadata

In [None]:
@odibi_function(
    engine="pandas",
    module="quality",
    description="Data quality checker with configurable thresholds",
    author="quality_team",
    version="2.0",
    tags=["quality", "validation", "monitoring"],
    category="data_quality"
)
def check_data_quality(df, null_threshold=0.1, dup_threshold=0.05):
    """
    Comprehensive data quality check.
    
    Args:
        df: Input DataFrame
        null_threshold: Max allowed null percentage (default 10%)
        dup_threshold: Max allowed duplicate percentage (default 5%)
    
    Returns:
        Quality report dictionary
    """
    total_rows = len(df)
    
    # Calculate metrics
    null_pct = (df.isnull().sum().sum() / (total_rows * len(df.columns)))
    dup_count = df.duplicated().sum()
    dup_pct = dup_count / total_rows if total_rows > 0 else 0
    
    report = {
        'total_rows': total_rows,
        'total_columns': len(df.columns),
        'null_percentage': null_pct,
        'duplicate_count': dup_count,
        'duplicate_percentage': dup_pct,
        'null_check_passed': null_pct <= null_threshold,
        'dup_check_passed': dup_pct <= dup_threshold,
        'overall_passed': (null_pct <= null_threshold) and (dup_pct <= dup_threshold)
    }
    
    return report

# Retrieve metadata about the function
metadata = REGISTRY.get_metadata("check_data_quality", "pandas")
print("Function Metadata:")
for key, value in metadata.items():
    print(f"  {key}: {value}")

In [None]:
# Test quality checker
test_data = pd.DataFrame({
    'id': [1, 2, 3, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Charlie', None, 'Eve'],
    'score': [95, 87, None, None, 92, 88]
})

quality_report = check_data_quality(test_data)

print("\nData Quality Report:")
print(f"Total Rows: {quality_report['total_rows']}")
print(f"Null %: {quality_report['null_percentage']:.2%}")
print(f"Duplicate %: {quality_report['duplicate_percentage']:.2%}")
print(f"\n✓ Overall Quality: {'PASSED' if quality_report['overall_passed'] else 'FAILED'}")

## Part 6: Listing and Discovery

In [None]:
# List all registered functions
all_functions = REGISTRY.get_all()

print("All Registered Functions:")
print("=" * 70)

for func_name, engine in sorted(all_functions):
    metadata = REGISTRY.get_metadata(func_name, engine)
    module = metadata.get('module', 'N/A')
    desc = metadata.get('description', 'No description')[:50]
    print(f"\n{func_name} [engine={engine}]")
    print(f"  Module: {module}")
    print(f"  Description: {desc}...")

In [None]:
# Filter by module
def get_functions_by_module(module_name):
    """Get all functions in a specific module."""
    results = []
    for func_name, engine in REGISTRY.get_all():
        metadata = REGISTRY.get_metadata(func_name, engine)
        if metadata.get('module') == module_name:
            results.append((func_name, engine, metadata))
    return results

# Find all cleaning functions
cleaning_functions = get_functions_by_module('cleaning')

print(f"\nFunctions in 'cleaning' module:")
for func_name, engine, metadata in cleaning_functions:
    print(f"  - {func_name} [{engine}]: {metadata.get('description', 'N/A')}")

## Summary

**What You Learned:**

1. ✓ **@odibi_function Decorator**: Register functions with `engine`, `module`, and metadata
2. ✓ **Engine-Specific Variants**: Create Spark and Pandas implementations of same logic
3. ✓ **Function Resolution**: Use `REGISTRY.resolve(module, func, engine)` with fallback
4. ✓ **Shorthand Decorators**: `@spark_function`, `@pandas_function`, `@universal_function`
5. ✓ **Metadata**: Attach author, version, tags, and custom fields
6. ✓ **Discovery**: List and filter registered functions programmatically

**Key Patterns:**

```python
# Register function
@odibi_function(engine="pandas", module="mymodule")
def my_function(df):
    return df

# Resolve and use
fn = REGISTRY.resolve("mymodule", "my_function", "pandas")
result = fn(data)
```

**Best Practices:**
- Use `engine="any"` for universal functions that work with both Spark and Pandas
- Organize functions by `module` for better discoverability
- Add descriptive metadata for documentation and searchability
- Functions remain normal Python callables (backward-compatible)

**Next Steps:**
- Tutorial 3: Hooks and Observability
- Tutorial 4: Complete Project Template