# Automated Delta Lake Optimization Metrics Collection

This companion notebook provides automated metrics collection and visualization capabilities for the Delta Lake optimization project. It demonstrates how to programmatically capture performance metrics and store them in a Delta table for trend analysis.

## Features
- Automated metrics capture from Spark UI and table metadata
- Storage of metrics in a Delta table for historical tracking
- Visualization of performance improvements over time
- Comparison utilities for different optimization techniques

## Prerequisites
Run the main `project.ipynb` notebook first to create the base tables and complete at least a few optimization steps.

In [0]:
# Configuration - must match main project settings
CATALOG_NAME = "delta_optimization_project"
SCHEMA_NAME = "sales_data"
METRICS_TABLE = f"{CATALOG_NAME}.{SCHEMA_NAME}.optimization_metrics"

# Ensure we're using the right catalog and schema
spark.sql(f"USE CATALOG {CATALOG_NAME}")
spark.sql(f"USE SCHEMA {SCHEMA_NAME}")

# Import required libraries
from pyspark.sql import functions as F
from pyspark.sql.types import *
import datetime
import json

In [0]:
# Define metrics collection schema
metrics_schema = StructType([
    StructField("experiment_id", StringType(), False),
    StructField("timestamp", TimestampType(), False),
    StructField("table_name", StringType(), False),
    StructField("optimization_technique", StringType(), False),
    StructField("step_number", IntegerType(), False),
    StructField("query_description", StringType(), True),
    StructField("files_scanned", LongType(), True),
    StructField("bytes_read", LongType(), True),
    StructField("duration_ms", LongType(), True),
    StructField("output_rows", LongType(), True),
    StructField("num_files_total", LongType(), True),
    StructField("table_size_bytes", LongType(), True),
    StructField("avg_file_size_mb", DoubleType(), True),
    StructField("additional_metrics", StringType(), True)  # JSON for extensibility
])

# Create metrics table if it doesn't exist
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {METRICS_TABLE} (
    experiment_id STRING NOT NULL,
    timestamp TIMESTAMP NOT NULL,
    table_name STRING NOT NULL,
    optimization_technique STRING NOT NULL,
    step_number INT NOT NULL,
    query_description STRING,
    files_scanned BIGINT,
    bytes_read BIGINT,
    duration_ms BIGINT,
    output_rows BIGINT,
    num_files_total BIGINT,
    table_size_bytes BIGINT,
    avg_file_size_mb DOUBLE,
    additional_metrics STRING
) USING DELTA
TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true'
)
""")

print(f"‚úÖ Metrics table created/verified: {METRICS_TABLE}")

In [0]:
class DeltaOptimizationMetricsCollector:
    """Automated metrics collection for Delta Lake optimization experiments."""
    
    def __init__(self, experiment_id=None):
        self.experiment_id = experiment_id or f"exp_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}"
        self.metrics_table = METRICS_TABLE
        
    def capture_table_metadata(self, table_name):
        """Capture table metadata like file count, size, etc."""
        try:
            detail_df = spark.sql(f"DESCRIBE DETAIL {table_name}")
            detail = detail_df.collect()[0]
            
            num_files = detail['numFiles'] if detail['numFiles'] else 0
            size_bytes = detail['sizeInBytes'] if detail['sizeInBytes'] else 0
            avg_file_size_mb = (size_bytes / num_files / 1024 / 1024) if num_files > 0 else 0
            
            return {
                'num_files_total': num_files,
                'table_size_bytes': size_bytes,
                'avg_file_size_mb': avg_file_size_mb
            }
        except Exception as e:
            print(f"‚ö†Ô∏è Error capturing table metadata: {e}")
            return {'num_files_total': None, 'table_size_bytes': None, 'avg_file_size_mb': None}
    
    def record_query_metrics(self, 
                           table_name, 
                           optimization_technique, 
                           step_number,
                           query_description=None,
                           files_scanned=None,
                           bytes_read=None,
                           duration_ms=None,
                           output_rows=None,
                           additional_metrics=None):
        """Record query execution metrics."""
        
        # Capture table metadata
        table_meta = self.capture_table_metadata(table_name)
        
        # Prepare metrics record
        metrics_record = {
            'experiment_id': self.experiment_id,
            'timestamp': datetime.datetime.now(),
            'table_name': table_name,
            'optimization_technique': optimization_technique,
            'step_number': step_number,
            'query_description': query_description,
            'files_scanned': files_scanned,
            'bytes_read': bytes_read,
            'duration_ms': duration_ms,
            'output_rows': output_rows,
            'additional_metrics': json.dumps(additional_metrics) if additional_metrics else None
        }
        
        # Add table metadata
        metrics_record.update(table_meta)
        
        # Insert into metrics table
        metrics_df = spark.createDataFrame([metrics_record], metrics_schema)
        metrics_df.write.mode("append").saveAsTable(self.metrics_table)
        
        print(f"üìä Metrics recorded for {optimization_technique} on {table_name}")
        return metrics_record
    
    def benchmark_query(self, query, table_name, optimization_technique, step_number, query_description=None):
        """Execute a query and automatically capture its metrics."""
        import time
        
        print(f"üîç Executing benchmark query: {query_description or 'Query'}")
        
        # Execute query and measure time
        start_time = time.time()
        result_df = spark.sql(query)
        output_rows = result_df.count()  # This forces execution
        end_time = time.time()
        
        duration_ms = int((end_time - start_time) * 1000)
        
        # Note: In a real Databricks environment, you would extract files_scanned 
        # and bytes_read from the Spark UI or query plan. For this demo, we'll 
        # set them as None and rely on manual input or future enhancement.
        
        # Record metrics
        metrics = self.record_query_metrics(
            table_name=table_name,
            optimization_technique=optimization_technique,
            step_number=step_number,
            query_description=query_description,
            duration_ms=duration_ms,
            output_rows=output_rows,
            additional_metrics={'query': query}
        )
        
        print(f"‚è±Ô∏è Query completed in {duration_ms}ms, returned {output_rows} rows")
        return result_df, metrics

# Create a global instance for easy use
metrics_collector = DeltaOptimizationMetricsCollector()

print(f"‚úÖ Metrics collector initialized with experiment ID: {metrics_collector.experiment_id}")

## Usage Examples

Here are examples of how to use the automated metrics collection system in your optimization experiments:

In [0]:
# Example 1: Benchmark a query automatically
# This would typically be run after each optimization step in the main notebook

sample_query = """
SELECT country, 
       COUNT(*) as total_sales,
       SUM(amount) as total_revenue
FROM delta_optimization_project.sales_data.sales_raw 
WHERE country IN ('USA', 'Germany', 'France')
GROUP BY country
"""

# Check if the table exists before running the example
try:
    spark.sql("DESCRIBE TABLE delta_optimization_project.sales_data.sales_raw")
    table_exists = True
except:
    table_exists = False
    print("‚ÑπÔ∏è Main project tables not found. Run project.ipynb first to create sample data.")

if table_exists:
    result_df, metrics = metrics_collector.benchmark_query(
        query=sample_query,
        table_name="delta_optimization_project.sales_data.sales_raw",
        optimization_technique="baseline",
        step_number=1,
        query_description="Country aggregation baseline"
    )
    
    display(result_df)

In [0]:
# Example 2: Record metrics manually (when you have Spark UI data)
# This approach allows you to input specific metrics from the Spark UI

if table_exists:
    manual_metrics = metrics_collector.record_query_metrics(
        table_name="delta_optimization_project.sales_data.sales_raw",
        optimization_technique="partitioned",
        step_number=2,
        query_description="After country partitioning",
        files_scanned=50,  # From Spark UI
        bytes_read=1024*1024*100,  # 100 MB from Spark UI
        duration_ms=2500,
        output_rows=3,
        additional_metrics={
            "scan_efficiency": "high",
            "partition_pruning": True
        }
    )

## Metrics Visualization

Visualize the performance improvements across different optimization techniques:

In [0]:
def create_performance_comparison():
    """Create a performance comparison visualization."""
    
    # Query metrics data
    metrics_df = spark.sql(f"""
    SELECT 
        optimization_technique,
        step_number,
        AVG(duration_ms) as avg_duration_ms,
        AVG(files_scanned) as avg_files_scanned,
        AVG(bytes_read / 1024 / 1024) as avg_mb_read,
        AVG(avg_file_size_mb) as avg_file_size_mb,
        COUNT(*) as measurement_count
    FROM {METRICS_TABLE}
    WHERE duration_ms IS NOT NULL
    GROUP BY optimization_technique, step_number
    ORDER BY step_number
    """)
    
    if metrics_df.count() > 0:
        display(metrics_df)
        
        print("\nüìà Performance Trends:")
        print("‚Ä¢ Lower duration_ms = better query performance")
        print("‚Ä¢ Lower files_scanned = better file pruning")
        print("‚Ä¢ Higher avg_file_size_mb = better file consolidation")
    else:
        print("üìä No metrics data available yet. Run some benchmarks first!")
    
    return metrics_df

def show_file_size_trends():
    """Show how file sizes change with different optimizations."""
    
    file_trends = spark.sql(f"""
    SELECT 
        table_name,
        optimization_technique,
        step_number,
        num_files_total,
        ROUND(table_size_bytes / 1024 / 1024, 2) as table_size_mb,
        ROUND(avg_file_size_mb, 2) as avg_file_size_mb,
        timestamp
    FROM {METRICS_TABLE}
    WHERE num_files_total IS NOT NULL
    ORDER BY table_name, step_number, timestamp DESC
    """)
    
    if file_trends.count() > 0:
        print("üìÅ File Size Evolution:")
        display(file_trends)
        
        # Calculate improvement ratios
        baseline_files = file_trends.filter(F.col("step_number") == 1).select("num_files_total").collect()
        if baseline_files:
            baseline_count = baseline_files[0]["num_files_total"]
            print(f"\nüéØ Optimization Impact (vs baseline of {baseline_count} files):")
            
            improvements = file_trends.withColumn(
                "file_reduction_ratio", 
                F.round((F.lit(baseline_count) - F.col("num_files_total")) / F.lit(baseline_count) * 100, 1)
            ).select("optimization_technique", "num_files_total", "file_reduction_ratio")
            
            display(improvements)
    else:
        print("üìÅ No file size data available yet.")
    
    return file_trends

# Create visualizations
print("üîç Generating performance analysis...\n")
perf_comparison = create_performance_comparison()
file_trends = show_file_size_trends()

## Integration with Main Project

To integrate this automated metrics collection with the main `project.ipynb` notebook, add these code snippets after each optimization step:

### Step 1: Initialize (add to main notebook setup)
```python
# Import metrics collection
%run "./metrics_collection"

# Initialize collector
metrics = DeltaOptimizationMetricsCollector("my_experiment_2024")
```

### Step 2: After each query (replace manual tracking)
```python
# Instead of manually recording metrics, use:
result_df, metrics_data = metrics.benchmark_query(
    query="SELECT * FROM sales_table WHERE country = 'USA'",
    table_name="sales_table",
    optimization_technique="partitioned",
    step_number=3,
    query_description="Country filter after partitioning"
)
```

### Step 3: View results
```python
# Generate performance comparison
create_performance_comparison()
```

This approach provides:
- ‚úÖ Automated data collection
- ‚úÖ Historical trend tracking
- ‚úÖ Visual performance comparisons
- ‚úÖ Reproducible experiments
- ‚úÖ Extensible metrics schema