# BenchBox Databricks Benchmarking

**Complete Guide to Running Analytics Benchmarks on Databricks**

This notebook demonstrates how to use BenchBox to run industry-standard benchmarks (TPC-H, TPC-DS, ClickBench) on Databricks with Unity Catalog and Delta Lake.

**What You'll Learn:**
- Install and configure BenchBox for Databricks
- Run benchmarks at different scale factors
- Leverage Unity Catalog and Delta Lake optimizations
- Analyze and visualize performance results
- Troubleshoot common issues

**Prerequisites:**
- Active Databricks workspace
- SQL Warehouse or compute cluster with Unity Catalog enabled
- Personal Access Token or Service Principal credentials
- Unity Catalog volume for data staging

**Estimated Time:** 15-30 minutes for quick examples, 1-2 hours for comprehensive benchmarking

## 1. Installation & Setup

First, we'll install BenchBox and required dependencies, then configure authentication.

In [None]:
# Install BenchBox with Databricks support
# Use %pip in Databricks notebooks for proper installation
%pip install benchbox[databricks] matplotlib seaborn pandas --quiet

# Restart Python kernel to load new packages
dbutils.library.restartPython()

In [None]:
# Import required libraries
import os

# Visualization imports
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# BenchBox imports
from benchbox.core.config import BenchmarkConfig, DatabaseConfig
from benchbox.core.runner import LifecyclePhases, run_benchmark_lifecycle
from benchbox.platforms.databricks import DatabricksAdapter

# Set style for better-looking plots
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

print("‚úÖ Imports successful")

### Authentication Method 1: Environment Variables (Recommended)

For production use, store credentials in Databricks Secrets or environment variables.

In [None]:
# Method 1: Environment Variables (Recommended for Jobs)
try:
    # In Databricks, these can be set via:
    # - Cluster environment variables
    # - Job task parameters
    # - Databricks Secrets (dbutils.secrets.get)

    DATABRICKS_HOST = os.environ.get("DATABRICKS_HOST")
    DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN")

    # Unity Catalog configuration
    UC_CATALOG = os.environ.get("UC_CATALOG", "main")
    UC_SCHEMA = os.environ.get("UC_SCHEMA", "benchbox")
    UC_VOLUME = os.environ.get("UC_VOLUME", "data")

    if not DATABRICKS_HOST or not DATABRICKS_TOKEN:
        print("‚ö†Ô∏è  Environment variables not set. Using fallback method...")
        # Fallback: Use current workspace context
        DATABRICKS_HOST = dbutils.notebook.entry_point.getDbutils().notebook().getContext().browserHostName().get()
        DATABRICKS_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
        print("‚úÖ Using workspace context for authentication")
    else:
        print("‚úÖ Using environment variables for authentication")

except Exception as e:
    print(f"‚ùå Authentication error: {e}")
    print("\nüí° Troubleshooting:")
    print("  1. Set DATABRICKS_HOST and DATABRICKS_TOKEN environment variables")
    print("  2. Or run this notebook in a Databricks workspace")
    print("  3. Ensure your token has workspace access permissions")
    raise

print(f"\nüìç Databricks Host: {DATABRICKS_HOST}")
print(f"üìã Unity Catalog: {UC_CATALOG}.{UC_SCHEMA}")
print(f"üìÅ Volume: {UC_VOLUME}")

### Authentication Method 2: Databricks Secrets (Production)

For secure credential management in production.

In [None]:
# Method 2: Databricks Secrets (Most Secure)
# Uncomment and configure if using Databricks Secrets

# try:
#     SECRET_SCOPE = "benchbox"  # Your secret scope name
#     DATABRICKS_TOKEN = dbutils.secrets.get(scope=SECRET_SCOPE, key="databricks_token")
#     print("‚úÖ Retrieved credentials from Databricks Secrets")
# except Exception as e:
#     print(f"‚ö†Ô∏è Could not retrieve secrets: {e}")
#     print("Using fallback authentication method")

print("‚ÑπÔ∏è Secrets method commented out - using environment variables")

### Connection Test

Verify we can connect to Databricks successfully.

In [None]:
# Test connection to Databricks
try:
    # Verify Unity Catalog volume exists
    volume_path = f"/Volumes/{UC_CATALOG}/{UC_SCHEMA}/{UC_VOLUME}"

    try:
        # Check if volume is accessible
        display(dbutils.fs.ls(volume_path))
        print(f"‚úÖ Unity Catalog volume accessible: {volume_path}")
    except Exception:
        print(f"‚ö†Ô∏è Volume not found: {volume_path}")
        print("\nüí° Creating volume...")
        print("Run this SQL command in a notebook or SQL editor:")
        print(f"  CREATE SCHEMA IF NOT EXISTS {UC_CATALOG}.{UC_SCHEMA};")
        print(f"  CREATE VOLUME IF NOT EXISTS {UC_CATALOG}.{UC_SCHEMA}.{UC_VOLUME};")

    print("\n‚úÖ Connection test passed")

except Exception as e:
    print(f"‚ùå Connection test failed: {e}")
    print("\nüí° Troubleshooting steps:")
    print("  1. Verify your token is valid")
    print("  2. Ensure you have Unity Catalog access")
    print("  3. Check if the catalog and schema exist")
    raise

### Verify BenchBox Installation

In [None]:
# Verify BenchBox version and available benchmarks
import benchbox

print(f"üì¶ BenchBox version: {benchbox.__version__}")
print("\nüéØ Available Benchmarks:")
print("  ‚Ä¢ TPC-H: Decision support benchmark (22 queries)")
print("  ‚Ä¢ TPC-DS: Complex analytics benchmark (99 queries)")
print("  ‚Ä¢ ClickBench: Real-world analytics (43 queries)")
print("  ‚Ä¢ SSB: Star Schema Benchmark")
print("  ‚Ä¢ And more...")

print("\n‚úÖ Setup complete! Ready to run benchmarks.")

## 2. Quick Start Example

Run a simple TPC-H benchmark to verify everything works. This will:
1. Generate ~10MB of TPC-H data (scale factor 0.01)
2. Load it into Delta tables in Unity Catalog
3. Execute the TPC-H power test (22 queries)
4. Display results

**Expected time:** 5-10 minutes

### Configure Small Benchmark

In [None]:
# Configure a small TPC-H benchmark
db_cfg = DatabaseConfig(type="databricks", name="unity-catalog")

platform_cfg = {
    "host": DATABRICKS_HOST,
    "token": DATABRICKS_TOKEN,
    "uc_catalog": UC_CATALOG,
    "uc_schema": UC_SCHEMA,
    "uc_volume": UC_VOLUME,
    "staging_root": f"dbfs:/Volumes/{UC_CATALOG}/{UC_SCHEMA}/{UC_VOLUME}/benchbox",
}

bench_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Quick Test",
    scale_factor=0.01,  # ~10MB of data
    test_execution_type="power",  # Sequential query execution
)

print("‚öôÔ∏è Configuration:")
print("  Benchmark: TPC-H")
print("  Scale Factor: 0.01 (~10MB)")
print("  Test Type: Power (sequential)")
print(f"  Target: {UC_CATALOG}.{UC_SCHEMA}")

### Run Complete Benchmark

In [None]:
# Run the complete benchmark lifecycle
print("üöÄ Starting TPC-H benchmark...\n")

try:
    results = run_benchmark_lifecycle(
        benchmark_config=bench_cfg,
        database_config=db_cfg,
        system_profile=None,
        platform_config=platform_cfg,
        phases=LifecyclePhases(
            generate=True,  # Generate TPC-H data
            load=True,  # Load into Delta tables
            execute=True,  # Execute queries
        ),
    )

    print("\n‚úÖ Benchmark completed successfully!")

except Exception as e:
    print(f"\n‚ùå Benchmark failed: {e}")
    print("\nüí° Common issues:")
    print("  ‚Ä¢ Check cluster is running and has sufficient resources")
    print("  ‚Ä¢ Verify Unity Catalog permissions")
    print("  ‚Ä¢ Ensure volume has write access")
    raise

### Display Results Summary

In [None]:
# Display key metrics
print("üìä Benchmark Results Summary\n")
print(f"Benchmark: {results.benchmark_name}")
print(f"Test Type: {results.test_execution_type}")
print(f"Scale Factor: {results.scale_factor}")
print("\n‚è±Ô∏è Performance:")
print(f"  Total Time: {results.total_execution_time:.2f} seconds")
print(f"  Average Query Time: {results.average_query_time:.2f} seconds")
print(f"  Queries Executed: {results.successful_queries}/{results.total_queries}")

if results.failed_queries > 0:
    print(f"\n‚ö†Ô∏è Failed Queries: {results.failed_queries}")
    print("Check the detailed results for error information")

### Visualize Query Performance

In [None]:
# Create performance visualization
if results.query_results:
    # Extract query data
    query_names = [qr.query_name for qr in results.query_results]
    execution_times = [qr.execution_time for qr in results.query_results]

    # Create bar chart
    fig, ax = plt.subplots(figsize=(14, 6))
    bars = ax.bar(query_names, execution_times, color="steelblue", alpha=0.8)

    # Highlight slowest queries
    max_time = max(execution_times)
    for i, (bar, time) in enumerate(zip(bars, execution_times)):
        if time > max_time * 0.7:  # Top 30% slowest
            bar.set_color("coral")

    ax.set_xlabel("Query", fontsize=12, fontweight="bold")
    ax.set_ylabel("Execution Time (seconds)", fontsize=12, fontweight="bold")
    ax.set_title("TPC-H Query Performance on Databricks", fontsize=14, fontweight="bold")
    ax.grid(axis="y", alpha=0.3)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

    print("‚úÖ Visualization complete")
else:
    print("‚ö†Ô∏è No query results available for visualization")

## 3. Advanced Examples

Now let's explore more advanced benchmarking scenarios:
- Multiple benchmarks (TPC-DS, ClickBench)
- Different scale factors
- Query subsets for fast iteration
- Performance tuning
- Result comparison

### Scenario 1: TPC-DS Comparison

TPC-DS is more complex than TPC-H with 99 queries testing advanced SQL features.

In [None]:
# Run TPC-DS at small scale
print("üöÄ Running TPC-DS benchmark...\n")

tpcds_cfg = BenchmarkConfig(
    name="tpcds",
    display_name="TPC-DS",
    scale_factor=0.01,  # Start small for TPC-DS
    test_execution_type="power",
)

try:
    tpcds_results = run_benchmark_lifecycle(
        benchmark_config=tpcds_cfg,
        database_config=db_cfg,
        system_profile=None,
        platform_config=platform_cfg,
        phases=LifecyclePhases(generate=True, load=True, execute=True),
    )

    print(f"\n‚úÖ TPC-DS completed in {tpcds_results.total_execution_time:.2f} seconds")
    print(f"Queries: {tpcds_results.successful_queries}/{tpcds_results.total_queries}")

except Exception as e:
    print(f"‚ùå TPC-DS failed: {e}")
    print("üí° Note: TPC-DS is more resource-intensive than TPC-H")

### Scenario 2: Scale Factor Comparison

Compare performance across different data sizes.

In [None]:
# Compare different scale factors
scale_factors = [0.01, 0.1]  # 10MB and 100MB
scaling_results = {}

print("üìä Scale Factor Performance Comparison\n")

for sf in scale_factors:
    print(f"Testing SF={sf} (~{int(sf * 1000)}MB)...")

    cfg = BenchmarkConfig(name="tpch", display_name=f"TPC-H SF{sf}", scale_factor=sf, test_execution_type="power")

    try:
        result = run_benchmark_lifecycle(
            benchmark_config=cfg,
            database_config=db_cfg,
            system_profile=None,
            platform_config=platform_cfg,
            phases=LifecyclePhases(generate=True, load=True, execute=True),
        )

        scaling_results[sf] = {
            "total_time": result.total_execution_time,
            "avg_time": result.average_query_time,
            "successful": result.successful_queries,
        }

        print(f"  ‚úÖ Completed: {result.total_execution_time:.2f}s\n")

    except Exception as e:
        print(f"  ‚ùå Failed: {e}\n")
        scaling_results[sf] = None

# Display comparison
print("\nüìã Scaling Analysis:")
df = pd.DataFrame(scaling_results).T
df.index.name = "Scale Factor"
display(df)

### Scenario 3: Query Subset for Fast Iteration

Run only specific queries for rapid testing.

In [None]:
# Run specific queries only (fast iteration)
from benchbox.tpch import TPCH

print("üéØ Running Query Subset (Q1, Q6, Q12)\n")

# Create benchmark and adapter directly
tpch = TPCH(scale_factor=0.01)
adapter = DatabricksAdapter(
    server_hostname=DATABRICKS_HOST,
    http_path="/sql/1.0/warehouses/<your-warehouse-id>",  # Update this
    access_token=DATABRICKS_TOKEN,
)

# Run specific queries
try:
    subset_results = adapter.run_benchmark(
        benchmark=tpch,
        test_execution_type="power",
        query_subset=["1", "6", "12"],  # Fast queries for smoke testing
    )

    print(f"‚úÖ Query subset completed: {subset_results.total_execution_time:.2f}s")
    for qr in subset_results.query_results:
        print(f"  Query {qr.query_name}: {qr.execution_time:.3f}s")

except Exception as e:
    print(f"‚ùå Query subset failed: {e}")
    print("üí° Update the http_path with your SQL Warehouse ID")

### Scenario 4: Performance Tuning Example

Compare baseline vs optimized configurations.

In [None]:
# Example: Z-ORDER optimization for Delta tables
print("üîß Performance Tuning Example\n")

# This is conceptual - actual Z-ORDER commands would be:
sql_examples = f"""
-- After loading TPC-H tables, optimize them:
OPTIMIZE {UC_CATALOG}.{UC_SCHEMA}.customer ZORDER BY (c_custkey);
OPTIMIZE {UC_CATALOG}.{UC_SCHEMA}.orders ZORDER BY (o_orderdate, o_custkey);
OPTIMIZE {UC_CATALOG}.{UC_SCHEMA}.lineitem ZORDER BY (l_orderkey, l_shipdate);

-- Analyze table statistics:
ANALYZE TABLE {UC_CATALOG}.{UC_SCHEMA}.customer COMPUTE STATISTICS FOR ALL COLUMNS;
ANALYZE TABLE {UC_CATALOG}.{UC_SCHEMA}.orders COMPUTE STATISTICS FOR ALL COLUMNS;
"""

print("üìù Optimization SQL Commands:")
print(sql_examples)

print("\nüí° Run these commands in a SQL notebook, then re-run benchmarks to measure improvement.")
print("\nExpected improvements:")
print("  ‚Ä¢ 10-30% faster for queries with date/key filters")
print("  ‚Ä¢ Better performance for join-heavy queries")
print("  ‚Ä¢ More efficient data skipping")

### Scenario 5: Throughput Test (Parallel Execution)

Run queries concurrently to test multi-user performance.

In [None]:
# Throughput test with concurrent streams
print("üîÄ Throughput Test (Concurrent Execution)\n")

throughput_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Throughput",
    scale_factor=0.01,
    test_execution_type="throughput",  # Parallel execution
    num_streams=2,  # Run 2 concurrent streams
)

try:
    throughput_results = run_benchmark_lifecycle(
        benchmark_config=throughput_cfg,
        database_config=db_cfg,
        system_profile=None,
        platform_config=platform_cfg,
        phases=LifecyclePhases(execute=True),  # Data already loaded
    )

    print("‚úÖ Throughput test completed")
    print(f"Total time: {throughput_results.total_execution_time:.2f}s")
    print(f"Streams: {throughput_cfg.num_streams}")

except Exception as e:
    print(f"‚ùå Throughput test failed: {e}")
    print("üí° Throughput tests require more cluster resources")

### Scenario 6: Result Comparison

In [None]:
# Compare results from different runs
comparison_data = {
    "TPC-H": results.total_execution_time,
}

# Add other results if available
if "tpcds_results" in locals() and tpcds_results:
    comparison_data["TPC-DS"] = tpcds_results.total_execution_time

# Create comparison chart
fig, ax = plt.subplots(figsize=(10, 6))
benchmarks = list(comparison_data.keys())
times = list(comparison_data.values())

ax.bar(benchmarks, times, color=["steelblue", "coral"][: len(benchmarks)], alpha=0.8)
ax.set_ylabel("Total Time (seconds)", fontsize=12, fontweight="bold")
ax.set_title("Benchmark Comparison (SF=0.01)", fontsize=14, fontweight="bold")
ax.grid(axis="y", alpha=0.3)

# Add value labels
for i, (bench, time) in enumerate(zip(benchmarks, times)):
    ax.text(i, time, f"{time:.1f}s", ha="center", va="bottom", fontweight="bold")

plt.tight_layout()
plt.show()

### Scenario 7: Export Results

Save results in multiple formats for analysis and reporting.

In [None]:
# Export results to JSON and CSV
from benchbox.core.results.exporter import ResultExporter

exporter = ResultExporter(results)

# Export to JSON (complete results)
json_path = "/tmp/databricks_tpch_results.json"
exporter.export_json(json_path)
print(f"‚úÖ Exported JSON: {json_path}")

# Export to CSV (query-level results)
csv_path = "/tmp/databricks_tpch_results.csv"
exporter.export_csv(csv_path)
print(f"‚úÖ Exported CSV: {csv_path}")

# Export to HTML (visual report)
html_path = "/tmp/databricks_tpch_results.html"
exporter.export_html(html_path)
print(f"‚úÖ Exported HTML: {html_path}")

print("\nüíæ Results exported to /tmp directory")

## 4. Platform-Specific Features

Leverage Databricks-specific optimizations and features.

### Unity Catalog Governance

All benchmark data is stored in Unity Catalog for governance and access control.

In [None]:
# Query Unity Catalog metadata
print("üìã Unity Catalog Table Inventory\n")

# List tables created by benchmarks
tables_query = f"""
SHOW TABLES IN {UC_CATALOG}.{UC_SCHEMA}
"""

print(f"Catalog: {UC_CATALOG}")
print(f"Schema: {UC_SCHEMA}\n")

# Note: This would require executing SQL against Databricks
print("üí° Run the following SQL to see your benchmark tables:")
print(tables_query)
print("\nTypical tables created:")
print("  TPC-H: customer, lineitem, nation, orders, part, partsupp, region, supplier")
print("  TPC-DS: 24 tables (store_sales, web_sales, catalog_sales, etc.)")

### Delta Lake Features

All tables are created as Delta tables with ACID transactions.

In [None]:
# Delta Lake feature examples
print("üî∑ Delta Lake Features\n")

delta_features = f"""
-- Time Travel (query historical versions)
SELECT COUNT(*) FROM {UC_CATALOG}.{UC_SCHEMA}.lineitem
VERSION AS OF 1;  -- Query first version

-- Optimize tables (compaction)
OPTIMIZE {UC_CATALOG}.{UC_SCHEMA}.lineitem;

-- Z-ORDER for better data skipping
OPTIMIZE {UC_CATALOG}.{UC_SCHEMA}.lineitem
ZORDER BY (l_shipdate, l_orderkey);

-- Vacuum old files (after 7 days retention)
VACUUM {UC_CATALOG}.{UC_SCHEMA}.lineitem RETAIN 168 HOURS;

-- Table history
DESCRIBE HISTORY {UC_CATALOG}.{UC_SCHEMA}.lineitem;
"""

print(delta_features)
print("‚úÖ All benchmark tables are Delta tables by default")

### COPY INTO from Unity Catalog Volumes

BenchBox uses COPY INTO for efficient data loading.

In [None]:
# COPY INTO pattern used by BenchBox
print("üìé Data Loading Pattern\n")

copy_example = f"""
-- BenchBox generates data to Unity Catalog Volumes:
/Volumes/{UC_CATALOG}/{UC_SCHEMA}/{UC_VOLUME}/benchbox/tpch_sf01/lineitem/

-- Then loads using COPY INTO:
COPY INTO {UC_CATALOG}.{UC_SCHEMA}.lineitem
FROM '/Volumes/{UC_CATALOG}/{UC_SCHEMA}/{UC_VOLUME}/benchbox/tpch_sf01/lineitem/'
FILEFORMAT = PARQUET
COPY_OPTIONS ('mergeSchema' = 'true');
"""

print(copy_example)
print("‚úÖ COPY INTO provides:")
print("  ‚Ä¢ Idempotent loads (safe to retry)")
print("  ‚Ä¢ Automatic file tracking")
print("  ‚Ä¢ Schema evolution support")
print("  ‚Ä¢ Better performance than INSERT")

### Spark UI Integration

Monitor query execution in the Spark UI.

In [None]:
# Spark UI tips
print("üìä Spark UI Monitoring\n")

print("After running benchmarks, check the Spark UI for:")
print("  1. SQL tab: See all executed queries")
print("  2. Jobs tab: Understand Spark job execution")
print("  3. Stages tab: Identify slow stages")
print("  4. Storage tab: Check cached data")
print("  5. Executors tab: Monitor resource usage")

print("\nüîç Key metrics to watch:")
print("  ‚Ä¢ Data scanned vs data output (selectivity)")
print("  ‚Ä¢ Shuffle read/write (join efficiency)")
print("  ‚Ä¢ Task execution time distribution")
print("  ‚Ä¢ Data skew indicators")

### Performance Comparison: Standard vs Optimized

In [None]:
# Conceptual comparison (would require actual runs)
print("üî• Optimization Impact (Typical Results)\n")

optimization_data = {
    "Configuration": ["Standard", "Z-ORDER", "Z-ORDER + Stats", "Z-ORDER + Stats + Cache"],
    "Query Time (s)": [45.2, 38.1, 32.5, 28.3],
    "Improvement (%)": [0, 15.7, 28.1, 37.4],
}

df = pd.DataFrame(optimization_data)
display(df)

# Visualization
fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.bar(df["Configuration"], df["Query Time (s)"], color="steelblue", alpha=0.8)
bars[-1].set_color("green")  # Highlight best

ax.set_ylabel("Average Query Time (seconds)", fontsize=12, fontweight="bold")
ax.set_title("Optimization Impact on TPC-H Performance", fontsize=14, fontweight="bold")
ax.grid(axis="y", alpha=0.3)
plt.xticks(rotation=15, ha="right")
plt.tight_layout()
plt.show()

print("\n‚úÖ Typical improvements: 15-40% with proper optimization")

## 5. Performance Analysis

Deep dive into performance metrics and optimization opportunities.

### Load and Parse Results

In [None]:
# Convert results to DataFrame for analysis
query_data = []
for qr in results.query_results:
    query_data.append(
        {
            "query": qr.query_name,
            "time": qr.execution_time,
            "rows": qr.row_count,
            "status": "success" if qr.success else "failed",
        }
    )

df_results = pd.DataFrame(query_data)
print("üìä Query Performance Data\n")
display(df_results.head(10))

### Statistical Analysis

In [None]:
# Compute detailed statistics
print("üìä Performance Statistics\n")

stats = df_results["time"].describe(percentiles=[0.5, 0.95, 0.99])
print(stats)

print("\nüîç Key Percentiles:")
print(f"  Median (P50): {df_results['time'].median():.3f}s")
print(f"  P95: {df_results['time'].quantile(0.95):.3f}s")
print(f"  P99: {df_results['time'].quantile(0.99):.3f}s")

# Identify outliers
mean_time = df_results["time"].mean()
std_time = df_results["time"].std()
outliers = df_results[df_results["time"] > mean_time + 2 * std_time]

if not outliers.empty:
    print("\n‚ö†Ô∏è Performance Outliers (>2 std dev):")
    for _, row in outliers.iterrows():
        print(f"  Query {row['query']}: {row['time']:.2f}s")

### Query Performance Breakdown

In [None]:
# Categorize queries by performance
df_sorted = df_results.sort_values("time", ascending=False)

print("üê¢ Top 5 Slowest Queries:\n")
display(df_sorted.head())

print("\n‚ö° Top 5 Fastest Queries:\n")
display(df_sorted.tail())

### Advanced Visualizations

In [None]:
# Create comprehensive visualization suite
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Distribution histogram
axes[0, 0].hist(df_results["time"], bins=20, color="steelblue", alpha=0.7, edgecolor="black")
axes[0, 0].axvline(df_results["time"].mean(), color="red", linestyle="--", linewidth=2, label="Mean")
axes[0, 0].axvline(df_results["time"].median(), color="green", linestyle="--", linewidth=2, label="Median")
axes[0, 0].set_xlabel("Execution Time (seconds)", fontweight="bold")
axes[0, 0].set_ylabel("Frequency", fontweight="bold")
axes[0, 0].set_title("Query Time Distribution", fontweight="bold")
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# 2. Box plot
axes[0, 1].boxplot(df_results["time"], vert=True, patch_artist=True, boxprops=dict(facecolor="lightblue", alpha=0.7))
axes[0, 1].set_ylabel("Execution Time (seconds)", fontweight="bold")
axes[0, 1].set_title("Query Time Box Plot", fontweight="bold")
axes[0, 1].grid(alpha=0.3)

# 3. Sorted bar chart
df_sorted.plot(x="query", y="time", kind="barh", ax=axes[1, 0], color="coral", alpha=0.8, legend=False)
axes[1, 0].set_xlabel("Execution Time (seconds)", fontweight="bold")
axes[1, 0].set_ylabel("Query", fontweight="bold")
axes[1, 0].set_title("Queries Ranked by Performance", fontweight="bold")
axes[1, 0].grid(axis="x", alpha=0.3)

# 4. Cumulative time
df_sorted["cumulative_pct"] = df_sorted["time"].cumsum() / df_sorted["time"].sum() * 100
axes[1, 1].plot(
    range(len(df_sorted)), df_sorted["cumulative_pct"], marker="o", linewidth=2, markersize=6, color="purple"
)
axes[1, 1].axhline(80, color="red", linestyle="--", alpha=0.5, label="80% threshold")
axes[1, 1].fill_between(range(len(df_sorted)), df_sorted["cumulative_pct"], alpha=0.3, color="purple")
axes[1, 1].set_xlabel("Number of Queries", fontweight="bold")
axes[1, 1].set_ylabel("Cumulative Time (%)", fontweight="bold")
axes[1, 1].set_title("Cumulative Performance (Pareto Analysis)", fontweight="bold")
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Find 80/20 point
pareto_80 = df_sorted[df_sorted["cumulative_pct"] <= 80]
print(
    f"\nüéØ Pareto Principle: {len(pareto_80)} queries ({len(pareto_80) / len(df_sorted) * 100:.1f}%) account for 80% of total time"
)

### Regression Detection

In [None]:
# Compare against baseline (if available)
print("üîç Regression Analysis\n")

# Simulated baseline for demonstration
baseline_avg = 2.1  # seconds
current_avg = df_results["time"].mean()

change_pct = ((current_avg - baseline_avg) / baseline_avg) * 100

if abs(change_pct) > 10:
    status = "‚ùå REGRESSION" if change_pct > 0 else "‚úÖ IMPROVEMENT"
    print(f"{status} DETECTED")
else:
    print("‚úÖ Performance stable")

print(f"\nBaseline: {baseline_avg:.2f}s")
print(f"Current: {current_avg:.2f}s")
print(f"Change: {change_pct:+.1f}%")

if change_pct > 10:
    print("\nüí° Investigation steps:")
    print("  1. Check cluster configuration changes")
    print("  2. Review recent Unity Catalog updates")
    print("  3. Verify data volume hasn't changed")
    print("  4. Check for table VACUUM/OPTIMIZE status")

### Optimization Recommendations

In [None]:
# Generate optimization recommendations
print("üí° Performance Optimization Recommendations\n")

recommendations = []

# Check for slow queries
slow_queries = df_results[df_results["time"] > df_results["time"].quantile(0.9)]
if not slow_queries.empty:
    recommendations.append(
        f"‚ö° {len(slow_queries)} slow queries detected (>P90). Consider Z-ORDER optimization for these tables."
    )

# Check variance
cv = df_results["time"].std() / df_results["time"].mean()
if cv > 0.5:
    recommendations.append(
        f"üìà High performance variance detected (CV={cv:.2f}). Review query plans for inconsistent performance."
    )

# Check failed queries
failed = df_results[df_results["status"] == "failed"]
if not failed.empty:
    recommendations.append(f"‚ùå {len(failed)} failed queries. Review error logs and fix issues.")

# General recommendations
recommendations.extend(
    [
        "üî∑ Run OPTIMIZE command on all Delta tables",
        "üìä Run ANALYZE TABLE for better query planning",
        "üíæ Consider result caching for repeated queries",
        "üöÄ Enable Photon acceleration for faster performance",
    ]
)

for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec}")

print("\n‚úÖ Run these optimizations and re-test to measure improvement")

## 6. Troubleshooting

Common issues and solutions for running benchmarks on Databricks.

### Common Issues and Solutions

In [None]:
# Common troubleshooting scenarios
print("üîß Common Issues and Solutions\n")

issues = [
    {
        "issue": "Authentication Failed",
        "symptoms": ["401 Unauthorized", "Invalid token", "Access denied"],
        "solutions": [
            "Verify token has not expired",
            "Check token has workspace access",
            "Ensure DATABRICKS_HOST includes https://",
            "Try regenerating the Personal Access Token",
        ],
    },
    {
        "issue": "Unity Catalog Volume Not Found",
        "symptoms": ["Volume does not exist", "Path not found", "VOLUME_NOT_FOUND"],
        "solutions": [
            "Create catalog: CREATE CATALOG IF NOT EXISTS main;",
            "Create schema: CREATE SCHEMA IF NOT EXISTS main.benchbox;",
            "Create volume: CREATE VOLUME IF NOT EXISTS main.benchbox.data;",
            "Verify permissions on catalog/schema/volume",
        ],
    },
    {
        "issue": "Out of Memory",
        "symptoms": ["OOM error", "Executor lost", "GC overhead limit"],
        "solutions": [
            "Reduce scale factor (try 0.001 or 0.01)",
            "Use larger cluster",
            "Enable auto-scaling",
            "Run fewer concurrent queries",
        ],
    },
    {
        "issue": "Slow Performance",
        "symptoms": ["Queries taking 10x longer", "Timeout errors"],
        "solutions": [
            "Run OPTIMIZE on all tables",
            "Add Z-ORDER by commonly filtered columns",
            "Run ANALYZE TABLE for statistics",
            "Check cluster is not shared/overloaded",
            "Enable Photon acceleration",
        ],
    },
]

for i, item in enumerate(issues, 1):
    print(f"{i}. ‚ùå {item['issue']}")
    print(f"   Symptoms: {', '.join(item['symptoms'])}")
    print("   Solutions:")
    for sol in item["solutions"]:
        print(f"     ‚Ä¢ {sol}")
    print()

### Connection Troubleshooting

In [None]:
# Connection diagnostic
def diagnose_connection():
    """
    Diagnose Databricks connection issues
    """
    print("üîç Connection Diagnostic\n")

    # Check 1: Environment variables
    print("1. Checking environment variables...")
    if DATABRICKS_HOST and DATABRICKS_TOKEN:
        print("   ‚úÖ DATABRICKS_HOST and DATABRICKS_TOKEN are set")
        print(f"   Host: {DATABRICKS_HOST}")
    else:
        print("   ‚ùå Missing environment variables")
        return False

    # Check 2: Host format
    print("\n2. Validating host format...")
    if DATABRICKS_HOST.startswith("https://"):
        print("   ‚úÖ Host includes https://")
    else:
        print("   ‚ö†Ô∏è Host should start with https://")

    # Check 3: Token format
    print("\n3. Checking token format...")
    if len(DATABRICKS_TOKEN) > 20:
        print("   ‚úÖ Token appears valid (length check)")
    else:
        print("   ‚ùå Token seems too short")
        return False

    # Check 4: Unity Catalog config
    print("\n4. Unity Catalog configuration...")
    print(f"   Catalog: {UC_CATALOG}")
    print(f"   Schema: {UC_SCHEMA}")
    print(f"   Volume: {UC_VOLUME}")
    print("   ‚úÖ Configuration looks good")

    print("\n‚úÖ All checks passed")
    return True


# Run diagnostic
diagnose_connection()

### Permission Validation

In [None]:
# Check permissions
print("üîê Permission Validation\n")

print("Required permissions for benchmarking:\n")

permissions = [
    ("Catalog", UC_CATALOG, ["USE CATALOG", "CREATE SCHEMA"]),
    ("Schema", f"{UC_CATALOG}.{UC_SCHEMA}", ["USE SCHEMA", "CREATE TABLE", "CREATE VOLUME"]),
    ("Volume", f"{UC_CATALOG}.{UC_SCHEMA}.{UC_VOLUME}", ["READ FILES", "WRITE FILES"]),
]

for resource_type, resource_name, perms in permissions:
    print(f"{resource_type}: {resource_name}")
    for perm in perms:
        print(f"  ‚Ä¢ {perm}")
    print()

print("üí° To grant permissions, run in SQL editor:")
print(f"GRANT USE CATALOG, CREATE SCHEMA ON CATALOG {UC_CATALOG} TO `<principal>`;")
print(f"GRANT ALL PRIVILEGES ON SCHEMA {UC_CATALOG}.{UC_SCHEMA} TO `<principal>`;")

### Diagnostic Utilities

In [None]:
# Comprehensive diagnostic
def run_full_diagnostic():
    """
    Run complete diagnostic check
    """
    print("üõ†Ô∏è Running Full Diagnostic...\n")
    print("=" * 60)

    # System info
    print("\nüíª System Information:")
    print(f"  Python version: {sys.version.split()[0]}")
    print(f"  BenchBox version: {benchbox.__version__}")

    # Databricks config
    print("\n‚òÅÔ∏è Databricks Configuration:")
    print(f"  Host: {DATABRICKS_HOST}")
    print(f"  Token: {'*' * 20} (hidden)")
    print(f"  Unity Catalog: {UC_CATALOG}.{UC_SCHEMA}")
    print(f"  Volume: {UC_VOLUME}")

    # Test connection
    print("\nüîå Connection Test:")
    try:
        # Try to list volume
        volume_path = f"/Volumes/{UC_CATALOG}/{UC_SCHEMA}/{UC_VOLUME}"
        dbutils.fs.ls(volume_path)
        print(f"  ‚úÖ Successfully connected to {volume_path}")
    except Exception as e:
        print(f"  ‚ùå Connection failed: {e}")

    # Memory check
    print("\nüíæ Resource Check:")
    print(f"  Cluster: {sc.getConf().get('spark.databricks.clusterUsageTags.clusterName', 'Unknown')}")
    print(f"  Driver Memory: {sc.getConf().get('spark.driver.memory', 'Unknown')}")
    print(f"  Executor Memory: {sc.getConf().get('spark.executor.memory', 'Unknown')}")

    print("\n=" * 60)
    print("‚úÖ Diagnostic complete")


# Run it
run_full_diagnostic()

## Summary

You've successfully completed the BenchBox Databricks benchmarking guide!

### What You Learned

1. ‚úÖ **Installation & Setup**: Configured BenchBox with Unity Catalog
2. ‚úÖ **Quick Start**: Ran TPC-H benchmark and visualized results
3. ‚úÖ **Advanced Examples**: Multiple benchmarks, scale factors, and optimizations
4. ‚úÖ **Platform Features**: Delta Lake, Unity Catalog, and Databricks optimizations
5. ‚úÖ **Performance Analysis**: Statistical analysis and visualization
6. ‚úÖ **Troubleshooting**: Diagnostic tools and common issue resolution

### Next Steps

- **Scale Up**: Try larger scale factors (1.0, 10.0) for production testing
- **Optimize**: Apply Z-ORDER and OPTIMIZE to improve performance
- **Compare**: Run benchmarks on different cluster sizes
- **Monitor**: Set up continuous performance monitoring
- **Production**: Integrate BenchBox into your CI/CD pipeline

### Resources

- [BenchBox Documentation](https://github.com/joeharris76/benchbox)
- [Databricks Unity Catalog](https://docs.databricks.com/data-governance/unity-catalog/index.html)
- [Delta Lake Optimization](https://docs.databricks.com/delta/optimize.html)
- [TPC-H Specification](http://www.tpc.org/tpch/)

**Happy Benchmarking!** üöÄ