# BenchBox Snowflake Benchmarking

This notebook demonstrates how to use BenchBox to benchmark **Snowflake** data warehouse.

**What You'll Learn:**
- Authenticate with Snowflake using multiple methods (password, key-pair, SSO)
- Run TPC-H, TPC-DS, and ClickBench benchmarks
- Compare warehouse sizes (XS, S, M, L, XL)
- Use clustering keys and micro-partitions for optimization
- Monitor credit consumption and query performance
- Leverage Time Travel, Zero-Copy Cloning, and result caching
- Troubleshoot common Snowflake issues

**Prerequisites:**
- Snowflake account
- User with appropriate privileges (SYSADMIN or ACCOUNTADMIN)
- Virtual warehouse (COMPUTE_WH or custom)
- Database for benchmark data

## 1. Installation & Setup

### 1.1 Install Required Libraries

Install BenchBox and Snowflake connector:

In [None]:
!pip install benchbox snowflake-connector-python

### 1.2 Import Libraries

Import BenchBox components and visualization libraries:

In [None]:
import os

# Visualization imports
import matplotlib.pyplot as plt
import pandas as pd

# Snowflake imports
import snowflake.connector

# BenchBox imports
from benchbox.core.config import BenchmarkConfig, DatabaseConfig
from benchbox.core.runner import LifecyclePhases, run_benchmark_lifecycle

print("‚úÖ All imports successful")

### 1.3 Authentication

Snowflake supports multiple authentication methods:

**Method 1: Username/Password** - Simple for development
```bash
export SNOWFLAKE_ACCOUNT='xy12345'
export SNOWFLAKE_USER='username'
export SNOWFLAKE_PASSWORD='password'
```

**Method 2: Key-Pair Authentication** - Recommended for production
```bash
# Generate key pair
openssl genrsa -out rsa_key.pem 2048
openssl rsa -in rsa_key.pem -pubout -out rsa_key.pub

# Configure in Snowflake
ALTER USER username SET RSA_PUBLIC_KEY='MII...';

export SNOWFLAKE_PRIVATE_KEY_PATH='/path/to/rsa_key.pem'
```

**Method 3: SSO/SAML** - Enterprise authentication
```python
authenticator='externalbrowser'  # Opens browser for SSO
```

This notebook will try environment variables with password fallback:

In [None]:
# Configure Snowflake connection
try:
    # Try environment variables first
    SNOWFLAKE_ACCOUNT = os.environ.get("SNOWFLAKE_ACCOUNT")
    SNOWFLAKE_USER = os.environ.get("SNOWFLAKE_USER")
    SNOWFLAKE_PASSWORD = os.environ.get("SNOWFLAKE_PASSWORD")
    SNOWFLAKE_WAREHOUSE = os.environ.get("SNOWFLAKE_WAREHOUSE", "COMPUTE_WH")
    SNOWFLAKE_DATABASE = os.environ.get("SNOWFLAKE_DATABASE", "BENCHBOX")
    SNOWFLAKE_SCHEMA = os.environ.get("SNOWFLAKE_SCHEMA", "PUBLIC")

    if not SNOWFLAKE_ACCOUNT or not SNOWFLAKE_USER:
        print("‚ö†Ô∏è  Required environment variables not set")
        print("\nüí° Set up authentication:")
        print("  Option 1 (Password):")
        print("    export SNOWFLAKE_ACCOUNT='xy12345'")
        print("    export SNOWFLAKE_USER='username'")
        print("    export SNOWFLAKE_PASSWORD='password'")
        print("  Option 2 (Key-Pair):")
        print("    export SNOWFLAKE_PRIVATE_KEY_PATH='/path/to/rsa_key.pem'")
        print("  Option 3 (SSO):")
        print("    Set authenticator='externalbrowser' in platform_cfg")
        raise ValueError("Missing required environment variables")

    if not SNOWFLAKE_PASSWORD and not os.environ.get("SNOWFLAKE_PRIVATE_KEY_PATH"):
        print("‚ö†Ô∏è  No password or private key configured")
        print("üí° Set SNOWFLAKE_PASSWORD or SNOWFLAKE_PRIVATE_KEY_PATH")
        raise ValueError("No authentication method configured")

    print(f"‚úÖ Using account: {SNOWFLAKE_ACCOUNT}")
    print(f"‚úÖ User: {SNOWFLAKE_USER}")
    print(f"‚úÖ Warehouse: {SNOWFLAKE_WAREHOUSE}")
    print(f"‚úÖ Database: {SNOWFLAKE_DATABASE}")
    print(f"‚úÖ Schema: {SNOWFLAKE_SCHEMA}")

except Exception as e:
    print(f"‚ùå Authentication error: {e}")
    raise

### 1.4 Test Connection

Verify connectivity and permissions:

In [None]:
try:
    # Initialize Snowflake connection
    conn = snowflake.connector.connect(
        user=SNOWFLAKE_USER,
        password=SNOWFLAKE_PASSWORD,
        account=SNOWFLAKE_ACCOUNT,
        warehouse=SNOWFLAKE_WAREHOUSE,
        database=SNOWFLAKE_DATABASE,
        schema=SNOWFLAKE_SCHEMA,
    )

    cursor = conn.cursor()

    # Test 1: Check connection
    print("1Ô∏è‚É£ Testing connection...")
    cursor.execute("SELECT CURRENT_VERSION()")
    version = cursor.fetchone()[0]
    print(f"   ‚úÖ Connected to Snowflake version: {version}")

    # Test 2: Check warehouse status
    print("\n2Ô∏è‚É£ Checking warehouse status...")
    cursor.execute(f"SHOW WAREHOUSES LIKE '{SNOWFLAKE_WAREHOUSE}'")
    warehouse_info = cursor.fetchone()
    if warehouse_info:
        print(f"   ‚úÖ Warehouse: {warehouse_info[0]}")
        print(f"   Size: {warehouse_info[3]}")
        print(f"   State: {warehouse_info[1]}")
    else:
        print(f"   ‚ö†Ô∏è  Warehouse '{SNOWFLAKE_WAREHOUSE}' not found")

    # Test 3: Check database exists
    print("\n3Ô∏è‚É£ Checking database...")
    cursor.execute(f"SHOW DATABASES LIKE '{SNOWFLAKE_DATABASE}'")
    db_exists = cursor.fetchone()
    if db_exists:
        print(f"   ‚úÖ Database exists: {SNOWFLAKE_DATABASE}")
    else:
        print(f"   ‚ö†Ô∏è  Database '{SNOWFLAKE_DATABASE}' does not exist")
        print("   üí° Creating database...")
        cursor.execute(f"CREATE DATABASE IF NOT EXISTS {SNOWFLAKE_DATABASE}")
        print(f"   ‚úÖ Created database: {SNOWFLAKE_DATABASE}")

    # Test 4: Run simple query
    print("\n4Ô∏è‚É£ Testing query execution...")
    cursor.execute("SELECT 1 as test")
    result = cursor.fetchone()
    print("   ‚úÖ Query executed successfully")

    # Test 5: Check current role
    print("\n5Ô∏è‚É£ Checking user role...")
    cursor.execute("SELECT CURRENT_ROLE()")
    role = cursor.fetchone()[0]
    print(f"   ‚úÖ Current role: {role}")

    cursor.close()
    conn.close()

    print("\n‚úÖ All connection tests passed!")

except Exception as e:
    print(f"‚ùå Connection test failed: {e}")
    print("\nüí° Troubleshooting:")
    print("  1. Verify account identifier (format: xy12345.region.cloud)")
    print("  2. Check username and password")
    print("  3. Verify warehouse is running")
    print("  4. Check network policies and IP allowlists")
    raise

### 1.5 Warehouse Sizing Guide

Understanding Snowflake warehouse sizes and pricing:

In [None]:
print("üìä Snowflake Warehouse Sizing Guide\n")
print("Size    Credits/Hour  Servers  Use Case")
print("=" * 60)
print("X-Small      1         1      Development, small datasets")
print("Small        2         2      Small workloads, testing")
print("Medium       4         4      Medium workloads, production")
print("Large        8         8      Large workloads, analytics")
print("X-Large     16        16      Very large datasets")
print("2X-Large    32        32      Massive concurrent queries")
print("3X-Large    64        64      Extreme workloads")
print("4X-Large   128       128      Maximum performance\n")

print("üí∞ Pricing (On-Demand):")
print("  - Standard Edition: $2-4 per credit")
print("  - Enterprise Edition: $3-5 per credit")
print("  - Business Critical: $4-6 per credit\n")

print("üí° Sizing Guidelines:")
print("  - Start with X-Small for benchmarking (SF 0.01-0.1)")
print("  - Use Small-Medium for production workloads (SF 1-10)")
print("  - Scale up for larger datasets or faster execution")
print("  - Use multi-cluster warehouses for concurrency\n")

print("‚ö° Auto-Scaling:")
print("  - Auto-suspend: Suspend after N minutes idle")
print("  - Auto-resume: Resume automatically on query")
print("  - Multi-cluster: Scale out for concurrency (1-10 clusters)")

### 1.6 Configuration Overview

Summary of your Snowflake configuration:

In [None]:
print("üìä Snowflake Configuration Summary\n")
print(f"Account:   {SNOWFLAKE_ACCOUNT}")
print(f"User:      {SNOWFLAKE_USER}")
print(f"Warehouse: {SNOWFLAKE_WAREHOUSE}")
print(f"Database:  {SNOWFLAKE_DATABASE}")
print(f"Schema:    {SNOWFLAKE_SCHEMA}")
print("\nüí° Tip: Use separate warehouses for ETL and analytics workloads")

## 2. Quick Start Example

### 2.1 Run TPC-H Power Test

Run a TPC-H power test at scale factor 0.01 (~10MB data).

**What happens:**
1. Generate TPC-H data (8 tables: customer, orders, lineitem, etc.)
2. Load data into Snowflake tables
3. Execute 22 TPC-H queries sequentially
4. Collect execution times and warehouse metrics

**Expected time:** 2-3 minutes (X-Small warehouse)

In [None]:
# Configure database connection
db_cfg = DatabaseConfig(type="snowflake", name="snowflake_benchbox")

# Snowflake platform configuration
platform_cfg = {
    "account": SNOWFLAKE_ACCOUNT,
    "user": SNOWFLAKE_USER,
    "password": SNOWFLAKE_PASSWORD,
    "warehouse": SNOWFLAKE_WAREHOUSE,
    "database": SNOWFLAKE_DATABASE,
    "schema": SNOWFLAKE_SCHEMA,
    # Optional: Use key-pair authentication
    # "private_key_path": "/path/to/rsa_key.pem"
}

# Configure TPC-H benchmark
bench_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H",
    scale_factor=0.01,  # 10MB dataset
    test_execution_type="power",
)

# Run full lifecycle: generate ‚Üí load ‚Üí execute
print("üöÄ Starting TPC-H power test on Snowflake...\n")

results = run_benchmark_lifecycle(
    benchmark_config=bench_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(
        generate=True,  # Generate TPC-H data
        load=True,  # Load into Snowflake
        execute=True,  # Execute 22 queries
    ),
)

print("\n‚úÖ Power test completed on Snowflake")
print(f"Total queries executed: {len(results.query_results)}")
print(f"Results saved to: {results.results_dir}")

### 2.2 Visualize Results

Create a bar chart of query execution times:

In [None]:
if results.query_results:
    # Extract query names and execution times
    query_names = [qr.query_name for qr in results.query_results]
    execution_times = [qr.execution_time for qr in results.query_results]

    # Create bar chart
    fig, ax = plt.subplots(figsize=(14, 6))
    bars = ax.bar(query_names, execution_times, color="#29B5E8", alpha=0.8)  # Snowflake Blue

    # Highlight slowest queries (top 30%)
    max_time = max(execution_times)
    for i, (bar, time) in enumerate(zip(bars, execution_times)):
        if time > max_time * 0.7:
            bar.set_color("#F26B1D")  # Snowflake Orange for slow queries

    ax.set_xlabel("Query", fontsize=12, fontweight="bold")
    ax.set_ylabel("Execution Time (seconds)", fontsize=12, fontweight="bold")
    ax.set_title("TPC-H Query Performance on Snowflake", fontsize=14, fontweight="bold")
    ax.tick_params(axis="x", rotation=45)
    ax.grid(axis="y", alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Print summary statistics
    print("\nüìä Performance Summary:")
    print(f"  Total time: {sum(execution_times):.2f}s")
    print(f"  Average: {sum(execution_times) / len(execution_times):.2f}s")
    print(f"  Fastest: {min(execution_times):.2f}s ({query_names[execution_times.index(min(execution_times))]})")
    print(f"  Slowest: {max(execution_times):.2f}s ({query_names[execution_times.index(max(execution_times))]})")
else:
    print("‚ö†Ô∏è No query results to visualize")

### 2.3 Monitor Warehouse Usage

Query ACCOUNT_USAGE.QUERY_HISTORY to analyze resource consumption:

In [None]:
try:
    # Connect to Snowflake
    conn = snowflake.connector.connect(
        user=SNOWFLAKE_USER,
        password=SNOWFLAKE_PASSWORD,
        account=SNOWFLAKE_ACCOUNT,
        warehouse=SNOWFLAKE_WAREHOUSE,
        database=SNOWFLAKE_DATABASE,
        schema=SNOWFLAKE_SCHEMA,
    )

    # Query recent query history
    query = f"""
    SELECT 
        query_id,
        query_text,
        warehouse_name,
        warehouse_size,
        execution_time / 1000.0 as execution_seconds,
        credits_used_cloud_services,
        bytes_scanned,
        rows_produced
    FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
    WHERE start_time > DATEADD(hour, -1, CURRENT_TIMESTAMP())
        AND warehouse_name = '{SNOWFLAKE_WAREHOUSE}'
        AND query_type = 'SELECT'
    ORDER BY start_time DESC
    LIMIT 50
    """

    df_history = pd.read_sql(query, conn)

    if not df_history.empty:
        print("üìä Recent Query History:\n")
        print(f"Total queries: {len(df_history)}")
        print(f"Total execution time: {df_history['EXECUTION_SECONDS'].sum():.2f}s")
        print(f"Total credits used: {df_history['CREDITS_USED_CLOUD_SERVICES'].sum():.6f}")
        print(f"Total bytes scanned: {df_history['BYTES_SCANNED'].sum() / (1024**3):.2f} GB")
        print(f"Total rows produced: {df_history['ROWS_PRODUCED'].sum():,}")

        print("\nüí° Note: ACCOUNT_USAGE views have latency (up to 45 minutes)")
        print("   For real-time data, use INFORMATION_SCHEMA.QUERY_HISTORY instead")
    else:
        print("‚ö†Ô∏è No recent query history found")

    conn.close()

except Exception as e:
    print(f"‚ö†Ô∏è Could not query ACCOUNT_USAGE: {e}")
    print("This requires ACCOUNTADMIN or appropriate grants on ACCOUNT_USAGE schema")

### 2.4 Results Overview

View comprehensive results summary:

In [None]:
print("üìã Benchmark Results Summary\n")
print(f"Benchmark: {results.benchmark_name}")
print(f"Platform: {results.database_config.type}")
print(f"Scale Factor: {results.benchmark_config.scale_factor}")
print(f"Test Type: {results.benchmark_config.test_execution_type}")
print("\nExecution:")
print(f"  Start: {results.execution_metadata.start_time}")
print(f"  End: {results.execution_metadata.end_time}")
print(f"  Duration: {results.execution_metadata.total_duration:.2f}s")
print("\nQueries:")
print(f"  Total: {len(results.query_results)}")
print(f"  Successful: {sum(1 for qr in results.query_results if qr.success)}")
print(f"  Failed: {sum(1 for qr in results.query_results if not qr.success)}")
print("\nResults Location:")
print(f"  {results.results_dir}")

## 3. Advanced Examples

### 3.1 TPC-DS Benchmark

Run TPC-DS (99 queries, more complex than TPC-H):

In [None]:
# TPC-DS configuration
tpcds_cfg = BenchmarkConfig(
    name="tpcds",
    display_name="TPC-DS",
    scale_factor=0.01,  # 10MB dataset
    test_execution_type="power",
)

print("üöÄ Running TPC-DS benchmark...")
print("‚ö†Ô∏è  Note: TPC-DS has 99 queries and will take longer\n")

tpcds_results = run_benchmark_lifecycle(
    benchmark_config=tpcds_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

print(f"‚úÖ TPC-DS completed: {len(tpcds_results.query_results)} queries")

### 3.2 Warehouse Size Comparison

Compare performance across different warehouse sizes:

In [None]:
# Test multiple warehouse sizes
warehouse_sizes = ["X-Small", "Small", "Medium"]
size_results = {}

for size in warehouse_sizes:
    print(f"\nüìä Testing warehouse size: {size}...")

    # Create/resize warehouse
    test_warehouse = f"BENCHBOX_{size.upper().replace('-', '_')}"

    size_cfg = {
        "account": SNOWFLAKE_ACCOUNT,
        "user": SNOWFLAKE_USER,
        "password": SNOWFLAKE_PASSWORD,
        "warehouse": test_warehouse,
        "warehouse_size": size,  # Specify size
        "database": SNOWFLAKE_DATABASE,
        "schema": SNOWFLAKE_SCHEMA,
    }

    size_bench_cfg = BenchmarkConfig(
        name="tpch",
        display_name=f"TPC-H {size}",
        scale_factor=0.1,  # Larger dataset to see differences
        test_execution_type="power",
    )

    size_res = run_benchmark_lifecycle(
        benchmark_config=size_bench_cfg,
        database_config=db_cfg,
        system_profile=None,
        platform_config=size_cfg,
        phases=LifecyclePhases(generate=False, load=False, execute=True),  # Reuse data
    )

    size_results[size] = size_res
    avg_time = sum(qr.execution_time for qr in size_res.query_results) / len(size_res.query_results)
    print(f"  Average query time: {avg_time:.2f}s")

print("\n‚úÖ Warehouse size comparison complete")
print("\nüí° Findings:")
print("  - Larger warehouses execute faster but cost more credits")
print("  - Use right-sized warehouses for your workload")
print("  - Consider multi-cluster for concurrency, not speed")

### 3.3 Clustering Keys

Use clustering keys to improve query performance:

In [None]:
print("üöÄ Testing clustering keys...\n")

# Configure platform with clustering
clustering_cfg = {
    "account": SNOWFLAKE_ACCOUNT,
    "user": SNOWFLAKE_USER,
    "password": SNOWFLAKE_PASSWORD,
    "warehouse": SNOWFLAKE_WAREHOUSE,
    "database": SNOWFLAKE_DATABASE,
    "schema": SNOWFLAKE_SCHEMA,
    "table_options": {
        "lineitem": {"cluster_by": "(l_shipdate, l_returnflag)"},
        "orders": {"cluster_by": "(o_orderdate)"},
    },
}

print("Clustering configuration:")
print("  - lineitem: cluster by (l_shipdate, l_returnflag)")
print("  - orders: cluster by (o_orderdate)\n")

cluster_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Clustered",
    scale_factor=1.0,  # Larger dataset to see benefits
    test_execution_type="power",
)

cluster_results = run_benchmark_lifecycle(
    benchmark_config=cluster_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=clustering_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

print("‚úÖ Clustered benchmark complete")
print("\nüí° Benefits of Clustering:")
print("  - Faster pruning (skip micro-partitions)")
print("  - Better for date range queries")
print("  - Automatic maintenance (but costs credits)")
print("  - Best for large tables (>1 TB)")
print("\n‚ö†Ô∏è  Clustering Costs:")
print("  - Automatic reclustering consumes credits")
print("  - Monitor with AUTOMATIC_CLUSTERING_HISTORY")

### 3.4 Query Subset Selection

Run only specific queries for faster iteration:

In [None]:
# Run only queries 1, 6, and 14 (fast queries for CI/CD)
subset_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Subset",
    scale_factor=0.01,
    test_execution_type="power",
    query_filter=[1, 6, 14],  # Only these queries
)

print("üöÄ Running query subset (1, 6, 14)...\n")

subset_results = run_benchmark_lifecycle(
    benchmark_config=subset_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=False, load=False, execute=True),  # Reuse data
)

print(f"‚úÖ Subset complete: {len(subset_results.query_results)} queries")
print("\nüí° Use case: Fast regression testing in CI/CD")

### 3.5 Result Caching

Leverage Snowflake's automatic result caching:

In [None]:
print("üìä Result Caching Overview\n")
print("What is Result Caching?")
print("  - Snowflake caches query results for 24 hours")
print("  - Identical queries return cached results instantly")
print("  - No compute charges for cached results")
print("  - Cache invalidated on data changes\n")

print("Requirements for Cache Hit:")
print("  1. Exact same SQL text (byte-for-byte)")
print("  2. Same role and permissions")
print("  3. Table data hasn't changed")
print("  4. Within 24-hour window\n")

print("üí∞ Cost Savings:")
print("  - No warehouse compute charges")
print("  - Small cloud services charge (typically $0)")
print("  - Instant response time")
print("  - Ideal for dashboards and repeated queries\n")

print("üîç Check Cache Usage:")
print("  SELECT query_id, query_text, bytes_scanned")
print("  FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY")
print("  WHERE bytes_scanned = 0  -- Indicates cache hit")
print("    AND execution_time < 100;  -- Fast execution")

### 3.6 Separate Compute for Load vs Query

Use different warehouses for ETL and analytics:

In [None]:
print("üöÄ Best Practice: Separate Compute for ETL and Analytics\n")

# Configure separate warehouses
load_cfg = {
    "account": SNOWFLAKE_ACCOUNT,
    "user": SNOWFLAKE_USER,
    "password": SNOWFLAKE_PASSWORD,
    "warehouse": "LOADING_WH",  # Dedicated for data loading
    "database": SNOWFLAKE_DATABASE,
    "schema": SNOWFLAKE_SCHEMA,
}

query_cfg = {
    "account": SNOWFLAKE_ACCOUNT,
    "user": SNOWFLAKE_USER,
    "password": SNOWFLAKE_PASSWORD,
    "warehouse": "ANALYTICS_WH",  # Dedicated for analytics
    "database": SNOWFLAKE_DATABASE,
    "schema": SNOWFLAKE_SCHEMA,
}

print("Warehouse Strategy:")
print("  - LOADING_WH (Medium): For data ingestion and transformation")
print("  - ANALYTICS_WH (Large): For analytics and reporting\n")

print("üí° Benefits:")
print("  - Isolate workloads (no resource contention)")
print("  - Independent scaling")
print("  - Better cost tracking (separate credit usage)")
print("  - Optimize warehouse size per workload type\n")

print("üìä Example Configuration:")
print("  ETL Warehouse:")
print("    - Size: Medium (4 servers)")
print("    - Auto-suspend: 5 minutes")
print("    - Max clusters: 1 (serial ETL)")
print("  Analytics Warehouse:")
print("    - Size: Large (8 servers)")
print("    - Auto-suspend: 10 minutes")
print("    - Max clusters: 3 (concurrent users)")

### 3.7 Throughput Testing

Run concurrent queries to test throughput:

In [None]:
# Throughput test configuration
throughput_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Throughput",
    scale_factor=0.1,
    test_execution_type="throughput",
    throughput_streams=4,  # 4 concurrent streams
)

print("üöÄ Running throughput test (4 concurrent streams)...\n")

throughput_results = run_benchmark_lifecycle(
    benchmark_config=throughput_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=False, load=False, execute=True),
)

print(f"‚úÖ Throughput test complete: {len(throughput_results.query_results)} queries")
print("\nüí° Snowflake Concurrency:")
print("  - Single cluster: queues after ~8-16 concurrent queries")
print("  - Multi-cluster: scales out automatically (1-10 clusters)")
print("  - Query queuing: FIFO with priority support")

### 3.8 Result Comparison

Compare results across different configurations:

In [None]:
# Load and compare results
if "results" in locals() and "tpcds_results" in locals():
    tpch_avg = sum(qr.execution_time for qr in results.query_results) / len(results.query_results)
    tpcds_avg = sum(qr.execution_time for qr in tpcds_results.query_results) / len(tpcds_results.query_results)

    # Create comparison visualization
    fig, ax = plt.subplots(figsize=(10, 6))

    benchmarks = ["TPC-H\n(22 queries)", "TPC-DS\n(99 queries)"]
    avg_times = [tpch_avg, tpcds_avg]
    total_times = [
        sum(qr.execution_time for qr in results.query_results),
        sum(qr.execution_time for qr in tpcds_results.query_results),
    ]

    x = range(len(benchmarks))
    width = 0.35

    ax.bar([i - width / 2 for i in x], avg_times, width, label="Avg Time/Query", color="#29B5E8")
    ax.bar([i + width / 2 for i in x], total_times, width, label="Total Time", color="#F26B1D")

    ax.set_ylabel("Time (seconds)", fontsize=12, fontweight="bold")
    ax.set_title("Benchmark Comparison on Snowflake", fontsize=14, fontweight="bold")
    ax.set_xticks(x)
    ax.set_xticklabels(benchmarks)
    ax.legend()
    ax.grid(axis="y", alpha=0.3)

    plt.tight_layout()
    plt.show()

    print("\nüìä Comparison Summary:")
    print(f"  TPC-H: {tpch_avg:.2f}s avg, {total_times[0]:.2f}s total")
    print(f"  TPC-DS: {tpcds_avg:.2f}s avg, {total_times[1]:.2f}s total")
else:
    print("‚ö†Ô∏è Run both TPC-H and TPC-DS benchmarks first")

### 3.9 Export Results

Export results in multiple formats:

In [None]:
from benchbox.core.results.exporter import ResultExporter

# Export to JSON (default)
exporter = ResultExporter(results)
json_path = exporter.export(format="json")
print(f"‚úÖ Exported to JSON: {json_path}")

# Export to CSV
csv_path = exporter.export(format="csv")
print(f"‚úÖ Exported to CSV: {csv_path}")

# Export to HTML
html_path = exporter.export(format="html")
print(f"‚úÖ Exported to HTML: {html_path}")

print("\nüí° Use these exports for:")
print("  - JSON: API integration, programmatic analysis")
print("  - CSV: Excel, data science tools, Snowsight")
print("  - HTML: Shareable reports, documentation")

### 3.10 Credit Consumption Analysis

Analyze Snowflake credit usage:

In [None]:
print("üí∞ Snowflake Credit Consumption\n")
print("Understanding Credits:")
print("  - 1 credit = 1 hour of X-Small warehouse")
print("  - Warehouse size determines credit rate")
print("  - Charged per second (minimum 60 seconds)")
print("  - Cloud services: up to 10% of compute (free)\n")

print("Credit Rate by Warehouse Size:")
print("  X-Small:  1 credit/hour  = $2-4/hour")
print("  Small:    2 credits/hour = $4-8/hour")
print("  Medium:   4 credits/hour = $8-16/hour")
print("  Large:    8 credits/hour = $16-32/hour\n")

print("üí° Cost Optimization:")
print("  1. Right-size warehouses (don't over-provision)")
print("  2. Set auto-suspend (5-10 minutes idle)")
print("  3. Use resource monitors to set spend limits")
print("  4. Enable query result caching")
print("  5. Use clustering sparingly (reclustering costs credits)")
print("  6. Monitor with WAREHOUSE_METERING_HISTORY view\n")

print("üîç Check Credit Usage:")
print("  SELECT warehouse_name, SUM(credits_used) as total_credits")
print("  FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY")
print("  WHERE start_time > DATEADD(day, -7, CURRENT_TIMESTAMP())")
print("  GROUP BY warehouse_name")
print("  ORDER BY total_credits DESC;")

## 4. Platform-Specific Features

### 4.1 Auto-Scaling and Auto-Suspend

Configure warehouse auto-scaling:

In [None]:
print("‚ö° Warehouse Auto-Scaling\n")
print("Auto-Suspend:")
print("  - Automatically suspend warehouse after idle period")
print("  - Recommended: 5-10 minutes for production")
print("  - Recommended: 1 minute for development")
print("  - SQL: ALTER WAREHOUSE name SET AUTO_SUSPEND = 300;\n")

print("Auto-Resume:")
print("  - Automatically resume on query submission")
print("  - First query waits for warehouse startup (~5-10 seconds)")
print("  - SQL: ALTER WAREHOUSE name SET AUTO_RESUME = TRUE;\n")

print("Multi-Cluster Auto-Scaling:")
print("  - Scale out for high concurrency")
print("  - Min/Max clusters: 1-10")
print("  - Scaling policy: Standard or Economy")
print("  - SQL: ALTER WAREHOUSE name SET")
print("         MIN_CLUSTER_COUNT = 1")
print("         MAX_CLUSTER_COUNT = 3")
print("         SCALING_POLICY = 'STANDARD';\n")

print("Scaling Policies:")
print("  Standard: Prevent queuing, start cluster immediately")
print("  Economy: Minimize credits, allow some queuing\n")

print("üí° Best Practices:")
print("  - Use auto-suspend for all warehouses")
print("  - Multi-cluster for concurrent users, not speed")
print("  - Monitor with WAREHOUSE_LOAD_HISTORY")

### 4.2 Clustering Keys and Micro-Partitions

Understanding Snowflake's automatic micro-partitioning:

In [None]:
print("üìä Micro-Partitions and Clustering\n")
print("Micro-Partitions (Automatic):")
print("  - Snowflake automatically divides data into 50-500 MB chunks")
print("  - Stores metadata (min/max values, null counts)")
print("  - Enables partition pruning (skip irrelevant micro-partitions)")
print("  - Compressed and encrypted automatically\n")

print("Clustering Keys (Manual):")
print("  - Define clustering order for large tables (>1 TB)")
print("  - Improves pruning efficiency")
print("  - Up to 4 columns in clustering key")
print("  - Automatic reclustering (consumes credits)\n")

print("When to Use Clustering:")
print("  ‚úÖ Large tables (>1 TB)")
print("  ‚úÖ Queries with range filters (WHERE date BETWEEN...)")
print("  ‚úÖ Queries on specific columns repeatedly")
print("  ‚ùå Small tables (<1 GB) - not worth it")
print("  ‚ùå Frequently updated tables - high reclustering cost\n")

print("Example: Add Clustering Key")
print("  ALTER TABLE orders CLUSTER BY (o_orderdate);\n")

print("Monitor Clustering:")
print("  SELECT system$clustering_information('orders', '(o_orderdate)');")

### 4.3 Time Travel and Zero-Copy Cloning

Leverage Snowflake's data protection features:

In [None]:
print("‚è∞ Time Travel\n")
print("What is Time Travel?")
print("  - Query historical data within retention period")
print("  - Standard: 1 day retention")
print("  - Enterprise: Up to 90 days retention")
print("  - No additional storage cost (included)\n")

print("Use Cases:")
print("  - Undo accidental deletes/updates")
print("  - Audit historical changes")
print("  - Compare data at different points in time\n")

print("Examples:")
print("  -- Query table as of 5 minutes ago")
print("  SELECT * FROM orders AT(OFFSET => -60*5);")
print("")
print("  -- Query before specific timestamp")
print("  SELECT * FROM orders BEFORE(TIMESTAMP => '2025-01-01 12:00:00'::timestamp);")
print("")
print("  -- Restore deleted table")
print("  UNDROP TABLE orders;\n")

print("üìã Zero-Copy Cloning\n")
print("What is Zero-Copy Cloning?")
print("  - Create instant copy without duplicating data")
print("  - No additional storage cost (initially)")
print("  - Metadata-only operation (seconds)")
print("  - Changes diverge (copy-on-write)\n")

print("Use Cases:")
print("  - Create dev/test environments")
print("  - Snapshot before major changes")
print("  - A/B testing")
print("  - Backup before data migration\n")

print("Examples:")
print("  -- Clone database")
print("  CREATE DATABASE dev_db CLONE prod_db;")
print("")
print("  -- Clone table")
print("  CREATE TABLE orders_backup CLONE orders;")
print("")
print("  -- Clone at specific time")
print("  CREATE TABLE orders_snapshot CLONE orders AT(OFFSET => -60*60*24);  -- 1 day ago")

### 4.4 External Stages (S3, Azure, GCS)

Load data from external cloud storage:

In [None]:
print("‚òÅÔ∏è  External Stages Overview\n")
print("What are External Stages?")
print("  - Named references to external storage locations")
print("  - Support: S3, Azure Blob Storage, Google Cloud Storage")
print("  - Load data directly from cloud storage")
print("  - No intermediate storage required\n")

print("Benefits:")
print("  - Load data from existing data lakes")
print("  - No duplication of data")
print("  - Supports all file formats (CSV, JSON, Parquet, etc.)")
print("  - Integration with Snowpipe for continuous loading\n")

print("Example: Create S3 Stage")
print("""
CREATE STAGE my_s3_stage
  URL = 's3://my-bucket/path/'
  CREDENTIALS = (AWS_KEY_ID = 'xxx' AWS_SECRET_KEY = 'xxx')
  FILE_FORMAT = (TYPE = PARQUET);
""")

print("\nExample: Load from Stage")
print("""
COPY INTO orders
FROM @my_s3_stage/orders.parquet
FILE_FORMAT = (TYPE = PARQUET);
""")

print("\nüí° BenchBox Integration:")
print("  platform_cfg = {")
print('      "stage_location": "@my_s3_stage",')
print('      "file_format": "PARQUET"')
print("  }")

### 4.5 Snowpipe and Streams

Continuous data ingestion with Snowpipe:

In [None]:
print("üîÑ Snowpipe (Continuous Data Loading)\n")
print("What is Snowpipe?")
print("  - Serverless, continuous data ingestion")
print("  - Load data within minutes of availability")
print("  - Event-driven (S3 notifications, Azure Event Grid)")
print("  - Pay per file processed (separate from warehouse credits)\n")

print("Example: Create Snowpipe")
print("""
CREATE PIPE my_pipe
  AUTO_INGEST = TRUE
  AS
  COPY INTO orders
  FROM @my_s3_stage
  FILE_FORMAT = (TYPE = PARQUET);
""")

print("\nüìä Streams (Change Data Capture)\n")
print("What are Streams?")
print("  - Track changes to table (INSERT, UPDATE, DELETE)")
print("  - Enable CDC patterns")
print("  - No additional storage cost")
print("  - Consume stream with DML statements\n")

print("Example: Create Stream")
print("""
CREATE STREAM orders_stream ON TABLE orders;

-- Process changes
INSERT INTO orders_history
SELECT * FROM orders_stream WHERE metadata$action = 'INSERT';
""")

print("\nüí° Use Cases:")
print("  - Snowpipe: Real-time data ingestion from cloud storage")
print("  - Streams: Incremental processing, CDC, data pipeline triggers")

## 5. Performance Analysis

### 5.1 Load and Prepare Results

Load benchmark results for analysis:

In [None]:
# Load results from previous run
if "results" in locals() and results.query_results:
    # Convert to pandas DataFrame for analysis
    df_results = pd.DataFrame(
        [
            {
                "query": qr.query_name,
                "time": qr.execution_time,
                "success": qr.success,
                "rows_returned": qr.row_count if hasattr(qr, "row_count") else None,
            }
            for qr in results.query_results
        ]
    )

    print("‚úÖ Results loaded into DataFrame")
    print(f"\nShape: {df_results.shape[0]} queries, {df_results.shape[1]} columns")
    print("\nFirst 5 rows:")
    print(df_results.head())
else:
    print("‚ö†Ô∏è No results available. Run a benchmark first.")

### 5.2 Statistical Analysis

Compute detailed statistics and identify outliers:

In [None]:
if "df_results" in locals():
    # Compute statistics
    stats = df_results["time"].describe(percentiles=[0.25, 0.5, 0.75, 0.95, 0.99])

    print("üìä Execution Time Statistics\n")
    print(stats)

    print("\nüîç Key Percentiles:")
    print(f"  P25 (25th percentile): {df_results['time'].quantile(0.25):.3f}s")
    print(f"  P50 (median): {df_results['time'].median():.3f}s")
    print(f"  P75 (75th percentile): {df_results['time'].quantile(0.75):.3f}s")
    print(f"  P95 (95th percentile): {df_results['time'].quantile(0.95):.3f}s")
    print(f"  P99 (99th percentile): {df_results['time'].quantile(0.99):.3f}s")

    # Identify outliers (>2 standard deviations)
    mean_time = df_results["time"].mean()
    std_time = df_results["time"].std()
    outliers = df_results[df_results["time"] > mean_time + 2 * std_time]

    if not outliers.empty:
        print("\n‚ö†Ô∏è  Performance Outliers (>2œÉ):")
        for _, row in outliers.iterrows():
            z_score = (row["time"] - mean_time) / std_time
            print(f"  {row['query']}: {row['time']:.2f}s (z-score: {z_score:.2f})")

        print("\nüí° Investigation steps:")
        print("  1. Check QUERY_PROFILE in Snowsight")
        print("  2. Review query execution plan")
        print("  3. Check for data skew")
        print("  4. Consider adding clustering keys")
    else:
        print("\n‚úÖ No significant outliers detected")

    # Coefficient of variation (CV)
    cv = (std_time / mean_time) * 100
    print("\nüìà Variability:")
    print(f"  Standard deviation: {std_time:.3f}s")
    print(f"  Coefficient of variation: {cv:.1f}%")
    if cv < 20:
        print("  Assessment: Low variability (consistent performance)")
    elif cv < 50:
        print("  Assessment: Moderate variability (typical for mixed workload)")
    else:
        print("  Assessment: High variability (investigate slow queries)")
else:
    print("‚ö†Ô∏è Load results first")

### 5.3 Comprehensive Visualizations

Multi-panel performance visualization:

In [None]:
if "df_results" in locals():
    # Create 2x2 subplot grid
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle("Snowflake Performance Analysis", fontsize=16, fontweight="bold")

    # 1. Distribution histogram
    axes[0, 0].hist(df_results["time"], bins=20, color="#29B5E8", alpha=0.7, edgecolor="black")
    axes[0, 0].axvline(df_results["time"].mean(), color="red", linestyle="--", linewidth=2, label="Mean")
    axes[0, 0].axvline(df_results["time"].median(), color="green", linestyle="--", linewidth=2, label="Median")
    axes[0, 0].set_xlabel("Execution Time (seconds)", fontweight="bold")
    axes[0, 0].set_ylabel("Frequency", fontweight="bold")
    axes[0, 0].set_title("Execution Time Distribution")
    axes[0, 0].legend()
    axes[0, 0].grid(axis="y", alpha=0.3)

    # 2. Box plot
    bp = axes[0, 1].boxplot(df_results["time"], patch_artist=True, vert=True)
    bp["boxes"][0].set_facecolor("#29B5E8")
    bp["boxes"][0].set_alpha(0.7)
    axes[0, 1].set_ylabel("Execution Time (seconds)", fontweight="bold")
    axes[0, 1].set_title("Box Plot (Outlier Detection)")
    axes[0, 1].set_xticklabels(["All Queries"])
    axes[0, 1].grid(axis="y", alpha=0.3)

    # 3. Sorted horizontal bar chart (top 15)
    df_sorted = df_results.sort_values("time", ascending=True).tail(15)
    colors = ["#F26B1D" if t > df_results["time"].quantile(0.9) else "#29B5E8" for t in df_sorted["time"]]
    axes[1, 0].barh(df_sorted["query"], df_sorted["time"], color=colors, alpha=0.8)
    axes[1, 0].set_xlabel("Execution Time (seconds)", fontweight="bold")
    axes[1, 0].set_title("Slowest 15 Queries")
    axes[1, 0].grid(axis="x", alpha=0.3)

    # 4. Cumulative performance (Pareto analysis)
    df_sorted_desc = df_results.sort_values("time", ascending=False)
    df_sorted_desc["cumulative_pct"] = df_sorted_desc["time"].cumsum() / df_sorted_desc["time"].sum() * 100
    axes[1, 1].plot(
        range(len(df_sorted_desc)), df_sorted_desc["cumulative_pct"], marker="o", color="#29B5E8", linewidth=2
    )
    axes[1, 1].axhline(80, color="red", linestyle="--", linewidth=2, label="80% threshold")
    axes[1, 1].set_xlabel("Number of Queries (sorted by time)", fontweight="bold")
    axes[1, 1].set_ylabel("Cumulative % of Total Time", fontweight="bold")
    axes[1, 1].set_title("Pareto Analysis (80/20 Rule)")
    axes[1, 1].legend()
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Pareto insight
    queries_for_80pct = len(df_sorted_desc[df_sorted_desc["cumulative_pct"] <= 80])
    print("\nüìä Pareto Insight:")
    print(
        f"  {queries_for_80pct} queries ({queries_for_80pct / len(df_results) * 100:.1f}%) account for 80% of total time"
    )
    print(f"  üí° Focus optimization efforts on these {queries_for_80pct} queries")
else:
    print("‚ö†Ô∏è Load results first")

### 5.4 Query History Analysis

Deep dive into query performance using QUERY_HISTORY:

In [None]:
try:
    conn = snowflake.connector.connect(
        user=SNOWFLAKE_USER, password=SNOWFLAKE_PASSWORD, account=SNOWFLAKE_ACCOUNT, warehouse=SNOWFLAKE_WAREHOUSE
    )

    # Query detailed performance metrics
    query = f"""
    SELECT 
        query_id,
        query_text,
        warehouse_name,
        warehouse_size,
        execution_time / 1000.0 as execution_seconds,
        queued_provisioning_time / 1000.0 as queue_seconds,
        compilation_time / 1000.0 as compile_seconds,
        bytes_scanned,
        bytes_written,
        rows_produced,
        partitions_scanned,
        partitions_total
    FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
    WHERE start_time > DATEADD(hour, -1, CURRENT_TIMESTAMP())
        AND warehouse_name = '{SNOWFLAKE_WAREHOUSE}'
        AND query_type = 'SELECT'
    ORDER BY execution_time DESC
    LIMIT 10
    """

    df_detailed = pd.read_sql(query, conn)

    if not df_detailed.empty:
        print("üìä Top 10 Slowest Queries:\n")
        for idx, row in df_detailed.iterrows():
            print(f"{idx + 1}. Query ID: {row['QUERY_ID'][:16]}...")
            print(f"   Execution: {row['EXECUTION_SECONDS']:.2f}s")
            print(f"   Queue time: {row['QUEUE_SECONDS']:.2f}s")
            print(f"   Compile time: {row['COMPILE_SECONDS']:.2f}s")
            print(f"   Bytes scanned: {row['BYTES_SCANNED'] / (1024**3):.2f} GB")
            print(f"   Rows produced: {row['ROWS_PRODUCED']:,}")

            if row["PARTITIONS_TOTAL"] > 0:
                prune_pct = (1 - row["PARTITIONS_SCANNED"] / row["PARTITIONS_TOTAL"]) * 100
                print(f"   Pruning efficiency: {prune_pct:.1f}%")
            print()

        print("üí° Optimization Tips:")
        print("  - High queue time: Increase warehouse size or use multi-cluster")
        print("  - High compile time: Use query result caching")
        print("  - Low pruning: Add clustering keys on filter columns")
        print("  - High bytes scanned: Use SELECT columns instead of SELECT *")
    else:
        print("‚ö†Ô∏è No query history available")

    conn.close()

except Exception as e:
    print(f"‚ö†Ô∏è Could not analyze query history: {e}")

### 5.5 Credit Consumption Analysis

Analyze warehouse credit usage:

In [None]:
try:
    conn = snowflake.connector.connect(
        user=SNOWFLAKE_USER, password=SNOWFLAKE_PASSWORD, account=SNOWFLAKE_ACCOUNT, warehouse=SNOWFLAKE_WAREHOUSE
    )

    # Query warehouse metering
    query = f"""
    SELECT 
        warehouse_name,
        DATE_TRUNC('hour', start_time) as hour,
        SUM(credits_used) as credits_used
    FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
    WHERE start_time > DATEADD(day, -1, CURRENT_TIMESTAMP())
        AND warehouse_name = '{SNOWFLAKE_WAREHOUSE}'
    GROUP BY warehouse_name, hour
    ORDER BY hour DESC
    """

    df_credits = pd.read_sql(query, conn)

    if not df_credits.empty:
        total_credits = df_credits["CREDITS_USED"].sum()

        print("üí∞ Credit Consumption (Last 24 Hours)\n")
        print(f"Total credits used: {total_credits:.4f}")
        print(f"Estimated cost: ${total_credits * 3:.2f} (assuming $3/credit)")
        print("\nHourly breakdown:")
        for _, row in df_credits.head(10).iterrows():
            print(f"  {row['HOUR']}: {row['CREDITS_USED']:.4f} credits")
    else:
        print("‚ö†Ô∏è No credit usage data available")

    conn.close()

except Exception as e:
    print(f"‚ö†Ô∏è Could not analyze credit consumption: {e}")

### 5.6 Regression Detection

Compare against baseline to detect performance regressions:

In [None]:
if "df_results" in locals():
    # Compare against baseline (you can load from a saved baseline file)
    # For demonstration, we'll use a mock baseline
    baseline_avg = 1.8  # seconds (mock baseline)
    current_avg = df_results["time"].mean()

    # Calculate change
    change_pct = ((current_avg - baseline_avg) / baseline_avg) * 100

    print("üîç Performance Regression Analysis\n")
    print(f"Baseline average: {baseline_avg:.2f}s")
    print(f"Current average: {current_avg:.2f}s")
    print(f"Change: {change_pct:+.1f}%\n")

    # Threshold: 10% change
    if abs(change_pct) > 10:
        if change_pct > 0:
            status = "‚ùå REGRESSION DETECTED"
            print(status)
            print(f"Performance degraded by {change_pct:.1f}%\n")
            print("üí° Investigation Steps:")
            print("  1. Check warehouse size and availability")
            print("  2. Review clustering health (CLUSTERING_INFORMATION)")
            print("  3. Check for data growth")
            print("  4. Verify result cache hit rate")
            print("  5. Review query execution plans in Snowsight")
        else:
            status = "‚úÖ PERFORMANCE IMPROVEMENT"
            print(status)
            print(f"Performance improved by {abs(change_pct):.1f}%\n")
            print("üí° Possible Reasons:")
            print("  - Clustering optimization")
            print("  - Result caching")
            print("  - Warehouse size increase")
            print("  - Query optimization")
    else:
        print("‚úÖ Performance stable (within 10% threshold)\n")

    print("\nüí° Save current run as new baseline:")
    print("   df_results.to_csv('baseline_snowflake_tpch.csv', index=False)")
else:
    print("‚ö†Ô∏è Load results first")

## 6. Troubleshooting

### 6.1 Connection Diagnostics

Comprehensive connection troubleshooting:

In [None]:
def diagnose_snowflake_connection():
    """Diagnose Snowflake connection issues"""
    print("üîç Snowflake Connection Diagnostic\n")

    # Check 1: Environment variables
    print("1Ô∏è‚É£ Checking environment variables...")
    if SNOWFLAKE_ACCOUNT:
        print(f"   ‚úÖ SNOWFLAKE_ACCOUNT = {SNOWFLAKE_ACCOUNT}")
    else:
        print("   ‚ùå SNOWFLAKE_ACCOUNT not set")

    if SNOWFLAKE_USER:
        print(f"   ‚úÖ SNOWFLAKE_USER = {SNOWFLAKE_USER}")
    else:
        print("   ‚ùå SNOWFLAKE_USER not set")

    if SNOWFLAKE_PASSWORD:
        print("   ‚úÖ SNOWFLAKE_PASSWORD is set")
    else:
        print("   ‚ö†Ô∏è  SNOWFLAKE_PASSWORD not set")

    # Check 2: Account format
    print("\n2Ô∏è‚É£ Validating account identifier...")
    if SNOWFLAKE_ACCOUNT and "." in SNOWFLAKE_ACCOUNT:
        print(f"   ‚úÖ Account includes region/cloud: {SNOWFLAKE_ACCOUNT}")
    else:
        print("   ‚ö†Ô∏è  Account may need full identifier (e.g., xy12345.us-east-1.aws)")

    # Check 3: Connection test
    print("\n3Ô∏è‚É£ Testing connection...")
    try:
        test_conn = snowflake.connector.connect(
            user=SNOWFLAKE_USER, password=SNOWFLAKE_PASSWORD, account=SNOWFLAKE_ACCOUNT
        )
        print("   ‚úÖ Connection successful")
        test_conn.close()
    except Exception as e:
        print(f"   ‚ùå Connection failed: {e}")

    print("\n" + "=" * 60)
    print("üìö Troubleshooting Guide:\n")
    print("If connection fails:")
    print("  1. Verify account identifier (xy12345.region.cloud)")
    print("  2. Check username and password")
    print("  3. Verify network policies (IP allowlists)")
    print("  4. Check user is not locked or expired\n")

    print("If warehouse issues:")
    print("  1. Verify warehouse exists: SHOW WAREHOUSES;")
    print("  2. Check warehouse state (suspended/running)")
    print("  3. Verify user has USAGE privilege on warehouse")


# Run diagnostics
diagnose_snowflake_connection()

### 6.2 Permission Validation

Verify required permissions:

In [None]:
def validate_snowflake_permissions():
    """Validate Snowflake permissions"""
    print("üîí Snowflake Permission Validation\n")

    try:
        conn = snowflake.connector.connect(user=SNOWFLAKE_USER, password=SNOWFLAKE_PASSWORD, account=SNOWFLAKE_ACCOUNT)
        cursor = conn.cursor()

        # Test 1: Current role
        print("1Ô∏è‚É£ Checking current role...")
        cursor.execute("SELECT CURRENT_ROLE()")
        role = cursor.fetchone()[0]
        print(f"   ‚úÖ Current role: {role}")

        # Test 2: Warehouse access
        print("\n2Ô∏è‚É£ Testing warehouse access...")
        try:
            cursor.execute(f"USE WAREHOUSE {SNOWFLAKE_WAREHOUSE}")
            print(f"   ‚úÖ Can use warehouse: {SNOWFLAKE_WAREHOUSE}")
        except Exception as e:
            print(f"   ‚ùå Cannot use warehouse: {e}")

        # Test 3: Database access
        print("\n3Ô∏è‚É£ Testing database access...")
        try:
            cursor.execute(f"USE DATABASE {SNOWFLAKE_DATABASE}")
            print(f"   ‚úÖ Can use database: {SNOWFLAKE_DATABASE}")
        except Exception as e:
            print(f"   ‚ùå Cannot use database: {e}")

        # Test 4: Create table
        print("\n4Ô∏è‚É£ Testing table creation...")
        try:
            cursor.execute("CREATE TEMPORARY TABLE test_table (id INT)")
            cursor.execute("DROP TABLE test_table")
            print("   ‚úÖ Can create and drop tables")
        except Exception as e:
            print(f"   ‚ùå Cannot create tables: {e}")

        cursor.close()
        conn.close()

        print("\n" + "=" * 60)
        print("üìã Required Privileges:\n")
        print("Minimum (for benchmarking):")
        print("  - USAGE on warehouse")
        print("  - USAGE on database and schema")
        print("  - CREATE TABLE in schema")
        print("  - SELECT, INSERT on tables\n")
        print("Recommended (for full features):")
        print("  - SYSADMIN role")
        print("  - CREATE WAREHOUSE privilege")

    except Exception as e:
        print(f"‚ùå Permission validation failed: {e}")


# Run validation
try:
    validate_snowflake_permissions()
except Exception as e:
    print(f"‚ùå Validation error: {e}")

### 6.3 Warehouse Diagnostics

Check warehouse status and configuration:

In [None]:
try:
    conn = snowflake.connector.connect(user=SNOWFLAKE_USER, password=SNOWFLAKE_PASSWORD, account=SNOWFLAKE_ACCOUNT)
    cursor = conn.cursor()

    print("üîß Warehouse Diagnostics\n")

    # Get warehouse details
    cursor.execute(f"SHOW WAREHOUSES LIKE '{SNOWFLAKE_WAREHOUSE}'")
    wh_info = cursor.fetchone()

    if wh_info:
        print(f"Warehouse: {wh_info[0]}")
        print(f"State: {wh_info[1]}")
        print(f"Type: {wh_info[2]}")
        print(f"Size: {wh_info[3]}")
        print(f"Min Clusters: {wh_info[4]}")
        print(f"Max Clusters: {wh_info[5]}")
        print(f"Auto Suspend: {wh_info[9]} seconds")
        print(f"Auto Resume: {wh_info[10]}")

        # Recommendations
        print("\nüí° Recommendations:")
        if wh_info[1] == "SUSPENDED":
            print("  ‚ö†Ô∏è  Warehouse is suspended. It will resume on first query.")
        if wh_info[9] is None or wh_info[9] > 600:
            print("  ‚ö†Ô∏è  Consider setting auto-suspend to 5-10 minutes")
        if wh_info[10] != "true":
            print("  ‚ö†Ô∏è  Enable auto-resume for automatic startup")
    else:
        print(f"‚ùå Warehouse '{SNOWFLAKE_WAREHOUSE}' not found")

    cursor.close()
    conn.close()

except Exception as e:
    print(f"‚ùå Warehouse diagnostic failed: {e}")

### 6.4 Common Issues and Solutions

Quick reference for common Snowflake benchmarking issues:

In [None]:
print("üîß Common Snowflake Benchmarking Issues\n")
print("=" * 70)

print("\n‚ùå Issue: 'Account not found' or connection timeout")
print("‚úÖ Solution:")
print("   1. Use full account identifier: xy12345.us-east-1.aws")
print("   2. Check network policies (IP allowlists)")
print("   3. Verify account is not suspended")
print("   4. Test from Snowsight web UI first\n")

print("‚ùå Issue: 'Warehouse not started' errors")
print("‚úÖ Solution:")
print("   1. Enable auto-resume: ALTER WAREHOUSE name SET AUTO_RESUME = TRUE;")
print("   2. Manually start: ALTER WAREHOUSE name RESUME;")
print("   3. Check warehouse privileges (USAGE)\n")

print("‚ùå Issue: Slow query performance")
print("‚úÖ Solution:")
print("   1. Review query profile in Snowsight")
print("   2. Check clustering keys on large tables")
print("   3. Increase warehouse size (XS ‚Üí S ‚Üí M)")
print("   4. Enable result caching (automatic)")
print("   5. Use EXPLAIN to analyze execution plan\n")

print("‚ùå Issue: High credit consumption")
print("‚úÖ Solution:")
print("   1. Set aggressive auto-suspend (5-10 minutes)")
print("   2. Right-size warehouses (don't over-provision)")
print("   3. Use separate warehouses for ETL vs analytics")
print("   4. Monitor with WAREHOUSE_METERING_HISTORY")
print("   5. Set resource monitors to limit spend\n")

print("‚ùå Issue: Query queuing")
print("‚úÖ Solution:")
print("   1. Increase warehouse size (more compute)")
print("   2. Enable multi-cluster (scale out for concurrency)")
print("   3. Use dedicated warehouses per workload")
print("   4. Optimize queries to reduce runtime\n")

print("‚ùå Issue: Data loading failures")
print("‚úÖ Solution:")
print("   1. Verify stage configuration (URL, credentials)")
print("   2. Check file format matches data")
print("   3. Use VALIDATION_MODE to test before loading")
print("   4. Enable ON_ERROR = 'CONTINUE' to skip bad rows")
print("   5. Review COPY_HISTORY for error details\n")

print("=" * 70)
print("\nüí° More Help:")
print("  - Snowflake docs: https://docs.snowflake.com")
print("  - Community: https://community.snowflake.com")
print("  - BenchBox docs: https://github.com/joeharris76/benchbox")

## Next Steps

**Continue Learning:**
- Explore other cloud platforms: BigQuery, Databricks, Redshift
- Try different benchmarks: TPC-DS, ClickBench, SSB
- Compare platforms with multi-platform notebooks
- Set up CI/CD regression testing

**Platform-Specific Features to Explore:**
- Multi-cluster warehouses (scale out for concurrency)
- Snowpipe (continuous data loading)
- Streams and Tasks (CDC and ETL pipelines)
- Data Sharing (share data across accounts)
- Search Optimization Service

**Resources:**
- [BenchBox Documentation](https://github.com/joeharris76/benchbox)
- [Snowflake Documentation](https://docs.snowflake.com)
- [Snowflake Best Practices](https://docs.snowflake.com/en/user-guide/performance)
- [Snowflake Pricing](https://www.snowflake.com/pricing/)