# Amazon Redshift Benchmarking with BenchBox

This notebook demonstrates comprehensive benchmarking of Amazon Redshift data warehouses using BenchBox.

**What you'll learn:**
- Running TPC-H and TPC-DS benchmarks on Redshift clusters
- Optimizing tables with distribution keys and sort keys
- Loading data efficiently using COPY from S3
- Monitoring cluster performance with STL/SVL system tables
- Configuring Workload Management (WLM) for query prioritization
- Analyzing costs and performance trade-offs

**Prerequisites:**
- Active Amazon Redshift cluster
- AWS credentials with Redshift and S3 access
- Database user with CREATE TABLE and COPY privileges
- Security groups configured to allow connections

**Estimated time:** 15-30 minutes (scale factor 0.01-1.0)

## 1. Installation & Setup

### Install Required Packages

Install BenchBox and the Redshift connector library.

In [None]:
!pip install -q benchbox redshift_connector boto3 pandas matplotlib seaborn

### Import Libraries

Import BenchBox components and visualization libraries.

In [None]:
import os
import warnings
from datetime import datetime
from pathlib import Path

warnings.filterwarnings("ignore")

# BenchBox imports
import matplotlib.pyplot as plt

# Visualization imports
import pandas as pd
import seaborn as sns
from benchbox.core.results.comparison import BenchmarkComparator

from benchbox.core.config import BenchmarkConfig, DatabaseConfig
from benchbox.core.results.exporter import ResultExporter
from benchbox.core.results.loader import ResultLoader
from benchbox.core.runner import LifecyclePhases, run_benchmark_lifecycle

# Redshift connector
try:
    import redshift_connector

    print("‚úÖ redshift_connector imported successfully")
except ImportError as e:
    print(f"‚ùå Error importing redshift_connector: {e}")
    print("   Install with: pip install redshift_connector")

# Configure plotting
plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")
%matplotlib inline

print("üì¶ All libraries imported successfully")

### Configure Authentication

Set up credentials for your Redshift cluster. Three methods are supported:

**Method 1: Environment Variables (Recommended)**
```bash
export REDSHIFT_HOST="mycluster.abc123.us-east-1.redshift.amazonaws.com"
export REDSHIFT_PORT="5439"
export REDSHIFT_DB="dev"
export REDSHIFT_USER="awsuser"
export REDSHIFT_PASSWORD="your-password"
export AWS_REGION="us-east-1"
export AWS_S3_BUCKET="my-benchmark-bucket"  # Optional, for COPY operations
export AWS_IAM_ROLE="arn:aws:iam::123456789012:role/RedshiftCopyRole"  # Optional
```

**Method 2: IAM Authentication (Federated)**
```python
# Use get_cluster_credentials API for temporary credentials
conn = redshift_connector.connect(
    host='mycluster.abc123.us-east-1.redshift.amazonaws.com',
    database='dev',
    cluster_identifier='mycluster',
    iam=True,
    db_user='iamuser'
)
```

**Method 3: Secrets Manager**
```python
import boto3
import json

client = boto3.client('secretsmanager', region_name='us-east-1')
secret = json.loads(client.get_secret_value(SecretId='redshift-credentials')['SecretString'])
REDSHIFT_USER = secret['username']
REDSHIFT_PASSWORD = secret['password']
```

In [None]:
# Try environment variables first
try:
    REDSHIFT_HOST = os.environ.get("REDSHIFT_HOST")
    REDSHIFT_PORT = int(os.environ.get("REDSHIFT_PORT", "5439"))
    REDSHIFT_DB = os.environ.get("REDSHIFT_DB")
    REDSHIFT_USER = os.environ.get("REDSHIFT_USER")
    REDSHIFT_PASSWORD = os.environ.get("REDSHIFT_PASSWORD")
    AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")
    AWS_S3_BUCKET = os.environ.get("AWS_S3_BUCKET", None)
    AWS_IAM_ROLE = os.environ.get("AWS_IAM_ROLE", None)

    # Validate required variables
    if not all([REDSHIFT_HOST, REDSHIFT_DB, REDSHIFT_USER, REDSHIFT_PASSWORD]):
        print("‚ö†Ô∏è  Missing required environment variables. Please set:")
        print("   - REDSHIFT_HOST: Cluster endpoint")
        print("   - REDSHIFT_DB: Database name")
        print("   - REDSHIFT_USER: Database user")
        print("   - REDSHIFT_PASSWORD: User password")
        raise ValueError("Missing required Redshift configuration")

    print("‚úÖ Redshift Configuration:")
    print(f"   Host: {REDSHIFT_HOST}")
    print(f"   Port: {REDSHIFT_PORT}")
    print(f"   Database: {REDSHIFT_DB}")
    print(f"   User: {REDSHIFT_USER}")
    print(f"   Region: {AWS_REGION}")
    if AWS_S3_BUCKET:
        print(f"   S3 Bucket: {AWS_S3_BUCKET}")
    if AWS_IAM_ROLE:
        print(f"   IAM Role: {AWS_IAM_ROLE[:50]}...")

except Exception as e:
    print(f"‚ùå Configuration error: {e}")
    raise

### Test Connection

Verify connectivity to your Redshift cluster and check cluster configuration.

In [None]:
try:
    # Connect to Redshift
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    cursor = conn.cursor()

    # Check cluster version
    cursor.execute("SELECT version();")
    version = cursor.fetchone()[0]
    print("‚úÖ Connected to Redshift")
    print(f"   Version: {version[:80]}...")

    # Get cluster configuration
    cursor.execute("""
        SELECT node_type, node_count
        FROM stv_slices
        WHERE slice = 0
        LIMIT 1;
    """)
    result = cursor.fetchone()
    if result:
        node_type, node_count = result
        print("\nüìä Cluster Configuration:")
        print(f"   Node Type: {node_type if node_type else 'Unknown'}")
        print(f"   Node Count: {node_count if node_count else 'Unknown'}")

    # Check WLM configuration
    cursor.execute("""
        SELECT COUNT(*) as queue_count
        FROM stv_wlm_service_class_config
        WHERE service_class >= 6;
    """)
    queue_count = cursor.fetchone()[0]
    print(f"   WLM Queues: {queue_count}")

    # Check disk space
    cursor.execute("""
        SELECT 
            SUM(capacity)/1024 as total_gb,
            SUM(used)/1024 as used_gb,
            ROUND(SUM(used)::float/SUM(capacity)::float*100, 1) as pct_used
        FROM stv_partitions;
    """)
    total_gb, used_gb, pct_used = cursor.fetchone()
    print("\nüíæ Disk Usage:")
    print(f"   Total: {total_gb:.1f} GB")
    print(f"   Used: {used_gb:.1f} GB ({pct_used}%)")

    cursor.close()
    conn.close()

    print("\n‚úÖ Connection test successful!")

except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    print("\nüîç Troubleshooting steps:")
    print("   1. Verify security group allows inbound connections on port 5439")
    print("   2. Check VPC routing and network ACLs")
    print("   3. Ensure cluster is publicly accessible (if connecting from outside VPC)")
    print("   4. Verify database user credentials")
    print("   5. Check cluster is in 'available' state")
    raise

### Redshift Node Type Guide

**RA3 Nodes (Recommended for most workloads)**
- **RA3.XLPLUS**: 4 vCPUs, 32 GB RAM, 32 TB managed storage
  - Best for: General analytics, growing datasets
  - Cost: ~$1.086/hour per node
- **RA3.4XLARGE**: 12 vCPUs, 96 GB RAM, 128 TB managed storage
  - Best for: Large workloads, complex queries
  - Cost: ~$3.26/hour per node
- **RA3.16XLARGE**: 48 vCPUs, 384 GB RAM, 128 TB managed storage
  - Best for: Massive datasets, highest performance
  - Cost: ~$13.04/hour per node

**DC2 Nodes (Compute-optimized, SSD)**
- **DC2.LARGE**: 2 vCPUs, 15 GB RAM, 160 GB SSD
  - Best for: Small datasets (<1 TB), development
  - Cost: ~$0.25/hour per node
- **DC2.8XLARGE**: 32 vCPUs, 244 GB RAM, 2.56 TB SSD
  - Best for: High-performance, fixed storage needs
  - Cost: ~$4.80/hour per node

**Serverless**
- Pay-per-use with Redshift Processing Units (RPUs)
- No cluster management required
- Best for: Variable workloads, getting started
- Cost: ~$0.36/RPU-hour

In [None]:
# Configure benchmark settings
config = {
    "project": "benchbox-redshift",
    "cluster_endpoint": REDSHIFT_HOST,
    "database": REDSHIFT_DB,
    "user": REDSHIFT_USER,
    "region": AWS_REGION,
    "s3_bucket": AWS_S3_BUCKET,
    "iam_role": AWS_IAM_ROLE,
    # Scale factors to test
    "scale_factors": [0.01, 0.1, 1.0],  # 10MB, 100MB, 1GB
    # Output directory
    "output_dir": "./benchmark_results",
}

# Create output directory
Path(config["output_dir"]).mkdir(parents=True, exist_ok=True)

print("‚úÖ Configuration complete")
print(f"   Output directory: {config['output_dir']}")
print(f"   Scale factors: {config['scale_factors']}")

## 2. Quick Start Example

### Run TPC-H Power Test

Execute a TPC-H power test at scale factor 0.01 (10MB). This runs all 22 TPC-H queries sequentially.

**What happens:**
1. Generate TPC-H data (customer, orders, lineitem, etc.)
2. Create tables in Redshift with default distribution
3. Load data using COPY command
4. Execute 22 queries and measure performance

**Expected time:** ~3-5 minutes at SF 0.01

In [None]:
# Configure database connection
db_cfg = DatabaseConfig(type="redshift", name="redshift-tpch")
platform_cfg = {
    "host": REDSHIFT_HOST,
    "port": REDSHIFT_PORT,
    "database": REDSHIFT_DB,
    "user": REDSHIFT_USER,
    "password": REDSHIFT_PASSWORD,
}

# Configure TPC-H benchmark
bench_cfg = BenchmarkConfig(
    name="tpch", display_name="TPC-H Power Test", scale_factor=0.01, test_execution_type="power"
)

# Run complete lifecycle
print("üöÄ Starting TPC-H power test on Redshift...\n")
results = run_benchmark_lifecycle(
    benchmark_config=bench_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

print("\n‚úÖ TPC-H power test completed!")
print(f"   Benchmark: {results.benchmark_name}")
print(f"   Total queries: {len(results.query_results)}")
print(f"   Geometric mean: {results.geometric_mean:.3f}s")
print(f"   Total execution time: {results.total_execution_time:.2f}s")

### Visualize Results

Create a bar chart showing execution time for each query.

In [None]:
if results.query_results:
    query_names = [qr.query_name for qr in results.query_results]
    execution_times = [qr.execution_time for qr in results.query_results]

    fig, ax = plt.subplots(figsize=(14, 6))
    bars = ax.bar(query_names, execution_times, color="#CC0000", alpha=0.8, edgecolor="black")

    # Highlight slowest queries
    max_time = max(execution_times)
    for i, (bar, time) in enumerate(zip(bars, execution_times)):
        if time > max_time * 0.7:  # Top 30% slowest
            bar.set_color("#FF9900")  # AWS orange
            # Annotate with time
            ax.text(i, time + 0.01, f"{time:.2f}s", ha="center", va="bottom", fontsize=8)

    ax.set_xlabel("Query", fontsize=12, fontweight="bold")
    ax.set_ylabel("Execution Time (seconds)", fontsize=12, fontweight="bold")
    ax.set_title("TPC-H Query Performance on Redshift (SF 0.01)", fontsize=14, fontweight="bold")
    ax.grid(axis="y", alpha=0.3)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

    print("\nüìä Performance Summary:")
    print(f"   Fastest query: {query_names[execution_times.index(min(execution_times))]} ({min(execution_times):.3f}s)")
    print(f"   Slowest query: {query_names[execution_times.index(max(execution_times))]} ({max(execution_times):.3f}s)")
    print(f"   Median time: {sorted(execution_times)[len(execution_times) // 2]:.3f}s")
else:
    print("‚ö†Ô∏è  No query results to visualize")

### Monitor Cluster Activity

Check current cluster activity and query queue status.

In [None]:
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    # Check active queries
    query = """
        SELECT 
            pid,
            user_name,
            starttime,
            query
        FROM stv_inflight
        WHERE user_name != 'rdsdb'
        ORDER BY starttime DESC
        LIMIT 5;
    """

    df_active = pd.read_sql(query, conn)

    print("üîÑ Active Queries:")
    if not df_active.empty:
        for _, row in df_active.iterrows():
            print(f"   PID {row['pid']}: {row['query'][:80]}...")
    else:
        print("   No active queries")

    # Check recent queries
    query = """
        SELECT 
            query,
            ROUND(total_exec_time/1000000.0, 2) as exec_time_sec,
            rows,
            queue_time/1000000 as queue_sec
        FROM svl_query_summary
        WHERE userid > 1
        ORDER BY endtime DESC
        LIMIT 5;
    """

    df_recent = pd.read_sql(query, conn)

    print("\nüìú Recent Query Performance:")
    if not df_recent.empty:
        for _, row in df_recent.iterrows():
            print(f"   Query {row['query']}: {row['exec_time_sec']}s, {row['rows']} rows")
    else:
        print("   No recent queries")

    conn.close()

except Exception as e:
    print(f"‚ö†Ô∏è Could not query cluster activity: {e}")

### Results Overview

Display detailed results including per-query breakdown.

In [None]:
print("üìä Detailed Results:\n")
print(f"Benchmark: {results.benchmark_name}")
print(f"Platform: {results.platform}")
print(f"Scale Factor: {results.scale_factor}")
print(f"Test Type: {results.test_execution_type}")
print(f"Timestamp: {results.start_time}")
print("\nExecution Summary:")
print(f"  Total queries: {len(results.query_results)}")
print(f"  Successful: {sum(1 for qr in results.query_results if qr.success)}")
print(f"  Failed: {sum(1 for qr in results.query_results if not qr.success)}")
print(f"  Geometric mean: {results.geometric_mean:.3f}s")
print(f"  Total time: {results.total_execution_time:.2f}s")

if results.data_generation_time:
    print(f"\nData Generation: {results.data_generation_time:.2f}s")
if results.data_loading_time:
    print(f"Data Loading: {results.data_loading_time:.2f}s")

print("\nüìã Query Breakdown:")
for qr in results.query_results[:5]:  # Show first 5
    status = "‚úÖ" if qr.success else "‚ùå"
    print(f"  {status} {qr.query_name}: {qr.execution_time:.3f}s")
if len(results.query_results) > 5:
    print(f"  ... and {len(results.query_results) - 5} more queries")

## 3. Advanced Examples

### TPC-DS Benchmark

Run the more complex TPC-DS benchmark (99 queries) with a smaller subset for faster iteration.

In [None]:
# Run TPC-DS with query subset
tpcds_cfg = BenchmarkConfig(
    name="tpcds",
    display_name="TPC-DS Sample",
    scale_factor=0.01,
    test_execution_type="power",
    query_numbers=[1, 2, 3, 10, 25],  # Run subset for faster results
)

print("üöÄ Running TPC-DS subset on Redshift...\n")
tpcds_results = run_benchmark_lifecycle(
    benchmark_config=tpcds_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

print(f"\n‚úÖ TPC-DS completed: {tpcds_results.geometric_mean:.3f}s geometric mean")
print(f"   Queries executed: {len(tpcds_results.query_results)}")

### Distribution Keys and Sort Keys

Optimize table performance using distribution and sort keys.

**Distribution Styles:**
- **KEY**: Distribute rows based on values in one column (best for joins)
- **ALL**: Copy entire table to all nodes (best for small dimension tables)
- **EVEN**: Distribute rows evenly across nodes (default, good for large fact tables)
- **AUTO**: Let Redshift choose (recommended for most cases)

**Sort Keys:**
- **Compound**: Multiple columns, order matters (best for range queries)
- **Interleaved**: Multiple columns, equal weight (best for multiple filter patterns)

**Best Practices:**
- Use KEY distribution on large tables' join columns
- Use ALL distribution for small dimension tables (<1M rows)
- Sort on date columns for time-series data
- Sort on frequently filtered columns

In [None]:
# Configure TPC-H with optimized distribution and sort keys
optimized_cfg = {
    "host": REDSHIFT_HOST,
    "port": REDSHIFT_PORT,
    "database": REDSHIFT_DB,
    "user": REDSHIFT_USER,
    "password": REDSHIFT_PASSWORD,
    "table_options": {
        "customer": {
            "distribution": "ALL",  # Small dimension table
            "sort_keys": ["c_custkey"],
        },
        "orders": {
            "distribution": "KEY",
            "distribution_key": "o_custkey",  # Join with customer
            "sort_keys": ["o_orderdate", "o_custkey"],
        },
        "lineitem": {
            "distribution": "KEY",
            "distribution_key": "l_orderkey",  # Join with orders
            "sort_keys": ["l_shipdate", "l_orderkey"],
        },
        "part": {
            "distribution": "ALL",  # Small dimension table
            "sort_keys": ["p_partkey"],
        },
        "supplier": {
            "distribution": "ALL",  # Small dimension table
            "sort_keys": ["s_suppkey"],
        },
        "partsupp": {
            "distribution": "KEY",
            "distribution_key": "ps_partkey",
            "sort_keys": ["ps_partkey", "ps_suppkey"],
        },
        "nation": {
            "distribution": "ALL",  # Very small reference table
            "sort_keys": ["n_nationkey"],
        },
        "region": {
            "distribution": "ALL",  # Very small reference table
            "sort_keys": ["r_regionkey"],
        },
    },
}

# Run with optimized settings
print("üöÄ Running TPC-H with optimized distribution and sort keys...\n")
optimized_results = run_benchmark_lifecycle(
    benchmark_config=bench_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=optimized_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

print(f"\n‚úÖ Optimized run completed: {optimized_results.geometric_mean:.3f}s geometric mean")
print(f"   Baseline: {results.geometric_mean:.3f}s")
improvement = ((results.geometric_mean - optimized_results.geometric_mean) / results.geometric_mean) * 100
print(f"   Improvement: {improvement:+.1f}%")

### Scale Factor Comparison

Compare performance across different data sizes.

In [None]:
scale_results = {}

for sf in [0.01, 0.1]:  # Test 10MB and 100MB
    print(f"\nüöÄ Running TPC-H at scale factor {sf}...")

    sf_cfg = BenchmarkConfig(
        name="tpch",
        display_name=f"TPC-H SF {sf}",
        scale_factor=sf,
        test_execution_type="power",
        query_numbers=list(range(1, 11)),  # First 10 queries only
    )

    sf_results = run_benchmark_lifecycle(
        benchmark_config=sf_cfg,
        database_config=db_cfg,
        system_profile=None,
        platform_config=platform_cfg,
        phases=LifecyclePhases(generate=True, load=True, execute=True),
    )

    scale_results[sf] = sf_results.geometric_mean
    print(f"   Geometric mean: {sf_results.geometric_mean:.3f}s")

# Visualize scaling
if len(scale_results) > 1:
    fig, ax = plt.subplots(figsize=(10, 6))
    sfs = list(scale_results.keys())
    times = list(scale_results.values())

    ax.plot(sfs, times, marker="o", linewidth=2, markersize=10, color="#CC0000")
    ax.set_xlabel("Scale Factor", fontsize=12, fontweight="bold")
    ax.set_ylabel("Geometric Mean Time (seconds)", fontsize=12, fontweight="bold")
    ax.set_title("TPC-H Scaling on Redshift", fontsize=14, fontweight="bold")
    ax.grid(True, alpha=0.3)
    ax.set_xscale("log")
    plt.tight_layout()
    plt.show()

    print("\nüìä Scaling Analysis:")
    for i in range(1, len(sfs)):
        data_mult = sfs[i] / sfs[i - 1]
        time_mult = times[i] / times[i - 1]
        print(f"   SF {sfs[i - 1]} ‚Üí {sfs[i]}: {data_mult}x data, {time_mult:.2f}x time")

### Query Subset Selection

Run specific queries for targeted testing or CI/CD pipelines.

In [None]:
# Fast smoke test: Run 5 representative queries
smoke_test_queries = [1, 3, 6, 10, 14]  # Mix of simple and complex

subset_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Smoke Test",
    scale_factor=0.01,
    test_execution_type="power",
    query_numbers=smoke_test_queries,
)

print(f"üöÄ Running smoke test with queries: {smoke_test_queries}\n")
subset_results = run_benchmark_lifecycle(
    benchmark_config=subset_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=False, load=False, execute=True),  # Reuse data
)

print(f"\n‚úÖ Smoke test completed: {subset_results.geometric_mean:.3f}s geometric mean")
print(f"   Queries: {len(subset_results.query_results)}")
print(f"   Time saved vs full suite: ~{(1 - len(smoke_test_queries) / 22) * 100:.0f}%")

### Workload Management (WLM) Configuration

Configure query queues for different workload priorities.

**WLM Best Practices:**
- Create separate queues for ETL, reporting, and ad-hoc queries
- Allocate memory based on query complexity
- Set concurrency limits to prevent resource contention
- Use query monitoring rules to abort runaway queries

**Example WLM Configuration:**
```json
[
  {"name": "etl", "concurrency": 3, "memory": 40, "priority": "highest"},
  {"name": "reporting", "concurrency": 5, "memory": 30, "priority": "high"},
  {"name": "adhoc", "concurrency": 10, "memory": 20, "priority": "normal"},
  {"name": "default", "concurrency": 5, "memory": 10, "priority": "low"}
]
```

In [None]:
# Check current WLM configuration
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    query = """
        SELECT 
            service_class,
            name,
            num_query_tasks as concurrency,
            query_working_mem as memory_mb
        FROM stv_wlm_service_class_config
        WHERE service_class >= 6
        ORDER BY service_class;
    """

    df_wlm = pd.read_sql(query, conn)

    print("üéØ Current WLM Configuration:\n")
    if not df_wlm.empty:
        print(df_wlm.to_string(index=False))
    else:
        print("   Using default WLM configuration")

    # Check queue wait times
    query = """
        SELECT 
            service_class,
            COUNT(*) as query_count,
            AVG(total_queue_time/1000000.0) as avg_queue_sec
        FROM stl_wlm_query
        WHERE service_class >= 6
        GROUP BY service_class
        ORDER BY service_class;
    """

    df_queue = pd.read_sql(query, conn)

    if not df_queue.empty:
        print("\n‚è±Ô∏è  Queue Statistics:\n")
        print(df_queue.to_string(index=False))

    conn.close()

except Exception as e:
    print(f"‚ö†Ô∏è Could not query WLM configuration: {e}")

### COPY from S3

Load data efficiently from S3 using the COPY command.

**Prerequisites:**
- S3 bucket with data files
- IAM role attached to cluster with S3 read permissions
- Or AWS credentials with S3 access

In [None]:
if AWS_S3_BUCKET and AWS_IAM_ROLE:
    print("üì¶ Example COPY command from S3:\n")

    copy_cmd = f"""
    COPY customer
    FROM 's3://{AWS_S3_BUCKET}/tpch/customer/'
    IAM_ROLE '{AWS_IAM_ROLE}'
    DELIMITER '|'
    REGION '{AWS_REGION}'
    COMPUPDATE ON
    STATUPDATE ON
    MAXERROR 0;
    """

    print(copy_cmd)

    print("\nüí° COPY Best Practices:")
    print("   - Split large files (100MB-1GB each)")
    print("   - Use GZIP compression to reduce I/O")
    print("   - Enable COMPUPDATE for automatic compression encoding")
    print("   - Enable STATUPDATE for automatic table statistics")
    print("   - Use manifest files for complex loads")
    print("   - Monitor STL_LOAD_ERRORS for failures")
else:
    print("‚ö†Ô∏è  S3 bucket or IAM role not configured")
    print("   Set AWS_S3_BUCKET and AWS_IAM_ROLE environment variables")

### Throughput Test

Run concurrent queries to test cluster throughput.

In [None]:
# Run throughput test with 2 concurrent streams
throughput_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Throughput",
    scale_factor=0.01,
    test_execution_type="throughput",
    query_numbers=list(range(1, 11)),  # First 10 queries
    num_streams=2,  # Concurrent query streams
)

print("üöÄ Running throughput test with 2 concurrent streams...\n")
throughput_results = run_benchmark_lifecycle(
    benchmark_config=throughput_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=False, load=False, execute=True),
)

print(f"\n‚úÖ Throughput test completed: {throughput_results.total_execution_time:.2f}s total time")
print(
    f"   Queries per hour: {(len(throughput_results.query_results) / throughput_results.total_execution_time) * 3600:.0f}"
)

### Compare Multiple Runs

Compare different configurations or time periods.

In [None]:
# Compare baseline vs optimized runs
try:
    comparator = BenchmarkComparator()

    comparison = comparator.compare(
        baseline=results,
        comparison=optimized_results,
        baseline_label="Default Distribution",
        comparison_label="Optimized Distribution",
    )

    print("üìä Configuration Comparison:\n")
    print(f"Baseline (Default): {results.geometric_mean:.3f}s")
    print(f"Optimized: {optimized_results.geometric_mean:.3f}s")
    print(f"Improvement: {comparison['overall_improvement']:.1f}%")

    if comparison["regressions"]:
        print(f"\n‚ö†Ô∏è  Queries with regressions: {len(comparison['regressions'])}")
        for reg in comparison["regressions"][:3]:
            print(f"   {reg['query']}: {reg['change']:+.1f}%")

    if comparison["improvements"]:
        print(f"\n‚úÖ Queries with improvements: {len(comparison['improvements'])}")
        for imp in comparison["improvements"][:3]:
            print(f"   {imp['query']}: {imp['change']:+.1f}%")

except Exception as e:
    print(f"‚ö†Ô∏è Could not compare results: {e}")

### Export Results

Export benchmark results to various formats for reporting and analysis.

In [None]:
# Export to multiple formats
try:
    exporter = ResultExporter(results)

    output_dir = Path(config["output_dir"])
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Export to JSON
    json_path = output_dir / f"redshift_tpch_{timestamp}.json"
    exporter.to_json(json_path)
    print(f"‚úÖ Exported JSON: {json_path}")

    # Export to CSV
    csv_path = output_dir / f"redshift_tpch_{timestamp}.csv"
    exporter.to_csv(csv_path)
    print(f"‚úÖ Exported CSV: {csv_path}")

    # Export to HTML report
    html_path = output_dir / f"redshift_tpch_{timestamp}.html"
    exporter.to_html(html_path)
    print(f"‚úÖ Exported HTML: {html_path}")

    print(f"\nüìÅ All results exported to: {output_dir}")

except Exception as e:
    print(f"‚ö†Ô∏è Export failed: {e}")

### Cost Analysis

Estimate costs based on cluster configuration and runtime.

In [None]:
# Cost estimation (example with RA3.XLPLUS)
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    # Get node count
    cursor = conn.cursor()
    cursor.execute("SELECT COUNT(DISTINCT node) FROM stv_slices;")
    node_count = cursor.fetchone()[0]

    # Cost assumptions (adjust for your region and node type)
    cost_per_node_hour = 1.086  # RA3.XLPLUS in us-east-1

    total_runtime_hours = results.total_execution_time / 3600
    estimated_cost = node_count * cost_per_node_hour * total_runtime_hours

    print("üí∞ Cost Estimation:\n")
    print(f"   Node count: {node_count}")
    print(f"   Runtime: {total_runtime_hours:.4f} hours")
    print(f"   Cost per node-hour: ${cost_per_node_hour}")
    print(f"   Estimated cost: ${estimated_cost:.4f}")

    # Extrapolate to production scale
    sf_multiplier = 1.0 / results.scale_factor  # Scale to SF 1.0
    production_runtime = total_runtime_hours * sf_multiplier
    production_cost = node_count * cost_per_node_hour * production_runtime

    print("\nüìä Extrapolated to SF 1.0:")
    print(f"   Estimated runtime: {production_runtime:.2f} hours")
    print(f"   Estimated cost: ${production_cost:.2f}")

    cursor.close()
    conn.close()

except Exception as e:
    print(f"‚ö†Ô∏è Could not estimate costs: {e}")

## 4. Platform-Specific Features

### Concurrency Scaling

Monitor and configure concurrency scaling to handle burst workloads.

**How it works:**
- Automatically adds cluster capacity when queues are long
- Routes queries to transient clusters
- Charged per-second of usage
- Free tier: 1 hour per day per cluster

**Best practices:**
- Enable for read-heavy workloads
- Monitor usage to control costs
- Use for unpredictable query loads

In [None]:
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    # Check concurrency scaling usage
    query = """
        SELECT 
            service_class,
            COUNT(*) as query_count,
            SUM(CASE WHEN concurrency_scaling_status = 1 THEN 1 ELSE 0 END) as cs_queries
        FROM stl_query
        WHERE userid > 1
        GROUP BY service_class;
    """

    df_cs = pd.read_sql(query, conn)

    print("üöÄ Concurrency Scaling Usage:\n")
    if not df_cs.empty:
        print(df_cs.to_string(index=False))
        total_cs = df_cs["cs_queries"].sum()
        total_queries = df_cs["query_count"].sum()
        if total_queries > 0:
            cs_pct = (total_cs / total_queries) * 100
            print(f"\n   Concurrency scaling usage: {cs_pct:.1f}%")
    else:
        print("   No concurrency scaling usage")

    conn.close()

    print("\nüí° To enable concurrency scaling:")
    print("   ALTER WORKLOAD GROUP my_group SET concurrency_scaling = auto;")

except Exception as e:
    print(f"‚ö†Ô∏è Could not query concurrency scaling: {e}")

### Compression Encodings

Redshift automatically chooses compression encodings, but you can optimize manually.

**Available encodings:**
- **RAW**: No compression
- **AZ64**: Best for numeric data (default)
- **LZO**: Fast compression for text
- **ZSTD**: High compression ratio for mixed data
- **DELTA**: Best for sorted numeric columns
- **RUNLENGTH**: Best for low-cardinality columns

**Check current encodings:**
```sql
SELECT 
    tablename,
    column,
    encoding,
    distkey,
    sortkey
FROM pg_table_def
WHERE schemaname = 'public'
ORDER BY tablename, column;
```

In [None]:
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    # Check compression for TPC-H tables
    query = """
        SELECT 
            tablename,
            "column",
            encoding
        FROM pg_table_def
        WHERE schemaname = 'public'
            AND tablename IN ('customer', 'orders', 'lineitem')
        ORDER BY tablename, column;
    """

    df_encoding = pd.read_sql(query, conn)

    print("üóúÔ∏è  Compression Encodings:\n")
    if not df_encoding.empty:
        for table in df_encoding["tablename"].unique():
            table_df = df_encoding[df_encoding["tablename"] == table]
            print(f"\n{table}:")
            for _, row in table_df.head(5).iterrows():  # Show first 5 columns
                print(f"   {row['column']}: {row['encoding']}")
    else:
        print("   No tables found")

    conn.close()

    print("\nüí° To analyze compression recommendations:")
    print("   ANALYZE COMPRESSION customer;")

except Exception as e:
    print(f"‚ö†Ô∏è Could not query compression encodings: {e}")

### VACUUM and ANALYZE

Maintain table performance with VACUUM and ANALYZE operations.

**VACUUM:**
- Reclaims space from deleted rows
- Re-sorts rows according to sort keys
- Types: FULL, DELETE ONLY, SORT ONLY, REINDEX

**ANALYZE:**
- Updates table statistics for query planner
- Run after loading data or major changes

**Best practices:**
- Run VACUUM during maintenance windows
- Use VACUUM DELETE ONLY for frequent deletes
- Run ANALYZE after COPY or INSERT operations
- Automatic VACUUM runs in background (can be disabled)

In [None]:
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    # Check table statistics
    query = """
        SELECT 
            "table",
            unsorted,
            stats_off,
            tbl_rows
        FROM svv_table_info
        WHERE "schema" = 'public'
        ORDER BY unsorted DESC;
    """

    df_stats = pd.read_sql(query, conn)

    print("üîß Table Maintenance Status:\n")
    if not df_stats.empty:
        print("Table                Unsorted%  Stats Off%  Rows")
        print("=" * 60)
        for _, row in df_stats.head(10).iterrows():
            print(f"{row['table']:<20} {row['unsorted']:>8.1f}%  {row['stats_off']:>10.1f}%  {row['tbl_rows']:>10,}")

        # Recommendations
        needs_vacuum = df_stats[df_stats["unsorted"] > 20]
        needs_analyze = df_stats[df_stats["stats_off"] > 10]

        if not needs_vacuum.empty:
            print(f"\n‚ö†Ô∏è  {len(needs_vacuum)} tables need VACUUM (>20% unsorted)")
        if not needs_analyze.empty:
            print(f"‚ö†Ô∏è  {len(needs_analyze)} tables need ANALYZE (>10% stats off)")
    else:
        print("   No tables found")

    conn.close()

    print("\nüí° Maintenance commands:")
    print("   VACUUM orders;")
    print("   ANALYZE lineitem;")
    print("   VACUUM FULL;  -- All tables")

except Exception as e:
    print(f"‚ö†Ô∏è Could not query table statistics: {e}")

### Redshift Spectrum

Query external data in S3 without loading into Redshift.

**Use cases:**
- Query historical data stored in S3
- Join S3 data with Redshift tables
- Reduce storage costs
- Process data lake files

**Example:**
```sql
CREATE EXTERNAL SCHEMA spectrum
FROM data catalog
DATABASE 'mydb'
IAM_ROLE 'arn:aws:iam::123456789012:role/SpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

CREATE EXTERNAL TABLE spectrum.orders (
    o_orderkey bigint,
    o_custkey bigint,
    o_orderstatus varchar(1),
    o_totalprice decimal(15,2),
    o_orderdate date
)
STORED AS PARQUET
LOCATION 's3://my-bucket/orders/';

-- Query S3 data
SELECT COUNT(*) FROM spectrum.orders;
```

In [None]:
if AWS_S3_BUCKET and AWS_IAM_ROLE:
    print("üåü Redshift Spectrum Example\n")
    print("To create external schema:")
    print(f"""
CREATE EXTERNAL SCHEMA spectrum
FROM data catalog
DATABASE 'benchbox'
IAM_ROLE '{AWS_IAM_ROLE}'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
    """)

    print("\nTo create external table:")
    print(f"""
CREATE EXTERNAL TABLE spectrum.lineitem (
    l_orderkey bigint,
    l_partkey bigint,
    l_suppkey bigint,
    l_linenumber int,
    l_quantity decimal(15,2),
    l_extendedprice decimal(15,2),
    l_discount decimal(15,2),
    l_tax decimal(15,2),
    l_returnflag varchar(1),
    l_linestatus varchar(1),
    l_shipdate date,
    l_commitdate date,
    l_receiptdate date,
    l_shipinstruct varchar(25),
    l_shipmode varchar(10),
    l_comment varchar(44)
)
STORED AS PARQUET
LOCATION 's3://{AWS_S3_BUCKET}/tpch/lineitem/';
    """)

    print("\nüí∞ Spectrum Pricing:")
    print("   $5 per TB of data scanned from S3")
    print("   Use partitioning to reduce data scanned")
    print("   Use columnar formats (Parquet, ORC) for efficiency")
else:
    print("‚ö†Ô∏è  S3 bucket and IAM role required for Spectrum")
    print("   Set AWS_S3_BUCKET and AWS_IAM_ROLE environment variables")

## 5. Performance Analysis

### Load and Analyze Previous Results

Load saved benchmark results for analysis.

In [None]:
# Find most recent result file
try:
    result_files = sorted(Path(config["output_dir"]).glob("redshift_tpch_*.json"))

    if result_files:
        latest_file = result_files[-1]
        print(f"üìÇ Loading results from: {latest_file.name}\n")

        loader = ResultLoader()
        loaded_results = loader.load_json(latest_file)

        print(f"‚úÖ Loaded {len(loaded_results.query_results)} query results")
        print(f"   Benchmark: {loaded_results.benchmark_name}")
        print(f"   Scale factor: {loaded_results.scale_factor}")
        print(f"   Geometric mean: {loaded_results.geometric_mean:.3f}s")
    else:
        print("‚ö†Ô∏è  No result files found. Run a benchmark first.")
        loaded_results = results  # Use current results

except Exception as e:
    print(f"‚ö†Ô∏è Could not load results: {e}")
    loaded_results = results

### Statistical Analysis

Calculate detailed statistics on query performance.

In [None]:
if loaded_results.query_results:
    times = [qr.execution_time for qr in loaded_results.query_results if qr.success]

    if times:
        # Calculate statistics
        import numpy as np

        stats = {
            "count": len(times),
            "mean": np.mean(times),
            "median": np.median(times),
            "std": np.std(times),
            "min": np.min(times),
            "max": np.max(times),
            "p25": np.percentile(times, 25),
            "p75": np.percentile(times, 75),
            "p95": np.percentile(times, 95),
            "p99": np.percentile(times, 99),
        }

        print("üìä Statistical Summary:\n")
        print(f"Count:      {stats['count']} queries")
        print(f"Mean:       {stats['mean']:.3f}s")
        print(f"Median:     {stats['median']:.3f}s")
        print(f"Std Dev:    {stats['std']:.3f}s")
        print(f"Min:        {stats['min']:.3f}s")
        print(f"Max:        {stats['max']:.3f}s")
        print("\nPercentiles:")
        print(f"P25:        {stats['p25']:.3f}s")
        print(f"P75:        {stats['p75']:.3f}s")
        print(f"P95:        {stats['p95']:.3f}s")
        print(f"P99:        {stats['p99']:.3f}s")

        # Coefficient of variation
        cv = stats["std"] / stats["mean"]
        print(f"\nCoefficient of Variation: {cv:.2f}")
        if cv < 0.5:
            print("   ‚úÖ Low variability - consistent performance")
        elif cv < 1.0:
            print("   ‚ö†Ô∏è  Moderate variability")
        else:
            print("   ‚ùå High variability - investigate outliers")

        # Identify outliers (>2 std devs from mean)
        outliers = [
            qr
            for qr in loaded_results.query_results
            if qr.success and abs(qr.execution_time - stats["mean"]) > 2 * stats["std"]
        ]

        if outliers:
            print(f"\n‚ö†Ô∏è  {len(outliers)} outlier queries detected:")
            for qr in outliers[:5]:
                deviation = (qr.execution_time - stats["mean"]) / stats["std"]
                print(f"   {qr.query_name}: {qr.execution_time:.3f}s ({deviation:+.1f}œÉ)")
    else:
        print("‚ö†Ô∏è  No successful queries to analyze")
else:
    print("‚ö†Ô∏è  No query results available")

### Advanced Visualizations

Create comprehensive performance visualizations.

In [None]:
if loaded_results.query_results:
    times = [qr.execution_time for qr in loaded_results.query_results if qr.success]
    query_names = [qr.query_name for qr in loaded_results.query_results if qr.success]

    if times:
        fig, axes = plt.subplots(2, 2, figsize=(16, 12))

        # 1. Histogram with distribution
        ax1 = axes[0, 0]
        ax1.hist(times, bins=20, color="#CC0000", alpha=0.7, edgecolor="black")
        ax1.axvline(np.mean(times), color="#FF9900", linestyle="--", linewidth=2, label=f"Mean: {np.mean(times):.2f}s")
        ax1.axvline(
            np.median(times), color="green", linestyle="--", linewidth=2, label=f"Median: {np.median(times):.2f}s"
        )
        ax1.set_xlabel("Execution Time (seconds)", fontweight="bold")
        ax1.set_ylabel("Frequency", fontweight="bold")
        ax1.set_title("Query Execution Time Distribution", fontweight="bold")
        ax1.legend()
        ax1.grid(axis="y", alpha=0.3)

        # 2. Box plot
        ax2 = axes[0, 1]
        box = ax2.boxplot([times], vert=True, patch_artist=True, widths=0.5)
        box["boxes"][0].set_facecolor("#CC0000")
        box["boxes"][0].set_alpha(0.7)
        ax2.set_ylabel("Execution Time (seconds)", fontweight="bold")
        ax2.set_title("Query Performance Box Plot", fontweight="bold")
        ax2.set_xticklabels(["All Queries"])
        ax2.grid(axis="y", alpha=0.3)

        # 3. Sorted bar chart (top 10 slowest)
        ax3 = axes[1, 0]
        sorted_indices = np.argsort(times)[-10:]
        sorted_times = [times[i] for i in sorted_indices]
        sorted_names = [query_names[i] for i in sorted_indices]

        bars = ax3.barh(sorted_names, sorted_times, color="#CC0000", alpha=0.8, edgecolor="black")
        ax3.set_xlabel("Execution Time (seconds)", fontweight="bold")
        ax3.set_title("Top 10 Slowest Queries", fontweight="bold")
        ax3.grid(axis="x", alpha=0.3)

        # 4. Cumulative performance (Pareto)
        ax4 = axes[1, 1]
        sorted_all_indices = np.argsort(times)[::-1]
        sorted_all_times = [times[i] for i in sorted_all_indices]
        cumulative = np.cumsum(sorted_all_times)
        cumulative_pct = (cumulative / cumulative[-1]) * 100

        ax4.plot(range(len(cumulative_pct)), cumulative_pct, marker="o", color="#CC0000", linewidth=2)
        ax4.axhline(80, color="#FF9900", linestyle="--", linewidth=2, label="80% of total time")
        ax4.set_xlabel("Number of Queries (sorted by time)", fontweight="bold")
        ax4.set_ylabel("Cumulative % of Total Time", fontweight="bold")
        ax4.set_title("Cumulative Performance (Pareto Analysis)", fontweight="bold")
        ax4.legend()
        ax4.grid(True, alpha=0.3)

        # Find how many queries account for 80% of time
        queries_80pct = np.argmax(cumulative_pct >= 80) + 1
        ax4.axvline(
            queries_80pct, color="green", linestyle="--", linewidth=2, label=f"{queries_80pct} queries = 80% time"
        )
        ax4.legend()

        plt.tight_layout()
        plt.show()

        print(
            f"\nüí° Insight: {queries_80pct} queries ({queries_80pct / len(times) * 100:.0f}%) account for 80% of total execution time"
        )
        print("   Focus optimization efforts on these queries for maximum impact")
    else:
        print("‚ö†Ô∏è  No successful queries to visualize")
else:
    print("‚ö†Ô∏è  No query results available")

### Query History Analysis

Analyze historical query patterns using STL system tables.

In [None]:
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    # Top 10 slowest queries by total execution time
    query = """
        SELECT 
            query,
            TRIM(database) as db,
            ROUND(total_exec_time/1000000.0, 2) as exec_sec,
            ROUND(compile_time/1000000.0, 2) as compile_sec,
            rows,
            aborted
        FROM svl_query_summary
        WHERE userid > 1
            AND starttime >= DATEADD(hour, -24, GETDATE())
        ORDER BY total_exec_time DESC
        LIMIT 10;
    """

    df_slow = pd.read_sql(query, conn)

    print("üêå Slowest Queries (Last 24h):\n")
    if not df_slow.empty:
        print("Query ID  Exec(s)  Compile(s)  Rows        Aborted")
        print("=" * 60)
        for _, row in df_slow.iterrows():
            print(
                f"{row['query']:<9} {row['exec_sec']:>7.2f}  {row['compile_sec']:>10.2f}  {row['rows']:>10,}  {row['aborted']}"
            )
    else:
        print("   No queries found in last 24 hours")

    # Query queue wait times
    query = """
        SELECT 
            service_class,
            COUNT(*) as query_count,
            AVG(total_queue_time/1000000.0) as avg_queue_sec,
            MAX(total_queue_time/1000000.0) as max_queue_sec
        FROM stl_wlm_query
        WHERE service_class >= 6
            AND queue_start_time >= DATEADD(hour, -24, GETDATE())
        GROUP BY service_class
        ORDER BY avg_queue_sec DESC;
    """

    df_queue = pd.read_sql(query, conn)

    if not df_queue.empty:
        print("\n‚è±Ô∏è  Queue Wait Times:\n")
        print("Queue  Queries  Avg Wait(s)  Max Wait(s)")
        print("=" * 50)
        for _, row in df_queue.iterrows():
            print(
                f"{row['service_class']:<6} {row['query_count']:>7,}  {row['avg_queue_sec']:>11.2f}  {row['max_queue_sec']:>11.2f}"
            )

    conn.close()

except Exception as e:
    print(f"‚ö†Ô∏è Could not query history: {e}")

### Disk Space Analysis

Monitor table sizes and disk usage.

In [None]:
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    # Table sizes
    query = """
        SELECT 
            "table",
            ROUND(size/1024.0, 2) as size_gb,
            tbl_rows as rows,
            ROUND((size/1024.0) / NULLIF(tbl_rows, 0) * 1024 * 1024, 2) as bytes_per_row
        FROM svv_table_info
        WHERE "schema" = 'public'
        ORDER BY size DESC
        LIMIT 10;
    """

    df_size = pd.read_sql(query, conn)

    print("üíæ Table Storage Analysis:\n")
    if not df_size.empty:
        print("Table              Size (GB)      Rows  Bytes/Row")
        print("=" * 60)
        for _, row in df_size.iterrows():
            print(f"{row['table']:<18} {row['size_gb']:>9.2f}  {row['rows']:>10,}  {row['bytes_per_row']:>9.0f}")

        total_gb = df_size["size_gb"].sum()
        print(f"\nTotal: {total_gb:.2f} GB")
    else:
        print("   No tables found")

    conn.close()

except Exception as e:
    print(f"‚ö†Ô∏è Could not analyze disk usage: {e}")

### Regression Detection

Compare current results against baseline to detect performance regressions.

In [None]:
# Compare current run against baseline (if available)
baseline_path = Path(config["output_dir"]) / "baseline_redshift.json"

if baseline_path.exists():
    try:
        loader = ResultLoader()
        baseline = loader.load_json(baseline_path)

        comparator = BenchmarkComparator()
        comparison = comparator.compare(
            baseline=baseline, comparison=loaded_results, baseline_label="Baseline", comparison_label="Current"
        )

        print("üîç Regression Detection:\n")
        print(f"Baseline: {baseline.geometric_mean:.3f}s")
        print(f"Current:  {loaded_results.geometric_mean:.3f}s")
        print(f"Change:   {comparison['overall_improvement']:+.1f}%")

        # Flag regressions (>10% slower)
        regressions = [r for r in comparison["regressions"] if r["change"] < -10]

        if regressions:
            print(f"\n‚ùå {len(regressions)} queries regressed >10%:")
            for reg in regressions[:5]:
                print(f"   {reg['query']}: {reg['baseline']:.3f}s ‚Üí {reg['comparison']:.3f}s ({reg['change']:+.1f}%)")
        else:
            print("\n‚úÖ No significant regressions detected")

        # Flag improvements (>10% faster)
        improvements = [i for i in comparison["improvements"] if i["change"] > 10]

        if improvements:
            print(f"\n‚úÖ {len(improvements)} queries improved >10%:")
            for imp in improvements[:5]:
                print(f"   {imp['query']}: {imp['baseline']:.3f}s ‚Üí {imp['comparison']:.3f}s ({imp['change']:+.1f}%)")

    except Exception as e:
        print(f"‚ö†Ô∏è Could not compare results: {e}")
else:
    print("üí° No baseline found. To enable regression detection:")
    print("   1. Run a baseline benchmark")
    print(f"   2. Save results to: {baseline_path}")
    print("   3. Re-run this cell")

    # Optionally save current results as baseline
    save_baseline = False  # Set to True to save
    if save_baseline:
        try:
            exporter = ResultExporter(loaded_results)
            exporter.to_json(baseline_path)
            print(f"\n‚úÖ Saved current results as baseline: {baseline_path}")
        except Exception as e:
            print(f"‚ö†Ô∏è Could not save baseline: {e}")

## 6. Troubleshooting

### Connection Diagnostics

Comprehensive diagnostic tool for troubleshooting connection issues.

In [None]:
def diagnose_redshift_connection():
    """Diagnose Redshift connection issues"""
    print("üîç Redshift Connection Diagnostic\n")

    # Check 1: Environment variables
    print("1. Checking environment variables...")
    required_vars = {
        "REDSHIFT_HOST": REDSHIFT_HOST,
        "REDSHIFT_DB": REDSHIFT_DB,
        "REDSHIFT_USER": REDSHIFT_USER,
        "REDSHIFT_PASSWORD": "***" if REDSHIFT_PASSWORD else None,
    }

    all_set = True
    for var, value in required_vars.items():
        if value:
            print(f"   ‚úÖ {var} is set")
        else:
            print(f"   ‚ùå {var} is not set")
            all_set = False

    if not all_set:
        print("\n   Action: Set missing environment variables")
        return False

    # Check 2: Host format
    print("\n2. Validating host format...")
    if ".redshift.amazonaws.com" in REDSHIFT_HOST or ".redshift-serverless.amazonaws.com" in REDSHIFT_HOST:
        print(f"   ‚úÖ Host format looks correct: {REDSHIFT_HOST}")
    else:
        print(f"   ‚ö†Ô∏è  Host may be incorrect: {REDSHIFT_HOST}")
        print("   Expected format: <cluster>.<region>.redshift.amazonaws.com")

    # Check 3: Port
    print("\n3. Checking port...")
    if REDSHIFT_PORT == 5439:
        print(f"   ‚úÖ Using default Redshift port: {REDSHIFT_PORT}")
    else:
        print(f"   ‚ö†Ô∏è  Using non-standard port: {REDSHIFT_PORT}")

    # Check 4: Test connection
    print("\n4. Testing connection...")
    try:
        conn = redshift_connector.connect(
            host=REDSHIFT_HOST,
            port=REDSHIFT_PORT,
            database=REDSHIFT_DB,
            user=REDSHIFT_USER,
            password=REDSHIFT_PASSWORD,
            timeout=10,
        )
        conn.close()
        print("   ‚úÖ Connection successful!")
        return True
    except Exception as e:
        print(f"   ‚ùå Connection failed: {str(e)[:100]}")

        # Provide specific guidance
        if "timeout" in str(e).lower():
            print("\n   Likely cause: Network connectivity")
            print("   - Check security group allows inbound on port 5439")
            print("   - Verify VPC routing and network ACLs")
            print("   - Ensure cluster is publicly accessible (if connecting from outside VPC)")
        elif "authentication" in str(e).lower() or "password" in str(e).lower():
            print("\n   Likely cause: Invalid credentials")
            print("   - Verify username and password")
            print("   - Check user has CONNECT privilege on database")
        elif "database" in str(e).lower():
            print("\n   Likely cause: Invalid database name")
            print("   - Verify database exists")
            print("   - Check spelling (case-sensitive)")
        else:
            print("\n   Check AWS Console for cluster status")

        return False

    print("\nüìö Additional Resources:")
    print("   - Amazon Redshift Documentation: https://docs.aws.amazon.com/redshift/")
    print("   - Security Groups: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-security-groups.html")
    print("   - Connection Issues: https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-troubleshooting.html")


# Run diagnostics
diagnose_redshift_connection()

### Permission Validation

Check user permissions for common operations.

In [None]:
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    cursor = conn.cursor()

    print("üîê Permission Check\n")

    # Check CREATE privilege
    try:
        cursor.execute("CREATE TEMP TABLE permission_test (id INT);")
        cursor.execute("DROP TABLE permission_test;")
        print("‚úÖ CREATE TABLE: Allowed")
    except Exception as e:
        print(f"‚ùå CREATE TABLE: Denied ({str(e)[:50]}...)")

    # Check SELECT privilege
    try:
        cursor.execute("SELECT 1;")
        print("‚úÖ SELECT: Allowed")
    except Exception as e:
        print(f"‚ùå SELECT: Denied ({str(e)[:50]}...)")

    # Check system table access
    try:
        cursor.execute("SELECT 1 FROM stv_slices LIMIT 1;")
        print("‚úÖ System tables: Allowed")
    except Exception as e:
        print(f"‚ö†Ô∏è  System tables: Limited access ({str(e)[:50]}...)")

    # Check COPY privilege (requires S3)
    if AWS_S3_BUCKET and AWS_IAM_ROLE:
        try:
            # This will fail on missing S3 file, but tells us if COPY is allowed
            cursor.execute("""
                CREATE TEMP TABLE copy_test (id INT);
            """)
            print("‚úÖ COPY capability: Available (IAM role configured)")
            cursor.execute("DROP TABLE copy_test;")
        except Exception as e:
            print(f"‚ö†Ô∏è  COPY: May have issues ({str(e)[:50]}...)")
    else:
        print("‚ö†Ô∏è  COPY: S3 configuration not set")

    cursor.close()
    conn.close()

    print("\nüí° To grant permissions:")
    print("   GRANT CREATE ON SCHEMA public TO your_user;")
    print("   GRANT SELECT ON ALL TABLES IN SCHEMA public TO your_user;")

except Exception as e:
    print(f"‚ùå Permission check failed: {e}")

### Cluster Health Check

Check cluster health and configuration.

In [None]:
try:
    conn = redshift_connector.connect(
        host=REDSHIFT_HOST, port=REDSHIFT_PORT, database=REDSHIFT_DB, user=REDSHIFT_USER, password=REDSHIFT_PASSWORD
    )

    print("üè• Cluster Health Check\n")

    # Check node health
    query = """
        SELECT 
            node,
            slice,
            node_type
        FROM stv_slices
        WHERE slice = 0
        ORDER BY node;
    """

    df_nodes = pd.read_sql(query, conn)

    if not df_nodes.empty:
        print(f"‚úÖ Cluster has {len(df_nodes)} healthy nodes")
        if len(df_nodes) > 0:
            print(f"   Node type: {df_nodes.iloc[0]['node_type']}")
    else:
        print("‚ö†Ô∏è  Could not determine node count")

    # Check disk space
    query = """
        SELECT 
            ROUND(SUM(used)::float/SUM(capacity)::float*100, 1) as pct_used
        FROM stv_partitions;
    """

    cursor = conn.cursor()
    cursor.execute(query)
    pct_used = cursor.fetchone()[0]

    if pct_used < 75:
        print(f"‚úÖ Disk usage: {pct_used}% (healthy)")
    elif pct_used < 90:
        print(f"‚ö†Ô∏è  Disk usage: {pct_used}% (approaching limit)")
    else:
        print(f"‚ùå Disk usage: {pct_used}% (critical - add nodes or clean up)")

    # Check for disk-based queries
    query = """
        SELECT COUNT(*) as disk_queries
        FROM svl_query_summary
        WHERE is_diskbased = 't'
            AND starttime >= DATEADD(hour, -24, GETDATE());
    """

    cursor.execute(query)
    disk_queries = cursor.fetchone()[0]

    if disk_queries == 0:
        print("‚úÖ No disk-based queries (last 24h)")
    else:
        print(f"‚ö†Ô∏è  {disk_queries} disk-based queries detected (last 24h)")
        print("   Consider increasing memory or optimizing queries")

    cursor.close()
    conn.close()

except Exception as e:
    print(f"‚ö†Ô∏è Health check failed: {e}")

### Common Issues and Solutions

**1. Connection Timeout**
```
Error: timeout expired
```
**Solution:**
- Check security group allows inbound TCP 5439
- Verify VPC routing table has route to internet gateway (if public)
- Check network ACLs aren't blocking traffic
- Ensure cluster is in "Available" state

**2. Authentication Failed**
```
Error: password authentication failed
```
**Solution:**
- Verify username and password are correct
- Check user has CONNECT privilege: `GRANT CONNECT ON DATABASE dbname TO username;`
- Try resetting password in AWS Console

**3. Insufficient Privileges**
```
Error: permission denied for schema public
```
**Solution:**
```sql
GRANT CREATE ON SCHEMA public TO username;
GRANT ALL ON ALL TABLES IN SCHEMA public TO username;
```

**4. Slow Query Performance**
**Solution:**
- Run VACUUM to reclaim space and re-sort: `VACUUM FULL;`
- Run ANALYZE to update statistics: `ANALYZE;`
- Check for disk-based queries: `SELECT * FROM svl_query_summary WHERE is_diskbased='t';`
- Review distribution and sort keys
- Consider adding nodes or using concurrency scaling

**5. Disk Space Full**
```
Error: disk full
```
**Solution:**
- Drop unused tables
- Run VACUUM to reclaim deleted row space
- Add more nodes (RA3 recommended)
- Move historical data to S3 and use Spectrum

**6. COPY from S3 Failed**
```
Error: S3ServiceException
```
**Solution:**
- Verify IAM role is attached to cluster
- Check IAM role has s3:GetObject permission
- Verify S3 bucket policy allows Redshift access
- Check S3 path is correct (case-sensitive)
- Review errors: `SELECT * FROM stl_load_errors ORDER BY starttime DESC LIMIT 10;`

**7. High Queue Times**
**Solution:**
- Review WLM configuration
- Enable concurrency scaling
- Increase query queue concurrency
- Add more nodes for additional capacity

**8. Query Monitoring Rule Abort**
```
Error: Query was aborted by a query monitoring rule
```
**Solution:**
- Review QMR settings in parameter group
- Optimize query to use less resources
- Adjust QMR thresholds if appropriate

**Need More Help?**
- Amazon Redshift Documentation: https://docs.aws.amazon.com/redshift/
- AWS Support: https://console.aws.amazon.com/support/
- BenchBox Documentation: https://github.com/joeharris76/benchbox
- Query system tables for detailed diagnostics:
  - `stl_query` - Query history
  - `stl_wlm_query` - WLM queue history
  - `svl_query_summary` - Query execution details
  - `stv_inflight` - Currently running queries
  - `stl_load_errors` - COPY command errors

## Next Steps

**Explore more features:**
1. Test different node types and cluster sizes
2. Compare RA3 vs DC2 performance and costs
3. Implement automated VACUUM and ANALYZE schedules
4. Set up query monitoring rules for runaway queries
5. Configure workload management (WLM) for your use case
6. Load data from S3 using COPY for production workflows
7. Use Spectrum to query S3 data directly
8. Monitor with CloudWatch metrics

**Related notebooks:**
- `snowflake_benchmarking.ipynb` - Compare with Snowflake
- `bigquery_benchmarking.ipynb` - Compare with BigQuery  
- `platform_comparison.ipynb` - Multi-cloud comparison
- `cost_analysis.ipynb` - Cost optimization strategies

**Resources:**
- BenchBox Documentation: https://github.com/joeharris76/benchbox
- Amazon Redshift Best Practices: https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
- TPC Benchmarks: http://www.tpc.org/