# BenchBox BigQuery Benchmarking

This notebook demonstrates how to use BenchBox to benchmark **Google BigQuery** with serverless analytics.

**What You'll Learn:**
- Authenticate with BigQuery using multiple methods (ADC, service account, secrets)
- Run TPC-H, TPC-DS, and ClickBench benchmarks
- Monitor query costs and bytes processed
- Use partitioned and clustered tables for optimization
- Compare slot reservations vs on-demand pricing
- Analyze performance with statistical visualizations
- Troubleshoot common BigQuery issues

**Prerequisites:**
- Google Cloud Platform account
- BigQuery API enabled
- Project with billing enabled
- Appropriate IAM permissions (BigQuery Admin or Data Editor)

## Installation & Setup

### 1.1 Install Required Libraries

Install BenchBox and Google Cloud libraries:

In [None]:
!pip install benchbox google-cloud-bigquery google-cloud-storage

### 1.2 Import Libraries

Import BenchBox components and visualization libraries:

In [None]:
import os

# Visualization imports
import matplotlib.pyplot as plt
import pandas as pd

# Google Cloud imports
from google.cloud import bigquery

# BenchBox imports
from benchbox.core.config import BenchmarkConfig, DatabaseConfig
from benchbox.core.runner import LifecyclePhases, run_benchmark_lifecycle

print("‚úÖ All imports successful")

### 1.3 Authentication

BigQuery supports three authentication methods:

**Method 1: Application Default Credentials (ADC)** - Recommended for development
```bash
gcloud auth application-default login
```

**Method 2: Service Account Key** - Recommended for production
```bash
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
```

**Method 3: Direct Service Account** - For programmatic use
```python
credentials = service_account.Credentials.from_service_account_file(
    '/path/to/service-account-key.json'
)
```

This notebook will try ADC first, then fall back to environment variables:

In [None]:
# Configure BigQuery connection
try:
    # Try environment variables first
    BQ_PROJECT = os.environ.get("BIGQUERY_PROJECT")
    BQ_LOCATION = os.environ.get("BIGQUERY_LOCATION", "US")
    BQ_DATASET = os.environ.get("BIGQUERY_DATASET", "benchbox")

    if not BQ_PROJECT:
        print("‚ö†Ô∏è  BIGQUERY_PROJECT environment variable not set")
        print("\nüí° Set up authentication:")
        print("  Option 1 (ADC): gcloud auth application-default login")
        print("  Option 2 (Env): export BIGQUERY_PROJECT='your-project-id'")
        print("  Option 3 (SA):  export GOOGLE_APPLICATION_CREDENTIALS='/path/to/key.json'")

        # Try to detect from ADC
        try:
            client = bigquery.Client()
            BQ_PROJECT = client.project
            print(f"\n‚úÖ Using ADC with project: {BQ_PROJECT}")
        except Exception as e:
            print(f"‚ùå Could not detect project from ADC: {e}")
            raise ValueError("Please set BIGQUERY_PROJECT environment variable")
    else:
        print(f"‚úÖ Using project: {BQ_PROJECT}")
        print(f"‚úÖ Location: {BQ_LOCATION}")
        print(f"‚úÖ Dataset: {BQ_DATASET}")

except Exception as e:
    print(f"‚ùå Authentication error: {e}")
    print("\nüí° Troubleshooting:")
    print("  1. Run: gcloud auth application-default login")
    print("  2. Set BIGQUERY_PROJECT environment variable")
    print("  3. Verify BigQuery API is enabled in your project")
    raise

### 1.4 Test Connection

Verify connectivity and permissions:

In [None]:
try:
    # Initialize BigQuery client
    client = bigquery.Client(project=BQ_PROJECT, location=BQ_LOCATION)

    # Test 1: List datasets
    print("1Ô∏è‚É£ Testing dataset listing...")
    datasets = list(client.list_datasets())
    print(f"   ‚úÖ Found {len(datasets)} dataset(s) in project {BQ_PROJECT}")

    # Test 2: Check if benchmark dataset exists
    print(f"\n2Ô∏è‚É£ Checking for dataset '{BQ_DATASET}'...")
    dataset_id = f"{BQ_PROJECT}.{BQ_DATASET}"
    try:
        dataset = client.get_dataset(dataset_id)
        print(f"   ‚úÖ Dataset exists: {dataset_id}")
        print(f"   Location: {dataset.location}")
        print(f"   Created: {dataset.created}")
    except Exception:
        print(f"   ‚ö†Ô∏è  Dataset '{BQ_DATASET}' does not exist")
        print("   üí° Creating dataset...")

        dataset = bigquery.Dataset(dataset_id)
        dataset.location = BQ_LOCATION
        dataset = client.create_dataset(dataset, exists_ok=True)
        print(f"   ‚úÖ Created dataset: {dataset_id}")

    # Test 3: Run simple query
    print("\n3Ô∏è‚É£ Testing query execution...")
    query = "SELECT 1 as test"
    query_job = client.query(query)
    results = query_job.result()
    print("   ‚úÖ Query executed successfully")
    print(f"   Bytes processed: {query_job.total_bytes_processed:,}")
    print(f"   Bytes billed: {query_job.total_bytes_billed:,}")

    print("\n‚úÖ All connection tests passed!")

except Exception as e:
    print(f"‚ùå Connection test failed: {e}")
    print("\nüí° Troubleshooting:")
    print("  1. Verify BigQuery API is enabled")
    print("  2. Check IAM permissions (roles/bigquery.admin or roles/bigquery.dataEditor)")
    print("  3. Ensure billing is enabled for the project")
    raise

### 1.5 Configuration Overview

Summary of your BigQuery configuration:

In [None]:
print("üìä BigQuery Configuration Summary\n")
print(f"Project:  {BQ_PROJECT}")
print(f"Location: {BQ_LOCATION}")
print(f"Dataset:  {BQ_DATASET}")
print("\nPricing Model: On-demand (pay per byte processed)")
print("Cost: $5 per TB processed (first 1TB/month free)")
print("\nüí° Tip: Use partitioned/clustered tables to reduce costs")

## Quick Start Example

### 2.1 Run TPC-H Power Test

Run a TPC-H power test at scale factor 0.01 (~10MB data).

**What happens:**
1. Generate TPC-H data (8 tables: customer, orders, lineitem, etc.)
2. Load data into BigQuery tables
3. Execute 22 TPC-H queries sequentially
4. Collect execution times and cost metrics

**Expected time:** 2-3 minutes

In [None]:
# Configure database connection
db_cfg = DatabaseConfig(type="bigquery", name="bigquery_benchbox")

# BigQuery platform configuration
platform_cfg = {
    "project": BQ_PROJECT,
    "location": BQ_LOCATION,
    "dataset": BQ_DATASET,
    # Optional: Use specific credentials
    # "credentials_path": "/path/to/service-account-key.json"
}

# Configure TPC-H benchmark
bench_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H",
    scale_factor=0.01,  # 10MB dataset
    test_execution_type="power",
)

# Run full lifecycle: generate ‚Üí load ‚Üí execute
print("üöÄ Starting TPC-H power test on BigQuery...\n")

results = run_benchmark_lifecycle(
    benchmark_config=bench_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(
        generate=True,  # Generate TPC-H data
        load=True,  # Load into BigQuery
        execute=True,  # Execute 22 queries
    ),
)

print("\n‚úÖ Power test completed on BigQuery")
print(f"Total queries executed: {len(results.query_results)}")
print(f"Results saved to: {results.results_dir}")

### 2.2 Visualize Results

Create a bar chart of query execution times:

In [None]:
if results.query_results:
    # Extract query names and execution times
    query_names = [qr.query_name for qr in results.query_results]
    execution_times = [qr.execution_time for qr in results.query_results]

    # Create bar chart
    fig, ax = plt.subplots(figsize=(14, 6))
    bars = ax.bar(query_names, execution_times, color="#4285F4", alpha=0.8)  # Google Blue

    # Highlight slowest queries (top 30%)
    max_time = max(execution_times)
    for i, (bar, time) in enumerate(zip(bars, execution_times)):
        if time > max_time * 0.7:
            bar.set_color("#EA4335")  # Google Red for slow queries

    ax.set_xlabel("Query", fontsize=12, fontweight="bold")
    ax.set_ylabel("Execution Time (seconds)", fontsize=12, fontweight="bold")
    ax.set_title("TPC-H Query Performance on BigQuery", fontsize=14, fontweight="bold")
    ax.tick_params(axis="x", rotation=45)
    ax.grid(axis="y", alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Print summary statistics
    print("\nüìä Performance Summary:")
    print(f"  Total time: {sum(execution_times):.2f}s")
    print(f"  Average: {sum(execution_times) / len(execution_times):.2f}s")
    print(f"  Fastest: {min(execution_times):.2f}s ({query_names[execution_times.index(min(execution_times))]})")
    print(f"  Slowest: {max(execution_times):.2f}s ({query_names[execution_times.index(max(execution_times))]})")
else:
    print("‚ö†Ô∏è No query results to visualize")

### 2.3 Analyze Query Costs

BigQuery charges based on bytes processed. Let's analyze the cost of each query:

In [None]:
# Query INFORMATION_SCHEMA.JOBS to get bytes processed
# Note: This requires recent query execution
try:
    # Get job IDs from recent queries
    jobs_query = f"""
    SELECT 
        job_id,
        query,
        total_bytes_processed,
        total_bytes_billed,
        total_slot_ms,
        creation_time
    FROM `{BQ_PROJECT}.{BQ_LOCATION}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
    WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
        AND job_type = 'QUERY'
        AND state = 'DONE'
    ORDER BY creation_time DESC
    LIMIT 50
    """

    query_job = client.query(jobs_query)
    jobs_df = query_job.to_dataframe()

    if not jobs_df.empty:
        # Calculate costs ($5 per TB on-demand)
        jobs_df["cost_usd"] = jobs_df["total_bytes_billed"] / (1024**4) * 5

        print("üí∞ Recent Query Costs:\n")
        print(f"Total bytes processed: {jobs_df['total_bytes_processed'].sum() / (1024**3):.2f} GB")
        print(f"Total bytes billed: {jobs_df['total_bytes_billed'].sum() / (1024**3):.2f} GB")
        print(f"Estimated cost: ${jobs_df['cost_usd'].sum():.4f}")
        print("\nüí° Note: First 1 TB per month is free")
    else:
        print("‚ö†Ô∏è No recent jobs found")

except Exception as e:
    print(f"‚ö†Ô∏è Could not query INFORMATION_SCHEMA.JOBS: {e}")
    print("This may require additional permissions (roles/bigquery.resourceViewer)")

### 2.4 Results Overview

View comprehensive results summary:

In [None]:
print("üìã Benchmark Results Summary\n")
print(f"Benchmark: {results.benchmark_name}")
print(f"Platform: {results.database_config.type}")
print(f"Scale Factor: {results.benchmark_config.scale_factor}")
print(f"Test Type: {results.benchmark_config.test_execution_type}")
print("\nExecution:")
print(f"  Start: {results.execution_metadata.start_time}")
print(f"  End: {results.execution_metadata.end_time}")
print(f"  Duration: {results.execution_metadata.total_duration:.2f}s")
print("\nQueries:")
print(f"  Total: {len(results.query_results)}")
print(f"  Successful: {sum(1 for qr in results.query_results if qr.success)}")
print(f"  Failed: {sum(1 for qr in results.query_results if not qr.success)}")
print("\nResults Location:")
print(f"  {results.results_dir}")

## Advanced Examples

### 3.1 TPC-DS Benchmark

Run TPC-DS (99 queries, more complex than TPC-H):

In [None]:
# TPC-DS configuration
tpcds_cfg = BenchmarkConfig(
    name="tpcds",
    display_name="TPC-DS",
    scale_factor=0.01,  # 10MB dataset
    test_execution_type="power",
)

print("üöÄ Running TPC-DS benchmark...")
print("‚ö†Ô∏è  Note: TPC-DS has 99 queries and will take longer\n")

tpcds_results = run_benchmark_lifecycle(
    benchmark_config=tpcds_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

print(f"‚úÖ TPC-DS completed: {len(tpcds_results.query_results)} queries")

### 3.2 Scale Factor Comparison

Compare performance across different data sizes:

In [None]:
# Test multiple scale factors
scale_factors = [0.01, 0.1, 1.0]  # 10MB, 100MB, 1GB
scale_results = {}

for sf in scale_factors:
    print(f"\nüìä Testing scale factor {sf} ({sf * 1000}MB)...")

    sf_cfg = BenchmarkConfig(name="tpch", display_name=f"TPC-H SF{sf}", scale_factor=sf, test_execution_type="power")

    sf_results = run_benchmark_lifecycle(
        benchmark_config=sf_cfg,
        database_config=db_cfg,
        system_profile=None,
        platform_config=platform_cfg,
        phases=LifecyclePhases(generate=True, load=True, execute=True),
    )

    scale_results[sf] = sf_results
    avg_time = sum(qr.execution_time for qr in sf_results.query_results) / len(sf_results.query_results)
    print(f"  Average query time: {avg_time:.2f}s")

print("\n‚úÖ Scale comparison complete")

### 3.3 Query Subset Selection

Run only specific queries for faster iteration:

In [None]:
# Run only queries 1, 6, and 14 (fast queries for CI/CD)
subset_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Subset",
    scale_factor=0.01,
    test_execution_type="power",
    query_filter=[1, 6, 14],  # Only these queries
)

print("üöÄ Running query subset (1, 6, 14)...\n")

subset_results = run_benchmark_lifecycle(
    benchmark_config=subset_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=False, load=False, execute=True),  # Reuse data
)

print(f"‚úÖ Subset complete: {len(subset_results.query_results)} queries")
print("\nüí° Use case: Fast regression testing in CI/CD")

### 3.4 Partitioned Tables

Use date/timestamp partitioning to reduce costs:

In [None]:
# Configure platform with partitioning
partitioned_cfg = {
    "project": BQ_PROJECT,
    "location": BQ_LOCATION,
    "dataset": BQ_DATASET,
    "table_options": {
        "orders": {"partition_field": "o_orderdate", "partition_type": "DAY"},
        "lineitem": {"partition_field": "l_shipdate", "partition_type": "MONTH"},
    },
}

print("üöÄ Running with partitioned tables...")
print("  - orders: partitioned by o_orderdate (daily)")
print("  - lineitem: partitioned by l_shipdate (monthly)\n")

part_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Partitioned",
    scale_factor=1.0,  # Larger dataset to see benefits
    test_execution_type="power",
)

part_results = run_benchmark_lifecycle(
    benchmark_config=part_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=partitioned_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

print("‚úÖ Partitioned benchmark complete")
print("\nüí° Benefits:")
print("  - Reduced bytes scanned for date-filtered queries")
print("  - Lower costs (pay only for partitions accessed)")
print("  - Faster query execution")

### 3.5 Clustered Tables

Use clustering to optimize filtering and aggregation:

In [None]:
# Configure platform with clustering
clustered_cfg = {
    "project": BQ_PROJECT,
    "location": BQ_LOCATION,
    "dataset": BQ_DATASET,
    "table_options": {
        "lineitem": {"cluster_fields": ["l_shipdate", "l_returnflag", "l_linestatus"]},
        "orders": {"cluster_fields": ["o_orderdate", "o_orderstatus"]},
    },
}

print("üöÄ Running with clustered tables...")
print("  - lineitem: clustered by (l_shipdate, l_returnflag, l_linestatus)")
print("  - orders: clustered by (o_orderdate, o_orderstatus)\n")

cluster_cfg = BenchmarkConfig(
    name="tpch", display_name="TPC-H Clustered", scale_factor=1.0, test_execution_type="power"
)

cluster_results = run_benchmark_lifecycle(
    benchmark_config=cluster_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=clustered_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

print("‚úÖ Clustered benchmark complete")
print("\nüí° Benefits:")
print("  - Faster filtering on clustered columns")
print("  - Reduced bytes scanned")
print("  - Optimal for GROUP BY operations")

### 3.6 Slot Reservation vs On-Demand

Compare on-demand pricing with flat-rate (slot reservation):

In [None]:
# Note: This requires slot reservations to be configured in your project
print("üí° Pricing Comparison:\n")
print("On-Demand:")
print("  - Pay per byte processed ($5/TB)")
print("  - No commitment")
print("  - Best for sporadic workloads")
print("  - First 1 TB/month free")
print("\nFlat-Rate (Slot Reservations):")
print("  - Pay per slot hour (~$40/hour for 100 slots)")
print("  - Monthly/annual commitment discounts")
print("  - Best for steady workloads (>$50k/month)")
print("  - Predictable costs")
print("\nAutoscaling (Flex Slots):")
print("  - Pay per slot second ($0.04/slot/hour)")
print("  - No commitment (60-second minimum)")
print("  - Best for burst workloads")
print("  - Scale 0-2000 slots automatically")

# To use slot reservations, configure:
# slot_cfg = {
#     "project": BQ_PROJECT,
#     "location": BQ_LOCATION,
#     "dataset": BQ_DATASET,
#     "reservation_id": "your-reservation-name"
# }

### 3.7 Throughput Testing

Run concurrent queries to test throughput:

In [None]:
# Throughput test configuration
throughput_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Throughput",
    scale_factor=0.1,
    test_execution_type="throughput",
    throughput_streams=4,  # 4 concurrent streams
)

print("üöÄ Running throughput test (4 concurrent streams)...\n")

throughput_results = run_benchmark_lifecycle(
    benchmark_config=throughput_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=False, load=False, execute=True),
)

print(f"‚úÖ Throughput test complete: {len(throughput_results.query_results)} queries")
print("\nüí° BigQuery handles concurrent queries automatically with slot allocation")

### 3.8 Result Comparison

Compare results across different configurations:

In [None]:
# Load and compare results
if "results" in locals() and "tpcds_results" in locals():
    tpch_avg = sum(qr.execution_time for qr in results.query_results) / len(results.query_results)
    tpcds_avg = sum(qr.execution_time for qr in tpcds_results.query_results) / len(tpcds_results.query_results)

    # Create comparison visualization
    fig, ax = plt.subplots(figsize=(10, 6))

    benchmarks = ["TPC-H\n(22 queries)", "TPC-DS\n(99 queries)"]
    avg_times = [tpch_avg, tpcds_avg]
    total_times = [
        sum(qr.execution_time for qr in results.query_results),
        sum(qr.execution_time for qr in tpcds_results.query_results),
    ]

    x = range(len(benchmarks))
    width = 0.35

    ax.bar([i - width / 2 for i in x], avg_times, width, label="Avg Time/Query", color="#4285F4")
    ax.bar([i + width / 2 for i in x], total_times, width, label="Total Time", color="#EA4335")

    ax.set_ylabel("Time (seconds)", fontsize=12, fontweight="bold")
    ax.set_title("Benchmark Comparison on BigQuery", fontsize=14, fontweight="bold")
    ax.set_xticks(x)
    ax.set_xticklabels(benchmarks)
    ax.legend()
    ax.grid(axis="y", alpha=0.3)

    plt.tight_layout()
    plt.show()

    print("\nüìä Comparison Summary:")
    print(f"  TPC-H: {tpch_avg:.2f}s avg, {total_times[0]:.2f}s total")
    print(f"  TPC-DS: {tpcds_avg:.2f}s avg, {total_times[1]:.2f}s total")
else:
    print("‚ö†Ô∏è Run both TPC-H and TPC-DS benchmarks first")

### 3.9 Export Results

Export results in multiple formats:

In [None]:
from benchbox.core.results.exporter import ResultExporter

# Export to JSON (default)
exporter = ResultExporter(results)
json_path = exporter.export(format="json")
print(f"‚úÖ Exported to JSON: {json_path}")

# Export to CSV
csv_path = exporter.export(format="csv")
print(f"‚úÖ Exported to CSV: {csv_path}")

# Export to HTML
html_path = exporter.export(format="html")
print(f"‚úÖ Exported to HTML: {html_path}")

print("\nüí° Use these exports for:")
print("  - JSON: API integration, programmatic analysis")
print("  - CSV: Excel, data science tools, dashboards")
print("  - HTML: Shareable reports, documentation")

### 3.10 Cost Optimization Strategies

Techniques to reduce BigQuery costs:

In [None]:
print("üí∞ BigQuery Cost Optimization Strategies\n")
print("1. Use Partitioned Tables:")
print("   - Partition by date/timestamp columns")
print("   - Scan only relevant partitions (prune others)")
print("   - Example: WHERE date BETWEEN '2023-01-01' AND '2023-01-31'")
print("   - Savings: 50-95% on filtered queries\n")

print("2. Use Clustered Tables:")
print("   - Cluster by frequently filtered columns")
print("   - Co-locate related data")
print("   - Up to 4 clustering columns")
print("   - Savings: 20-50% on filtered/grouped queries\n")

print("3. Materialized Views:")
print("   - Pre-compute expensive aggregations")
print("   - Auto-refresh on base table changes")
print("   - Savings: 90%+ on repeated aggregations\n")

print("4. BI Engine:")
print("   - In-memory analysis engine")
print("   - Cache frequently accessed data")
print("   - Best for dashboards and repeated queries")
print("   - Cost: $0.06/GB/hour + $0.75/TB scan\n")

print("5. Query Optimization:")
print("   - SELECT only needed columns (not SELECT *)")
print("   - Use WHERE clauses before JOINs")
print("   - Use APPROX_COUNT_DISTINCT instead of COUNT(DISTINCT)")
print("   - Avoid CROSS JOINs on large tables")
print("   - Use table preview (LIMIT) for exploration\n")

print("6. Slot Management:")
print("   - On-demand: Good for <$50k/month spend")
print("   - Flat-rate: Good for >$50k/month (break-even)")
print("   - Flex slots: Good for burst workloads")
print("   - Autoscaling: Best of both worlds\n")

print("7. Storage Optimization:")
print("   - Long-term storage: 50% discount after 90 days")
print("   - Delete unused tables/partitions")
print("   - Use table expiration settings")
print("   - Archive to Cloud Storage if rarely accessed")

## Platform-Specific Features

### 4.1 Query Cost Analysis with INFORMATION_SCHEMA

Analyze costs using BigQuery's INFORMATION_SCHEMA:

In [None]:
# Query detailed job information
try:
    cost_query = f"""
    SELECT 
        user_email,
        job_id,
        query,
        total_bytes_processed,
        total_bytes_billed,
        ROUND(total_bytes_billed / POW(1024, 4) * 5, 4) as cost_usd,
        total_slot_ms,
        TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) as duration_ms,
        cache_hit,
        creation_time
    FROM `{BQ_PROJECT}.region-{BQ_LOCATION}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
    WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
        AND job_type = 'QUERY'
        AND state = 'DONE'
    ORDER BY total_bytes_billed DESC
    LIMIT 20
    """

    print("üìä Analyzing recent query costs...\n")

    query_job = client.query(cost_query)
    cost_df = query_job.to_dataframe()

    if not cost_df.empty:
        # Display top costly queries
        print("üí∞ Most Expensive Queries:\n")
        for idx, row in cost_df.head(5).iterrows():
            print(f"{idx + 1}. Cost: ${row['cost_usd']:.4f}")
            print(f"   Bytes processed: {row['total_bytes_processed'] / (1024**3):.2f} GB")
            print(f"   Duration: {row['duration_ms'] / 1000:.2f}s")
            print(f"   Cache hit: {row['cache_hit']}")
            print(f"   Query preview: {row['query'][:80]}...\n")

        # Summary statistics
        total_cost = cost_df["cost_usd"].sum()
        total_gb = cost_df["total_bytes_processed"].sum() / (1024**3)
        cache_hit_rate = (cost_df["cache_hit"].sum() / len(cost_df)) * 100

        print("\nüìà Summary Statistics:")
        print(f"  Total queries: {len(cost_df)}")
        print(f"  Total cost: ${total_cost:.4f}")
        print(f"  Total data processed: {total_gb:.2f} GB")
        print(f"  Cache hit rate: {cache_hit_rate:.1f}%")
        print(f"  Average cost/query: ${total_cost / len(cost_df):.4f}")
    else:
        print("‚ö†Ô∏è No recent queries found")

except Exception as e:
    print(f"‚ùå Error querying INFORMATION_SCHEMA: {e}")
    print("\nüí° Troubleshooting:")
    print("  1. Verify roles/bigquery.resourceViewer permission")
    print("  2. Check that recent queries exist (past hour)")
    print("  3. Verify region matches your BigQuery location")

### 4.2 Slot Usage Monitoring

Monitor slot allocation and usage:

In [None]:
# Query slot usage from INFORMATION_SCHEMA
try:
    slot_query = f"""
    SELECT 
        job_id,
        total_slot_ms,
        TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) as duration_ms,
        ROUND(total_slot_ms / TIMESTAMP_DIFF(end_time, start_time, MILLISECOND), 2) as avg_slots
    FROM `{BQ_PROJECT}.region-{BQ_LOCATION}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
    WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
        AND job_type = 'QUERY'
        AND state = 'DONE'
        AND total_slot_ms > 0
    ORDER BY total_slot_ms DESC
    LIMIT 10
    """

    print("üìä Analyzing slot usage...\n")

    query_job = client.query(slot_query)
    slot_df = query_job.to_dataframe()

    if not slot_df.empty:
        print("üîß Top Slot Consumers:\n")
        for idx, row in slot_df.head(5).iterrows():
            print(f"{idx + 1}. Job: {row['job_id'][-20:]}")
            print(f"   Total slot-ms: {row['total_slot_ms']:,.0f}")
            print(f"   Duration: {row['duration_ms'] / 1000:.2f}s")
            print(f"   Avg slots: {row['avg_slots']:.2f}\n")

        # Summary
        total_slot_ms = slot_df["total_slot_ms"].sum()
        avg_slots = slot_df["avg_slots"].mean()

        print("üìà Slot Usage Summary:")
        print(f"  Total slot-ms: {total_slot_ms:,.0f}")
        print(f"  Average slots/query: {avg_slots:.2f}")
        print(f"  Peak slots: {slot_df['avg_slots'].max():.2f}")

        print("\nüí° Notes:")
        print("  - On-demand: 2000 slots default")
        print("  - Flat-rate: Based on your reservation")
        print("  - Slots allocated dynamically based on query complexity")
    else:
        print("‚ö†Ô∏è No slot usage data found")

except Exception as e:
    print(f"‚ùå Error querying slot usage: {e}")

### 4.3 BI Engine Acceleration

Enable and monitor BI Engine for in-memory acceleration:

In [None]:
print("‚ö° BI Engine Overview\n")
print("What is BI Engine?")
print("  - In-memory analysis engine")
print("  - Accelerates SELECT queries")
print("  - Automatic cache management")
print("  - Best for repeated queries and dashboards\n")

print("Pricing:")
print("  - $0.06 per GB per hour (reserved capacity)")
print("  - $0.75 per TB scanned (on-demand)")
print("  - Minimum: 1 GB reservation\n")

print("To Enable BI Engine:")
print("  1. Go to BigQuery console")
print("  2. Click 'BI Engine' in left nav")
print("  3. Click 'Create Reservation'")
print("  4. Choose capacity (1-100 GB)")
print("  5. Select location (must match dataset)\n")

print("To Use BI Engine in BenchBox:")
print("  platform_cfg = {")
print('      "project": BQ_PROJECT,')
print('      "location": BQ_LOCATION,')
print('      "dataset": BQ_DATASET,')
print('      "use_bi_engine": True  # Enable BI Engine')
print("  }\n")

print("Performance Gains:")
print("  - 3-10x faster for cached queries")
print("  - Sub-second response times")
print("  - Automatic cache warming")
print("  - No query rewriting needed")

### 4.4 External Tables (Cloud Storage)

Query data directly from Google Cloud Storage:

In [None]:
print("‚òÅÔ∏è  External Tables Overview\n")
print("What are External Tables?")
print("  - Query data in Cloud Storage (GCS) without loading")
print("  - Supports CSV, JSON, Avro, Parquet, ORC")
print("  - Data stays in GCS (no storage cost in BigQuery)")
print("  - Best for infrequently accessed data\n")

print("Benefits:")
print("  - Lower storage costs")
print("  - No ETL required")
print("  - Data lake integration")
print("  - Automatic schema detection\n")

print("Limitations:")
print("  - Slower than native tables (no cache)")
print("  - No DML (INSERT/UPDATE/DELETE)")
print("  - No clustering or partitioning")
print("  - Limited query optimization\n")

print("Example: Create External Table")
print("""
CREATE EXTERNAL TABLE `project.dataset.table`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://bucket/path/*.parquet'],
  hive_partition_uri_prefix = 'gs://bucket/path',
  require_hive_partition_filter = true
);
""")

print("\nüí° Use Case: Query data lake from BigQuery without moving data")

## Performance Analysis

### 5.1 Load and Prepare Results

Load benchmark results for analysis:

In [None]:
# Load results from previous run
if "results" in locals() and results.query_results:
    # Convert to pandas DataFrame for analysis
    df_results = pd.DataFrame(
        [
            {
                "query": qr.query_name,
                "time": qr.execution_time,
                "success": qr.success,
                "rows_returned": qr.row_count if hasattr(qr, "row_count") else None,
            }
            for qr in results.query_results
        ]
    )

    print("‚úÖ Results loaded into DataFrame")
    print(f"\nShape: {df_results.shape[0]} queries, {df_results.shape[1]} columns")
    print("\nFirst 5 rows:")
    print(df_results.head())
else:
    print("‚ö†Ô∏è No results available. Run a benchmark first.")

### 5.2 Statistical Analysis

Compute detailed statistics and identify outliers:

In [None]:
if "df_results" in locals():
    # Compute statistics
    stats = df_results["time"].describe(percentiles=[0.25, 0.5, 0.75, 0.95, 0.99])

    print("üìä Execution Time Statistics\n")
    print(stats)

    print("\nüîç Key Percentiles:")
    print(f"  P25 (25th percentile): {df_results['time'].quantile(0.25):.3f}s")
    print(f"  P50 (median): {df_results['time'].median():.3f}s")
    print(f"  P75 (75th percentile): {df_results['time'].quantile(0.75):.3f}s")
    print(f"  P95 (95th percentile): {df_results['time'].quantile(0.95):.3f}s")
    print(f"  P99 (99th percentile): {df_results['time'].quantile(0.99):.3f}s")

    # Identify outliers (>2 standard deviations)
    mean_time = df_results["time"].mean()
    std_time = df_results["time"].std()
    outliers = df_results[df_results["time"] > mean_time + 2 * std_time]

    if not outliers.empty:
        print("\n‚ö†Ô∏è  Performance Outliers (>2œÉ):")
        for _, row in outliers.iterrows():
            z_score = (row["time"] - mean_time) / std_time
            print(f"  {row['query']}: {row['time']:.2f}s (z-score: {z_score:.2f})")

        print("\nüí° Investigation steps:")
        print("  1. Review query execution plan (EXPLAIN)")
        print("  2. Check bytes processed (INFORMATION_SCHEMA.JOBS)")
        print("  3. Consider partitioning/clustering")
        print("  4. Verify slot allocation during execution")
    else:
        print("\n‚úÖ No significant outliers detected")

    # Coefficient of variation (CV)
    cv = (std_time / mean_time) * 100
    print("\nüìà Variability:")
    print(f"  Standard deviation: {std_time:.3f}s")
    print(f"  Coefficient of variation: {cv:.1f}%")
    if cv < 20:
        print("  Assessment: Low variability (consistent performance)")
    elif cv < 50:
        print("  Assessment: Moderate variability (typical for mixed workload)")
    else:
        print("  Assessment: High variability (investigate slow queries)")
else:
    print("‚ö†Ô∏è Load results first")

### 5.3 Comprehensive Visualizations

Multi-panel performance visualization:

In [None]:
if "df_results" in locals():
    # Create 2x2 subplot grid
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle("BigQuery Performance Analysis", fontsize=16, fontweight="bold")

    # 1. Distribution histogram
    axes[0, 0].hist(df_results["time"], bins=20, color="#4285F4", alpha=0.7, edgecolor="black")
    axes[0, 0].axvline(df_results["time"].mean(), color="red", linestyle="--", linewidth=2, label="Mean")
    axes[0, 0].axvline(df_results["time"].median(), color="green", linestyle="--", linewidth=2, label="Median")
    axes[0, 0].set_xlabel("Execution Time (seconds)", fontweight="bold")
    axes[0, 0].set_ylabel("Frequency", fontweight="bold")
    axes[0, 0].set_title("Execution Time Distribution")
    axes[0, 0].legend()
    axes[0, 0].grid(axis="y", alpha=0.3)

    # 2. Box plot
    bp = axes[0, 1].boxplot(df_results["time"], patch_artist=True, vert=True)
    bp["boxes"][0].set_facecolor("#4285F4")
    bp["boxes"][0].set_alpha(0.7)
    axes[0, 1].set_ylabel("Execution Time (seconds)", fontweight="bold")
    axes[0, 1].set_title("Box Plot (Outlier Detection)")
    axes[0, 1].set_xticklabels(["All Queries"])
    axes[0, 1].grid(axis="y", alpha=0.3)

    # 3. Sorted horizontal bar chart (top 15)
    df_sorted = df_results.sort_values("time", ascending=True).tail(15)
    colors = ["#EA4335" if t > df_results["time"].quantile(0.9) else "#4285F4" for t in df_sorted["time"]]
    axes[1, 0].barh(df_sorted["query"], df_sorted["time"], color=colors, alpha=0.8)
    axes[1, 0].set_xlabel("Execution Time (seconds)", fontweight="bold")
    axes[1, 0].set_title("Slowest 15 Queries")
    axes[1, 0].grid(axis="x", alpha=0.3)

    # 4. Cumulative performance (Pareto analysis)
    df_sorted_desc = df_results.sort_values("time", ascending=False)
    df_sorted_desc["cumulative_pct"] = df_sorted_desc["time"].cumsum() / df_sorted_desc["time"].sum() * 100
    axes[1, 1].plot(
        range(len(df_sorted_desc)), df_sorted_desc["cumulative_pct"], marker="o", color="#4285F4", linewidth=2
    )
    axes[1, 1].axhline(80, color="red", linestyle="--", linewidth=2, label="80% threshold")
    axes[1, 1].set_xlabel("Number of Queries (sorted by time)", fontweight="bold")
    axes[1, 1].set_ylabel("Cumulative % of Total Time", fontweight="bold")
    axes[1, 1].set_title("Pareto Analysis (80/20 Rule)")
    axes[1, 1].legend()
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Pareto insight
    queries_for_80pct = len(df_sorted_desc[df_sorted_desc["cumulative_pct"] <= 80])
    print("\nüìä Pareto Insight:")
    print(
        f"  {queries_for_80pct} queries ({queries_for_80pct / len(df_results) * 100:.1f}%) account for 80% of total time"
    )
    print(f"  üí° Focus optimization efforts on these {queries_for_80pct} queries")
else:
    print("‚ö†Ô∏è Load results first")

### 5.4 Cost Per Query Analysis

Analyze cost distribution across queries:

In [None]:
# Note: This requires INFORMATION_SCHEMA.JOBS access
try:
    # Query job costs
    cost_query = rf"""
    SELECT 
        REGEXP_EXTRACT(query, r'query(\d+)') as query_num,
        total_bytes_processed,
        total_bytes_billed,
        ROUND(total_bytes_billed / POW(1024, 4) * 5, 4) as cost_usd
    FROM `{BQ_PROJECT}.region-{BQ_LOCATION}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
    WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
        AND job_type = 'QUERY'
        AND state = 'DONE'
        AND query LIKE '%tpch%'
    ORDER BY creation_time
    """

    query_job = client.query(cost_query)
    cost_df = query_job.to_dataframe()

    if not cost_df.empty:
        # Visualize cost per query
        fig, ax = plt.subplots(figsize=(14, 6))

        bars = ax.bar(cost_df["query_num"], cost_df["cost_usd"], color="#34A853", alpha=0.8)

        # Highlight expensive queries (top 20%)
        threshold = cost_df["cost_usd"].quantile(0.8)
        for bar, cost in zip(bars, cost_df["cost_usd"]):
            if cost > threshold:
                bar.set_color("#EA4335")

        ax.set_xlabel("Query Number", fontsize=12, fontweight="bold")
        ax.set_ylabel("Cost (USD)", fontsize=12, fontweight="bold")
        ax.set_title("Cost Per Query on BigQuery", fontsize=14, fontweight="bold")
        ax.grid(axis="y", alpha=0.3)

        plt.tight_layout()
        plt.show()

        # Cost summary
        total_cost = cost_df["cost_usd"].sum()
        total_gb = cost_df["total_bytes_processed"].sum() / (1024**3)

        print("\nüí∞ Cost Summary:")
        print(f"  Total cost: ${total_cost:.4f}")
        print(f"  Total data processed: {total_gb:.2f} GB")
        print(f"  Average cost/query: ${total_cost / len(cost_df):.4f}")
        print(f"  Most expensive query: ${cost_df['cost_usd'].max():.4f}")
    else:
        print("‚ö†Ô∏è No cost data available")

except Exception as e:
    print(f"‚ö†Ô∏è Could not analyze costs: {e}")
    print("This requires roles/bigquery.resourceViewer permission")

### 5.5 Bytes Processed Analysis

Analyze data scanning patterns:

In [None]:
print("üìä Bytes Processed Analysis\n")
print("Why This Matters:")
print("  - BigQuery charges $5 per TB processed (on-demand)")
print("  - Scanning less data = lower costs")
print("  - Partitioning/clustering reduces bytes scanned\n")

print("Optimization Strategies:")
print("  1. Use partitioned tables (reduce scan range)")
print("  2. Use clustered tables (prune irrelevant blocks)")
print("  3. SELECT only needed columns (avoid SELECT *)")
print("  4. Use WHERE clauses early (partition pruning)")
print("  5. Avoid CROSS JOINs (Cartesian product scanning)\n")

print("Example Impact:")
print("  Scenario: 10 TB table, querying 1 month of data")
print("  Without partitioning: Scans 10 TB = $50")
print("  With daily partitioning: Scans 0.3 TB = $1.50")
print("  Savings: $48.50 (97% reduction)\n")

print("üí° Use the Cost Per Query Analysis cell above to see actual bytes processed")

### 5.6 Regression Detection

Compare against baseline to detect performance regressions:

In [None]:
if "df_results" in locals():
    # Compare against baseline (you can load from a saved baseline file)
    # For demonstration, we'll use a mock baseline
    baseline_avg = 1.5  # seconds (mock baseline)
    current_avg = df_results["time"].mean()

    # Calculate change
    change_pct = ((current_avg - baseline_avg) / baseline_avg) * 100

    print("üîç Performance Regression Analysis\n")
    print(f"Baseline average: {baseline_avg:.2f}s")
    print(f"Current average: {current_avg:.2f}s")
    print(f"Change: {change_pct:+.1f}%\n")

    # Threshold: 10% change
    if abs(change_pct) > 10:
        if change_pct > 0:
            status = "‚ùå REGRESSION DETECTED"
            print(status)
            print(f"Performance degraded by {change_pct:.1f}%\n")
            print("üí° Investigation Steps:")
            print("  1. Check for slot contention (INFORMATION_SCHEMA.JOBS)")
            print("  2. Verify data size hasn't increased unexpectedly")
            print("  3. Review recent schema changes (partitions, clustering)")
            print("  4. Check for concurrent workloads")
            print("  5. Verify query cache hit rate")
        else:
            status = "‚úÖ PERFORMANCE IMPROVEMENT"
            print(status)
            print(f"Performance improved by {abs(change_pct):.1f}%\n")
            print("üí° Possible Reasons:")
            print("  - Partitioning/clustering optimization")
            print("  - BI Engine caching")
            print("  - Query optimization")
            print("  - Increased slot allocation")
    else:
        print("‚úÖ Performance stable (within 10% threshold)\n")

    # Per-query regression analysis
    print("\nüìä Per-Query Regression (mock baseline):")
    for _, row in df_results.head(5).iterrows():
        # Mock per-query baseline (in practice, load from saved file)
        query_baseline = baseline_avg * (0.8 + 0.4 * (hash(row["query"]) % 100) / 100)
        query_change = ((row["time"] - query_baseline) / query_baseline) * 100

        status_icon = "‚ö†Ô∏è" if abs(query_change) > 20 else "‚úÖ"
        print(
            f"{status_icon} {row['query']}: {row['time']:.2f}s (baseline: {query_baseline:.2f}s, change: {query_change:+.1f}%)"
        )

    print("\nüí° Save current run as new baseline:")
    print("   df_results.to_csv('baseline_bigquery_tpch.csv', index=False)")
else:
    print("‚ö†Ô∏è Load results first")

## Troubleshooting

### 6.1 Connection Diagnostics

Comprehensive connection troubleshooting:

In [None]:
def diagnose_bigquery_connection():
    """Diagnose BigQuery connection issues"""
    print("üîç BigQuery Connection Diagnostic\n")

    # Check 1: Environment variables
    print("1Ô∏è‚É£ Checking environment variables...")
    if os.environ.get("BIGQUERY_PROJECT"):
        print(f"   ‚úÖ BIGQUERY_PROJECT = {os.environ.get('BIGQUERY_PROJECT')}")
    else:
        print("   ‚ö†Ô∏è  BIGQUERY_PROJECT not set")

    if os.environ.get("GOOGLE_APPLICATION_CREDENTIALS"):
        creds_path = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")
        print(f"   ‚úÖ GOOGLE_APPLICATION_CREDENTIALS = {creds_path}")
        if os.path.exists(creds_path):
            print("   ‚úÖ Credentials file exists")
        else:
            print("   ‚ùå Credentials file not found")
    else:
        print("   ‚ÑπÔ∏è  GOOGLE_APPLICATION_CREDENTIALS not set (using ADC)")

    # Check 2: ADC availability
    print("\n2Ô∏è‚É£ Checking Application Default Credentials...")
    try:
        test_client = bigquery.Client()
        print(f"   ‚úÖ ADC available (project: {test_client.project})")
    except Exception as e:
        print(f"   ‚ùå ADC not available: {e}")

    # Check 3: API connectivity
    print("\n3Ô∏è‚É£ Testing BigQuery API connectivity...")
    try:
        client = bigquery.Client(project=BQ_PROJECT)
        query = "SELECT 1 as test"
        query_job = client.query(query)
        results = query_job.result()
        print("   ‚úÖ API connectivity successful")
    except Exception as e:
        print(f"   ‚ùå API connectivity failed: {e}")

    # Check 4: Dataset access
    print("\n4Ô∏è‚É£ Checking dataset access...")
    try:
        client = bigquery.Client(project=BQ_PROJECT)
        dataset_id = f"{BQ_PROJECT}.{BQ_DATASET}"
        dataset = client.get_dataset(dataset_id)
        print(f"   ‚úÖ Dataset accessible: {dataset_id}")
    except Exception as e:
        print(f"   ‚ùå Dataset access failed: {e}")

    print("\n" + "=" * 60)
    print("üìö Troubleshooting Guide:\n")
    print("If authentication fails:")
    print("  1. Run: gcloud auth application-default login")
    print("  2. Or set GOOGLE_APPLICATION_CREDENTIALS to service account key")
    print("  3. Verify project ID is correct\n")

    print("If API connectivity fails:")
    print("  1. Enable BigQuery API: https://console.cloud.google.com/apis/library/bigquery.googleapis.com")
    print("  2. Verify billing is enabled")
    print("  3. Check firewall/network settings\n")

    print("If dataset access fails:")
    print("  1. Verify IAM permissions (roles/bigquery.admin or roles/bigquery.dataEditor)")
    print("  2. Create dataset if it doesn't exist")
    print("  3. Check dataset location matches client location")


# Run diagnostics
diagnose_bigquery_connection()

### 6.2 Permission Validation

Verify required IAM permissions:

In [None]:
def validate_bigquery_permissions():
    """Validate BigQuery permissions"""
    print("üîí BigQuery Permission Validation\n")

    client = bigquery.Client(project=BQ_PROJECT)

    # Test 1: List datasets
    print("1Ô∏è‚É£ Testing dataset listing (bigquery.datasets.get)...")
    try:
        datasets = list(client.list_datasets())
        print(f"   ‚úÖ Can list datasets ({len(datasets)} found)")
    except Exception as e:
        print(f"   ‚ùå Cannot list datasets: {e}")

    # Test 2: Create table
    print("\n2Ô∏è‚É£ Testing table creation (bigquery.tables.create)...")
    try:
        test_table_id = f"{BQ_PROJECT}.{BQ_DATASET}.benchbox_test_table"
        schema = [bigquery.SchemaField("test_col", "STRING")]
        table = bigquery.Table(test_table_id, schema=schema)
        table = client.create_table(table, exists_ok=True)
        print("   ‚úÖ Can create tables")

        # Clean up
        client.delete_table(test_table_id)
        print("   ‚úÖ Can delete tables")
    except Exception as e:
        print(f"   ‚ùå Cannot create/delete tables: {e}")

    # Test 3: Query execution
    print("\n3Ô∏è‚É£ Testing query execution (bigquery.jobs.create)...")
    try:
        query = "SELECT 1 as test"
        query_job = client.query(query)
        results = query_job.result()
        print("   ‚úÖ Can execute queries")
    except Exception as e:
        print(f"   ‚ùå Cannot execute queries: {e}")

    # Test 4: INFORMATION_SCHEMA access
    print("\n4Ô∏è‚É£ Testing INFORMATION_SCHEMA access (bigquery.jobs.list)...")
    try:
        query = f"""
        SELECT job_id
        FROM `{BQ_PROJECT}.region-{BQ_LOCATION}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
        LIMIT 1
        """
        query_job = client.query(query)
        results = query_job.result()
        print("   ‚úÖ Can access INFORMATION_SCHEMA (cost analysis available)")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Cannot access INFORMATION_SCHEMA: {e}")
        print("   Note: This requires roles/bigquery.resourceViewer")

    print("\n" + "=" * 60)
    print("üìã Required Roles:\n")
    print("Minimum (for benchmarking):")
    print("  - roles/bigquery.dataEditor (create/delete tables)")
    print("  - roles/bigquery.jobUser (execute queries)\n")
    print("Recommended (full features):")
    print("  - roles/bigquery.admin (all operations)\n")
    print("Optional (cost analysis):")
    print("  - roles/bigquery.resourceViewer (INFORMATION_SCHEMA access)")


# Run validation
try:
    validate_bigquery_permissions()
except Exception as e:
    print(f"‚ùå Permission validation failed: {e}")

### 6.3 Quota and Limits Checking

Verify quotas and limits for large benchmark runs:

In [None]:
print("üìä BigQuery Quotas and Limits\n")
print("Query Quotas (per project, per day):")
print("  - Interactive queries: 2000 concurrent")
print("  - Batch queries: 100 concurrent")
print("  - Query size: 1 MB per query (text)")
print("  - Response size: 10 GB (with pagination)\n")

print("Loading Quotas:")
print("  - Load jobs per table: 1,500 per day")
print("  - Load jobs per project: 100,000 per day")
print("  - Maximum file size: 5 TB (uncompressed)")
print("  - Maximum files per load: 10,000\n")

print("Slot Quotas (on-demand):")
print("  - Default slots: 2000")
print("  - Burst capacity: Up to 2000 slots")
print("  - Fair scheduling: Auto slot allocation\n")

print("Storage Limits:")
print("  - Maximum table size: Unlimited")
print("  - Maximum columns: 10,000 per table")
print("  - Maximum row size: 100 MB")
print("  - Maximum nested depth: 15 levels\n")

print("Rate Limits:")
print("  - API requests: 100 per second per user")
print("  - Streaming inserts: 100,000 rows/second per table")
print("  - Dataset metadata operations: 100 per second\n")

print("üí° For Large Benchmarks:")
print("  1. Use batch queries for non-urgent workloads (lower priority)")
print("  2. Request quota increases if needed (IAM console)")
print("  3. Monitor quota usage: https://console.cloud.google.com/iam-admin/quotas")
print("  4. Consider flat-rate pricing for sustained high usage\n")

print("üîç Check Current Usage:")
print("  https://console.cloud.google.com/iam-admin/quotas?service=bigquery.googleapis.com")

### 6.4 Common Issues and Solutions

Quick reference for common BigQuery benchmarking issues:

In [None]:
print("üîß Common BigQuery Benchmarking Issues\n")
print("=" * 70)

print("\n‚ùå Issue: 'Permission denied' errors")
print("‚úÖ Solution:")
print("   1. Verify IAM role: roles/bigquery.admin or roles/bigquery.dataEditor")
print("   2. Check dataset-level permissions (Dataset Info ‚Üí Permissions)")
print("   3. Ensure project billing is enabled")
print("   4. Re-authenticate: gcloud auth application-default login\n")

print("‚ùå Issue: 'Quota exceeded' errors")
print("‚úÖ Solution:")
print("   1. Check quotas: https://console.cloud.google.com/iam-admin/quotas")
print("   2. Wait for quota reset (daily quotas reset at midnight PST)")
print("   3. Request quota increase in GCP console")
print("   4. Use batch queries (lower priority, less quota consumption)\n")

print("‚ùå Issue: Slow query performance")
print("‚úÖ Solution:")
print("   1. Use EXPLAIN query to analyze execution plan")
print("   2. Check INFORMATION_SCHEMA.JOBS for bytes processed")
print("   3. Add partitioning/clustering to reduce scan size")
print("   4. Verify slot allocation (may be limited by concurrent workloads)")
print("   5. Consider BI Engine for repeated queries\n")

print("‚ùå Issue: High costs")
print("‚úÖ Solution:")
print("   1. Use partitioned tables (scan only needed partitions)")
print("   2. Add clustering (reduce bytes scanned)")
print("   3. Avoid SELECT * (specify columns)")
print("   4. Use materialized views for repeated aggregations")
print("   5. Consider flat-rate pricing for sustained workloads (>$50k/month)\n")

print("‚ùå Issue: 'Resources exceeded' during query")
print("‚úÖ Solution:")
print("   1. Break query into smaller chunks")
print("   2. Use approximate aggregations (APPROX_COUNT_DISTINCT)")
print("   3. Filter early (WHERE before JOIN)")
print("   4. Increase destination table expiration")
print("   5. Use intermediate tables for complex multi-stage queries\n")

print("‚ùå Issue: Data loading failures")
print("‚úÖ Solution:")
print("   1. Check file format (CSV, JSON, Parquet)")
print("   2. Verify schema matches data")
print("   3. Use autodetect=True for schema inference")
print("   4. Check for malformed rows (enable bad record handling)")
print("   5. Split large files into smaller chunks (<5 GB each)\n")

print("‚ùå Issue: 'Not found: Dataset' errors")
print("‚úÖ Solution:")
print("   1. Create dataset: bq mk --dataset PROJECT_ID:DATASET_NAME")
print("   2. Verify dataset location matches client location")
print("   3. Check dataset name spelling (case-sensitive)")
print("   4. Ensure project ID is correct\n")

print("=" * 70)
print("\nüí° More Help:")
print("  - BigQuery docs: https://cloud.google.com/bigquery/docs")
print("  - Troubleshooting guide: https://cloud.google.com/bigquery/docs/troubleshoot")
print("  - BenchBox docs: https://github.com/joeharris76/benchbox")

## Next Steps

**Continue Learning:**
- Explore other cloud platforms: Snowflake, Databricks, Redshift
- Try different benchmarks: TPC-DS, ClickBench, SSB
- Compare platforms with multi-platform notebooks
- Set up CI/CD regression testing

**Platform-Specific Features to Explore:**
- BI Engine (in-memory acceleration)
- Materialized views (pre-computed aggregations)
- External tables (query data in GCS)
- BigQuery ML (machine learning integration)
- Data transfer service (automated ETL)

**Resources:**
- [BenchBox Documentation](https://github.com/joeharris76/benchbox)
- [BigQuery Documentation](https://cloud.google.com/bigquery/docs)
- [BigQuery Best Practices](https://cloud.google.com/bigquery/docs/best-practices)
- [BigQuery Pricing](https://cloud.google.com/bigquery/pricing)