# DuckDB Local Benchmarking with BenchBox

This notebook demonstrates comprehensive benchmarking of DuckDB, an embedded analytical database designed for fast analytics on local data.

**What you'll learn:**
- Running TPC-H and TPC-DS benchmarks with DuckDB
- Optimizing memory and threading for performance
- Working with different file formats (CSV, Parquet, JSON)
- Integrating with pandas DataFrames
- Using DuckDB extensions for enhanced capabilities
- Comparing persistent vs in-memory performance

**Why DuckDB?**
- **Embedded**: No server, no configuration - runs in-process
- **Fast**: Vectorized execution, optimized for analytics
- **Portable**: Single-file database or pure in-memory
- **Versatile**: Query CSV/Parquet/JSON files directly
- **Zero-cost**: Free and open source

**Prerequisites:**
- Python 3.8+
- Sufficient disk space for test data (~100MB-10GB depending on scale)
- Recommended: 4GB+ RAM for larger scale factors

**Estimated time:** 5-15 minutes (scale factor 0.01-1.0)

## 1. Installation & Setup

### Install Required Packages

Install BenchBox and DuckDB.

In [None]:
!pip install -q benchbox duckdb pandas matplotlib seaborn psutil

### Import Libraries

Import BenchBox components and visualization libraries.

In [None]:
import os
import warnings
from datetime import datetime
from pathlib import Path

warnings.filterwarnings("ignore")

# BenchBox imports
import matplotlib.pyplot as plt
import numpy as np

# Visualization imports
import pandas as pd
import seaborn as sns

from benchbox.core.config import BenchmarkConfig, DatabaseConfig
from benchbox.core.results.exporter import ResultExporter
from benchbox.core.results.loader import ResultLoader
from benchbox.core.runner import LifecyclePhases, run_benchmark_lifecycle

# DuckDB import
try:
    import duckdb

    print(f"‚úÖ DuckDB {duckdb.__version__} imported successfully")
except ImportError as e:
    print(f"‚ùå Error importing DuckDB: {e}")
    print("   Install with: pip install duckdb")

# System monitoring
try:
    import psutil

    print("‚úÖ psutil imported for system monitoring")
except ImportError:
    print("‚ö†Ô∏è  psutil not available - install for system monitoring: pip install psutil")
    psutil = None

# Configure plotting
plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")
%matplotlib inline

print("\nüì¶ All libraries imported successfully")

### Configure DuckDB

DuckDB offers two modes:

**1. In-Memory Mode (Fastest)**
- All data stored in RAM
- No persistence between sessions
- Best for: Quick tests, temporary analysis
```python
conn = duckdb.connect(':memory:')
```

**2. Persistent Mode (Recommended)**
- Data stored in single database file
- Persists between sessions
- Best for: Repeated testing, data reuse
```python
conn = duckdb.connect('benchbox.duckdb')
```

**Configuration Options:**
- **Threads**: Set worker threads (default: all CPU cores)
- **Memory**: Limit memory usage (default: 80% of system RAM)
- **Temp Directory**: Location for spill-to-disk operations

In [None]:
# Configure benchmark settings
config = {
    "mode": "persistent",  # or "memory"
    "database_file": "./benchmark_runs/duckdb/benchbox.duckdb",
    "threads": os.cpu_count(),  # Use all CPU cores
    "memory_limit": "4GB",  # Adjust based on your system
    "temp_directory": "./benchmark_runs/duckdb/temp",
    # Scale factors to test
    "scale_factors": [0.01, 0.1, 1.0],  # 10MB, 100MB, 1GB
    # Output directory
    "output_dir": "./benchmark_results",
}

# Create directories
Path(config["database_file"]).parent.mkdir(parents=True, exist_ok=True)
Path(config["temp_directory"]).mkdir(parents=True, exist_ok=True)
Path(config["output_dir"]).mkdir(parents=True, exist_ok=True)

# Get system information
if psutil:
    total_ram = psutil.virtual_memory().total / (1024**3)  # GB
    available_ram = psutil.virtual_memory().available / (1024**3)  # GB
    print("üíª System Information:")
    print(f"   CPU Cores: {config['threads']}")
    print(f"   Total RAM: {total_ram:.1f} GB")
    print(f"   Available RAM: {available_ram:.1f} GB")
    print(f"   DuckDB Memory Limit: {config['memory_limit']}")
else:
    print("üíª System Information:")
    print(f"   CPU Cores: {config['threads']}")
    print(f"   DuckDB Memory Limit: {config['memory_limit']}")

print("\n‚úÖ Configuration complete")
print(f"   Mode: {config['mode']}")
if config["mode"] == "persistent":
    print(f"   Database: {config['database_file']}")
print(f"   Output directory: {config['output_dir']}")

### Test DuckDB Connection

Verify DuckDB is working and check its capabilities.

In [None]:
try:
    # Connect to DuckDB
    if config["mode"] == "memory":
        conn = duckdb.connect(":memory:")
    else:
        conn = duckdb.connect(config["database_file"])

    # Configure settings
    conn.execute(f"SET threads TO {config['threads']};")
    conn.execute(f"SET memory_limit = '{config['memory_limit']}';")
    conn.execute(f"SET temp_directory = '{config['temp_directory']}';")

    # Check version and settings
    version = conn.execute("SELECT version();").fetchone()[0]
    print("‚úÖ Connected to DuckDB")
    print(f"   Version: {version}")

    # Check current settings
    settings = conn.execute("""
        SELECT name, value 
        FROM duckdb_settings() 
        WHERE name IN ('threads', 'memory_limit', 'temp_directory')
        ORDER BY name;
    """).fetchall()

    print("\n‚öôÔ∏è  Current Settings:")
    for name, value in settings:
        print(f"   {name}: {value}")

    # Check available extensions
    extensions = conn.execute("""
        SELECT extension_name, loaded 
        FROM duckdb_extensions() 
        WHERE extension_name IN ('parquet', 'json', 'httpfs', 'fts')
        ORDER BY extension_name;
    """).fetchall()

    if extensions:
        print("\nüîå Available Extensions:")
        for name, loaded in extensions:
            status = "loaded" if loaded else "available"
            print(f"   {name}: {status}")

    # Simple test query
    result = conn.execute("SELECT 42 as answer, 'DuckDB' as database;").fetchone()
    print(f"\n‚úÖ Test query successful: {result}")

    conn.close()
    print("\n‚úÖ Connection test passed!")

except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    raise

## 2. Quick Start Example

### Run TPC-H Power Test

Execute a TPC-H power test at scale factor 0.01 (10MB). This runs all 22 TPC-H queries sequentially.

**What happens:**
1. Generate TPC-H data (customer, orders, lineitem, etc.)
2. Create tables in DuckDB
3. Load data from generated files
4. Execute 22 queries and measure performance

**Expected time:** ~30-60 seconds at SF 0.01 on modern hardware

**Note**: DuckDB is extremely fast for small datasets. You may see sub-second query times!

In [None]:
# Configure database connection
db_cfg = DatabaseConfig(type="duckdb", name="duckdb-local")
platform_cfg = {
    "database": config["database_file"] if config["mode"] == "persistent" else ":memory:",
    "threads": config["threads"],
    "memory_limit": config["memory_limit"],
}

# Configure TPC-H benchmark
bench_cfg = BenchmarkConfig(
    name="tpch", display_name="TPC-H Power Test", scale_factor=0.01, test_execution_type="power"
)

# Track start time
start_time = datetime.now()

# Run complete lifecycle
print("üöÄ Starting TPC-H power test on DuckDB...\n")
results = run_benchmark_lifecycle(
    benchmark_config=bench_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

end_time = datetime.now()
total_time = (end_time - start_time).total_seconds()

print("\n‚úÖ TPC-H power test completed!")
print(f"   Benchmark: {results.benchmark_name}")
print(f"   Total queries: {len(results.query_results)}")
print(f"   Geometric mean: {results.geometric_mean:.3f}s")
print(f"   Total execution time: {results.total_execution_time:.2f}s")
print(f"   Wall clock time: {total_time:.2f}s")

### Visualize Results

Create a bar chart showing execution time for each query.

In [None]:
if results.query_results:
    query_names = [qr.query_name for qr in results.query_results]
    execution_times = [qr.execution_time for qr in results.query_results]

    fig, ax = plt.subplots(figsize=(14, 6))
    bars = ax.bar(query_names, execution_times, color="#FFC220", alpha=0.8, edgecolor="black")

    # Highlight slowest queries
    max_time = max(execution_times)
    for i, (bar, time) in enumerate(zip(bars, execution_times)):
        if time > max_time * 0.7:  # Top 30% slowest
            bar.set_color("#FF6F00")  # DuckDB orange accent
            # Annotate with time
            ax.text(i, time + max_time * 0.02, f"{time:.3f}s", ha="center", va="bottom", fontsize=8)

    ax.set_xlabel("Query", fontsize=12, fontweight="bold")
    ax.set_ylabel("Execution Time (seconds)", fontsize=12, fontweight="bold")
    ax.set_title("TPC-H Query Performance on DuckDB (SF 0.01)", fontsize=14, fontweight="bold")
    ax.grid(axis="y", alpha=0.3)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

    print("\nüìä Performance Summary:")
    print(f"   Fastest query: {query_names[execution_times.index(min(execution_times))]} ({min(execution_times):.3f}s)")
    print(f"   Slowest query: {query_names[execution_times.index(max(execution_times))]} ({max(execution_times):.3f}s)")
    print(f"   Median time: {sorted(execution_times)[len(execution_times) // 2]:.3f}s")

    # Calculate queries per second
    qps = len(execution_times) / results.total_execution_time
    print(f"   Throughput: {qps:.1f} queries/second")
else:
    print("‚ö†Ô∏è  No query results to visualize")

### Monitor Resource Usage

Check system resource consumption during the benchmark.

In [None]:
if psutil:
    # Get current resource usage
    cpu_percent = psutil.cpu_percent(interval=1)
    memory = psutil.virtual_memory()
    disk = psutil.disk_usage(".")

    print("üíª Resource Usage:\n")
    print(f"CPU Usage: {cpu_percent}%")
    print(f"Memory: {memory.used / (1024**3):.1f} GB / {memory.total / (1024**3):.1f} GB ({memory.percent}%)")
    print(f"Disk: {disk.used / (1024**3):.1f} GB / {disk.total / (1024**3):.1f} GB ({disk.percent}%)")

    # Check database file size if persistent
    if config["mode"] == "persistent" and Path(config["database_file"]).exists():
        db_size = Path(config["database_file"]).stat().st_size / (1024**2)  # MB
        print(f"\nüíæ Database File: {db_size:.1f} MB")
else:
    print("‚ö†Ô∏è  Resource monitoring not available (install psutil)")

    # Still check database file size
    if config["mode"] == "persistent" and Path(config["database_file"]).exists():
        db_size = Path(config["database_file"]).stat().st_size / (1024**2)  # MB
        print(f"üíæ Database File: {db_size:.1f} MB")

### Results Overview

Display detailed results including per-query breakdown.

In [None]:
print("üìä Detailed Results:\n")
print(f"Benchmark: {results.benchmark_name}")
print(f"Platform: {results.platform}")
print(f"Scale Factor: {results.scale_factor}")
print(f"Test Type: {results.test_execution_type}")
print(f"Timestamp: {results.start_time}")
print("\nExecution Summary:")
print(f"  Total queries: {len(results.query_results)}")
print(f"  Successful: {sum(1 for qr in results.query_results if qr.success)}")
print(f"  Failed: {sum(1 for qr in results.query_results if not qr.success)}")
print(f"  Geometric mean: {results.geometric_mean:.3f}s")
print(f"  Total time: {results.total_execution_time:.2f}s")

if results.data_generation_time:
    print(f"\nData Generation: {results.data_generation_time:.2f}s")
if results.data_loading_time:
    print(f"Data Loading: {results.data_loading_time:.2f}s")

print("\nüìã Query Breakdown:")
for qr in results.query_results[:5]:  # Show first 5
    status = "‚úÖ" if qr.success else "‚ùå"
    print(f"  {status} {qr.query_name}: {qr.execution_time:.3f}s")
if len(results.query_results) > 5:
    print(f"  ... and {len(results.query_results) - 5} more queries")

print("\nüí° DuckDB Performance Insight:")
print(f"   DuckDB executed {len(results.query_results)} queries in {results.total_execution_time:.2f}s")
print(f"   Average query time: {results.total_execution_time / len(results.query_results):.3f}s")
print("   This is excellent performance for an embedded database!")

## 3. Advanced Examples

### TPC-DS Benchmark

Run the more complex TPC-DS benchmark (99 queries) with a smaller subset for faster iteration.

In [None]:
# Run TPC-DS with query subset
tpcds_cfg = BenchmarkConfig(
    name="tpcds",
    display_name="TPC-DS Sample",
    scale_factor=0.01,
    test_execution_type="power",
    query_numbers=[1, 2, 3, 10, 25],  # Run subset for faster results
)

print("üöÄ Running TPC-DS subset on DuckDB...\n")
tpcds_results = run_benchmark_lifecycle(
    benchmark_config=tpcds_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=True, load=True, execute=True),
)

print(f"\n‚úÖ TPC-DS completed: {tpcds_results.geometric_mean:.3f}s geometric mean")
print(f"   Queries executed: {len(tpcds_results.query_results)}")
print("   DuckDB handles TPC-DS's complex queries efficiently!")

### Scale Factor Comparison

Compare performance across different data sizes to see how DuckDB scales.

In [None]:
scale_results = {}

for sf in [0.01, 0.1]:  # Test 10MB and 100MB
    print(f"\nüöÄ Running TPC-H at scale factor {sf}...")

    sf_cfg = BenchmarkConfig(
        name="tpch",
        display_name=f"TPC-H SF {sf}",
        scale_factor=sf,
        test_execution_type="power",
        query_numbers=list(range(1, 11)),  # First 10 queries only
    )

    sf_results = run_benchmark_lifecycle(
        benchmark_config=sf_cfg,
        database_config=db_cfg,
        system_profile=None,
        platform_config=platform_cfg,
        phases=LifecyclePhases(generate=True, load=True, execute=True),
    )

    scale_results[sf] = sf_results.geometric_mean
    print(f"   Geometric mean: {sf_results.geometric_mean:.3f}s")

# Visualize scaling
if len(scale_results) > 1:
    fig, ax = plt.subplots(figsize=(10, 6))
    sfs = list(scale_results.keys())
    times = list(scale_results.values())

    ax.plot(sfs, times, marker="o", linewidth=2, markersize=10, color="#FFC220")
    ax.set_xlabel("Scale Factor", fontsize=12, fontweight="bold")
    ax.set_ylabel("Geometric Mean Time (seconds)", fontsize=12, fontweight="bold")
    ax.set_title("TPC-H Scaling on DuckDB", fontsize=14, fontweight="bold")
    ax.grid(True, alpha=0.3)
    ax.set_xscale("log")
    plt.tight_layout()
    plt.show()

    print("\nüìä Scaling Analysis:")
    for i in range(1, len(sfs)):
        data_mult = sfs[i] / sfs[i - 1]
        time_mult = times[i] / times[i - 1]
        efficiency = data_mult / time_mult
        print(
            f"   SF {sfs[i - 1]} ‚Üí {sfs[i]}: {data_mult}x data, {time_mult:.2f}x time (efficiency: {efficiency:.2f}x)"
        )

    print("\nüí° DuckDB typically shows sub-linear scaling - great for growing datasets!")

### Query Subset Selection

Run specific queries for targeted testing or CI/CD pipelines.

In [None]:
# Fast smoke test: Run 5 representative queries
smoke_test_queries = [1, 3, 6, 10, 14]  # Mix of simple and complex

subset_cfg = BenchmarkConfig(
    name="tpch",
    display_name="TPC-H Smoke Test",
    scale_factor=0.01,
    test_execution_type="power",
    query_numbers=smoke_test_queries,
)

print(f"üöÄ Running smoke test with queries: {smoke_test_queries}\n")
subset_results = run_benchmark_lifecycle(
    benchmark_config=subset_cfg,
    database_config=db_cfg,
    system_profile=None,
    platform_config=platform_cfg,
    phases=LifecyclePhases(generate=False, load=False, execute=True),  # Reuse data
)

print(f"\n‚úÖ Smoke test completed: {subset_results.geometric_mean:.3f}s geometric mean")
print(f"   Queries: {len(subset_results.query_results)}")
print(f"   Time saved vs full suite: ~{(1 - len(smoke_test_queries) / 22) * 100:.0f}%")
print("\nüí° Perfect for CI/CD: Run subset tests in seconds, not minutes!")

### Memory Tuning

Test different memory limits to find optimal configuration.

In [None]:
memory_limits = ["1GB", "2GB", "4GB"]
memory_results = {}

print("üß™ Testing different memory limits...\n")

for mem_limit in memory_limits:
    print(f"Testing with {mem_limit} memory limit...")

    mem_cfg = platform_cfg.copy()
    mem_cfg["memory_limit"] = mem_limit

    try:
        mem_results_obj = run_benchmark_lifecycle(
            benchmark_config=subset_cfg,  # Reuse smoke test config
            database_config=db_cfg,
            system_profile=None,
            platform_config=mem_cfg,
            phases=LifecyclePhases(generate=False, load=False, execute=True),
        )
        memory_results[mem_limit] = mem_results_obj.geometric_mean
        print(f"  ‚úÖ {mem_limit}: {mem_results_obj.geometric_mean:.3f}s\n")
    except Exception as e:
        print(f"  ‚ùå {mem_limit}: Failed - {str(e)[:50]}...\n")
        memory_results[mem_limit] = None

if memory_results:
    print("üìä Memory Limit Comparison:")
    for mem_limit, time in memory_results.items():
        if time:
            print(f"   {mem_limit}: {time:.3f}s")
        else:
            print(f"   {mem_limit}: Failed")

    print("\nüí° Lower memory limits may cause disk spilling, increasing query time.")
    print("   Recommended: 2-4GB for typical workloads, 8GB+ for large datasets.")

### Parallel Execution

Test impact of thread count on performance.

In [None]:
# Test with different thread counts
max_threads = os.cpu_count()
thread_counts = [1, max_threads // 2, max_threads] if max_threads > 1 else [1]
thread_results = {}

print(f"üßµ Testing different thread counts (max: {max_threads})...\n")

for threads in thread_counts:
    print(f"Testing with {threads} threads...")

    thread_cfg = platform_cfg.copy()
    thread_cfg["threads"] = threads

    thread_results_obj = run_benchmark_lifecycle(
        benchmark_config=subset_cfg,
        database_config=db_cfg,
        system_profile=None,
        platform_config=thread_cfg,
        phases=LifecyclePhases(generate=False, load=False, execute=True),
    )
    thread_results[threads] = thread_results_obj.geometric_mean
    print(f"  ‚úÖ {threads} threads: {thread_results_obj.geometric_mean:.3f}s\n")

if len(thread_results) > 1:
    print("üìä Thread Count Comparison:")
    baseline = thread_results[1]
    for threads, time in thread_results.items():
        speedup = baseline / time if time > 0 else 0
        print(f"   {threads} threads: {time:.3f}s (speedup: {speedup:.2f}x)")

    print("\nüí° DuckDB scales well with more threads for analytical queries.")
    print(f"   Recommendation: Use all available cores ({max_threads}) for best performance.")

### Persistent vs In-Memory Comparison

Compare performance between persistent and in-memory modes.

In [None]:
mode_results = {}

for mode in ["persistent", "memory"]:
    print(f"\nüöÄ Testing {mode} mode...")

    mode_cfg = platform_cfg.copy()
    mode_cfg["database"] = config["database_file"] if mode == "persistent" else ":memory:"

    mode_results_obj = run_benchmark_lifecycle(
        benchmark_config=subset_cfg,
        database_config=db_cfg,
        system_profile=None,
        platform_config=mode_cfg,
        phases=LifecyclePhases(generate=False, load=True, execute=True),
    )
    mode_results[mode] = mode_results_obj.geometric_mean
    print(f"  ‚úÖ {mode}: {mode_results_obj.geometric_mean:.3f}s")

print("\nüìä Mode Comparison:")
for mode, time in mode_results.items():
    print(f"   {mode.capitalize()}: {time:.3f}s")

if len(mode_results) == 2:
    diff_pct = abs(mode_results["persistent"] - mode_results["memory"]) / mode_results["persistent"] * 100
    print(f"\n   Difference: {diff_pct:.1f}%")

    print("\nüí° Performance Insights:")
    print("   - In-memory is typically slightly faster (no disk I/O)")
    print("   - Persistent mode allows data reuse and larger-than-RAM datasets")
    print("   - For small datasets, the difference is minimal")
    print("   - Choose persistent for production, in-memory for temporary analysis")

### Export Results

Export benchmark results to various formats for reporting and analysis.

In [None]:
# Export to multiple formats
try:
    exporter = ResultExporter(results)

    output_dir = Path(config["output_dir"])
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Export to JSON
    json_path = output_dir / f"duckdb_tpch_{timestamp}.json"
    exporter.to_json(json_path)
    print(f"‚úÖ Exported JSON: {json_path}")

    # Export to CSV
    csv_path = output_dir / f"duckdb_tpch_{timestamp}.csv"
    exporter.to_csv(csv_path)
    print(f"‚úÖ Exported CSV: {csv_path}")

    # Export to HTML report
    html_path = output_dir / f"duckdb_tpch_{timestamp}.html"
    exporter.to_html(html_path)
    print(f"‚úÖ Exported HTML: {html_path}")

    print(f"\nüìÅ All results exported to: {output_dir}")

except Exception as e:
    print(f"‚ö†Ô∏è Export failed: {e}")

### File Format Comparison

DuckDB can read directly from CSV, Parquet, and JSON files. Compare performance.

In [None]:
print("üìä DuckDB File Format Capabilities\n")
print("DuckDB can query files directly without loading:")
print("\n1. CSV Files:")
print("   SELECT * FROM read_csv_auto('data.csv');")
print("\n2. Parquet Files:")
print("   SELECT * FROM read_parquet('data.parquet');")
print("\n3. JSON Files:")
print("   SELECT * FROM read_json_auto('data.json');")
print("\n4. Multiple Files (Glob patterns):")
print("   SELECT * FROM read_parquet('data/*.parquet');")
print("\n5. Remote Files (with httpfs extension):")
print("   SELECT * FROM read_parquet('https://example.com/data.parquet');")

print("\nüí° Performance Tips:")
print("   - Parquet is fastest (columnar, compressed)")
print("   - CSV is most compatible but slower")
print("   - Use read_*_auto() for automatic schema detection")
print("   - DuckDB can query files in-place without loading!")

## 4. Platform-Specific Features

### Reading from Files

DuckDB's killer feature: Query data files directly without importing.

In [None]:
# Connect to DuckDB
conn = duckdb.connect(":memory:")

print("üìÇ Direct File Querying Examples\n")

# Example 1: Read CSV
print("1. CSV Reading:")
print(
    "   "
    + """SELECT * FROM read_csv_auto(
       'data.csv',
       header=True,
       delim=',',
       auto_detect=True
   );"""
)

# Example 2: Read Parquet
print("\n2. Parquet Reading (Recommended):")
print("   SELECT * FROM read_parquet('data.parquet');")
print("   -- Or multiple files:")
print("   SELECT * FROM read_parquet(['file1.parquet', 'file2.parquet']);")
print("   -- Or glob pattern:")
print("   SELECT * FROM read_parquet('data/**/*.parquet');")

# Example 3: Create table from file
print("\n3. Create Table from File:")
print("   CREATE TABLE customers AS SELECT * FROM read_csv_auto('customers.csv');")

# Example 4: Export to Parquet
print("\n4. Export Query Results:")
print("   COPY (SELECT * FROM customers WHERE country = 'USA')")
print("   TO 'usa_customers.parquet' (FORMAT PARQUET);")

conn.close()

print("\nüí° Why This Matters:")
print("   - No ETL required for analysis")
print("   - Work with data lakes directly")
print("   - Convert between formats easily")
print("   - Perfect for data science workflows")

### Pandas Integration

DuckDB has seamless integration with pandas DataFrames.

In [None]:
import duckdb

print("üêº DuckDB + Pandas Integration\n")

# Create sample DataFrame
df = pd.DataFrame(
    {"name": ["Alice", "Bob", "Charlie", "David"], "age": [25, 30, 35, 40], "salary": [50000, 60000, 70000, 80000]}
)

print("Sample DataFrame:")
print(df)

# Query DataFrame with SQL
result = duckdb.query("SELECT name, salary FROM df WHERE age > 30").to_df()

print("\nSQL Query Result:")
print(result)

print("\nüìä Integration Examples:\n")

print("1. Query DataFrame:")
print("   duckdb.query('SELECT * FROM df WHERE age > 30').to_df()")

print("\n2. Create DuckDB Table from DataFrame:")
print("   conn = duckdb.connect()")
print("   conn.register('my_table', df)")
print("   conn.execute('SELECT * FROM my_table')")

print("\n3. Convert Query Result to DataFrame:")
print("   result_df = conn.execute('SELECT * FROM table').df()")

print("\n4. Use Arrow for Zero-Copy Transfer:")
print("   arrow_table = conn.execute('SELECT * FROM table').arrow()")

print("\nüí° Performance Tips:")
print("   - DuckDB can query pandas DataFrames directly (no copying!)")
print("   - Use Arrow for zero-copy data transfer")
print("   - DuckDB is often faster than pandas for aggregations")
print("   - Perfect for data preprocessing pipelines")

### Extensions

DuckDB extensions add capabilities like spatial data, full-text search, and remote files.

In [None]:
conn = duckdb.connect(":memory:")

print("üîå DuckDB Extensions\n")

# List available extensions
extensions = conn.execute("""
    SELECT extension_name, installed, loaded, description
    FROM duckdb_extensions()
    WHERE extension_name IN ('httpfs', 'parquet', 'json', 'fts', 'spatial')
    ORDER BY extension_name;
""").fetchall()

if extensions:
    print("Available Extensions:")
    for name, installed, loaded, desc in extensions:
        status = "loaded" if loaded else ("installed" if installed else "available")
        print(f"\n{name} ({status}):")
        print(f"  {desc}")
else:
    print("No extensions found")

print("\nüì¶ Popular Extensions:\n")

print("1. httpfs - Read remote files:")
print("   INSTALL httpfs; LOAD httpfs;")
print("   SELECT * FROM read_parquet('https://example.com/data.parquet');")

print("\n2. parquet - Parquet format support:")
print("   Usually auto-loaded")
print("   SELECT * FROM read_parquet('data.parquet');")

print("\n3. json - JSON support:")
print("   INSTALL json; LOAD json;")
print("   SELECT * FROM read_json_auto('data.json');")

print("\n4. fts - Full-text search:")
print("   INSTALL fts; LOAD fts;")
print("   CREATE INDEX idx ON documents USING FTS(content);")

print("\n5. spatial - GIS functionality:")
print("   INSTALL spatial; LOAD spatial;")
print("   SELECT ST_Distance(point1, point2) FROM locations;")

conn.close()

print("\nüí° Extension Tips:")
print("   - Install once: INSTALL extension_name;")
print("   - Load each session: LOAD extension_name;")
print("   - Some extensions auto-load when needed")
print("   - Check duckdb_extensions() for full list")

### EXPLAIN and Query Optimization

Use EXPLAIN to understand query execution plans.

In [None]:
conn = duckdb.connect(config["database_file"] if config["mode"] == "persistent" else ":memory:")

print("üîç Query Optimization with EXPLAIN\n")

# Example query
query = """
SELECT l_orderkey, SUM(l_quantity) as total_qty
FROM lineitem
WHERE l_shipdate >= DATE '1995-01-01'
GROUP BY l_orderkey
ORDER BY total_qty DESC
LIMIT 10;
"""

print("Example Query:")
print(query)

try:
    # Get query plan
    plan = conn.execute(f"EXPLAIN {query}").fetchall()

    print("\nQuery Plan:")
    for row in plan:
        print(row[1])  # explain_value column

except Exception as e:
    print(f"\n‚ö†Ô∏è Could not get query plan: {e}")
    print("   (This is expected if tables don't exist yet)")

conn.close()

print("\nüí° EXPLAIN Usage:\n")
print("1. Basic plan: EXPLAIN SELECT ...")
print("2. Analyzed plan: EXPLAIN ANALYZE SELECT ...")
print("3. Show all details: PRAGMA explain_output = 'all'; EXPLAIN SELECT ...")

print("\nüéØ Optimization Tips:")
print("   - Look for 'SEQ_SCAN' - might need indexes")
print("   - Check join order - smaller tables first")
print("   - Filter early to reduce data scanned")
print("   - Use EXPLAIN ANALYZE to see actual timings")
print("   - DuckDB is smart - trust the optimizer!")

## 5. Performance Analysis

### Load and Analyze Previous Results

Load saved benchmark results for analysis.

In [None]:
# Find most recent result file
try:
    result_files = sorted(Path(config["output_dir"]).glob("duckdb_tpch_*.json"))

    if result_files:
        latest_file = result_files[-1]
        print(f"üìÇ Loading results from: {latest_file.name}\n")

        loader = ResultLoader()
        loaded_results = loader.load_json(latest_file)

        print(f"‚úÖ Loaded {len(loaded_results.query_results)} query results")
        print(f"   Benchmark: {loaded_results.benchmark_name}")
        print(f"   Scale factor: {loaded_results.scale_factor}")
        print(f"   Geometric mean: {loaded_results.geometric_mean:.3f}s")
    else:
        print("‚ö†Ô∏è  No result files found. Run a benchmark first.")
        loaded_results = results  # Use current results

except Exception as e:
    print(f"‚ö†Ô∏è Could not load results: {e}")
    loaded_results = results

### Statistical Analysis

Calculate detailed statistics on query performance.

In [None]:
if loaded_results.query_results:
    times = [qr.execution_time for qr in loaded_results.query_results if qr.success]

    if times:
        stats = {
            "count": len(times),
            "mean": np.mean(times),
            "median": np.median(times),
            "std": np.std(times),
            "min": np.min(times),
            "max": np.max(times),
            "p25": np.percentile(times, 25),
            "p75": np.percentile(times, 75),
            "p95": np.percentile(times, 95),
            "p99": np.percentile(times, 99),
        }

        print("üìä Statistical Summary:\n")
        print(f"Count:      {stats['count']} queries")
        print(f"Mean:       {stats['mean']:.3f}s")
        print(f"Median:     {stats['median']:.3f}s")
        print(f"Std Dev:    {stats['std']:.3f}s")
        print(f"Min:        {stats['min']:.3f}s")
        print(f"Max:        {stats['max']:.3f}s")
        print("\nPercentiles:")
        print(f"P25:        {stats['p25']:.3f}s")
        print(f"P75:        {stats['p75']:.3f}s")
        print(f"P95:        {stats['p95']:.3f}s")
        print(f"P99:        {stats['p99']:.3f}s")

        # Coefficient of variation
        cv = stats["std"] / stats["mean"]
        print(f"\nCoefficient of Variation: {cv:.2f}")
        if cv < 0.5:
            print("   ‚úÖ Low variability - consistent performance")
        elif cv < 1.0:
            print("   ‚ö†Ô∏è  Moderate variability")
        else:
            print("   ‚ùå High variability - investigate outliers")

        # DuckDB-specific insights
        if stats["max"] < 1.0:
            print("\nüí° DuckDB Insight: All queries under 1 second - excellent performance!")
        if stats["mean"] < 0.1:
            print("   Sub-100ms average - DuckDB's vectorized execution shines here!")
    else:
        print("‚ö†Ô∏è  No successful queries to analyze")
else:
    print("‚ö†Ô∏è  No query results available")

### Advanced Visualizations

Create comprehensive performance visualizations.

In [None]:
if loaded_results.query_results:
    times = [qr.execution_time for qr in loaded_results.query_results if qr.success]
    query_names = [qr.query_name for qr in loaded_results.query_results if qr.success]

    if times:
        fig, axes = plt.subplots(2, 2, figsize=(16, 12))

        # 1. Histogram with distribution
        ax1 = axes[0, 0]
        ax1.hist(times, bins=20, color="#FFC220", alpha=0.7, edgecolor="black")
        ax1.axvline(np.mean(times), color="#FF6F00", linestyle="--", linewidth=2, label=f"Mean: {np.mean(times):.3f}s")
        ax1.axvline(
            np.median(times), color="green", linestyle="--", linewidth=2, label=f"Median: {np.median(times):.3f}s"
        )
        ax1.set_xlabel("Execution Time (seconds)", fontweight="bold")
        ax1.set_ylabel("Frequency", fontweight="bold")
        ax1.set_title("Query Execution Time Distribution", fontweight="bold")
        ax1.legend()
        ax1.grid(axis="y", alpha=0.3)

        # 2. Box plot
        ax2 = axes[0, 1]
        box = ax2.boxplot([times], vert=True, patch_artist=True, widths=0.5)
        box["boxes"][0].set_facecolor("#FFC220")
        box["boxes"][0].set_alpha(0.7)
        ax2.set_ylabel("Execution Time (seconds)", fontweight="bold")
        ax2.set_title("Query Performance Box Plot", fontweight="bold")
        ax2.set_xticklabels(["All Queries"])
        ax2.grid(axis="y", alpha=0.3)

        # 3. Sorted bar chart (top 10 slowest)
        ax3 = axes[1, 0]
        sorted_indices = np.argsort(times)[-10:]
        sorted_times = [times[i] for i in sorted_indices]
        sorted_names = [query_names[i] for i in sorted_indices]

        bars = ax3.barh(sorted_names, sorted_times, color="#FFC220", alpha=0.8, edgecolor="black")
        ax3.set_xlabel("Execution Time (seconds)", fontweight="bold")
        ax3.set_title("Top 10 Slowest Queries", fontweight="bold")
        ax3.grid(axis="x", alpha=0.3)

        # 4. Cumulative performance (Pareto)
        ax4 = axes[1, 1]
        sorted_all_indices = np.argsort(times)[::-1]
        sorted_all_times = [times[i] for i in sorted_all_indices]
        cumulative = np.cumsum(sorted_all_times)
        cumulative_pct = (cumulative / cumulative[-1]) * 100

        ax4.plot(range(len(cumulative_pct)), cumulative_pct, marker="o", color="#FFC220", linewidth=2)
        ax4.axhline(80, color="#FF6F00", linestyle="--", linewidth=2, label="80% of total time")
        ax4.set_xlabel("Number of Queries (sorted by time)", fontweight="bold")
        ax4.set_ylabel("Cumulative % of Total Time", fontweight="bold")
        ax4.set_title("Cumulative Performance (Pareto Analysis)", fontweight="bold")
        ax4.legend()
        ax4.grid(True, alpha=0.3)

        # Find how many queries account for 80% of time
        queries_80pct = np.argmax(cumulative_pct >= 80) + 1
        ax4.axvline(
            queries_80pct, color="green", linestyle="--", linewidth=2, label=f"{queries_80pct} queries = 80% time"
        )
        ax4.legend()

        plt.tight_layout()
        plt.show()

        print(
            f"\nüí° Insight: {queries_80pct} queries ({queries_80pct / len(times) * 100:.0f}%) account for 80% of total execution time"
        )
    else:
        print("‚ö†Ô∏è  No successful queries to visualize")
else:
    print("‚ö†Ô∏è  No query results available")

### Memory Usage Tracking

Monitor memory consumption during benchmarks.

In [None]:
if psutil:
    process = psutil.Process()

    print("üíæ Memory Usage Analysis\n")

    # Get process memory info
    mem_info = process.memory_info()
    mem_mb = mem_info.rss / (1024**2)

    print(f"Current Process Memory: {mem_mb:.1f} MB")

    # System memory
    sys_mem = psutil.virtual_memory()
    print(f"System Memory Usage: {sys_mem.percent}%")
    print(f"Available: {sys_mem.available / (1024**3):.1f} GB")

    # Database file size
    if config["mode"] == "persistent" and Path(config["database_file"]).exists():
        db_size = Path(config["database_file"]).stat().st_size / (1024**2)
        print(f"\nDatabase File: {db_size:.1f} MB")

    print("\nüí° Memory Optimization Tips:")
    print("   - Use persistent mode for large datasets")
    print("   - Adjust memory_limit setting if needed")
    print("   - DuckDB will spill to disk if memory is insufficient")
    print("   - Monitor temp_directory disk usage for spilling")
else:
    print("‚ö†Ô∏è  Memory tracking not available (install psutil)")

### Performance Profiling

Use DuckDB's built-in profiling to understand query performance.

In [None]:
print("üî¨ DuckDB Performance Profiling\n")
print("DuckDB provides built-in profiling:")
print("\n1. Enable profiling:")
print("   PRAGMA enable_profiling;")
print("   PRAGMA profiling_output = 'profile_output.json';")
print("\n2. Run your queries...")
print("\n3. View profile:")
print("   PRAGMA last_profile_query;")

print("\nüìä What Profiling Shows:")
print("   - Time spent in each operator")
print("   - Number of rows processed")
print("   - Memory usage per operator")
print("   - Bottlenecks in query execution")

print("\nüí° Quick Profiling Tips:")
print("   - Focus on operators taking >10% of time")
print("   - Look for unexpected full table scans")
print("   - Check if sorts/aggregations are slow")
print("   - Consider adding indexes for frequent lookups")
print("   - DuckDB automatically optimizes most queries!")

## 6. Troubleshooting

### Diagnostics Function

Comprehensive diagnostic tool for troubleshooting DuckDB issues.

In [None]:
def diagnose_duckdb():
    """Diagnose DuckDB setup and configuration"""
    print("üîç DuckDB Diagnostic\n")

    # Check 1: DuckDB import
    print("1. Checking DuckDB installation...")
    try:
        import duckdb

        print(f"   ‚úÖ DuckDB {duckdb.__version__} installed")
    except ImportError:
        print("   ‚ùå DuckDB not installed")
        print("   Action: pip install duckdb")
        return False

    # Check 2: Test connection
    print("\n2. Testing connection...")
    try:
        conn = duckdb.connect(":memory:")
        result = conn.execute("SELECT 42 as answer;").fetchone()
        conn.close()
        print(f"   ‚úÖ Connection successful: {result}")
    except Exception as e:
        print(f"   ‚ùå Connection failed: {e}")
        return False

    # Check 3: System resources
    print("\n3. Checking system resources...")
    if psutil:
        mem = psutil.virtual_memory()
        cpu = psutil.cpu_count()
        print(f"   ‚úÖ CPU cores: {cpu}")
        print(f"   ‚úÖ RAM: {mem.total / (1024**3):.1f} GB ({mem.available / (1024**3):.1f} GB available)")

        if mem.available < 1024**3:  # Less than 1GB available
            print("   ‚ö†Ô∏è  Low memory - reduce memory_limit setting")
    else:
        print("   ‚ö†Ô∏è  psutil not available - install for resource monitoring")

    # Check 4: Database file (if persistent)
    print("\n4. Checking database configuration...")
    if config["mode"] == "persistent":
        db_path = Path(config["database_file"])
        if db_path.exists():
            size_mb = db_path.stat().st_size / (1024**2)
            print(f"   ‚úÖ Database file exists: {size_mb:.1f} MB")
        else:
            print(f"   ‚ÑπÔ∏è  Database file will be created: {db_path}")

        # Check parent directory is writable
        parent = db_path.parent
        if parent.exists() and os.access(parent, os.W_OK):
            print(f"   ‚úÖ Directory writable: {parent}")
        else:
            print(f"   ‚ùå Directory not writable: {parent}")
            return False
    else:
        print("   ‚ÑπÔ∏è  Using in-memory mode")

    # Check 5: Extensions
    print("\n5. Checking extensions...")
    try:
        conn = duckdb.connect(":memory:")
        exts = conn.execute("""
            SELECT extension_name, installed
            FROM duckdb_extensions()
            WHERE extension_name IN ('parquet', 'json', 'httpfs')
        """).fetchall()
        conn.close()

        for name, installed in exts:
            status = "installed" if installed else "available"
            print(f"   {name}: {status}")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Could not check extensions: {e}")

    print("\n‚úÖ All diagnostics passed!")
    print("\nüìö Additional Resources:")
    print("   - DuckDB Documentation: https://duckdb.org/docs/")
    print("   - GitHub Issues: https://github.com/duckdb/duckdb/issues")
    print("   - Discord Community: https://discord.duckdb.org/")

    return True


# Run diagnostics
diagnose_duckdb()

### Common Issues and Solutions

**1. Out of Memory Error**
```
Error: Out of Memory
```
**Solution:**
- Reduce `memory_limit` setting: `SET memory_limit = '2GB';`
- Use persistent mode instead of in-memory
- Ensure `temp_directory` has sufficient disk space for spilling
- Process data in smaller batches

**2. File Lock Error (Persistent Mode)**
```
Error: Could not set lock on file
```
**Solution:**
- Close all connections to the database
- Check no other processes are using the .duckdb file
- Use `conn.close()` to properly close connections
- Delete .duckdb.wal file if safe

**3. Slow Query Performance**
**Solution:**
- Increase thread count: `SET threads TO 8;`
- Increase memory limit if available
- Use EXPLAIN ANALYZE to find bottlenecks
- Consider creating indexes for frequent lookups
- Use Parquet files instead of CSV for better performance

**4. Extension Not Found**
```
Error: Extension "httpfs" not found
```
**Solution:**
```sql
INSTALL httpfs;
LOAD httpfs;
```

**5. File Not Found (Reading Data)**
```
Error: Could not open file
```
**Solution:**
- Use absolute paths: `/full/path/to/file.csv`
- Check file permissions
- Verify file exists: `Path('file.csv').exists()`
- Use forward slashes even on Windows: `data/file.csv`

**6. Catalog Error (Table Not Found)**
```
Error: Table with name "mytable" does not exist
```
**Solution:**
- In-memory databases don't persist between sessions
- Use persistent mode for data reuse
- Check table name spelling (case-sensitive)
- Verify connection to correct database file

**7. Import/Export Errors**
**Solution:**
- Use read_csv_auto() for automatic type detection
- Specify delimiter explicitly: `delim=','`
- Handle null values: `null_padding=True`
- Check file encoding: `encoding='UTF-8'`

**8. Version Compatibility**
**Solution:**
- Update DuckDB: `pip install -U duckdb`
- Check version: `duckdb.__version__`
- Some features require newer versions
- Clear cached .duckdb files after upgrades

**Need More Help?**
- DuckDB Documentation: https://duckdb.org/docs/
- GitHub Issues: https://github.com/duckdb/duckdb/issues
- Discord Community: https://discord.duckdb.org/
- Stack Overflow: https://stackoverflow.com/questions/tagged/duckdb

### Performance Tuning Guide

**Configuration Settings:**
```sql
-- Use all CPU cores
SET threads TO 8;

-- Set memory limit (adjust for your system)
SET memory_limit = '4GB';

-- Configure temp directory for spilling
SET temp_directory = '/path/to/temp';

-- Disable optimizer for debugging (not recommended)
SET disabled_optimizers = 'join_order';
```

**Data Loading Best Practices:**
1. **Use Parquet files** - Faster than CSV by 10-100x
2. **Parallel loading** - DuckDB loads multiple files in parallel
3. **Compression** - Parquet with Snappy or ZSTD compression
4. **Partition files** - Split large files into 100MB-1GB chunks

**Query Optimization:**
1. **Filter early** - WHERE clauses before joins
2. **Select only needed columns** - Avoid SELECT *
3. **Use appropriate joins** - Let optimizer choose join order
4. **Aggregate efficiently** - GROUP BY early if possible
5. **Trust the optimizer** - DuckDB is very good at optimization

**Memory Management:**
- Start with 50% of available RAM: `memory_limit = '4GB'`
- Increase if queries are fast and memory available
- Decrease if getting OOM errors
- DuckDB will spill to disk if needed (check temp_directory)

**When to Use Each Mode:**

**In-Memory Mode:**
- ‚úÖ Temporary analysis
- ‚úÖ Small datasets (<1GB)
- ‚úÖ Maximum speed
- ‚ùå No data persistence
- ‚ùå Limited by RAM

**Persistent Mode:**
- ‚úÖ Data reuse
- ‚úÖ Larger-than-RAM datasets
- ‚úÖ Production workloads
- ‚úÖ Crash recovery
- ‚ùå Slightly slower (minimal difference)

**Benchmarking Tips:**
1. **Warm-up run** - First execution may be slower (compilation)
2. **Multiple runs** - Average 3-5 runs for stability
3. **Clear caches** - Close and reopen connection between tests
4. **Monitor resources** - Check CPU, memory, and disk usage
5. **Compare apples-to-apples** - Same hardware, same data, same queries

## Next Steps

**Try these next:**
1. Run with larger scale factors (1.0, 10, 100)
2. Test with your own datasets (CSV, Parquet)
3. Compare DuckDB vs other databases
4. Integrate with your Python data pipeline
5. Explore DuckDB extensions (spatial, full-text search)

**Related notebooks:**
- `sqlite_benchmarking.ipynb` - Compare with SQLite
- `platform_comparison.ipynb` - Compare cloud vs local
- `visualization_examples.ipynb` - Advanced plotting

**Resources:**
- BenchBox Documentation: https://github.com/joeharris76/benchbox
- DuckDB Documentation: https://duckdb.org/docs/
- TPC Benchmarks: http://www.tpc.org/