# iSamples Comprehensive Schema Benchmark

**Purpose**: Fair apples-to-apples comparison of all parquet formats with same source coverage.

## Formats Compared

| Format | Source | Samples | Has H3 | Notes |
|--------|--------|---------|--------|-------|
| **Export** | Zenodo | 6.7M | No | Flat, nested STRUCTs |
| **Narrow** | Zenodo | 6.7M | No | Graph with edge rows |
| **Wide** | Zenodo | 6.7M | No | Entity-centric, p__* arrays |
| **Frontend** | Generated | 6.7M | **Yes** | Export + H3 columns |

## Equivalence Verified

Tests in `pqg/tests/test_format_equivalence.py` confirm:
- All formats have identical sample counts
- 100% material category concept overlap
- Same sample identifiers

## 1. Setup

In [1]:
import duckdb
import pandas as pd
import time
from pathlib import Path

# All file paths
PATHS = {
    # Zenodo formats (same source coverage)
    'export': Path.home() / 'Data/iSample/2025_04_21_16_23_46/isamples_export_2025_04_21_16_23_46_geo.parquet',
    'narrow': Path.home() / 'Data/iSample/pqg_refining/zenodo_narrow_strict.parquet',
    'wide': Path.home() / 'Data/iSample/pqg_refining/zenodo_wide_strict.parquet',
    # Frontend bundle (export + H3)
    'frontend': Path.home() / 'Data/iSample/frontend_bundle/samples_frontend.parquet',
    'h3_cache': Path.home() / 'Data/iSample/frontend_bundle/h3_cache.parquet',
    'lookup_agents': Path.home() / 'Data/iSample/frontend_bundle/lookup_agents.parquet',
}

# Verify files exist and show sizes
print("File Inventory:")
print("=" * 60)
for name, path in PATHS.items():
    if path.exists():
        size_mb = path.stat().st_size / 1e6
        print(f"✅ {name:15} {size_mb:>8.1f} MB  {path.name}")
    else:
        print(f"❌ {name:15} NOT FOUND      {path}")

File Inventory:
✅ export             297.0 MB  isamples_export_2025_04_21_16_23_46_geo.parquet
✅ narrow             743.4 MB  zenodo_narrow_strict.parquet
✅ wide               253.9 MB  zenodo_wide_strict.parquet
✅ frontend           250.1 MB  samples_frontend.parquet
✅ h3_cache             7.4 MB  h3_cache.parquet
✅ lookup_agents      160.9 MB  lookup_agents.parquet


In [2]:
# Benchmark utilities
con = duckdb.connect()

def benchmark(sql, name="Query", iterations=3):
    """Run query multiple times and return (result, avg_ms, min_ms)."""
    times = []
    result = None
    for _ in range(iterations):
        start = time.time()
        result = con.sql(sql).fetchdf()
        times.append((time.time() - start) * 1000)
    avg_ms = sum(times) / len(times)
    min_ms = min(times)
    print(f"{name}: {avg_ms:.1f}ms avg, {min_ms:.1f}ms min, {len(result):,} rows")
    return result, avg_ms, min_ms

# Store results
results = {}

## 2. Row Count Verification

In [3]:
# Verify all formats have same sample count
counts = {}

# Export/Frontend: direct count
for name in ['export', 'frontend']:
    if PATHS[name].exists():
        counts[name] = con.sql(f"SELECT COUNT(*) FROM read_parquet('{PATHS[name]}')").fetchone()[0]

# Narrow/Wide: count MaterialSampleRecord only
for name in ['narrow', 'wide']:
    if PATHS[name].exists():
        counts[name] = con.sql(f"""
            SELECT COUNT(*) FROM read_parquet('{PATHS[name]}')
            WHERE otype = 'MaterialSampleRecord'
        """).fetchone()[0]

print("Sample Counts:")
for name, count in counts.items():
    print(f"  {name}: {count:,}")

# Verify equivalence
unique_counts = set(counts.values())
if len(unique_counts) == 1:
    print(f"\n✅ All formats have identical sample count: {list(unique_counts)[0]:,}")
else:
    print(f"\n⚠️ Sample counts differ: {unique_counts}")

Sample Counts:
  export: 6,680,932
  frontend: 6,680,932
  narrow: 6,680,932
  wide: 6,680,932

✅ All formats have identical sample count: 6,680,932


## 3. Benchmark: Map Coordinates Query

**Use case**: Render all sample points on a map

In [4]:
print("=" * 60)
print("BENCHMARK: Get All Coordinates")
print("=" * 60)

# Export
_, export_ms, _ = benchmark(f"""
    SELECT sample_location_latitude as lat, sample_location_longitude as lon
    FROM read_parquet('{PATHS['export']}')
    WHERE sample_location_latitude IS NOT NULL
""", "Export")

# Frontend (same as export but sorted)
_, frontend_ms, _ = benchmark(f"""
    SELECT sample_location_latitude as lat, sample_location_longitude as lon
    FROM read_parquet('{PATHS['frontend']}')
    WHERE sample_location_latitude IS NOT NULL
""", "Frontend")

# Narrow
_, narrow_ms, _ = benchmark(f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['narrow']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "Narrow")

# Wide
_, wide_ms, _ = benchmark(f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['wide']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "Wide")

results['map_coords'] = {
    'export': export_ms,
    'frontend': frontend_ms,
    'narrow': narrow_ms,
    'wide': wide_ms,
}

BENCHMARK: Get All Coordinates
Export: 23.4ms avg, 21.7ms min, 5,980,282 rows


Frontend: 30.6ms avg, 28.9ms min, 5,980,282 rows


Narrow: 51.5ms avg, 48.7ms min, 5,980,282 rows


Wide: 31.7ms avg, 30.6ms min, 5,980,282 rows


## 4. Benchmark: H3 Hexbin Aggregation

**Use case**: Initial map render with hexbin aggregation

In [5]:
print("=" * 60)
print("BENCHMARK: H3 Hexbin Aggregation (Resolution 6)")
print("=" * 60)

# Frontend with pre-computed H3
_, frontend_h3_ms, _ = benchmark(f"""
    SELECT h3_06, COUNT(*) as cnt
    FROM read_parquet('{PATHS['frontend']}')
    WHERE h3_06 IS NOT NULL
    GROUP BY h3_06
""", "Frontend (pre-computed H3)")

# H3 Cache (even faster - pre-aggregated)
_, h3_cache_ms, _ = benchmark(f"""
    SELECT h3_index, SUM(sample_count) as cnt
    FROM read_parquet('{PATHS['h3_cache']}')
    WHERE resolution = 6
    GROUP BY h3_index
""", "H3 Cache (pre-aggregated)")

# Export without H3 - must compute (we skip this as h3 not in export)
print("Export: N/A (no H3 column, would need runtime computation)")
print("Narrow/Wide: N/A (no H3 column)")

results['h3_agg'] = {
    'frontend': frontend_h3_ms,
    'h3_cache': h3_cache_ms,
}

BENCHMARK: H3 Hexbin Aggregation (Resolution 6)
Frontend (pre-computed H3): 18.6ms avg, 17.6ms min, 111,681 rows
H3 Cache (pre-aggregated): 8.3ms avg, 8.0ms min, 111,613 rows
Export: N/A (no H3 column, would need runtime computation)
Narrow/Wide: N/A (no H3 column)


## 5. Benchmark: Faceted Search

**Use case**: Show material category counts in search UI

In [6]:
print("=" * 60)
print("BENCHMARK: Material Category Facets")
print("=" * 60)

# Export
_, export_facet_ms, _ = benchmark(f"""
    SELECT mat.identifier as material, COUNT(*) as cnt
    FROM (
        SELECT unnest(has_material_category) as mat
        FROM read_parquet('{PATHS['export']}')
        WHERE has_material_category IS NOT NULL AND len(has_material_category) > 0
    )
    GROUP BY mat.identifier
    ORDER BY cnt DESC
    LIMIT 10
""", "Export")

# Frontend (same as export)
_, frontend_facet_ms, _ = benchmark(f"""
    SELECT mat.identifier as material, COUNT(*) as cnt
    FROM (
        SELECT unnest(has_material_category) as mat
        FROM read_parquet('{PATHS['frontend']}')
        WHERE has_material_category IS NOT NULL AND len(has_material_category) > 0
    )
    GROUP BY mat.identifier
    ORDER BY cnt DESC
    LIMIT 10
""", "Frontend")

# Narrow
_, narrow_facet_ms, _ = benchmark(f"""
    WITH edges AS (
        SELECT s as sample_rowid, unnest(o) as concept_rowid
        FROM read_parquet('{PATHS['narrow']}')
        WHERE otype = '_edge_' AND p = 'has_material_category'
    ),
    concepts AS (
        SELECT row_id, pid
        FROM read_parquet('{PATHS['narrow']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.pid as material, COUNT(*) as cnt
    FROM edges e
    JOIN concepts c ON e.concept_rowid = c.row_id
    GROUP BY c.pid
    ORDER BY cnt DESC
    LIMIT 10
""", "Narrow")

# Wide
_, wide_facet_ms, _ = benchmark(f"""
    WITH samples AS (
        SELECT unnest(p__has_material_category) as concept_rowid
        FROM read_parquet('{PATHS['wide']}')
        WHERE otype = 'MaterialSampleRecord' 
          AND p__has_material_category IS NOT NULL
    ),
    concepts AS (
        SELECT row_id, pid
        FROM read_parquet('{PATHS['wide']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.pid as material, COUNT(*) as cnt
    FROM samples s
    JOIN concepts c ON s.concept_rowid = c.row_id
    GROUP BY c.pid
    ORDER BY cnt DESC
    LIMIT 10
""", "Wide")

results['facets'] = {
    'export': export_facet_ms,
    'frontend': frontend_facet_ms,
    'narrow': narrow_facet_ms,
    'wide': wide_facet_ms,
}

BENCHMARK: Material Category Facets


Export: 53.7ms avg, 52.5ms min, 10 rows


Frontend: 41.8ms avg, 35.2ms min, 10 rows


Narrow: 246.5ms avg, 244.1ms min, 10 rows


Wide: 200.0ms avg, 195.4ms min, 10 rows


## 6. Benchmark: Agent Lookup (with Inverted Index)

**Use case**: Find all samples by a specific agent

In [7]:
AGENT_NAME = 'Jacob Freeman'  # Common agent in data

print("=" * 60)
print(f"BENCHMARK: Find Samples by Agent '{AGENT_NAME}'")
print("=" * 60)

# Export (scan nested struct - O(n))
_, export_agent_ms, _ = benchmark(f"""
    SELECT sample_identifier, label
    FROM read_parquet('{PATHS['export']}')
    WHERE list_contains(
        [r.name FOR r IN produced_by.responsibility],
        '{AGENT_NAME}'
    )
    LIMIT 100
""", "Export (O(n) scan)")

# Lookup table (inverted index - O(log n))
_, lookup_agent_ms, _ = benchmark(f"""
    SELECT sample_identifier
    FROM read_parquet('{PATHS['lookup_agents']}')
    WHERE agent_name = '{AGENT_NAME}'
    LIMIT 100
""", "Lookup Table (O(log n))")

results['agent_lookup'] = {
    'export': export_agent_ms,
    'lookup': lookup_agent_ms,
}

print(f"\nSpeedup: {export_agent_ms / lookup_agent_ms:.1f}x faster with inverted index")

BENCHMARK: Find Samples by Agent 'Jacob Freeman'


Export (O(n) scan): 482.8ms avg, 479.0ms min, 100 rows
Lookup Table (O(log n)): 9.3ms avg, 8.5ms min, 100 rows

Speedup: 52.0x faster with inverted index


## 7. Benchmark Summary

In [8]:
print("=" * 70)
print("BENCHMARK SUMMARY (times in ms, lower is better)")
print("=" * 70)

summary_data = []

# Map coordinates
if 'map_coords' in results:
    r = results['map_coords']
    summary_data.append({
        'Query': 'Map Coordinates',
        'Export': f"{r['export']:.0f}",
        'Frontend': f"{r['frontend']:.0f}",
        'Narrow': f"{r['narrow']:.0f}",
        'Wide': f"{r['wide']:.0f}",
        'Winner': min(r, key=r.get),
    })

# H3 aggregation
if 'h3_agg' in results:
    r = results['h3_agg']
    summary_data.append({
        'Query': 'H3 Hexbin Agg',
        'Export': 'N/A',
        'Frontend': f"{r['frontend']:.0f}",
        'Narrow': 'N/A',
        'Wide': 'N/A',
        'Winner': 'frontend',
    })

# Facets
if 'facets' in results:
    r = results['facets']
    summary_data.append({
        'Query': 'Material Facets',
        'Export': f"{r['export']:.0f}",
        'Frontend': f"{r['frontend']:.0f}",
        'Narrow': f"{r['narrow']:.0f}",
        'Wide': f"{r['wide']:.0f}",
        'Winner': min(r, key=r.get),
    })

# Agent lookup
if 'agent_lookup' in results:
    r = results['agent_lookup']
    summary_data.append({
        'Query': 'Agent Lookup',
        'Export': f"{r['export']:.0f}",
        'Frontend': f"{r['export']:.0f}",  # Same as export
        'Narrow': 'N/A',
        'Wide': 'N/A',
        'Winner': 'lookup table',
    })

summary_df = pd.DataFrame(summary_data)
print(summary_df.to_string(index=False))

BENCHMARK SUMMARY (times in ms, lower is better)
          Query Export Frontend Narrow Wide       Winner
Map Coordinates     23       31     52   32       export
  H3 Hexbin Agg    N/A       19    N/A  N/A     frontend
Material Facets     54       42    246  200     frontend
   Agent Lookup    483      483    N/A  N/A lookup table


## 8. Storage Comparison

In [9]:
storage_data = []
for name, path in PATHS.items():
    if path.exists():
        size_mb = path.stat().st_size / 1e6
        storage_data.append({
            'Format': name,
            'Size (MB)': f"{size_mb:.1f}",
            'Has H3': 'Yes' if name == 'frontend' else ('Cache' if name == 'h3_cache' else 'No'),
        })

storage_df = pd.DataFrame(storage_data)
print("Storage Comparison:")
print(storage_df.to_string(index=False))

# Total frontend bundle
bundle_size = sum(p.stat().st_size for name, p in PATHS.items() 
                  if name in ['frontend', 'h3_cache', 'lookup_agents'] and p.exists())
print(f"\nFrontend Bundle Total: {bundle_size/1e6:.1f} MB")

Storage Comparison:
       Format Size (MB) Has H3
       export     297.0     No
       narrow     743.4     No
         wide     253.9     No
     frontend     250.1    Yes
     h3_cache       7.4  Cache
lookup_agents     160.9     No

Frontend Bundle Total: 418.4 MB


## 9. Conclusions

### Format Recommendations by Use Case

| Use Case | Recommended Format | Reason |
|----------|-------------------|--------|
| **Frontend UI** | Frontend + Lookup Tables | H3 for instant hexbins, inverted index for agent search |
| **Map First Paint** | H3 Cache | Pre-aggregated, ~7MB download |
| **Sample Detail** | Export/Frontend | All data in one row, no JOINs |
| **Entity Listing** | Narrow/Wide | Direct otype filter |
| **Graph Traversal** | Narrow | Full edge flexibility |
| **Archival** | Narrow | Highest fidelity |

### H3 Column Impact

Adding H3 columns enables:
- Instant hexbin aggregation without runtime computation
- Multiple resolution support (h3_05, h3_06, h3_07)
- Minimal size increase (~3 columns × 8 bytes × 6.7M rows ≈ 160MB uncompressed, much smaller compressed)

### Inverted Index Impact

Agent lookup table provides:
- O(log n) instead of O(n) for "samples by agent" queries
- ~10-100x speedup for common search patterns