# iSamples Parquet Schema Comparison

**Goal**: Understand the tradeoffs among five parquet formats for iSamples data.

| Format | Philosophy | Sources | Relationships |
|--------|-----------|---------|---------------|
| **Export** | Sample-centric (flat) | All 4 sources | Nested STRUCTs |
| **Zenodo Narrow** | Graph (nodes + edges) | All 4 sources | Separate `_edge_` rows |
| **Zenodo Wide** | Entity-centric | All 4 sources | `p__*` arrays ‚Üí row_ids |
| **Eric's Narrow** | Graph (nodes + edges) | OpenContext only | Separate `_edge_` rows |
| **Eric's Wide** | Entity-centric | OpenContext only | `p__*` arrays ‚Üí row_ids |

**Key insight**: There is no universal best format. Each optimizes for different query patterns.

---

## Portability

This notebook works in multiple environments:

| Environment | Behavior |
|-------------|----------|
| **Raymond's laptop** | Uses local files in `~/Data/iSample/` |
| **mybinder.org** | Downloads to `/tmp/pqgfiles/` cache |
| **Other users** | Downloads to `~/Data/iSample/pqg_cache/` |

**Configuration options** (in cell 2):
- `CACHE_DIR`: Override with `ISAMPLES_CACHE_DIR` env var
- `USE_REMOTE=True`: Skip downloads, query remote parquet via HTTP (slower but no disk)
- `DOWNLOAD_MISSING=False`: Error instead of downloading missing files

---

## Data Source Coverage

| Format | Sources | Description |
|--------|---------|-------------|
| **Export, Zenodo Narrow, Zenodo Wide** | SESAR, OpenContext, GEOME, Smithsonian | Full iSamples (~6.7M samples) |
| **Eric's Narrow, Eric's Wide** | OpenContext only | Subset (~1.1M samples) |

This allows fair comparisons:
- **Apples-to-apples**: Export vs Zenodo Narrow vs Zenodo Wide (same data)
- **Structure comparison**: Eric's Narrow vs Eric's Wide (same data, different structure)

## 1. Setup & Load Data

In [None]:
import duckdb
import pandas as pd
import time
import os
import urllib.request
from pathlib import Path

# =============================================================================
# CONFIGURATION - Edit these paths for your environment
# =============================================================================

# Cache directory for downloaded files (used when local paths don't exist)
# - On mybinder.org: uses /tmp/pqgfiles
# - Locally: uses ~/Data/iSample/pqg_cache (or override with ISAMPLES_CACHE_DIR env var)
CACHE_DIR = Path(os.environ.get('ISAMPLES_CACHE_DIR', 
                                '/tmp/pqgfiles' if Path('/tmp').exists() and not Path.home().joinpath('Data/iSample').exists()
                                else Path.home() / 'Data/iSample/pqg_cache'))

# Local paths (Raymond's setup) - these are checked first
LOCAL_PATHS = {
    'export': Path.home() / 'Data/iSample/2025_04_21_16_23_46/isamples_export_2025_04_21_16_23_46_geo.parquet',
    'zenodo_narrow': Path.home() / 'Data/iSample/pqg_refining/zenodo_narrow_strict.parquet',
    'zenodo_wide': Path.home() / 'Data/iSample/pqg_refining/zenodo_wide_strict.parquet',
    'eric_narrow': Path.home() / 'Data/iSample/pqg_refining/oc_isamples_pqg.parquet',
    'eric_wide': Path.home() / 'Data/iSample/pqg_refining/oc_isamples_pqg_wide.parquet',
}

# Remote URLs - fallback when local files don't exist
URLS = {
    'export': 'https://zenodo.org/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet',
    'zenodo_narrow': 'https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202512_narrow.parquet',
    'zenodo_wide': 'https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202512_wide.parquet',
    'eric_narrow': 'https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg.parquet',
    'eric_wide': 'https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg_wide.parquet',
}

# =============================================================================
# PATH RESOLUTION - Automatically finds or downloads files
# =============================================================================

def resolve_path(name: str, local_paths: dict, urls: dict, cache_dir: Path, 
                 download: bool = True, use_remote: bool = False) -> Path:
    """
    Resolve file path: check local first, then cache, optionally download.
    
    Args:
        name: File identifier (e.g., 'export', 'zenodo_wide')
        local_paths: Dict of local file paths to check first
        urls: Dict of remote URLs for downloading
        cache_dir: Directory for cached downloads
        download: If True, download missing files to cache
        use_remote: If True, return URL for DuckDB remote access (no download)
    
    Returns:
        Path to local file, or URL string if use_remote=True
    """
    # Option 1: Local file exists
    if name in local_paths and local_paths[name].exists():
        return local_paths[name]
    
    # Option 2: Return URL for remote access (DuckDB can read directly)
    if use_remote and name in urls:
        return urls[name]
    
    # Option 3: Check cache
    cache_dir.mkdir(parents=True, exist_ok=True)
    cached_file = cache_dir / f"{name}.parquet"
    
    if cached_file.exists():
        return cached_file
    
    # Option 4: Download to cache
    if download and name in urls:
        url = urls[name]
        print(f"Downloading {name} from {url}...")
        print(f"  -> {cached_file}")
        
        # Download with progress
        def progress_hook(block_num, block_size, total_size):
            downloaded = block_num * block_size
            if total_size > 0:
                pct = min(100, downloaded * 100 // total_size)
                mb = downloaded / 1e6
                total_mb = total_size / 1e6
                print(f"\r  Progress: {pct}% ({mb:.1f}/{total_mb:.1f} MB)", end='', flush=True)
        
        urllib.request.urlretrieve(url, cached_file, reporthook=progress_hook)
        print()  # newline after progress
        return cached_file
    
    # No file available
    raise FileNotFoundError(f"File '{name}' not found locally and download=False")

# =============================================================================
# RESOLVE ALL PATHS
# =============================================================================

# Set to True to skip downloads and use DuckDB's remote parquet reading
# (Slower queries but no disk usage - good for quick exploration)
USE_REMOTE = False

# Set to False to skip downloading missing files (will error if not found)
DOWNLOAD_MISSING = True

print(f"Cache directory: {CACHE_DIR}")
print(f"Use remote: {USE_REMOTE}, Download missing: {DOWNLOAD_MISSING}\n")

PATHS = {}
for name in ['export', 'zenodo_narrow', 'zenodo_wide', 'eric_narrow', 'eric_wide']:
    try:
        path = resolve_path(name, LOCAL_PATHS, URLS, CACHE_DIR, 
                           download=DOWNLOAD_MISSING, use_remote=USE_REMOTE)
        PATHS[name] = path
    except FileNotFoundError as e:
        print(f"‚ö†Ô∏è {name}: {e}")
        PATHS[name] = None

# =============================================================================
# VERIFY FILES
# =============================================================================

def get_file_info(path):
    """Get file info - works for both local paths and URLs."""
    if path is None:
        return '‚ùå', 'Not available'
    if isinstance(path, str) and path.startswith('http'):
        return 'üåê', 'Remote URL'
    if Path(path).exists():
        size_mb = Path(path).stat().st_size / 1e6
        return '‚úÖ', f'{size_mb:.1f} MB'
    return '‚ùå', 'Not found'

print("=== Full iSamples (all sources) ===")
for name in ['export', 'zenodo_narrow', 'zenodo_wide']:
    status, info = get_file_info(PATHS.get(name))
    source = "local" if PATHS.get(name) and Path(PATHS[name]).exists() and PATHS[name] in LOCAL_PATHS.values() else "cache/remote"
    print(f'{status} {name}: {info} ({source})')

print("\n=== OpenContext only (Eric's) ===")
for name in ['eric_narrow', 'eric_wide']:
    status, info = get_file_info(PATHS.get(name))
    source = "local" if PATHS.get(name) and Path(PATHS[name]).exists() and PATHS[name] in LOCAL_PATHS.values() else "cache/remote"
    print(f'{status} {name}: {info} ({source})')

In [None]:
# Helper functions for timing queries
import statistics

def timed_query(con, sql, name="Query"):
    """Execute query and return (result_df, elapsed_ms)"""
    start = time.time()
    result = con.sql(sql).fetchdf()
    elapsed = (time.time() - start) * 1000
    print(f"{name}: {elapsed:.1f}ms, {len(result):,} rows")
    return result, elapsed

def timed_query_multirun(con, sql, name="Query", runs=3):
    """Execute query multiple times and return (result_df, mean_ms, stddev_ms)"""
    times = []
    result = None
    for i in range(runs):
        start = time.time()
        result = con.sql(sql).fetchdf()
        elapsed = (time.time() - start) * 1000
        times.append(elapsed)
    
    mean_ms = statistics.mean(times)
    stddev_ms = statistics.stdev(times) if len(times) > 1 else 0
    print(f"{name}: {mean_ms:.1f}ms ¬± {stddev_ms:.1f}ms (n={runs}), {len(result):,} rows")
    return result, mean_ms, stddev_ms

# Create connection
con = duckdb.connect()

## 2. Schema Inspection

Understanding what columns exist and their types.

In [None]:
# Helper to check if path is available (works for Path objects and URL strings)
def path_available(path):
    """Check if a path is available (local file exists or is a URL)."""
    if path is None:
        return False
    if isinstance(path, str) and path.startswith('http'):
        return True  # URLs are assumed available
    return Path(path).exists()

# Get schema for each format
schemas = {}
for name, path in PATHS.items():
    if path_available(path):
        result = con.sql(f"DESCRIBE SELECT * FROM read_parquet('{path}')").fetchdf()
        schemas[name] = result
        print(f"\n=== {name.upper()} ({len(result)} columns) ===")
        # Show just first 15 columns to keep output manageable
        print(result[['column_name', 'column_type']].head(15).to_string())
        if len(result) > 15:
            print(f"  ... and {len(result) - 15} more columns")
    else:
        print(f"\n=== {name.upper()} ===")
        print(f"  ‚ö†Ô∏è Not available")

In [None]:
# Compare column counts and key structural differences (computed from schemas)
def check_schema_features(schema_df):
    """Analyze schema DataFrame for structural features."""
    if schema_df is None or len(schema_df) == 0:
        return {'columns': 0, 'has_edge_cols': False, 'has_p__cols': False, 
                'has_nested_structs': False, 'has_otype': False}
    
    cols = set(schema_df['column_name'].tolist())
    types = dict(zip(schema_df['column_name'], schema_df['column_type']))
    
    return {
        'columns': len(schema_df),
        'has_edge_cols': all(c in cols for c in ['s', 'p', 'o']),
        'has_p__cols': any(c.startswith('p__') for c in cols),
        'has_nested_structs': any('STRUCT' in str(t) for t in types.values()),
        'has_otype': 'otype' in cols,
    }

# Compute features for each format
format_order = ['export', 'zenodo_narrow', 'zenodo_wide', 'eric_narrow', 'eric_wide']
features = {name: check_schema_features(schemas.get(name)) for name in format_order}

# Build comparison table
comparison = pd.DataFrame([
    {
        'Format': name.replace('_', ' ').title(),
        'Data': 'Full' if name in ['export', 'zenodo_narrow', 'zenodo_wide'] else 'OC only',
        'Columns': features[name]['columns'],
        'Edge cols (s,p,o)': '‚úì' if features[name]['has_edge_cols'] else '',
        'p__* cols': '‚úì' if features[name]['has_p__cols'] else '',
        'Nested STRUCTs': '‚úì' if features[name]['has_nested_structs'] else '',
        'otype col': '‚úì' if features[name]['has_otype'] else '',
    }
    for name in format_order
])
comparison

## 3. Row Count Analysis

Understanding what's IN each format.

In [None]:
# Total row counts
row_counts = {}
print("=== Full iSamples ===")
for name in ['export', 'zenodo_narrow', 'zenodo_wide']:
    path = PATHS.get(name)
    if path_available(path):
        count = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}')").fetchone()[0]
        row_counts[name] = count
        print(f"{name}: {count:,} rows")
    else:
        print(f"{name}: ‚ö†Ô∏è Not available")

print("\n=== OpenContext only ===")
for name in ['eric_narrow', 'eric_wide']:
    path = PATHS.get(name)
    if path_available(path):
        count = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}')").fetchone()[0]
        row_counts[name] = count
        print(f"{name}: {count:,} rows")
    else:
        print(f"{name}: ‚ö†Ô∏è Not available")

In [None]:
# For PQG formats: breakdown by otype
for name in ['zenodo_narrow', 'zenodo_wide', 'eric_narrow', 'eric_wide']:
    path = PATHS.get(name)
    if path_available(path):
        print(f"=== {name.upper()}: Rows by otype ===")
        result = con.sql(f"""
            SELECT otype, COUNT(*) as cnt 
            FROM read_parquet('{path}')
            GROUP BY otype ORDER BY cnt DESC
        """).fetchdf()
        print(result.to_string())
        print()

In [None]:
# For Export: breakdown by source_collection
print("=== EXPORT: Rows by source_collection ===")
if path_available(PATHS.get('export')):
    result = con.sql(f"""
        SELECT source_collection, COUNT(*) as cnt 
        FROM read_parquet('{PATHS['export']}')
        GROUP BY source_collection ORDER BY cnt DESC
    """).fetchdf()
    print(result.to_string())
else:
    print("‚ö†Ô∏è Export file not available")

## 4. Query Benchmark Suite

Testing common query patterns across all three formats.

### 4.1 Map Visualization: Get All Coordinates

**Use case**: Render points on a Cesium/Leaflet map

In [None]:
# EXPORT: Direct column access
print("=== EXPORT (full iSamples) ===")
export_coords, export_coords_time = timed_query(con, f"""
    SELECT sample_location_latitude as lat, sample_location_longitude as lon
    FROM read_parquet('{PATHS['export']}')
    WHERE sample_location_latitude IS NOT NULL
""", "All coordinates")

In [None]:
# WIDE formats: Filter by otype
print("=== ZENODO WIDE (full iSamples) ===")
zenodo_wide_coords, zenodo_wide_coords_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['zenodo_wide']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

print("\n=== ERIC WIDE (OpenContext only) ===")
eric_wide_coords, eric_wide_coords_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['eric_wide']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

In [None]:
# NARROW formats: Filter by otype  
print("=== ZENODO NARROW (full iSamples) ===")
zenodo_narrow_coords, zenodo_narrow_coords_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['zenodo_narrow']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

print("\n=== ERIC NARROW (OpenContext only) ===")
eric_narrow_coords, eric_narrow_coords_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['eric_narrow']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

In [None]:
# Summary - Map query comparison
print("=== MAP QUERY SUMMARY ===")
print("\nFull iSamples (apples-to-apples comparison):")
print(f"  Export:        {export_coords_time:6.1f}ms ({len(export_coords):,} points)")
print(f"  Zenodo Wide:   {zenodo_wide_coords_time:6.1f}ms ({len(zenodo_wide_coords):,} points)")
print(f"  Zenodo Narrow: {zenodo_narrow_coords_time:6.1f}ms ({len(zenodo_narrow_coords):,} points)")

print("\nOpenContext only (Eric's files):")
print(f"  Eric Wide:     {eric_wide_coords_time:6.1f}ms ({len(eric_wide_coords):,} points)")
print(f"  Eric Narrow:   {eric_narrow_coords_time:6.1f}ms ({len(eric_narrow_coords):,} points)")

print("\nüí° Key insight: Export returns coords directly; PQG formats need otype filter")

### 4.2 Faceted Search: Count by Material Category

**Use case**: Show facet counts in a search UI

In [None]:
# EXPORT: Unnest nested struct array
# SQL Complexity: 1 subquery, 0 JOINs - simple unnest
print("=== EXPORT (full iSamples) ===")
export_facets, export_facets_time = timed_query(con, f"""
    SELECT 
        mat.identifier as material,
        COUNT(*) as cnt
    FROM (
        SELECT unnest(has_material_category) as mat
        FROM read_parquet('{PATHS['export']}')
        WHERE has_material_category IS NOT NULL AND len(has_material_category) > 0
    )
    GROUP BY mat.identifier
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(export_facets.to_string())

In [None]:
# WIDE formats: JOIN via p__has_material_category
# SQL Complexity: 2 CTEs, 1 JOIN - requires row_id lookup
print("=== ZENODO WIDE (full iSamples) ===")
zenodo_wide_facets, zenodo_wide_facets_time = timed_query(con, f"""
    WITH samples AS (
        SELECT unnest(p__has_material_category) as concept_rowid
        FROM read_parquet('{PATHS['zenodo_wide']}')
        WHERE otype = 'MaterialSampleRecord' 
          AND p__has_material_category IS NOT NULL
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['zenodo_wide']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM samples s
    JOIN concepts c ON s.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(zenodo_wide_facets.to_string())

print("\n=== ERIC WIDE (OpenContext only) ===")
eric_wide_facets, eric_wide_facets_time = timed_query(con, f"""
    WITH samples AS (
        SELECT unnest(p__has_material_category) as concept_rowid
        FROM read_parquet('{PATHS['eric_wide']}')
        WHERE otype = 'MaterialSampleRecord' 
          AND p__has_material_category IS NOT NULL
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['eric_wide']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM samples s
    JOIN concepts c ON s.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(eric_wide_facets.to_string())

In [None]:
# NARROW formats: Follow edges with predicate='has_material_category'
# SQL Complexity: 2 CTEs, 1 JOIN - requires edge traversal
print("=== ZENODO NARROW (full iSamples) ===")
zenodo_narrow_facets, zenodo_narrow_facets_time = timed_query(con, f"""
    WITH edges AS (
        SELECT s as sample_rowid, unnest(o) as concept_rowid
        FROM read_parquet('{PATHS['zenodo_narrow']}')
        WHERE otype = '_edge_' AND p = 'has_material_category'
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['zenodo_narrow']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM edges e
    JOIN concepts c ON e.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(zenodo_narrow_facets.to_string())

print("\n=== ERIC NARROW (OpenContext only) ===")
eric_narrow_facets, eric_narrow_facets_time = timed_query(con, f"""
    WITH edges AS (
        SELECT s as sample_rowid, unnest(o) as concept_rowid
        FROM read_parquet('{PATHS['eric_narrow']}')
        WHERE otype = '_edge_' AND p = 'has_material_category'
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['eric_narrow']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM edges e
    JOIN concepts c ON e.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(eric_narrow_facets.to_string())

In [None]:
# Facet query summary
print("=== FACET QUERY SUMMARY ===")
print("\nFull iSamples (apples-to-apples):")
print(f"  Export:        {export_facets_time:6.1f}ms (SQL: 1 subquery, 0 JOINs)")
print(f"  Zenodo Wide:   {zenodo_wide_facets_time:6.1f}ms (SQL: 2 CTEs, 1 JOIN)")
print(f"  Zenodo Narrow: {zenodo_narrow_facets_time:6.1f}ms (SQL: 2 CTEs, 1 JOIN)")

print("\nOpenContext only (Eric's files):")
print(f"  Eric Wide:     {eric_wide_facets_time:6.1f}ms")
print(f"  Eric Narrow:   {eric_narrow_facets_time:6.1f}ms")

print("\nüí° Key insight: Export is simplest (no JOINs), but PQG returns human-readable labels")

### 4.3 Entity Listing: Get All Unique Agents

**Use case**: Populate a dropdown, show "who collected samples"

**Key tradeoff**: Export cannot do this efficiently!

In [None]:
# WIDE formats: Direct query on Agent rows
# SQL Complexity: 0 CTEs, 0 JOINs - simple otype filter
print("=== ZENODO WIDE (full iSamples) ===")
zenodo_wide_agents, zenodo_wide_agents_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['zenodo_wide']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(zenodo_wide_agents.to_string())

print("\n=== ERIC WIDE (OpenContext only) ===")
eric_wide_agents, eric_wide_agents_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['eric_wide']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(eric_wide_agents.to_string())

In [None]:
# NARROW formats: Same approach - otype filter
# SQL Complexity: 0 CTEs, 0 JOINs - simple otype filter
print("=== ZENODO NARROW (full iSamples) ===")
zenodo_narrow_agents, zenodo_narrow_agents_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['zenodo_narrow']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(zenodo_narrow_agents.to_string())

print("\n=== ERIC NARROW (OpenContext only) ===")
eric_narrow_agents, eric_narrow_agents_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['eric_narrow']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(eric_narrow_agents.to_string())

In [None]:
# EXPORT: Must scan all samples and extract from nested structs
# SQL Complexity: 1 subquery, 0 JOINs - but FULL TABLE SCAN required
# This is MUCH slower because agents are embedded in every sample row
print("=== EXPORT (full iSamples) ===")
export_agents, export_agents_time = timed_query(con, f"""
    SELECT 
        resp.name as name,
        resp.role as role,
        COUNT(*) as cnt
    FROM (
        SELECT unnest(produced_by.responsibility) as resp
        FROM read_parquet('{PATHS['export']}')
        WHERE produced_by IS NOT NULL 
          AND produced_by.responsibility IS NOT NULL
    )
    GROUP BY resp.name, resp.role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents (from nested)")
print(export_agents.to_string())

In [None]:
# Agent listing summary
print("=== ENTITY LISTING SUMMARY ===")
print("\nFull iSamples (apples-to-apples):")
print(f"  Zenodo Wide:   {zenodo_wide_agents_time:6.1f}ms (SQL: 0 JOINs, otype filter)")
print(f"  Zenodo Narrow: {zenodo_narrow_agents_time:6.1f}ms (SQL: 0 JOINs, otype filter)")
print(f"  Export:        {export_agents_time:6.1f}ms (SQL: 0 JOINs, FULL SCAN)")

print("\nOpenContext only (Eric's files):")
print(f"  Eric Wide:     {eric_wide_agents_time:6.1f}ms")
print(f"  Eric Narrow:   {eric_narrow_agents_time:6.1f}ms")

print("\n‚ö†Ô∏è Export is 10-100x SLOWER for entity listing!")
print("   Reason: Agents are embedded in every sample row, requiring full scan")
print("   PQG: Agents are separate rows, filtered by otype = 'Agent'")

### 4.4 Reverse Lookup: Samples by Agent

**Use case**: "Show me all samples collected by Agent X"

In [20]:
# First, pick an agent name that exists in all formats
# Using a common agent from the data
AGENT_NAME = 'Vance Vredenburg'  # Adjust based on your data

print(f"Looking for samples by: {AGENT_NAME}")

Looking for samples by: Vance Vredenburg


In [21]:
# EXPORT: Filter on nested struct
print("=== EXPORT ===")
export_by_agent, export_time = timed_query(con, f"""
    SELECT sample_identifier, label
    FROM read_parquet('{PATHS['export']}')
    WHERE list_contains(
        [r.name FOR r IN produced_by.responsibility],
        '{AGENT_NAME}'
    )
    LIMIT 10
""", f"Samples by {AGENT_NAME}")
print(export_by_agent.to_string())

=== EXPORT ===
Samples by Vance Vredenburg: 4.9ms, 10 rows
     sample_identifier label
0   ark:/21547/DSz2757   757
1   ark:/21547/DSz2779   779
2   ark:/21547/DSz2806   806
3   ark:/21547/DSz2807   807
4   ark:/21547/DSz2759   759
5   ark:/21547/DSz2761   761
6   ark:/21547/DSz2967   967
7   ark:/21547/DSz2763   763
8   ark:/21547/DSz2979   979
9  ark:/21547/DSz21792  1792


In [None]:
# WIDE: Find agent row_id, then find samples with that row_id in p__responsibility
# Note: Agent may not exist in Eric's OC-only data, so use Zenodo Wide for full coverage
print("=== ZENODO WIDE (full iSamples) ===")
zenodo_wide_by_agent, zenodo_wide_by_agent_time = timed_query(con, f"""
    WITH agent AS (
        SELECT row_id 
        FROM read_parquet('{PATHS['zenodo_wide']}')
        WHERE otype = 'Agent' AND name = '{AGENT_NAME}'
        LIMIT 1
    ),
    events AS (
        SELECT w.row_id as event_id
        FROM read_parquet('{PATHS['zenodo_wide']}') w, agent
        WHERE w.otype = 'SamplingEvent' 
          AND list_contains(w.p__responsibility, agent.row_id)
    )
    SELECT s.sample_identifier, s.label
    FROM read_parquet('{PATHS['zenodo_wide']}') s, events
    WHERE s.otype = 'MaterialSampleRecord'
      AND list_contains(s.p__produced_by, events.event_id)
    LIMIT 10
""", f"Samples by {AGENT_NAME}")
print(zenodo_wide_by_agent.to_string())

In [None]:
# Summary
print("\n=== REVERSE LOOKUP SUMMARY ===")
print(f"Export:      {export_time:.1f}ms ({len(export_by_agent)} rows)")
print(f"Zenodo Wide: {zenodo_wide_by_agent_time:.1f}ms ({len(zenodo_wide_by_agent)} rows)")
print("\nNote: Export's nested list_contains is efficient for this pattern")

### 4.5 Sample Detail: Get Full Info for One Sample

**Use case**: User clicks on a sample, show all details

In [24]:
# Pick a sample identifier
SAMPLE_ID = con.sql(f"""
    SELECT sample_identifier FROM read_parquet('{PATHS['export']}')
    WHERE sample_identifier IS NOT NULL LIMIT 1
""").fetchone()[0]
print(f"Sample: {SAMPLE_ID}")

Sample: ark:/21547/DSz2757


In [25]:
# EXPORT: Everything on one row
print("=== EXPORT ===")
start = time.time()
result = con.sql(f"""
    SELECT *
    FROM read_parquet('{PATHS['export']}')
    WHERE sample_identifier = '{SAMPLE_ID}'
""").fetchdf()
export_time = (time.time() - start) * 1000
print(f"Time: {export_time:.1f}ms")
print(f"Columns returned: {len(result.columns)}")
print(result.T)  # Transpose for readability

=== EXPORT ===
Time: 61.9ms
Columns returned: 19
                                                                           0
sample_identifier                                         ark:/21547/DSz2757
@id                                                   metadata/21547/DSz2757
label                                                                    757
description                                 basisOfRecord: PreservedSpecimen
source_collection                                                      GEOME
has_sample_object_type     [{'identifier': 'https://w3id.org/isample/voca...
has_material_category      [{'identifier': 'https://w3id.org/isample/voca...
has_context_category       [{'identifier': 'https://w3id.org/isample/biol...
informal_classification                                 [Taricha, granulosa]
keywords                     [{'keyword': 'California'}, {'keyword': 'USA'}]
produced_by                {'description': 'expeditionCode: newts | proje...
last_modified_time         

In [None]:
# ZENODO WIDE: Need to JOIN related entities
print("=== ZENODO WIDE ===")
start = time.time()
# This is more complex - would need multiple JOINs to get full picture
result = con.sql(f"""
    SELECT *
    FROM read_parquet('{PATHS['zenodo_wide']}')
    WHERE sample_identifier = '{SAMPLE_ID}'
""").fetchdf()
zenodo_wide_detail_time = (time.time() - start) * 1000
print(f"Time: {zenodo_wide_detail_time:.1f}ms")
print(f"Rows returned: {len(result)}")
if len(result) > 0:
    print(f"Columns returned: {len(result.columns)}")
    print("Note: This only returns the sample row, not related entities")
    print(result[['sample_identifier', 'label']].T)
else:
    print("Note: Sample not found (may be from GEOME source, not in this sample_identifier format)")

## 5. Storage Comparison

In [None]:
# File sizes and efficiency
def get_file_size_mb(path):
    """Get file size - returns None for URLs (size unknown without HEAD request)."""
    if path is None:
        return None
    if isinstance(path, str) and path.startswith('http'):
        return None  # Can't easily get URL size
    p = Path(path)
    if p.exists():
        return p.stat().st_size / 1e6
    return None

storage = []
for name in ['export', 'zenodo_narrow', 'zenodo_wide', 'eric_narrow', 'eric_wide']:
    path = PATHS.get(name)
    if path_available(path):
        size_mb = get_file_size_mb(path)
        rows = row_counts.get(name, 0)
        cols = len(schemas.get(name, []))
        bytes_per_row = (size_mb * 1e6) / rows if (size_mb and rows > 0) else None
        data_scope = 'Full' if name in ['export', 'zenodo_narrow', 'zenodo_wide'] else 'OC only'
        is_remote = isinstance(path, str) and path.startswith('http')
        storage.append({
            'Format': name.replace('_', ' ').title(),
            'Data': data_scope,
            'Size (MB)': f'{size_mb:.1f}' if size_mb else 'Remote',
            'Rows': f'{rows:,}',
            'Columns': cols,
            'Bytes/Row': f'{bytes_per_row:.1f}' if bytes_per_row else 'N/A',
        })

pd.DataFrame(storage)

## 6. Benchmark Summary

### Benchmark Results Summary

**Data Coverage Verification:**
- ‚úÖ Export, Zenodo Narrow, Zenodo Wide all contain **6,680,932 samples** from all 4 sources
- ‚úÖ Eric's Narrow/Wide contain OpenContext subset (~1.1M samples)

| Query Pattern | Best For | SQL Complexity | Notes |
|--------------|----------|----------------|-------|
| **Map (all coords)** | Export ‚âà Zenodo Wide | Simple SELECT | Both ~30ms for 6M points |
| **Facets (material counts)** | Export | 1 subquery vs 2 CTEs + JOIN | Export has URIs, PQG has labels |
| **Entity listing (agents)** | PQG formats | 0 JOINs (otype filter) | Export requires full scan |
| **Reverse lookup by agent** | Export | list_contains() | Only works if agent exists |
| **Sample detail (one row)** | Export | Simple WHERE | All data on single row |

**Key tradeoffs:**
- **Export**: Best for UI (map + facets + detail) but slow for entity listing
- **PQG Wide**: Good balance - entities queryable, reasonable JOIN complexity
- **PQG Narrow**: Most flexible but slower (92M rows including edges)

## 7. Conclusions: When to Use Each Format

### Export Format
**Best for:**
- UI queries (map, search, facets)
- Sample-centric analysis
- When you don't need to query entities independently

**Avoid when:**
- You need to list all agents/sites/concepts
- You need graph traversal flexibility
- You need incremental updates

### Wide Format
**Best for:**
- Entity-centric queries ("all agents", "all sites")
- Analytical dashboards
- When you need both samples AND other entity types

**Avoid when:**
- Pure sample queries (Export is faster)
- Complex multi-hop traversals (Narrow is more natural)

### Narrow Format
**Best for:**
- Archival/preservation (full fidelity)
- Graph algorithms
- Relationship exploration
- When you need to traverse in any direction

**Avoid when:**
- Interactive UI (too slow)
- Simple sample queries (overkill)

## 8. Key Insights

### What Export Gains
1. **No JOINs** - Everything on one row
2. **Pre-extracted coords** - `sample_location_latitude/longitude` at top level
3. **Fewer rows** - 6.7M vs 19.5M vs 92M

### What Export Loses
1. **Entity independence** - Can't query agents without scanning all samples
2. **Graph flexibility** - Can't traverse in arbitrary directions
3. **Incremental updates** - Must regenerate entire file

### The `list_contains()` Problem
Both Wide (p__* arrays) and Export (nested structs) suffer from O(n) scans when searching within arrays. Neither has index support in DuckDB/Parquet.

### Recommendation for Eric's UI
For the iSamples Central UI requirements:
- **Start with Export format** - fastest for map + facets + click-to-detail
- **Pre-compute H3 aggregations** - for initial map render
- **Pre-compute facet counts** - avoid runtime aggregation
- **Keep Wide/Narrow for advanced queries** - entity exploration, graph traversal