# iSamples Parquet Schema Comparison

**Goal**: Understand the tradeoffs among five parquet formats for iSamples data.

| Format | Philosophy | Sources | Relationships |
|--------|-----------|---------|---------------|
| **Export** | Sample-centric (flat) | All 4 sources | Nested STRUCTs |
| **Zenodo Narrow** | Graph (nodes + edges) | All 4 sources | Separate `_edge_` rows |
| **Zenodo Wide** | Entity-centric | All 4 sources | `p__*` arrays → row_ids |
| **Eric's Narrow** | Graph (nodes + edges) | OpenContext only | Separate `_edge_` rows |
| **Eric's Wide** | Entity-centric | OpenContext only | `p__*` arrays → row_ids |

**Key insight**: There is no universal best format. Each optimizes for different query patterns.

---

## Portability

This notebook works in multiple environments:

| Environment | Behavior |
|-------------|----------|
| **Raymond's laptop** | Uses local files in `~/Data/iSample/` |
| **mybinder.org** | Downloads to `/tmp/pqgfiles/` cache |
| **Other users** | Downloads to `~/Data/iSample/pqg_cache/` |

**Configuration options** (in cell 2):
- `CACHE_DIR`: Override with `ISAMPLES_CACHE_DIR` env var
- `USE_REMOTE=True`: Skip downloads, query remote parquet via HTTP (slower but no disk)
- `DOWNLOAD_MISSING=False`: Error instead of downloading missing files

---

## Data Source Coverage

| Format | Sources | Description |
|--------|---------|-------------|
| **Export, Zenodo Narrow, Zenodo Wide** | SESAR, OpenContext, GEOME, Smithsonian | Full iSamples (~6.7M samples) |
| **Eric's Narrow, Eric's Wide** | OpenContext only | Subset (~1.1M samples) |

This allows fair comparisons:
- **Apples-to-apples**: Export vs Zenodo Narrow vs Zenodo Wide (same data)
- **Structure comparison**: Eric's Narrow vs Eric's Wide (same data, different structure)

## 1. Setup & Load Data

In [1]:
import duckdb
import pandas as pd
import time
import os
import urllib.request
from pathlib import Path

# =============================================================================
# CONFIGURATION - Edit these paths for your environment
# =============================================================================

# Cache directory for downloaded files (used when local paths don't exist)
# - On mybinder.org: uses /tmp/pqgfiles
# - Locally: uses ~/Data/iSample/pqg_cache (or override with ISAMPLES_CACHE_DIR env var)
CACHE_DIR = Path(os.environ.get('ISAMPLES_CACHE_DIR', 
                                '/tmp/pqgfiles' if Path('/tmp').exists() and not Path.home().joinpath('Data/iSample').exists()
                                else Path.home() / 'Data/iSample/pqg_cache'))

# Local paths (Raymond's setup) - these are checked first
# Updated 2026-01-09: zenodo_wide now points to January 9 conversion
# which fixes issue #8 ([null] array bug in p__* columns)
LOCAL_PATHS = {
    'export': Path.home() / 'Data/iSample/2025_04_21_16_23_46/isamples_export_2025_04_21_16_23_46_geo.parquet',
    'zenodo_narrow': Path.home() / 'Data/iSample/pqg_refining/zenodo_narrow_2025-12-12.parquet',
    'zenodo_wide': Path.home() / 'Data/iSample/pqg_refining/zenodo_wide_2026-01-09.parquet',
    'eric_narrow': Path.home() / 'Data/iSample/pqg_refining/oc_isamples_pqg.parquet',
    'eric_wide': Path.home() / 'Data/iSample/pqg_refining/oc_isamples_pqg_wide.parquet',
}

# Remote URLs - fallback when local files don't exist
# Updated 2026-01-09: R2 bucket contains January 9 wide conversion (fixes issue #8)
URLS = {
    'export': 'https://zenodo.org/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet',
    'zenodo_narrow': 'https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202512_narrow.parquet',
    'zenodo_wide': 'https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet',
    'eric_narrow': 'https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg.parquet',
    'eric_wide': 'https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg_wide.parquet',
}

# =============================================================================
# PATH RESOLUTION - Automatically finds or downloads files
# =============================================================================

def resolve_path(name: str, local_paths: dict, urls: dict, cache_dir: Path, 
                 download: bool = True, use_remote: bool = False) -> Path:
    """
    Resolve file path: check local first, then cache, optionally download.
    
    Args:
        name: File identifier (e.g., 'export', 'zenodo_wide')
        local_paths: Dict of local file paths to check first
        urls: Dict of remote URLs for downloading
        cache_dir: Directory for cached downloads
        download: If True, download missing files to cache
        use_remote: If True, return URL for DuckDB remote access (no download)
    
    Returns:
        Path to local file, or URL string if use_remote=True
    """
    # Option 1: Local file exists
    if name in local_paths and local_paths[name].exists():
        return local_paths[name]
    
    # Option 2: Return URL for remote access (DuckDB can read directly)
    if use_remote and name in urls:
        return urls[name]
    
    # Option 3: Check cache
    cache_dir.mkdir(parents=True, exist_ok=True)
    cached_file = cache_dir / f"{name}.parquet"
    
    if cached_file.exists():
        return cached_file
    
    # Option 4: Download to cache
    if download and name in urls:
        url = urls[name]
        print(f"Downloading {name} from {url}...")
        print(f"  -> {cached_file}")
        
        # Download with progress
        def progress_hook(block_num, block_size, total_size):
            downloaded = block_num * block_size
            if total_size > 0:
                pct = min(100, downloaded * 100 // total_size)
                mb = downloaded / 1e6
                total_mb = total_size / 1e6
                print(f"\r  Progress: {pct}% ({mb:.1f}/{total_mb:.1f} MB)", end='', flush=True)
        
        urllib.request.urlretrieve(url, cached_file, reporthook=progress_hook)
        print()  # newline after progress
        return cached_file
    
    # No file available
    raise FileNotFoundError(f"File '{name}' not found locally and download=False")

# =============================================================================
# RESOLVE ALL PATHS
# =============================================================================

# Set to True to skip downloads and use DuckDB's remote parquet reading
# (Slower queries but no disk usage - good for quick exploration)
USE_REMOTE = False

# Set to False to skip downloading missing files (will error if not found)
DOWNLOAD_MISSING = True

print(f"Cache directory: {CACHE_DIR}")
print(f"Use remote: {USE_REMOTE}, Download missing: {DOWNLOAD_MISSING}\n")

PATHS = {}
for name in ['export', 'zenodo_narrow', 'zenodo_wide', 'eric_narrow', 'eric_wide']:
    try:
        path = resolve_path(name, LOCAL_PATHS, URLS, CACHE_DIR, 
                           download=DOWNLOAD_MISSING, use_remote=USE_REMOTE)
        PATHS[name] = path
    except FileNotFoundError as e:
        print(f"⚠️ {name}: {e}")
        PATHS[name] = None

# =============================================================================
# VERIFY FILES
# =============================================================================

def get_file_info(path):
    """Get file info - works for both local paths and URLs."""
    if path is None:
        return '❌', 'Not available'
    if isinstance(path, str) and path.startswith('http'):
        return '🌐', 'Remote URL'
    if Path(path).exists():
        size_mb = Path(path).stat().st_size / 1e6
        return '✅', f'{size_mb:.1f} MB'
    return '❌', 'Not found'

print("=== Full iSamples (all sources) ===")
for name in ['export', 'zenodo_narrow', 'zenodo_wide']:
    status, info = get_file_info(PATHS.get(name))
    source = "local" if PATHS.get(name) and Path(PATHS[name]).exists() and PATHS[name] in LOCAL_PATHS.values() else "cache/remote"
    print(f'{status} {name}: {info} ({source})')

print("\n=== OpenContext only (Eric's) ===")
for name in ['eric_narrow', 'eric_wide']:
    status, info = get_file_info(PATHS.get(name))
    source = "local" if PATHS.get(name) and Path(PATHS[name]).exists() and PATHS[name] in LOCAL_PATHS.values() else "cache/remote"
    print(f'{status} {name}: {info} ({source})')

Cache directory: /Users/raymondyee/Data/iSample/pqg_cache
Use remote: False, Download missing: True

=== Full iSamples (all sources) ===
✅ export: 297.0 MB (local)
✅ zenodo_narrow: 860.1 MB (local)
✅ zenodo_wide: 291.8 MB (local)

=== OpenContext only (Eric's) ===
✅ eric_narrow: 724.5 MB (local)
✅ eric_wide: 288.7 MB (local)


In [2]:
# Helper functions for timing queries
import statistics

def timed_query(con, sql, name="Query"):
    """Execute query and return (result_df, elapsed_ms)"""
    start = time.time()
    result = con.sql(sql).fetchdf()
    elapsed = (time.time() - start) * 1000
    print(f"{name}: {elapsed:.1f}ms, {len(result):,} rows")
    return result, elapsed

def timed_query_multirun(con, sql, name="Query", runs=3):
    """Execute query multiple times and return (result_df, mean_ms, stddev_ms)"""
    times = []
    result = None
    for i in range(runs):
        start = time.time()
        result = con.sql(sql).fetchdf()
        elapsed = (time.time() - start) * 1000
        times.append(elapsed)
    
    mean_ms = statistics.mean(times)
    stddev_ms = statistics.stdev(times) if len(times) > 1 else 0
    print(f"{name}: {mean_ms:.1f}ms ± {stddev_ms:.1f}ms (n={runs}), {len(result):,} rows")
    return result, mean_ms, stddev_ms

# Create connection
con = duckdb.connect()

## 2. Schema Inspection

Understanding what columns exist and their types.

In [3]:
# Helper to check if path is available (works for Path objects and URL strings)
def path_available(path):
    """Check if a path is available (local file exists or is a URL)."""
    if path is None:
        return False
    if isinstance(path, str) and path.startswith('http'):
        return True  # URLs are assumed available
    return Path(path).exists()

# Get schema for each format
schemas = {}
for name, path in PATHS.items():
    if path_available(path):
        result = con.sql(f"DESCRIBE SELECT * FROM read_parquet('{path}')").fetchdf()
        schemas[name] = result
        print(f"\n=== {name.upper()} ({len(result)} columns) ===")
        # Show just first 15 columns to keep output manageable
        print(result[['column_name', 'column_type']].head(15).to_string())
        if len(result) > 15:
            print(f"  ... and {len(result) - 15} more columns")
    else:
        print(f"\n=== {name.upper()} ===")
        print(f"  ⚠️ Not available")


=== EXPORT (19 columns) ===
                column_name                                                                                                                                                                                                                                                                                                                                     column_type
0         sample_identifier                                                                                                                                                                                                                                                                                                                                         VARCHAR
1                       @id                                                                                                                                                                                                                        

In [4]:
# Compare column counts and key structural differences (computed from schemas)
def check_schema_features(schema_df):
    """Analyze schema DataFrame for structural features."""
    if schema_df is None or len(schema_df) == 0:
        return {'columns': 0, 'has_edge_cols': False, 'has_p__cols': False, 
                'has_nested_structs': False, 'has_otype': False}
    
    cols = set(schema_df['column_name'].tolist())
    types = dict(zip(schema_df['column_name'], schema_df['column_type']))
    
    return {
        'columns': len(schema_df),
        'has_edge_cols': all(c in cols for c in ['s', 'p', 'o']),
        'has_p__cols': any(c.startswith('p__') for c in cols),
        'has_nested_structs': any('STRUCT' in str(t) for t in types.values()),
        'has_otype': 'otype' in cols,
    }

# Compute features for each format
format_order = ['export', 'zenodo_narrow', 'zenodo_wide', 'eric_narrow', 'eric_wide']
features = {name: check_schema_features(schemas.get(name)) for name in format_order}

# Build comparison table
comparison = pd.DataFrame([
    {
        'Format': name.replace('_', ' ').title(),
        'Data': 'Full' if name in ['export', 'zenodo_narrow', 'zenodo_wide'] else 'OC only',
        'Columns': features[name]['columns'],
        'Edge cols (s,p,o)': '✓' if features[name]['has_edge_cols'] else '',
        'p__* cols': '✓' if features[name]['has_p__cols'] else '',
        'Nested STRUCTs': '✓' if features[name]['has_nested_structs'] else '',
        'otype col': '✓' if features[name]['has_otype'] else '',
    }
    for name in format_order
])
comparison

Unnamed: 0,Format,Data,Columns,"Edge cols (s,p,o)",p__* cols,Nested STRUCTs,otype col
0,Export,Full,19,,,✓,
1,Zenodo Narrow,Full,40,✓,,,✓
2,Zenodo Wide,Full,49,,✓,,✓
3,Eric Narrow,OC only,40,✓,,,✓
4,Eric Wide,OC only,47,,✓,,✓


## 3. Row Count Analysis

Understanding what's IN each format.

In [5]:
# Total row counts
row_counts = {}
print("=== Full iSamples ===")
for name in ['export', 'zenodo_narrow', 'zenodo_wide']:
    path = PATHS.get(name)
    if path_available(path):
        count = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}')").fetchone()[0]
        row_counts[name] = count
        print(f"{name}: {count:,} rows")
    else:
        print(f"{name}: ⚠️ Not available")

print("\n=== OpenContext only ===")
for name in ['eric_narrow', 'eric_wide']:
    path = PATHS.get(name)
    if path_available(path):
        count = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}')").fetchone()[0]
        row_counts[name] = count
        print(f"{name}: {count:,} rows")
    else:
        print(f"{name}: ⚠️ Not available")

=== Full iSamples ===
export: 6,680,932 rows
zenodo_narrow: 101,387,180 rows
zenodo_wide: 20,729,358 rows

=== OpenContext only ===
eric_narrow: 11,637,144 rows
eric_wide: 2,464,690 rows


In [6]:
# For PQG formats: breakdown by otype
for name in ['zenodo_narrow', 'zenodo_wide', 'eric_narrow', 'eric_wide']:
    path = PATHS.get(name)
    if path_available(path):
        print(f"=== {name.upper()}: Rows by otype ===")
        result = con.sql(f"""
            SELECT otype, COUNT(*) as cnt 
            FROM read_parquet('{path}')
            GROUP BY otype ORDER BY cnt DESC
        """).fetchdf()
        print(result.to_string())
        print()

=== ZENODO_NARROW: Rows by otype ===
                     otype       cnt
0                   _edge_  80657822
1     MaterialSampleRecord   6680932
2            SamplingEvent   6354171
3  GeospatialCoordLocation   5980282
4   MaterialSampleCuration    720254
5           SampleRelation    501579
6             SamplingSite    386160
7        IdentifiedConcept     55893
8                    Agent     50087

=== ZENODO_WIDE: Rows by otype ===
                     otype      cnt
0     MaterialSampleRecord  6680932
1            SamplingEvent  6354171
2  GeospatialCoordLocation  5980282
3   MaterialSampleCuration   720254
4           SampleRelation   501579
5             SamplingSite   386160
6        IdentifiedConcept    55893
7                    Agent    50087

=== ERIC_NARROW: Rows by otype ===
                     otype      cnt
0                   _edge_  9201451
1     MaterialSampleRecord  1096352
2            SamplingEvent  1096352
3  GeospatialCoordLocation   198433
4        Identifi

In [7]:
# For Export: breakdown by source_collection
print("=== EXPORT: Rows by source_collection ===")
if path_available(PATHS.get('export')):
    result = con.sql(f"""
        SELECT source_collection, COUNT(*) as cnt 
        FROM read_parquet('{PATHS['export']}')
        GROUP BY source_collection ORDER BY cnt DESC
    """).fetchdf()
    print(result.to_string())
else:
    print("⚠️ Export file not available")

=== EXPORT: Rows by source_collection ===
  source_collection      cnt
0             SESAR  4688386
1       OPENCONTEXT  1064831
2             GEOME   605554
3       SMITHSONIAN   322161


## 4. Query Benchmark Suite

Testing common query patterns across all three formats.

### 4.1 Map Visualization: Get All Coordinates

**Use case**: Render points on a Cesium/Leaflet map

In [8]:
# EXPORT: Direct column access
print("=== EXPORT (full iSamples) ===")
export_coords, export_coords_time = timed_query(con, f"""
    SELECT sample_location_latitude as lat, sample_location_longitude as lon
    FROM read_parquet('{PATHS['export']}')
    WHERE sample_location_latitude IS NOT NULL
""", "All coordinates")

=== EXPORT (full iSamples) ===
All coordinates: 27.1ms, 5,980,282 rows


In [9]:
# WIDE formats: Filter by otype
print("=== ZENODO WIDE (full iSamples) ===")
zenodo_wide_coords, zenodo_wide_coords_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['zenodo_wide']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

print("\n=== ERIC WIDE (OpenContext only) ===")
eric_wide_coords, eric_wide_coords_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['eric_wide']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

=== ZENODO WIDE (full iSamples) ===
All coordinates: 31.9ms, 5,980,282 rows

=== ERIC WIDE (OpenContext only) ===
All coordinates: 4.5ms, 199,146 rows


In [10]:
# NARROW formats: Filter by otype  
print("=== ZENODO NARROW (full iSamples) ===")
zenodo_narrow_coords, zenodo_narrow_coords_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['zenodo_narrow']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

print("\n=== ERIC NARROW (OpenContext only) ===")
eric_narrow_coords, eric_narrow_coords_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['eric_narrow']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

=== ZENODO NARROW (full iSamples) ===
All coordinates: 65.0ms, 5,980,282 rows

=== ERIC NARROW (OpenContext only) ===
All coordinates: 6.6ms, 198,432 rows


In [11]:
# Summary - Map query comparison
print("=== MAP QUERY SUMMARY ===")
print("\nFull iSamples (apples-to-apples comparison):")
print(f"  Export:        {export_coords_time:6.1f}ms ({len(export_coords):,} points)")
print(f"  Zenodo Wide:   {zenodo_wide_coords_time:6.1f}ms ({len(zenodo_wide_coords):,} points)")
print(f"  Zenodo Narrow: {zenodo_narrow_coords_time:6.1f}ms ({len(zenodo_narrow_coords):,} points)")

print("\nOpenContext only (Eric's files):")
print(f"  Eric Wide:     {eric_wide_coords_time:6.1f}ms ({len(eric_wide_coords):,} points)")
print(f"  Eric Narrow:   {eric_narrow_coords_time:6.1f}ms ({len(eric_narrow_coords):,} points)")

print("\n💡 Key insight: Export returns coords directly; PQG formats need otype filter")

=== MAP QUERY SUMMARY ===

Full iSamples (apples-to-apples comparison):
  Export:          27.1ms (5,980,282 points)
  Zenodo Wide:     31.9ms (5,980,282 points)
  Zenodo Narrow:   65.0ms (5,980,282 points)

OpenContext only (Eric's files):
  Eric Wide:        4.5ms (199,146 points)
  Eric Narrow:      6.6ms (198,432 points)

💡 Key insight: Export returns coords directly; PQG formats need otype filter


### 4.2 Faceted Search: Count by Material Category

**Use case**: Show facet counts in a search UI

In [12]:
# EXPORT: Unnest nested struct array
# SQL Complexity: 1 subquery, 0 JOINs - simple unnest
print("=== EXPORT (full iSamples) ===")
export_facets, export_facets_time = timed_query(con, f"""
    SELECT 
        mat.identifier as material,
        COUNT(*) as cnt
    FROM (
        SELECT unnest(has_material_category) as mat
        FROM read_parquet('{PATHS['export']}')
        WHERE has_material_category IS NOT NULL AND len(has_material_category) > 0
    )
    GROUP BY mat.identifier
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(export_facets.to_string())

=== EXPORT (full iSamples) ===
Material facets: 530.0ms, 10 rows
                                                                      material      cnt
0               https://w3id.org/isample/vocabulary/material/1.0/earthmaterial  2261513
1             https://w3id.org/isample/vocabulary/material/1.0/organicmaterial  1265560
2                        https://w3id.org/isample/vocabulary/material/1.0/rock  1208585
3  https://w3id.org/isample/vocabulary/material/1.0/biogenicnonorganicmaterial  1091781
4       https://w3id.org/isample/vocabulary/material/1.0/mixedsoilsedimentrock   838805
5                    https://w3id.org/isample/vocabulary/material/1.0/material   673018
6                     https://w3id.org/isample/vocabulary/material/1.0/mineral   390797
7          https://w3id.org/isample/vocabulary/material/1.0/anthropogenicmetal   270040
8                https://w3id.org/isample/opencontext/material/0.1/ceramicclay   100573
9                    https://w3id.org/isample/vocabular

In [13]:
# WIDE formats: JOIN via p__has_material_category
# SQL Complexity: 2 CTEs, 1 JOIN - requires row_id lookup
print("=== ZENODO WIDE (full iSamples) ===")
zenodo_wide_facets, zenodo_wide_facets_time = timed_query(con, f"""
    WITH samples AS (
        SELECT unnest(p__has_material_category) as concept_rowid
        FROM read_parquet('{PATHS['zenodo_wide']}')
        WHERE otype = 'MaterialSampleRecord' 
          AND p__has_material_category IS NOT NULL
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['zenodo_wide']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM samples s
    JOIN concepts c ON s.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(zenodo_wide_facets.to_string())

print("\n=== ERIC WIDE (OpenContext only) ===")
eric_wide_facets, eric_wide_facets_time = timed_query(con, f"""
    WITH samples AS (
        SELECT unnest(p__has_material_category) as concept_rowid
        FROM read_parquet('{PATHS['eric_wide']}')
        WHERE otype = 'MaterialSampleRecord' 
          AND p__has_material_category IS NOT NULL
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['eric_wide']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM samples s
    JOIN concepts c ON s.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(eric_wide_facets.to_string())

=== ZENODO WIDE (full iSamples) ===
Material facets: 527.1ms, 10 rows
                                                                      material      cnt
0               https://w3id.org/isample/vocabulary/material/1.0/earthmaterial  2261513
1             https://w3id.org/isample/vocabulary/material/1.0/organicmaterial  1265560
2                        https://w3id.org/isample/vocabulary/material/1.0/rock  1208585
3  https://w3id.org/isample/vocabulary/material/1.0/biogenicnonorganicmaterial  1091781
4       https://w3id.org/isample/vocabulary/material/1.0/mixedsoilsedimentrock   838805
5                    https://w3id.org/isample/vocabulary/material/1.0/material   673018
6                     https://w3id.org/isample/vocabulary/material/1.0/mineral   390797
7          https://w3id.org/isample/vocabulary/material/1.0/anthropogenicmetal   270040
8                https://w3id.org/isample/opencontext/material/0.1/ceramicclay   100573
9                    https://w3id.org/isample/voca

In [14]:
# NARROW formats: Follow edges with predicate='has_material_category'
# SQL Complexity: 2 CTEs, 1 JOIN - requires edge traversal
print("=== ZENODO NARROW (full iSamples) ===")
zenodo_narrow_facets, zenodo_narrow_facets_time = timed_query(con, f"""
    WITH edges AS (
        SELECT s as sample_rowid, unnest(o) as concept_rowid
        FROM read_parquet('{PATHS['zenodo_narrow']}')
        WHERE otype = '_edge_' AND p = 'has_material_category'
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['zenodo_narrow']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM edges e
    JOIN concepts c ON e.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(zenodo_narrow_facets.to_string())

print("\n=== ERIC NARROW (OpenContext only) ===")
eric_narrow_facets, eric_narrow_facets_time = timed_query(con, f"""
    WITH edges AS (
        SELECT s as sample_rowid, unnest(o) as concept_rowid
        FROM read_parquet('{PATHS['eric_narrow']}')
        WHERE otype = '_edge_' AND p = 'has_material_category'
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['eric_narrow']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM edges e
    JOIN concepts c ON e.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(eric_narrow_facets.to_string())

=== ZENODO NARROW (full iSamples) ===
Material facets: 743.0ms, 10 rows
                                                                      material      cnt
0               https://w3id.org/isample/vocabulary/material/1.0/earthmaterial  2261513
1             https://w3id.org/isample/vocabulary/material/1.0/organicmaterial  1265560
2                        https://w3id.org/isample/vocabulary/material/1.0/rock  1208585
3  https://w3id.org/isample/vocabulary/material/1.0/biogenicnonorganicmaterial  1091781
4       https://w3id.org/isample/vocabulary/material/1.0/mixedsoilsedimentrock   838805
5                    https://w3id.org/isample/vocabulary/material/1.0/material   673018
6                     https://w3id.org/isample/vocabulary/material/1.0/mineral   390797
7          https://w3id.org/isample/vocabulary/material/1.0/anthropogenicmetal   270040
8                https://w3id.org/isample/opencontext/material/0.1/ceramicclay   100573
9                    https://w3id.org/isample/vo

In [15]:
# Facet query summary
print("=== FACET QUERY SUMMARY ===")
print("\nFull iSamples (apples-to-apples):")
print(f"  Export:        {export_facets_time:6.1f}ms (SQL: 1 subquery, 0 JOINs)")
print(f"  Zenodo Wide:   {zenodo_wide_facets_time:6.1f}ms (SQL: 2 CTEs, 1 JOIN)")
print(f"  Zenodo Narrow: {zenodo_narrow_facets_time:6.1f}ms (SQL: 2 CTEs, 1 JOIN)")

print("\nOpenContext only (Eric's files):")
print(f"  Eric Wide:     {eric_wide_facets_time:6.1f}ms")
print(f"  Eric Narrow:   {eric_narrow_facets_time:6.1f}ms")

print("\n💡 Key insight: Export is simplest (no JOINs), but PQG returns human-readable labels")

=== FACET QUERY SUMMARY ===

Full iSamples (apples-to-apples):
  Export:         530.0ms (SQL: 1 subquery, 0 JOINs)
  Zenodo Wide:    527.1ms (SQL: 2 CTEs, 1 JOIN)
  Zenodo Narrow:  743.0ms (SQL: 2 CTEs, 1 JOIN)

OpenContext only (Eric's files):
  Eric Wide:      102.2ms
  Eric Narrow:    109.9ms

💡 Key insight: Export is simplest (no JOINs), but PQG returns human-readable labels


### 4.3 Entity Listing: Get All Unique Agents

**Use case**: Populate a dropdown, show "who collected samples"

**Key tradeoff**: Export cannot do this efficiently!

In [16]:
# WIDE formats: Direct query on Agent rows
# SQL Complexity: 0 CTEs, 0 JOINs - simple otype filter
print("=== ZENODO WIDE (full iSamples) ===")
zenodo_wide_agents, zenodo_wide_agents_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['zenodo_wide']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(zenodo_wide_agents.to_string())

print("\n=== ERIC WIDE (OpenContext only) ===")
eric_wide_agents, eric_wide_agents_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['eric_wide']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(eric_wide_agents.to_string())

=== ZENODO WIDE (full iSamples) ===
All agents: 20.3ms, 10 rows
                       name        role  cnt
0                      KUIT     curator    1
1               Haddock Lab     curator    1
2                      NMNZ     curator    1
3                    VNM-ZI     curator    1
4  Allegra Hosford Scheirer  registrant    1
5             Andra Bobbitt  registrant    1
6              Claude Payri  registrant    1
7                      UWBM     curator    1
8                      NSMT     curator    1
9                        AC     curator    1

=== ERIC WIDE (OpenContext only) ===
All agents: 3.9ms, 10 rows
                            name                                                                                                                                 role  cnt
0                Arianne Boileau                                                                        Participated in: Household Zooarchaeology of Colonial Lamanai    2
1                             LJ  

In [17]:
# NARROW formats: Same approach - otype filter
# SQL Complexity: 0 CTEs, 0 JOINs - simple otype filter
print("=== ZENODO NARROW (full iSamples) ===")
zenodo_narrow_agents, zenodo_narrow_agents_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['zenodo_narrow']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(zenodo_narrow_agents.to_string())

print("\n=== ERIC NARROW (OpenContext only) ===")
eric_narrow_agents, eric_narrow_agents_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['eric_narrow']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(eric_narrow_agents.to_string())

=== ZENODO NARROW (full iSamples) ===
All agents: 48.2ms, 10 rows
                      name        role  cnt
0           Christine Chan  registrant    1
1       Alexandra Belinsky  registrant    1
2            Nicolas Perez  registrant    1
3              Jeffrey Alt  registrant    1
4           Susan Brantley  registrant    1
5         Sebastian Zapata  registrant    1
6            Brady Foreman  registrant    1
7            Takeshi Hanyu  registrant    1
8  Sarah Penniston-Dorland  registrant    1
9         Katherine Kelley  registrant    1

=== ERIC NARROW (OpenContext only) ===
All agents: 6.5ms, 10 rows
                name                                                                            role  cnt
0    Arianne Boileau                   Participated in: Household Zooarchaeology of Colonial Lamanai    2
1        Peter Grave                                           Participated in: Asian Stoneware Jars    1
2       Levent Atici                                Participated 

In [18]:
# EXPORT: Must scan all samples and extract from nested structs
# SQL Complexity: 1 subquery, 0 JOINs - but FULL TABLE SCAN required
# This is MUCH slower because agents are embedded in every sample row
print("=== EXPORT (full iSamples) ===")
export_agents, export_agents_time = timed_query(con, f"""
    SELECT 
        resp.name as name,
        resp.role as role,
        COUNT(*) as cnt
    FROM (
        SELECT unnest(produced_by.responsibility) as resp
        FROM read_parquet('{PATHS['export']}')
        WHERE produced_by IS NOT NULL 
          AND produced_by.responsibility IS NOT NULL
    )
    GROUP BY resp.name, resp.role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents (from nested)")
print(export_agents.to_string())

=== EXPORT (full iSamples) ===


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

All agents (from nested): 3082.6ms, 10 rows
                                                             name     role      cnt
0                                              Curator,,Collector     None  3516917
1  Curator Integrated Ocean Drilling Program (TAMU),,Sample Owner     None  3516905
2                                       Adam Mansur,,Sample Owner     None   383835
3                                    Edward Gilbert,,Sample Owner     None   258790
4                                                     Emma Loftus  creator   161623
5                                                 Robert L. Kelly  creator   161623
6                                                     Lux Miranda  creator   161623
7                                                 Eugenia M. Gayo  creator   161623
8                                              Judson Byrd Finley  creator   161623
9                                            Jade d’Alpoim Guedes  creator   161623


In [19]:
# Agent listing summary
print("=== ENTITY LISTING SUMMARY ===")
print("\nFull iSamples (apples-to-apples):")
print(f"  Zenodo Wide:   {zenodo_wide_agents_time:6.1f}ms (SQL: 0 JOINs, otype filter)")
print(f"  Zenodo Narrow: {zenodo_narrow_agents_time:6.1f}ms (SQL: 0 JOINs, otype filter)")
print(f"  Export:        {export_agents_time:6.1f}ms (SQL: 0 JOINs, FULL SCAN)")

print("\nOpenContext only (Eric's files):")
print(f"  Eric Wide:     {eric_wide_agents_time:6.1f}ms")
print(f"  Eric Narrow:   {eric_narrow_agents_time:6.1f}ms")

print("\n⚠️ Export is 10-100x SLOWER for entity listing!")
print("   Reason: Agents are embedded in every sample row, requiring full scan")
print("   PQG: Agents are separate rows, filtered by otype = 'Agent'")

=== ENTITY LISTING SUMMARY ===

Full iSamples (apples-to-apples):
  Zenodo Wide:     20.3ms (SQL: 0 JOINs, otype filter)
  Zenodo Narrow:   48.2ms (SQL: 0 JOINs, otype filter)
  Export:        3082.6ms (SQL: 0 JOINs, FULL SCAN)

OpenContext only (Eric's files):
  Eric Wide:        3.9ms
  Eric Narrow:      6.5ms

⚠️ Export is 10-100x SLOWER for entity listing!
   Reason: Agents are embedded in every sample row, requiring full scan
   PQG: Agents are separate rows, filtered by otype = 'Agent'


### 4.4 Reverse Lookup: Samples by Agent

**Use case**: "Show me all samples collected by Agent X"

In [20]:
# First, pick an agent name that exists in all formats
# Using a common agent from the data
AGENT_NAME = 'Vance Vredenburg'  # Adjust based on your data

print(f"Looking for samples by: {AGENT_NAME}")

Looking for samples by: Vance Vredenburg


In [21]:
# EXPORT: Filter on nested struct
print("=== EXPORT ===")
export_by_agent, export_time = timed_query(con, f"""
    SELECT sample_identifier, label
    FROM read_parquet('{PATHS['export']}')
    WHERE list_contains(
        [r.name FOR r IN produced_by.responsibility],
        '{AGENT_NAME}'
    )
    LIMIT 10
""", f"Samples by {AGENT_NAME}")
print(export_by_agent.to_string())

=== EXPORT ===
Samples by Vance Vredenburg: 198.2ms, 10 rows
     sample_identifier label
0   ark:/21547/DSz2757   757
1   ark:/21547/DSz2779   779
2   ark:/21547/DSz2806   806
3   ark:/21547/DSz2807   807
4   ark:/21547/DSz2759   759
5   ark:/21547/DSz2761   761
6   ark:/21547/DSz2967   967
7   ark:/21547/DSz2763   763
8   ark:/21547/DSz2979   979
9  ark:/21547/DSz21792  1792


In [22]:
# WIDE: Find agent row_id, then find samples with that row_id in p__responsibility
# Note: Agent may not exist in Eric's OC-only data, so use Zenodo Wide for full coverage
print("=== ZENODO WIDE (full iSamples) ===")
zenodo_wide_by_agent, zenodo_wide_by_agent_time = timed_query(con, f"""
    WITH agent AS (
        SELECT row_id 
        FROM read_parquet('{PATHS['zenodo_wide']}')
        WHERE otype = 'Agent' AND name = '{AGENT_NAME}'
        LIMIT 1
    ),
    events AS (
        SELECT w.row_id as event_id
        FROM read_parquet('{PATHS['zenodo_wide']}') w, agent
        WHERE w.otype = 'SamplingEvent' 
          AND list_contains(w.p__responsibility, agent.row_id)
    )
    SELECT s.sample_identifier, s.label
    FROM read_parquet('{PATHS['zenodo_wide']}') s, events
    WHERE s.otype = 'MaterialSampleRecord'
      AND list_contains(s.p__produced_by, events.event_id)
    LIMIT 10
""", f"Samples by {AGENT_NAME}")
print(zenodo_wide_by_agent.to_string())

=== ZENODO WIDE (full iSamples) ===


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Samples by Vance Vredenburg: 6105.2ms, 10 rows
     sample_identifier label
0   ark:/21547/DSz2757   757
1   ark:/21547/DSz2779   779
2   ark:/21547/DSz2806   806
3   ark:/21547/DSz2807   807
4   ark:/21547/DSz2759   759
5   ark:/21547/DSz2761   761
6   ark:/21547/DSz2967   967
7   ark:/21547/DSz2763   763
8   ark:/21547/DSz2979   979
9  ark:/21547/DSz21792  1792


In [23]:
# Summary
print("\n=== REVERSE LOOKUP SUMMARY ===")
print(f"Export:      {export_time:.1f}ms ({len(export_by_agent)} rows)")
print(f"Zenodo Wide: {zenodo_wide_by_agent_time:.1f}ms ({len(zenodo_wide_by_agent)} rows)")
print("\nNote: Export's nested list_contains is efficient for this pattern")


=== REVERSE LOOKUP SUMMARY ===
Export:      198.2ms (10 rows)
Zenodo Wide: 6105.2ms (10 rows)

Note: Export's nested list_contains is efficient for this pattern


### 4.5 Sample Detail: Get Full Info for One Sample

**Use case**: User clicks on a sample, show all details

In [24]:
# Pick a sample identifier
SAMPLE_ID = con.sql(f"""
    SELECT sample_identifier FROM read_parquet('{PATHS['export']}')
    WHERE sample_identifier IS NOT NULL LIMIT 1
""").fetchone()[0]
print(f"Sample: {SAMPLE_ID}")

Sample: ark:/21547/DSz2757


In [25]:
# EXPORT: Everything on one row
print("=== EXPORT ===")
start = time.time()
result = con.sql(f"""
    SELECT *
    FROM read_parquet('{PATHS['export']}')
    WHERE sample_identifier = '{SAMPLE_ID}'
""").fetchdf()
export_time = (time.time() - start) * 1000
print(f"Time: {export_time:.1f}ms")
print(f"Columns returned: {len(result.columns)}")
print(result.T)  # Transpose for readability

=== EXPORT ===
Time: 56.6ms
Columns returned: 19
                                                                           0
sample_identifier                                         ark:/21547/DSz2757
@id                                                   metadata/21547/DSz2757
label                                                                    757
description                                 basisOfRecord: PreservedSpecimen
source_collection                                                      GEOME
has_sample_object_type     [{'identifier': 'https://w3id.org/isample/voca...
has_material_category      [{'identifier': 'https://w3id.org/isample/voca...
has_context_category       [{'identifier': 'https://w3id.org/isample/biol...
informal_classification                                 [Taricha, granulosa]
keywords                     [{'keyword': 'California'}, {'keyword': 'USA'}]
produced_by                {'description': 'expeditionCode: newts | proje...
last_modified_time         

In [26]:
# ZENODO WIDE: Need to JOIN related entities
print("=== ZENODO WIDE ===")
start = time.time()
# This is more complex - would need multiple JOINs to get full picture
result = con.sql(f"""
    SELECT *
    FROM read_parquet('{PATHS['zenodo_wide']}')
    WHERE sample_identifier = '{SAMPLE_ID}'
""").fetchdf()
zenodo_wide_detail_time = (time.time() - start) * 1000
print(f"Time: {zenodo_wide_detail_time:.1f}ms")
print(f"Rows returned: {len(result)}")
if len(result) > 0:
    print(f"Columns returned: {len(result.columns)}")
    print("Note: This only returns the sample row, not related entities")
    print(result[['sample_identifier', 'label']].T)
else:
    print("Note: Sample not found (may be from GEOME source, not in this sample_identifier format)")

=== ZENODO WIDE ===
Time: 29.5ms
Rows returned: 1
Columns returned: 49
Note: This only returns the sample row, not related entities
                                    0
sample_identifier  ark:/21547/DSz2757
label                             757


## 5. Storage Comparison

In [27]:
# File sizes and efficiency
def get_file_size_mb(path):
    """Get file size - returns None for URLs (size unknown without HEAD request)."""
    if path is None:
        return None
    if isinstance(path, str) and path.startswith('http'):
        return None  # Can't easily get URL size
    p = Path(path)
    if p.exists():
        return p.stat().st_size / 1e6
    return None

storage = []
for name in ['export', 'zenodo_narrow', 'zenodo_wide', 'eric_narrow', 'eric_wide']:
    path = PATHS.get(name)
    if path_available(path):
        size_mb = get_file_size_mb(path)
        rows = row_counts.get(name, 0)
        cols = len(schemas.get(name, []))
        bytes_per_row = (size_mb * 1e6) / rows if (size_mb and rows > 0) else None
        data_scope = 'Full' if name in ['export', 'zenodo_narrow', 'zenodo_wide'] else 'OC only'
        is_remote = isinstance(path, str) and path.startswith('http')
        storage.append({
            'Format': name.replace('_', ' ').title(),
            'Data': data_scope,
            'Size (MB)': f'{size_mb:.1f}' if size_mb else 'Remote',
            'Rows': f'{rows:,}',
            'Columns': cols,
            'Bytes/Row': f'{bytes_per_row:.1f}' if bytes_per_row else 'N/A',
        })

pd.DataFrame(storage)

Unnamed: 0,Format,Data,Size (MB),Rows,Columns,Bytes/Row
0,Export,Full,297.0,6680932,19,44.5
1,Zenodo Narrow,Full,860.1,101387180,40,8.5
2,Zenodo Wide,Full,291.8,20729358,49,14.1
3,Eric Narrow,OC only,724.5,11637144,40,62.3
4,Eric Wide,OC only,288.7,2464690,47,117.1


## 6. Benchmark Summary

### Benchmark Results Summary

**Data Coverage Verification:**
- ✅ Export, Zenodo Narrow, Zenodo Wide all contain **6,680,932 samples** from all 4 sources
- ✅ Eric's Narrow/Wide contain OpenContext subset (~1.1M samples)

| Query Pattern | Best For | SQL Complexity | Notes |
|--------------|----------|----------------|-------|
| **Map (all coords)** | Export ≈ Zenodo Wide | Simple SELECT | Both ~30ms for 6M points |
| **Facets (material counts)** | Export | 1 subquery vs 2 CTEs + JOIN | Export has URIs, PQG has labels |
| **Entity listing (agents)** | PQG formats | 0 JOINs (otype filter) | Export requires full scan |
| **Reverse lookup by agent** | Export | list_contains() | Only works if agent exists |
| **Sample detail (one row)** | Export | Simple WHERE | All data on single row |

**Key tradeoffs:**
- **Export**: Best for UI (map + facets + detail) but slow for entity listing
- **PQG Wide**: Good balance - entities queryable, reasonable JOIN complexity
- **PQG Narrow**: Most flexible but slower (92M rows including edges)

## 7. Conclusions: When to Use Each Format

### Export Format
**Best for:**
- UI queries (map, search, facets)
- Sample-centric analysis
- When you don't need to query entities independently

**Avoid when:**
- You need to list all agents/sites/concepts
- You need graph traversal flexibility
- You need incremental updates

### Wide Format
**Best for:**
- Entity-centric queries ("all agents", "all sites")
- Analytical dashboards
- When you need both samples AND other entity types

**Avoid when:**
- Pure sample queries (Export is faster)
- Complex multi-hop traversals (Narrow is more natural)

### Narrow Format
**Best for:**
- Archival/preservation (full fidelity)
- Graph algorithms
- Relationship exploration
- When you need to traverse in any direction

**Avoid when:**
- Interactive UI (too slow)
- Simple sample queries (overkill)

## 8. Key Insights

### What Export Gains
1. **No JOINs** - Everything on one row
2. **Pre-extracted coords** - `sample_location_latitude/longitude` at top level
3. **Fewer rows** - 6.7M vs 19.5M vs 92M

### What Export Loses
1. **Entity independence** - Can't query agents without scanning all samples
2. **Graph flexibility** - Can't traverse in arbitrary directions
3. **Incremental updates** - Must regenerate entire file

### The `list_contains()` Problem
Both Wide (p__* arrays) and Export (nested structs) suffer from O(n) scans when searching within arrays. Neither has index support in DuckDB/Parquet.

### Recommendation for Eric's UI
For the iSamples Central UI requirements:
- **Start with Export format** - fastest for map + facets + click-to-detail
- **Pre-compute H3 aggregations** - for initial map render
- **Pre-compute facet counts** - avoid runtime aggregation
- **Keep Wide/Narrow for advanced queries** - entity exploration, graph traversal

## 9. Visualization with Lonboard

Now let's visualize the coordinate data we queried earlier using **Lonboard** - a high-performance WebGL-based mapping library for Jupyter.

**Key considerations for 6M+ points:**
- Use sampling to avoid memory issues
- Color by source collection for insight
- Compare visualization speed across formats

In [28]:
# Import visualization libraries
try:
    from lonboard import Map, ScatterplotLayer
    import geopandas as gpd
    from shapely.geometry import Point
    import numpy as np
    LONBOARD_AVAILABLE = True
    print("✅ Lonboard available for visualization")
except ImportError as e:
    LONBOARD_AVAILABLE = False
    print(f"⚠️ Lonboard not available: {e}")
    print("   Install with: pip install lonboard geopandas")

✅ Lonboard available for visualization


In [29]:
# Visualize a sample of points from EXPORT format (includes source_collection for coloring)
if LONBOARD_AVAILABLE:
    SAMPLE_SIZE = 50000  # Adjust based on your system's memory
    
    print(f"Querying {SAMPLE_SIZE:,} random samples from Export format...")
    start = time.time()
    
    # Export has source_collection and pre-extracted coords - ideal for visualization
    viz_data = con.sql(f"""
        SELECT 
            sample_location_longitude as lon,
            sample_location_latitude as lat,
            source_collection,
            sample_identifier
        FROM read_parquet('{PATHS['export']}')
        WHERE sample_location_latitude IS NOT NULL
        USING SAMPLE {SAMPLE_SIZE}
    """).fetchdf()
    
    query_time = (time.time() - start) * 1000
    print(f"Query time: {query_time:.1f}ms, {len(viz_data):,} points")
    
    # Show distribution by source
    print("\nSample distribution by source:")
    print(viz_data['source_collection'].value_counts().to_string())
else:
    print("Skipping visualization (Lonboard not available)")

Querying 50,000 random samples from Export format...
Query time: 56.5ms, 44,872 points

Sample distribution by source:
source_collection
SESAR          31734
OPENCONTEXT     8593
GEOME           3324
SMITHSONIAN     1221


In [30]:
# Create the Lonboard visualization with color by source collection
if LONBOARD_AVAILABLE and len(viz_data) > 0:
    from IPython.display import display
    
    # Define colors for each source collection
    SOURCE_COLORS = {
        'SESAR': [255, 99, 71, 200],      # Tomato red
        'OPENCONTEXT': [65, 105, 225, 200], # Royal blue  
        'GEOME': [50, 205, 50, 200],       # Lime green
        'SMITHSONIAN': [255, 215, 0, 200], # Gold
    }
    DEFAULT_COLOR = [128, 128, 128, 200]  # Gray for unknown
    
    # Create geometry from coordinates
    geometry = gpd.points_from_xy(viz_data['lon'], viz_data['lat'])
    gdf = gpd.GeoDataFrame(viz_data, geometry=geometry, crs="EPSG:4326")
    
    # Create color array (RGBA as uint8)
    colors = np.array([
        SOURCE_COLORS.get(src, DEFAULT_COLOR) 
        for src in viz_data['source_collection']
    ], dtype=np.uint8)
    
    # Create ScatterplotLayer
    layer = ScatterplotLayer.from_geopandas(
        gdf,
        get_fill_color=colors,
        get_radius=3000,  # meters
        radius_min_pixels=2,
        radius_max_pixels=10,
        opacity=0.8,
        pickable=True,
    )
    
    # Create map
    m = Map(layer)
    
    print(f"🗺️ Visualizing {len(gdf):,} points colored by source:")
    for src, color in SOURCE_COLORS.items():
        count = (viz_data['source_collection'] == src).sum()
        if count > 0:
            print(f"   {src}: {count:,} points")
    
    # Display the map explicitly
    display(m)
else:
    print("No data to visualize")

🗺️ Visualizing 44,872 points colored by source:
   SESAR: 31,734 points
   OPENCONTEXT: 8,593 points
   GEOME: 3,324 points
   SMITHSONIAN: 1,221 points


Map(custom_attribution='', layers=(ScatterplotLayer(get_fill_color=arro3.core.ChunkedArray<FixedSizeList(Field…

### 9.1 Visualizing from Wide Format

The PQG Wide format stores coordinates in `GeospatialCoordLocation` rows with `otype` filter.
The `n` column contains the source collection (named graph).

In [31]:
# Visualize from WIDE format (uses `n` column for source, `otype` for filtering)
if LONBOARD_AVAILABLE:
    from IPython.display import display
    
    print(f"Querying {SAMPLE_SIZE:,} samples from Wide format...")
    start = time.time()
    
    # Wide format uses `n` for named graph (source collection) and otype filter
    wide_viz_data = con.sql(f"""
        SELECT 
            longitude as lon,
            latitude as lat,
            n as source_collection,  -- Named graph contains source
            pid as sample_identifier
        FROM read_parquet('{PATHS['zenodo_wide']}')
        WHERE otype = 'GeospatialCoordLocation' 
          AND latitude IS NOT NULL
        USING SAMPLE {SAMPLE_SIZE}
    """).fetchdf()
    
    query_time = (time.time() - start) * 1000
    print(f"Query time: {query_time:.1f}ms, {len(wide_viz_data):,} points")
    
    # Create geometry and colors
    geometry = gpd.points_from_xy(wide_viz_data['lon'], wide_viz_data['lat'])
    wide_gdf = gpd.GeoDataFrame(wide_viz_data, geometry=geometry, crs="EPSG:4326")
    
    colors = np.array([
        SOURCE_COLORS.get(src, DEFAULT_COLOR) 
        for src in wide_viz_data['source_collection']
    ], dtype=np.uint8)
    
    # Create layer and map
    wide_layer = ScatterplotLayer.from_geopandas(
        wide_gdf,
        get_fill_color=colors,
        get_radius=3000,
        radius_min_pixels=2,
        radius_max_pixels=10,
        opacity=0.8,
        pickable=True,
    )
    
    wide_map = Map(wide_layer)
    
    print(f"\n🗺️ Wide format: {len(wide_gdf):,} points")
    print(wide_viz_data['source_collection'].value_counts().to_string())
    
    # Display the map explicitly
    display(wide_map)
else:
    print("Skipping (Lonboard not available)")

Querying 50,000 samples from Wide format...
Query time: 166.9ms, 12,579 points

🗺️ Wide format: 12,579 points
source_collection
SESAR          9247
OPENCONTEXT    2219
GEOME           605
SMITHSONIAN     508


Map(custom_attribution='', layers=(ScatterplotLayer(get_fill_color=arro3.core.ChunkedArray<FixedSizeList(Field…

### 9.2 Visualization Tips

**Memory Management for 6M+ Points:**
- Use `USING SAMPLE N` to limit points (shown above)
- Or use `LIMIT` with `ORDER BY RANDOM()` for reproducible sampling
- For full dataset: consider H3 hexbin aggregation first

**Format Comparison for Visualization:**

| Format | Query | Source Color | Notes |
|--------|-------|--------------|-------|
| **Export** | Direct `lat/lon` columns | `source_collection` | Fastest, simplest |
| **Wide** | Filter `otype='GeospatialCoordLocation'` | `n` (named graph) | Slightly slower |
| **Narrow** | Same as Wide | Same as Wide | Slowest (most rows) |

**Color Scheme Used:**
- 🔴 SESAR (geological): Tomato red
- 🔵 OPENCONTEXT (archaeological): Royal blue  
- 🟢 GEOME (biological): Lime green
- 🟡 SMITHSONIAN (museum): Gold

**Next Steps:**
- See `geoparquet.ipynb` for more advanced memory-efficient strategies
- See `isample-archive.ipynb` for remote parquet visualization patterns

## 10. Browser Visualization with Cesium

For web-based 3D globe visualization, use **CesiumJS** with **DuckDB-WASM**. This enables:
- No server required - runs entirely in browser
- 3D globe with terrain
- Click-to-query sample details
- Works with remote parquet files via HTTP range requests

**Reference implementations:**
- `isamplesorg.github.io/tutorials/parquet_cesium_isamples_wide.qmd` - Quarto tutorial with live demo
- Remote parquet URLs work directly in browser:
  ```javascript
  const db = await AsyncDuckDB.create();
  await db.open({path: ':memory:'});
  const result = await db.query(`
    SELECT latitude, longitude, pid
    FROM read_parquet('https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet')
    WHERE otype = 'GeospatialCoordLocation'
    LIMIT 10000
  `);
  ```

**Lonboard vs Cesium:**

| Feature | Lonboard (Jupyter) | Cesium (Browser) |
|---------|-------------------|------------------|
| Environment | Jupyter notebooks | Web pages/Quarto |
| Rendering | 2D WebGL | 3D Globe |
| Best for | Data exploration | Public demos |
| Max points | ~500K comfortable | ~100K with clustering |
| Interactivity | Pan/zoom, hover | Click, terrain, 3D |

## 11. Focus Sites: PKAP and Poggio Civitate

To make the 6M+ sample dataset more tangible, let's explore two well-documented OpenContext archaeological sites:

| Site | Location | Coordinates | Scale |
|------|----------|-------------|-------|
| **PKAP** | Pyla-Koutsopetria, Cyprus | 34.99°N, 33.71°E | 544 locations, 15K+ events |
| **Poggio Civitate** | Murlo, Tuscany, Italy | 43.15°N, 11.40°E | 11K+ locations, 30K events |

These sites demonstrate:
- How coordinates cluster around archaeological excavations
- The relationship between samples, events, and locations
- Real-world query patterns for site-specific analysis

In [None]:
# Define focus sites
FOCUS_SITES = {
    'PKAP': {
        'name': 'Pyla-Koutsopetria Archaeological Project',
        'location': 'Cyprus',
        'lat': 34.987406,
        'lon': 33.708047,
        'radius_deg': 0.05,  # ~5km bounding box
    },
    'Poggio': {
        'name': 'Poggio Civitate',
        'location': 'Murlo, Tuscany, Italy', 
        'lat': 43.15,
        'lon': 11.40,
        'radius_deg': 0.1,  # ~10km bounding box
    }
}

def get_site_bbox(site):
    """Get bounding box for a focus site."""
    return {
        'min_lat': site['lat'] - site['radius_deg'],
        'max_lat': site['lat'] + site['radius_deg'],
        'min_lon': site['lon'] - site['radius_deg'],
        'max_lon': site['lon'] + site['radius_deg'],
    }

# Display site info
for key, site in FOCUS_SITES.items():
    bbox = get_site_bbox(site)
    print(f"📍 {key}: {site['name']}")
    print(f"   Location: {site['location']}")
    print(f"   Center: {site['lat']:.4f}°N, {site['lon']:.4f}°E")
    print(f"   Bbox: [{bbox['min_lat']:.2f}, {bbox['min_lon']:.2f}] to [{bbox['max_lat']:.2f}, {bbox['max_lon']:.2f}]")
    print()

### 11.1 Query Site Data from Export Format

The Export format makes spatial queries simple - just filter on lat/lon columns.

In [None]:
# Query samples from each focus site using Export format
site_data = {}

for key, site in FOCUS_SITES.items():
    bbox = get_site_bbox(site)
    
    print(f"=== {key}: {site['name']} ===")
    start = time.time()
    
    df = con.sql(f"""
        SELECT 
            sample_identifier,
            label,
            sample_location_latitude as lat,
            sample_location_longitude as lon,
            source_collection
        FROM read_parquet('{PATHS['export']}')
        WHERE sample_location_latitude BETWEEN {bbox['min_lat']} AND {bbox['max_lat']}
          AND sample_location_longitude BETWEEN {bbox['min_lon']} AND {bbox['max_lon']}
    """).fetchdf()
    
    elapsed = (time.time() - start) * 1000
    site_data[key] = df
    
    print(f"Found {len(df):,} samples in {elapsed:.1f}ms")
    print(f"Coordinate range: [{df['lat'].min():.4f}, {df['lon'].min():.4f}] to [{df['lat'].max():.4f}, {df['lon'].max():.4f}]")
    print(f"Unique locations: {df.groupby(['lat', 'lon']).ngroups:,}")
    print()

### 11.2 Visualize PKAP (Cyprus)

Zoomed view of the Pyla-Koutsopetria Archaeological Project survey area.

In [None]:
# Visualize PKAP site
if LONBOARD_AVAILABLE and 'PKAP' in site_data and len(site_data['PKAP']) > 0:
    from IPython.display import display
    
    pkap_df = site_data['PKAP']
    site = FOCUS_SITES['PKAP']
    
    # Create geometry
    geometry = gpd.points_from_xy(pkap_df['lon'], pkap_df['lat'])
    pkap_gdf = gpd.GeoDataFrame(pkap_df, geometry=geometry, crs="EPSG:4326")
    
    # Single color for site-specific view (blue)
    colors = np.full((len(pkap_gdf), 4), [65, 105, 225, 200], dtype=np.uint8)
    
    # Create layer
    pkap_layer = ScatterplotLayer.from_geopandas(
        pkap_gdf,
        get_fill_color=colors,
        get_radius=50,  # smaller radius for zoomed view
        radius_min_pixels=3,
        radius_max_pixels=8,
        opacity=0.8,
        pickable=True,
    )
    
    # Create map centered on site
    pkap_map = Map(pkap_layer)
    pkap_map.set_view_state(latitude=site['lat'], longitude=site['lon'], zoom=14)
    
    print(f"🗺️ PKAP: {len(pkap_gdf):,} samples at {pkap_gdf.groupby(['lat', 'lon']).ngroups:,} unique locations")
    print(f"   Center: {site['lat']:.4f}°N, {site['lon']:.4f}°E")
    
    display(pkap_map)
else:
    print("PKAP data not available or Lonboard not installed")

### 11.3 Visualize Poggio Civitate (Tuscany)

Zoomed view of the Poggio Civitate excavation site in Murlo, Italy.

In [None]:
# Visualize Poggio Civitate site
if LONBOARD_AVAILABLE and 'Poggio' in site_data and len(site_data['Poggio']) > 0:
    from IPython.display import display
    
    poggio_df = site_data['Poggio']
    site = FOCUS_SITES['Poggio']
    
    # Create geometry
    geometry = gpd.points_from_xy(poggio_df['lon'], poggio_df['lat'])
    poggio_gdf = gpd.GeoDataFrame(poggio_df, geometry=geometry, crs="EPSG:4326")
    
    # Single color for site-specific view (tomato red)
    colors = np.full((len(poggio_gdf), 4), [255, 99, 71, 200], dtype=np.uint8)
    
    # Create layer
    poggio_layer = ScatterplotLayer.from_geopandas(
        poggio_gdf,
        get_fill_color=colors,
        get_radius=20,  # even smaller for dense site
        radius_min_pixels=2,
        radius_max_pixels=6,
        opacity=0.7,
        pickable=True,
    )
    
    # Create map centered on site
    poggio_map = Map(poggio_layer)
    poggio_map.set_view_state(latitude=site['lat'], longitude=site['lon'], zoom=15)
    
    print(f"🗺️ Poggio Civitate: {len(poggio_gdf):,} samples at {poggio_gdf.groupby(['lat', 'lon']).ngroups:,} unique locations")
    print(f"   Center: {site['lat']:.4f}°N, {site['lon']:.4f}°E")
    
    display(poggio_map)
else:
    print("Poggio Civitate data not available or Lonboard not installed")

### 11.4 Site-Specific Material Analysis

What materials were found at each site? This demonstrates practical site-level queries.

In [None]:
# Material categories at each site
# Load official iSamples vocabulary labels from material_hierarchy.json
# Source: https://github.com/isamplesorg/isamples_inabox/blob/develop/isb_web/static/controlled_vocabulary/material_hierarchy.json

import json
from pathlib import Path

VOCAB_PATH = Path.home() / 'C/src/iSamples/isamples_inabox/isb_web/static/controlled_vocabulary/material_hierarchy.json'

def extract_labels_from_hierarchy(node, result=None):
    """Recursively extract URI -> label mappings from vocabulary hierarchy."""
    if result is None:
        result = {}
    
    for uri, data in node.items():
        if isinstance(data, dict):
            if 'label' in data and 'en' in data['label']:
                # Store both 0.9 and 1.0 versions (data uses 1.0)
                uri_1_0 = uri.replace('/0.9/', '/1.0/')
                result[uri_1_0] = data['label']['en']
                result[uri] = data['label']['en']
            if 'children' in data:
                for child in data['children']:
                    extract_labels_from_hierarchy(child, result)
    return result

# Load vocabulary and build lookup
if VOCAB_PATH.exists():
    with open(VOCAB_PATH) as f:
        vocab_hierarchy = json.load(f)
    URI_TO_LABEL = extract_labels_from_hierarchy(vocab_hierarchy)
    print(f"Loaded {len(URI_TO_LABEL)} material labels from vocabulary file")
else:
    print(f"⚠️ Vocabulary file not found: {VOCAB_PATH}")
    URI_TO_LABEL = {}

def get_material_label(uri):
    """Get human-readable label for a material URI."""
    return URI_TO_LABEL.get(uri, uri.split('/')[-1])

# Query and display materials for each site
for key, site in FOCUS_SITES.items():
    bbox = get_site_bbox(site)
    
    print(f"\n=== {key}: Material Categories ===")
    
    result = con.sql(f"""
        SELECT 
            mat.identifier as uri,
            COUNT(*) as cnt
        FROM (
            SELECT unnest(has_material_category) as mat
            FROM read_parquet('{PATHS['export']}')
            WHERE sample_location_latitude BETWEEN {bbox['min_lat']} AND {bbox['max_lat']}
              AND sample_location_longitude BETWEEN {bbox['min_lon']} AND {bbox['max_lon']}
              AND has_material_category IS NOT NULL
        )
        GROUP BY mat.identifier
        ORDER BY cnt DESC
        LIMIT 8
    """).fetchdf()
    
    # Add friendly label from vocabulary
    result['material'] = result['uri'].apply(get_material_label)
    
    # Display with friendly labels
    print(result[['material', 'cnt']].to_string())

### 11.5 Data Format Question: Should PQG Include Labels?

**Current state:**
- **Export format**: `has_material_category` only contains `identifier` (URI), no label
- **Zenodo Wide/Narrow**: `IdentifiedConcept.label` = URI (not human-readable)
- **Eric's Wide**: `IdentifiedConcept.label` = human-readable (vocabulary lookup applied)

**The question:** Should Zenodo Wide/Narrow `IdentifiedConcept` rows include:
1. Just the URI (current) - requires external vocabulary lookup
2. Just the label - loses precise identifier
3. Both URI and label - redundant but self-contained

**Tradeoffs:**

| Approach | File Size | Query Simplicity | Vocabulary Updates |
|----------|-----------|------------------|-------------------|
| URI only | Smaller | Need JOIN to vocab | Easy to re-label |
| Label only | Smaller | Direct display | Stuck with old labels |
| Both | Larger | Best of both | Must regenerate |

**Recommendation:** Include both `pid` (URI) and `label` (human-readable) in `IdentifiedConcept` rows. The ~50K concept rows are tiny compared to 6M+ samples, so the size increase is negligible.

**Vocabulary source:** 
- Local: `~/C/src/iSamples/isamples_inabox/isb_web/static/controlled_vocabulary/material_hierarchy.json`
- GitHub: https://github.com/isamplesorg/isamples_inabox/blob/develop/isb_web/static/controlled_vocabulary/material_hierarchy.json