# iSamples Parquet Schema Comparison

**Goal**: Understand the tradeoffs among three parquet formats for iSamples data.

| Format | Philosophy | Rows | Relationships |
|--------|-----------|------|---------------|
| **Narrow** | Graph (nodes + edges) | 92M | Separate `_edge_` rows |
| **Wide** | Entity-centric | 19.5M | `p__*` arrays → row_ids |
| **Export** | Sample-centric | 6.7M | Nested STRUCTs |

**Key insight**: There is no universal best format. Each optimizes for different query patterns.

---

## ⚠️ Important: Source Coverage Mismatch

**The files compared here have DIFFERENT source coverage:**

| Format | Sources Included | Samples |
|--------|-----------------|---------|
| **Export** | SESAR, OpenContext, GEOME, Smithsonian | 6.7M |
| **Narrow/Wide** | OpenContext ONLY | ~1.1M |

This means:
- Export returns **6M coordinate points** vs **~200K** for PQG formats
- Benchmark comparisons are NOT apples-to-apples for row counts
- Performance comparisons for query patterns are still valid (structure matters more than size)

For fair benchmarks, either:
1. Filter Export to `source_collection = 'OPENCONTEXT'`
2. Or use PQG files that include all sources

## 1. Setup & Load Data

In [1]:
import duckdb
import pandas as pd
import time
from pathlib import Path

# File paths (local)
PATHS = {
    'export': Path.home() / 'Data/iSample/2025_04_21_16_23_46/isamples_export_2025_04_21_16_23_46_geo.parquet',
    'narrow': Path.home() / 'Data/iSample/pqg_refining/oc_isamples_pqg.parquet',
    'wide': Path.home() / 'Data/iSample/pqg_refining/oc_isamples_pqg_wide.parquet',
}

# Remote URLs (for reference)
URLS = {
    'export': 'https://zenodo.org/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet',
    'narrow': 'https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg.parquet',
    'wide': 'https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg_wide.parquet',
}

# Verify files exist
for name, path in PATHS.items():
    exists = '✅' if path.exists() else '❌'
    size = f'{path.stat().st_size / 1e6:.1f} MB' if path.exists() else 'N/A'
    print(f'{exists} {name}: {size}')

✅ export: 297.0 MB
✅ narrow: 724.5 MB
✅ wide: 288.7 MB


In [2]:
# Helper function for timing queries
def timed_query(con, sql, name="Query"):
    """Execute query and return (result_df, elapsed_ms)"""
    start = time.time()
    result = con.sql(sql).fetchdf()
    elapsed = (time.time() - start) * 1000
    print(f"{name}: {elapsed:.1f}ms, {len(result):,} rows")
    return result, elapsed

# Create connection
con = duckdb.connect()

## 2. Schema Inspection

Understanding what columns exist and their types.

In [3]:
# Get schema for each format
schemas = {}
for name, path in PATHS.items():
    if path.exists():
        result = con.sql(f"DESCRIBE SELECT * FROM read_parquet('{path}')").fetchdf()
        schemas[name] = result
        print(f"\n=== {name.upper()} ({len(result)} columns) ===")
        print(result[['column_name', 'column_type']].to_string())


=== EXPORT (19 columns) ===
                  column_name                                                                                                                                                                                                                                                                                                                                     column_type
0           sample_identifier                                                                                                                                                                                                                                                                                                                                         VARCHAR
1                         @id                                                                                                                                                                                                                  

In [4]:
# Compare column counts and key structural differences
comparison = pd.DataFrame({
    'Format': ['Export', 'Narrow', 'Wide'],
    'Columns': [len(schemas.get('export', [])), len(schemas.get('narrow', [])), len(schemas.get('wide', []))],
    'Has edge cols (s,p,o)': ['No', 'Yes', 'No'],
    'Has p__* cols': ['No', 'No', 'Yes'],
    'Has nested STRUCTs': ['Yes', 'No', 'No'],
    'Has otype col': ['No', 'Yes', 'Yes'],
})
comparison

Unnamed: 0,Format,Columns,"Has edge cols (s,p,o)",Has p__* cols,Has nested STRUCTs,Has otype col
0,Export,19,No,No,Yes,No
1,Narrow,40,Yes,No,No,Yes
2,Wide,47,No,Yes,No,Yes


## 3. Row Count Analysis

Understanding what's IN each format.

In [5]:
# Total row counts
row_counts = {}
for name, path in PATHS.items():
    if path.exists():
        count = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}')").fetchone()[0]
        row_counts[name] = count
        print(f"{name}: {count:,} rows")

export: 6,680,932 rows
narrow: 11,637,144 rows
wide: 2,464,690 rows


In [6]:
# For PQG formats: breakdown by otype
print("=== NARROW: Rows by otype ===")
if PATHS['narrow'].exists():
    result = con.sql(f"""
        SELECT otype, COUNT(*) as cnt 
        FROM read_parquet('{PATHS['narrow']}')
        GROUP BY otype ORDER BY cnt DESC
    """).fetchdf()
    print(result.to_string())

print("\n=== WIDE: Rows by otype ===")
if PATHS['wide'].exists():
    result = con.sql(f"""
        SELECT otype, COUNT(*) as cnt 
        FROM read_parquet('{PATHS['wide']}')
        GROUP BY otype ORDER BY cnt DESC
    """).fetchdf()
    print(result.to_string())

=== NARROW: Rows by otype ===
                     otype      cnt
0                   _edge_  9201451
1            SamplingEvent  1096352
2     MaterialSampleRecord  1096352
3  GeospatialCoordLocation   198433
4        IdentifiedConcept    25778
5             SamplingSite    18213
6                    Agent      565

=== WIDE: Rows by otype ===
                     otype      cnt
0     MaterialSampleRecord  1110412
1            SamplingEvent  1110412
2  GeospatialCoordLocation   199147
3        IdentifiedConcept    25929
4             SamplingSite    18213
5                    Agent      577


In [7]:
# For Export: breakdown by source_collection
print("=== EXPORT: Rows by source_collection ===")
if PATHS['export'].exists():
    result = con.sql(f"""
        SELECT source_collection, COUNT(*) as cnt 
        FROM read_parquet('{PATHS['export']}')
        GROUP BY source_collection ORDER BY cnt DESC
    """).fetchdf()
    print(result.to_string())

=== EXPORT: Rows by source_collection ===
  source_collection      cnt
0             SESAR  4688386
1       OPENCONTEXT  1064831
2             GEOME   605554
3       SMITHSONIAN   322161


## 4. Query Benchmark Suite

Testing common query patterns across all three formats.

### 4.1 Map Visualization: Get All Coordinates

**Use case**: Render points on a Cesium/Leaflet map

In [8]:
# EXPORT: Direct column access
print("=== EXPORT ===")
export_coords, export_time = timed_query(con, f"""
    SELECT sample_location_latitude as lat, sample_location_longitude as lon
    FROM read_parquet('{PATHS['export']}')
    WHERE sample_location_latitude IS NOT NULL
""", "All coordinates")

=== EXPORT ===
All coordinates: 32.0ms, 5,980,282 rows


In [9]:
# WIDE: Filter by otype
print("=== WIDE ===")
wide_coords, wide_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['wide']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

=== WIDE ===
All coordinates: 4.6ms, 199,146 rows


In [10]:
# NARROW: Same as Wide
print("=== NARROW ===")
narrow_coords, narrow_time = timed_query(con, f"""
    SELECT latitude as lat, longitude as lon
    FROM read_parquet('{PATHS['narrow']}')
    WHERE otype = 'GeospatialCoordLocation' AND latitude IS NOT NULL
""", "All coordinates")

=== NARROW ===
All coordinates: 6.3ms, 198,432 rows


In [11]:
# Summary
print("\n=== MAP QUERY SUMMARY ===")
print(f"Export: {export_time:.1f}ms ({len(export_coords):,} points)")
print(f"Wide:   {wide_time:.1f}ms ({len(wide_coords):,} points)")
print(f"Narrow: {narrow_time:.1f}ms ({len(narrow_coords):,} points)")
print(f"\nExport is {wide_time/export_time:.1f}x faster than Wide")


=== MAP QUERY SUMMARY ===
Export: 32.0ms (5,980,282 points)
Wide:   4.6ms (199,146 points)
Narrow: 6.3ms (198,432 points)

Export is 0.1x faster than Wide


### 4.2 Faceted Search: Count by Material Category

**Use case**: Show facet counts in a search UI

In [12]:
# EXPORT: Unnest nested struct array
print("=== EXPORT ===")
export_facets, export_time = timed_query(con, f"""
    SELECT 
        mat.identifier as material,
        COUNT(*) as cnt
    FROM (
        SELECT unnest(has_material_category) as mat
        FROM read_parquet('{PATHS['export']}')
        WHERE has_material_category IS NOT NULL AND len(has_material_category) > 0
    )
    GROUP BY mat.identifier
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(export_facets.to_string())

=== EXPORT ===
Material facets: 64.0ms, 10 rows
                                                                      material      cnt
0               https://w3id.org/isample/vocabulary/material/1.0/earthmaterial  2261513
1             https://w3id.org/isample/vocabulary/material/1.0/organicmaterial  1265560
2                        https://w3id.org/isample/vocabulary/material/1.0/rock  1208585
3  https://w3id.org/isample/vocabulary/material/1.0/biogenicnonorganicmaterial  1091781
4       https://w3id.org/isample/vocabulary/material/1.0/mixedsoilsedimentrock   838805
5                    https://w3id.org/isample/vocabulary/material/1.0/material   673018
6                     https://w3id.org/isample/vocabulary/material/1.0/mineral   390797
7          https://w3id.org/isample/vocabulary/material/1.0/anthropogenicmetal   270040
8                https://w3id.org/isample/opencontext/material/0.1/ceramicclay   100573
9                    https://w3id.org/isample/vocabulary/material/1.0/se

In [13]:
# WIDE: JOIN via p__has_material_category
# This requires finding IdentifiedConcept rows by row_id
print("=== WIDE ===")
wide_facets, wide_time = timed_query(con, f"""
    WITH samples AS (
        SELECT unnest(p__has_material_category) as concept_rowid
        FROM read_parquet('{PATHS['wide']}')
        WHERE otype = 'MaterialSampleRecord' 
          AND p__has_material_category IS NOT NULL
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['wide']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM samples s
    JOIN concepts c ON s.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(wide_facets.to_string())

=== WIDE ===


Material facets: 30.4ms, 10 rows
                        material     cnt
0  Biogenic non-organic material  532843
1               Organic material  217562
2                       Material  159434
3   Other anthropogenic material  145431
4                           Rock   37948
5   Anthropogenic metal material   11694
6    Mixed soil sediment or rock    3207
7                        Mineral    2233
8         Natural Solid Material      58
9                       Sediment       2


In [14]:
# NARROW: Follow edges with predicate='has_material_category'
print("=== NARROW ===")
narrow_facets, narrow_time = timed_query(con, f"""
    WITH edges AS (
        SELECT s as sample_rowid, unnest(o) as concept_rowid
        FROM read_parquet('{PATHS['narrow']}')
        WHERE otype = '_edge_' AND p = 'has_material_category'
    ),
    concepts AS (
        SELECT row_id, label
        FROM read_parquet('{PATHS['narrow']}')
        WHERE otype = 'IdentifiedConcept'
    )
    SELECT c.label as material, COUNT(*) as cnt
    FROM edges e
    JOIN concepts c ON e.concept_rowid = c.row_id
    GROUP BY c.label
    ORDER BY cnt DESC
    LIMIT 10
""", "Material facets")
print(narrow_facets.to_string())

=== NARROW ===


Material facets: 36.3ms, 10 rows
                        material     cnt
0  Biogenic non-organic material  532675
1               Organic material  212584
2                       Material  158586
3   Other anthropogenic material  145316
4                           Rock   30186
5   Anthropogenic metal material   11659
6    Mixed soil sediment or rock    3207
7                        Mineral    2080
8         Natural Solid Material      58
9                       Sediment       1


In [15]:
# Summary
print("\n=== FACET QUERY SUMMARY ===")
print(f"Export: {export_time:.1f}ms")
print(f"Wide:   {wide_time:.1f}ms")
print(f"Narrow: {narrow_time:.1f}ms")


=== FACET QUERY SUMMARY ===
Export: 64.0ms
Wide:   30.4ms
Narrow: 36.3ms


### 4.3 Entity Listing: Get All Unique Agents

**Use case**: Populate a dropdown, show "who collected samples"

**Key tradeoff**: Export cannot do this efficiently!

In [16]:
# WIDE: Direct query on Agent rows
print("=== WIDE ===")
wide_agents, wide_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['wide']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(wide_agents.to_string())

=== WIDE ===
All agents: 3.5ms, 10 rows
                       name                                                                                                           role  cnt
0           Arianne Boileau                                                  Participated in: Household Zooarchaeology of Colonial Lamanai    2
1                Mila Hover                                                                                    Participated in: Kenan Tepe    1
2           Justin Jennings                                                     Participated in: Andean Geochemistry Visualization Project    1
3                  Mohammed                                                                Participated in: Petra Great Temple Excavations    1
4           Madeline Mackie                                                                       Participated in: Cross-referenced p3k14c    1
5           Liora K Horwitz  Participated in: Biometry of Iron Age II and Hellenistic Period Dog

In [17]:
# NARROW: Same approach
print("=== NARROW ===")
narrow_agents, narrow_time = timed_query(con, f"""
    SELECT name, role, COUNT(*) as cnt
    FROM read_parquet('{PATHS['narrow']}')
    WHERE otype = 'Agent'
    GROUP BY name, role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents")
print(narrow_agents.to_string())

=== NARROW ===
All agents: 4.8ms, 10 rows
                name                                                                            role  cnt
0    Arianne Boileau                   Participated in: Household Zooarchaeology of Colonial Lamanai    2
1        Dusan Boric                                      Participated in: Çatalhöyük Zooarchaeology    1
2    Elizabeth Clark                                                     Participated in: Kenan Tepe    1
3     Marisa Lazzari                      Participated in: Andean Geochemistry Visualization Project    1
4    Darryl B. Sneag                                 Participated in: Petra Great Temple Excavations    1
5    Pamela Crabtree                                  Participated in: West Stow West Zooarchaeology    1
6        Chris Hills                                      Participated in: Çatalhöyük Zooarchaeology    1
7  P. Nick Kardulias  Participated in: Pyla-Koutsopetria Archaeological Project I: Pedestrian Survey    1
8   

In [18]:
# EXPORT: Must scan all samples and extract from nested structs
# This is MUCH slower because agents are embedded in every sample row
print("=== EXPORT ===")
export_agents, export_time = timed_query(con, f"""
    SELECT 
        resp.name as name,
        resp.role as role,
        COUNT(*) as cnt
    FROM (
        SELECT unnest(produced_by.responsibility) as resp
        FROM read_parquet('{PATHS['export']}')
        WHERE produced_by IS NOT NULL 
          AND produced_by.responsibility IS NOT NULL
    )
    GROUP BY resp.name, resp.role
    ORDER BY cnt DESC
    LIMIT 10
""", "All agents (from nested)")
print(export_agents.to_string())

=== EXPORT ===


All agents (from nested): 272.3ms, 10 rows
                                                             name     role      cnt
0                                              Curator,,Collector     None  3516917
1  Curator Integrated Ocean Drilling Program (TAMU),,Sample Owner     None  3516905
2                                       Adam Mansur,,Sample Owner     None   383835
3                                    Edward Gilbert,,Sample Owner     None   258790
4                                                   Jacob Freeman  creator   161623
5                                                      Andrea Kay  creator   161623
6                                                 Eugenia M. Gayo  creator   161623
7                                               Julie A. Hoggarth  creator   161623
8                                                 Madeline Mackie  creator   161623
9                                                 Steinar Solheim  creator   161623


In [19]:
# Summary
print("\n=== ENTITY LISTING SUMMARY ===")
print(f"Wide:   {wide_time:.1f}ms")
print(f"Narrow: {narrow_time:.1f}ms")
print(f"Export: {export_time:.1f}ms")
print("\n⚠️ Export is SLOWER for entity listing because agents are embedded in every sample row!")


=== ENTITY LISTING SUMMARY ===
Wide:   3.5ms
Narrow: 4.8ms
Export: 272.3ms

⚠️ Export is SLOWER for entity listing because agents are embedded in every sample row!


### 4.4 Reverse Lookup: Samples by Agent

**Use case**: "Show me all samples collected by Agent X"

In [20]:
# First, pick an agent name that exists in all formats
# Using a common agent from the data
AGENT_NAME = 'Vance Vredenburg'  # Adjust based on your data

print(f"Looking for samples by: {AGENT_NAME}")

Looking for samples by: Vance Vredenburg


In [21]:
# EXPORT: Filter on nested struct
print("=== EXPORT ===")
export_by_agent, export_time = timed_query(con, f"""
    SELECT sample_identifier, label
    FROM read_parquet('{PATHS['export']}')
    WHERE list_contains(
        [r.name FOR r IN produced_by.responsibility],
        '{AGENT_NAME}'
    )
    LIMIT 10
""", f"Samples by {AGENT_NAME}")
print(export_by_agent.to_string())

=== EXPORT ===
Samples by Vance Vredenburg: 4.9ms, 10 rows
     sample_identifier label
0   ark:/21547/DSz2757   757
1   ark:/21547/DSz2779   779
2   ark:/21547/DSz2806   806
3   ark:/21547/DSz2807   807
4   ark:/21547/DSz2759   759
5   ark:/21547/DSz2761   761
6   ark:/21547/DSz2967   967
7   ark:/21547/DSz2763   763
8   ark:/21547/DSz2979   979
9  ark:/21547/DSz21792  1792


In [22]:
# WIDE: Find agent row_id, then find samples with that row_id in p__responsibility
print("=== WIDE ===")
wide_by_agent, wide_time = timed_query(con, f"""
    WITH agent AS (
        SELECT row_id 
        FROM read_parquet('{PATHS['wide']}')
        WHERE otype = 'Agent' AND name = '{AGENT_NAME}'
        LIMIT 1
    ),
    events AS (
        SELECT w.row_id as event_id
        FROM read_parquet('{PATHS['wide']}') w, agent
        WHERE w.otype = 'SamplingEvent' 
          AND list_contains(w.p__responsibility, agent.row_id)
    )
    SELECT s.sample_identifier, s.label
    FROM read_parquet('{PATHS['wide']}') s, events
    WHERE s.otype = 'MaterialSampleRecord'
      AND list_contains(s.p__produced_by, events.event_id)
    LIMIT 10
""", f"Samples by {AGENT_NAME}")
print(wide_by_agent.to_string())

=== WIDE ===
Samples by Vance Vredenburg: 18.2ms, 0 rows
Empty DataFrame
Columns: [sample_identifier, label]
Index: []


In [23]:
# Summary
print("\n=== REVERSE LOOKUP SUMMARY ===")
print(f"Export: {export_time:.1f}ms")
print(f"Wide:   {wide_time:.1f}ms")
print("\nNote: Both require scanning, but Export's nested access may be faster")


=== REVERSE LOOKUP SUMMARY ===
Export: 4.9ms
Wide:   18.2ms

Note: Both require scanning, but Export's nested access may be faster


### 4.5 Sample Detail: Get Full Info for One Sample

**Use case**: User clicks on a sample, show all details

In [24]:
# Pick a sample identifier
SAMPLE_ID = con.sql(f"""
    SELECT sample_identifier FROM read_parquet('{PATHS['export']}')
    WHERE sample_identifier IS NOT NULL LIMIT 1
""").fetchone()[0]
print(f"Sample: {SAMPLE_ID}")

Sample: ark:/21547/DSz2757


In [25]:
# EXPORT: Everything on one row
print("=== EXPORT ===")
start = time.time()
result = con.sql(f"""
    SELECT *
    FROM read_parquet('{PATHS['export']}')
    WHERE sample_identifier = '{SAMPLE_ID}'
""").fetchdf()
export_time = (time.time() - start) * 1000
print(f"Time: {export_time:.1f}ms")
print(f"Columns returned: {len(result.columns)}")
print(result.T)  # Transpose for readability

=== EXPORT ===
Time: 61.9ms
Columns returned: 19
                                                                           0
sample_identifier                                         ark:/21547/DSz2757
@id                                                   metadata/21547/DSz2757
label                                                                    757
description                                 basisOfRecord: PreservedSpecimen
source_collection                                                      GEOME
has_sample_object_type     [{'identifier': 'https://w3id.org/isample/voca...
has_material_category      [{'identifier': 'https://w3id.org/isample/voca...
has_context_category       [{'identifier': 'https://w3id.org/isample/biol...
informal_classification                                 [Taricha, granulosa]
keywords                     [{'keyword': 'California'}, {'keyword': 'USA'}]
produced_by                {'description': 'expeditionCode: newts | proje...
last_modified_time         

In [26]:
# WIDE: Need to JOIN related entities
print("=== WIDE ===")
start = time.time()
# This is more complex - would need multiple JOINs to get full picture
result = con.sql(f"""
    SELECT *
    FROM read_parquet('{PATHS['wide']}')
    WHERE sample_identifier = '{SAMPLE_ID}'
""").fetchdf()
wide_time = (time.time() - start) * 1000
print(f"Time: {wide_time:.1f}ms")
print(f"Columns returned: {len(result.columns)}")
print("Note: This only returns the sample row, not related entities")
print(result[['sample_identifier', 'label', 'p__produced_by', 'p__has_material_category']].T)

=== WIDE ===
Time: 8.2ms
Columns returned: 47
Note: This only returns the sample row, not related entities
Empty DataFrame
Columns: []
Index: [sample_identifier, label, p__produced_by, p__has_material_category]


## 5. Storage Comparison

In [27]:
# File sizes and efficiency
storage = []
for name, path in PATHS.items():
    if path.exists():
        size_mb = path.stat().st_size / 1e6
        rows = row_counts.get(name, 0)
        cols = len(schemas.get(name, []))
        bytes_per_row = (size_mb * 1e6) / rows if rows > 0 else 0
        storage.append({
            'Format': name,
            'Size (MB)': f'{size_mb:.1f}',
            'Rows': f'{rows:,}',
            'Columns': cols,
            'Bytes/Row': f'{bytes_per_row:.1f}',
        })

pd.DataFrame(storage)

Unnamed: 0,Format,Size (MB),Rows,Columns,Bytes/Row
0,export,297.0,6680932,19,44.5
1,narrow,724.5,11637144,40,62.3
2,wide,288.7,2464690,47,117.1


## 6. Benchmark Summary

In [28]:
# Collect all benchmark results
# (Run this after all benchmarks above)

# You can fill this in after running the benchmarks
benchmark_results = {
    'Query Pattern': [
        'Map (all coords)',
        'Facets (material counts)',
        'Entity listing (all agents)',
        'Reverse lookup (samples by agent)',
        'Sample detail (one sample)',
    ],
    'Best Format': [
        'Export (direct columns)',
        'Export (unnest struct)',
        'Wide/Narrow (otype filter)',
        'Depends on data',
        'Export (all-in-one row)',
    ],
    'Notes': [
        'No JOINs needed in Export',
        'Export avoids row_id lookups',
        'Export must scan all samples',
        'list_contains() is O(n) in all formats',
        'Wide needs JOINs for related entities',
    ],
}

pd.DataFrame(benchmark_results)

Unnamed: 0,Query Pattern,Best Format,Notes
0,Map (all coords),Export (direct columns),No JOINs needed in Export
1,Facets (material counts),Export (unnest struct),Export avoids row_id lookups
2,Entity listing (all agents),Wide/Narrow (otype filter),Export must scan all samples
3,Reverse lookup (samples by agent),Depends on data,list_contains() is O(n) in all formats
4,Sample detail (one sample),Export (all-in-one row),Wide needs JOINs for related entities


## 7. Conclusions: When to Use Each Format

### Export Format
**Best for:**
- UI queries (map, search, facets)
- Sample-centric analysis
- When you don't need to query entities independently

**Avoid when:**
- You need to list all agents/sites/concepts
- You need graph traversal flexibility
- You need incremental updates

### Wide Format
**Best for:**
- Entity-centric queries ("all agents", "all sites")
- Analytical dashboards
- When you need both samples AND other entity types

**Avoid when:**
- Pure sample queries (Export is faster)
- Complex multi-hop traversals (Narrow is more natural)

### Narrow Format
**Best for:**
- Archival/preservation (full fidelity)
- Graph algorithms
- Relationship exploration
- When you need to traverse in any direction

**Avoid when:**
- Interactive UI (too slow)
- Simple sample queries (overkill)

## 8. Key Insights

### What Export Gains
1. **No JOINs** - Everything on one row
2. **Pre-extracted coords** - `sample_location_latitude/longitude` at top level
3. **Fewer rows** - 6.7M vs 19.5M vs 92M

### What Export Loses
1. **Entity independence** - Can't query agents without scanning all samples
2. **Graph flexibility** - Can't traverse in arbitrary directions
3. **Incremental updates** - Must regenerate entire file

### The `list_contains()` Problem
Both Wide (p__* arrays) and Export (nested structs) suffer from O(n) scans when searching within arrays. Neither has index support in DuckDB/Parquet.

### Recommendation for Eric's UI
For the iSamples Central UI requirements:
- **Start with Export format** - fastest for map + facets + click-to-detail
- **Pre-compute H3 aggregations** - for initial map render
- **Pre-compute facet counts** - avoid runtime aggregation
- **Keep Wide/Narrow for advanced queries** - entity exploration, graph traversal