# OpenContext Parquet Analysis - Enhanced Version

This notebook provides comprehensive analysis of the OpenContext iSamples property graph parquet file.

## Key Distinction: Generic PQG vs OpenContext-Specific

This analysis works with two conceptual layers:

1. **Generic PQG (Property Graph) Framework**: A domain-agnostic way to represent graphs in tabular format
   - Core fields: `row_id`, `s`, `p`, `o`, `n` (subject, predicate, object, name)
   - Edge representation: Rows with `otype = '_edge_'` 
   - Graph traversal patterns applicable to any domain

2. **OpenContext-Specific Implementation**: Archaeological domain model built on PQG
   - Entity types: `MaterialSampleRecord`, `SamplingEvent`, `GeospatialCoordLocation`, etc.
   - Predicates: `produced_by`, `sample_location`, `has_material_category`, etc.
   - Domain fields: `latitude`, `longitude`, `label`, `description`, etc.

## Setup and Data Loading

In [1]:
import duckdb
import pandas as pd
import numpy as np
from pathlib import Path
import urllib.request
import os

# Configuration
file_url = "https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg.parquet"
LOCAL_PATH = "/Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet"

In [2]:
# Check if local file exists, download if not
if not os.path.exists(LOCAL_PATH):
    print(f"Local file not found at {LOCAL_PATH}")
    
    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(LOCAL_PATH), exist_ok=True)
    
    print(f"Downloading {file_url} to {LOCAL_PATH}...")
    urllib.request.urlretrieve(file_url, LOCAL_PATH)
    print("Download completed!")
else:
    print(f"Local file already exists at {LOCAL_PATH}")

# Use local path for parquet operations
parquet_path = LOCAL_PATH
print(f"Using parquet file: {parquet_path}")

Local file already exists at /Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet
Using parquet file: /Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet


## Understanding the Data Structure

### Generic PQG Framework
The parquet file uses a **property graph model** where both entities (nodes) and relationships (edges) are stored in a single table. This is a generic framework that could represent any graph data.

**Core PQG fields (framework-level)**:
- `row_id`: Unique identifier for each row
- `s` (subject): Source node in an edge
- `p` (predicate): Relationship type in an edge  
- `o` (object): Target node(s) in an edge (array)
- `n` (name): Graph context/namespace

### OpenContext Domain Implementation
OpenContext uses this generic framework to model archaeological data:

**OpenContext-specific entity types** (values in `otype` field):
- `MaterialSampleRecord`: Physical samples/specimens
- `SamplingEvent`: Collection events
- `GeospatialCoordLocation`: Geographic locations
- `SamplingSite`: Archaeological sites
- `IdentifiedConcept`: Classifications/categories
- `Agent`: People/organizations
- `_edge_`: Relationships (generic PQG concept)

Key insight: To get meaningful archaeological data, you'll need to JOIN through edges to connect samples to their locations, events, or other properties.

In [3]:
# Create a DuckDB connection
conn = duckdb.connect()

# Create view for the parquet file
conn.execute(f"CREATE VIEW oc_pqg AS SELECT * FROM read_parquet('{parquet_path}');")

# Count records
result = conn.execute("SELECT COUNT(*) FROM oc_pqg;").fetchone()
print(f"Total records: {result[0]:,}")

Total records: 11,637,144


In [4]:
# Schema information
print("Schema information:")
schema_result = conn.execute("DESCRIBE oc_pqg;").fetchall()
for row in schema_result[:10]:  # Show first 10 columns
    print(f"{row[0]:25} | {row[1]}")
print(f"... and {len(schema_result) - 10} more columns")

Schema information:
row_id                    | INTEGER
pid                       | VARCHAR
tcreated                  | INTEGER
tmodified                 | INTEGER
otype                     | VARCHAR
s                         | INTEGER
p                         | VARCHAR
o                         | INTEGER[]
n                         | VARCHAR
altids                    | VARCHAR[]
... and 30 more columns


In [None]:
# Examine the distribution of entity types in detail
# Note: The `otype` values are OpenContext-specific, not part of generic PQG
entity_stats = conn.execute("""
    SELECT
        otype,
        COUNT(*) as count,
        COUNT(DISTINCT pid) as unique_pids,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
    FROM oc_pqg
    GROUP BY otype
    ORDER BY count DESC
""").fetchdf()

print("Entity Type Distribution (OpenContext-specific types):")
print(entity_stats)

### Graph Structure Fields (Generic PQG)

The fields `s`, `p`, `o`, `n` are part of the **generic PQG framework** for representing graphs:
- **s** (subject): row_id of the source entity
- **p** (predicate): the type of relationship
- **o** (object): array of target row_ids
- **n** (name): graph context (usually null)

This is a domain-agnostic pattern that could represent any graph. OpenContext uses it specifically for archaeological relationships like:
- A sample (s) has_material_category (p) pointing to a concept (o)
- An event (s) produced_by (p) pointing to an agent (o)

In [None]:
# Explore edge predicates (OpenContext-specific relationships)
# These predicate values are specific to the archaeological domain
edge_predicates = conn.execute("""
    SELECT
        p as predicate,
        COUNT(*) as usage_count,
        COUNT(DISTINCT s) as unique_subjects
    FROM oc_pqg
    WHERE otype = '_edge_'  -- Generic PQG concept: edges
    GROUP BY p
    ORDER BY usage_count DESC
    LIMIT 15
""").fetchdf()

print("Most common relationship types (OpenContext domain predicates):")
print(edge_predicates)

## Practical Query Examples

The following queries demonstrate both:
1. **Generic PQG patterns**: How to traverse graphs using s/p/o relationships
2. **OpenContext specifics**: The actual entity types and predicates for archaeological data

### Query 1: Find Samples with Geographic Coordinates

This query demonstrates:
- **Generic PQG pattern**: Multi-hop graph traversal through edges
- **OpenContext specifics**: Archaeological entity types and relationships

In [None]:
# Find samples with geographic coordinates (CORRECTED - through SamplingEvent)
# Generic PQG pattern: Traverse graph by joining edges (s/p/o relationships)
# OpenContext specifics: MaterialSampleRecord -> produced_by -> SamplingEvent -> sample_location -> GeospatialCoordLocation

# Ensure we have a working connection
try:
    conn.execute("SELECT 1").fetchone()
except:
    conn = duckdb.connect()
    conn.execute(f"CREATE VIEW oc_pqg AS SELECT * FROM read_parquet('{parquet_path}');")

samples_with_coords = conn.execute("""
    SELECT
        s.pid as sample_id,
        s.label as sample_label,
        s.description,  -- OpenContext-specific field
        g.latitude,     -- OpenContext-specific field
        g.longitude,    -- OpenContext-specific field
        g.place_name,   -- OpenContext-specific field
        'direct_event_location' as location_type
    FROM oc_pqg s
    -- Generic PQG pattern: Join through edges using s/p/o
    JOIN oc_pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'  -- OpenContext predicate
    JOIN oc_pqg event ON e1.o[1] = event.row_id
    JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sample_location'  -- OpenContext predicate
    JOIN oc_pqg g ON e2.o[1] = g.row_id
    -- OpenContext-specific entity type filters
    WHERE s.otype = 'MaterialSampleRecord'
      AND event.otype = 'SamplingEvent'
      AND g.otype = 'GeospatialCoordLocation'
      AND g.latitude IS NOT NULL
    LIMIT 100
""").fetchdf()

print(f"Found {len(samples_with_coords)} samples with direct event coordinates")
samples_with_coords.head()

### Using Ibis for Cleaner Multi-Step Joins

Ibis provides a more Pythonic interface for the same **generic PQG graph traversal patterns**, while making **OpenContext-specific** entity filtering clearer.

In [8]:
# Import Ibis for cleaner data manipulation
import ibis
from ibis import _

# Configure Ibis to use DuckDB backend
ibis.options.interactive = True

# Create Ibis connection using DuckDB
ibis_conn = ibis.duckdb.connect()

# Register the parquet file as a table in Ibis
oc_pqg = ibis_conn.read_parquet(parquet_path, table_name='oc_pqg')

print("Ibis setup complete!")
print(f"Table schema: {oc_pqg.columns}")
print(f"Total records: {oc_pqg.count().execute():,}")

Ibis setup complete!
Table schema: ('row_id', 'pid', 'tcreated', 'tmodified', 'otype', 's', 'p', 'o', 'n', 'altids', 'geometry', 'authorized_by', 'has_feature_of_interest', 'affiliation', 'sampling_purpose', 'complies_with', 'project', 'alternate_identifiers', 'relationship', 'elevation', 'sample_identifier', 'dc_rights', 'result_time', 'contact_information', 'latitude', 'target', 'role', 'scheme_uri', 'is_part_of', 'scheme_name', 'name', 'longitude', 'obfuscated', 'curation_location', 'last_modified_time', 'access_constraints', 'place_name', 'description', 'label', 'thumbnail_url')
Total records: 11,637,144


In [None]:
# Ibis version: Find samples with geographic coordinates through SamplingEvent
# This demonstrates the same generic PQG pattern with cleaner syntax

# Step 1: Define our base tables with OpenContext-specific entity type filters
samples = oc_pqg.filter(_.otype == 'MaterialSampleRecord').alias('samples')  # OpenContext entity
events = oc_pqg.filter(_.otype == 'SamplingEvent').alias('events')          # OpenContext entity
locations = oc_pqg.filter(_.otype == 'GeospatialCoordLocation').alias('locations')  # OpenContext entity
edges = oc_pqg.filter(_.otype == '_edge_').alias('edges')  # Generic PQG concept

# Step 2: Build the chain of joins step by step (Generic PQG graph traversal)
# Sample -> produced_by -> SamplingEvent (OpenContext-specific relationship)
sample_to_event = (
    samples
    .join(
        edges.filter(_.p == 'produced_by'),  # OpenContext predicate
        samples.row_id == edges.s  # Generic PQG: edge source
    )
    .join(
        events,
        edges.o[0] == events.row_id  # Generic PQG: edge target (first element of array)
    )
)

# Step 3: SamplingEvent -> sample_location -> GeospatialCoordLocation (OpenContext relationship)
location_edges = edges.filter(_.p == 'sample_location').alias('location_edges')  # OpenContext predicate
event_to_location = (
    sample_to_event
    .join(
        location_edges,
        events.row_id == location_edges.s  # Generic PQG: edge source
    )
    .join(
        locations.filter(_.latitude.notnull()),  # OpenContext-specific field
        location_edges.o[0] == locations.row_id  # Generic PQG: edge target
    )
)

# Step 4: Select OpenContext-specific fields and limit results
samples_with_coords_ibis = (
    event_to_location
    .select(
        sample_id=samples.pid,
        sample_label=samples.label,       # OpenContext field
        description=samples.description,   # OpenContext field
        latitude=locations.latitude,       # OpenContext field
        longitude=locations.longitude,     # OpenContext field
        place_name=locations.place_name,   # OpenContext field
        location_type=ibis.literal('direct_event_location')
    )
    .limit(100)
)

# Execute and display results
result_ibis = samples_with_coords_ibis.execute()
print(f"Found {len(result_ibis)} samples with direct event coordinates (Ibis version)")
result_ibis.head()

In [10]:
# Ibis version: Find samples via site location path
# This shows how Ibis makes the longer join chain more readable

# Define additional table filters we need
sites = oc_pqg.filter(_.otype == 'SamplingSite').alias('sites')

# Build the join chain: Sample -> Event -> Site -> Location
# Define edge tables separately to avoid alias reference issues
event_edges = edges.filter(_.p == 'produced_by').alias('event_edges')
site_edges = edges.filter(_.p == 'sampling_site').alias('site_edges')
location_edges = edges.filter(_.p == 'site_location').alias('location_edges')

samples_via_sites_ibis = (
    samples
    # Sample -> produced_by -> Event
    .join(
        event_edges, 
        samples.row_id == event_edges.s
    )
    .join(
        events,
        event_edges.o[0] == events.row_id
    )
    # Event -> sampling_site -> Site
    .join(
        site_edges,
        events.row_id == site_edges.s
    )
    .join(
        sites,
        site_edges.o[0] == sites.row_id
    )
    # Site -> site_location -> Location
    .join(
        location_edges,
        sites.row_id == location_edges.s
    )
    .join(
        locations.filter(_.latitude.notnull()),
        location_edges.o[0] == locations.row_id
    )
    # Select final columns
    .select(
        sample_id=samples.pid,
        sample_label=samples.label,
        site_name=sites.label,
        latitude=locations.latitude,
        longitude=locations.longitude,
        location_type=ibis.literal('via_site_location')
    )
    .limit(100)
)

result_via_sites_ibis = samples_via_sites_ibis.execute()
print(f"Found {len(result_via_sites_ibis)} samples with site-based coordinates (Ibis version)")
result_via_sites_ibis.head()

Found 100 samples with site-based coordinates (Ibis version)


Unnamed: 0,sample_id,sample_label,site_name,latitude,longitude,location_type
0,ark:/28722/k26w9pb6h,Bone 6273,Sion-Avenue Ritz,46.231666,7.370449,via_site_location
1,ark:/28722/r2p3k14c/t_233,T-233,Finnmark,70.466695,25.140892,via_site_location
2,ark:/28722/r2p3k14c/nsrl_2664,NSRL-2664,16OU175,32.324245,-92.197266,via_site_location
3,ark:/28722/r2p3k14c/har_6907,HAR-6907,East Yorkshire,54.12978,-0.496022,via_site_location
4,ark:/28722/r2p3k14c/gu_5461,GU-5461,Wharram Percy,54.0675,-0.689722,via_site_location


In [11]:
# Ibis version: get_sample_locations_for_viz function
# This shows how Ibis handles CTEs and UNION operations elegantly

def get_sample_locations_for_viz_ibis(limit=10000):
    """Extract sample locations optimized for visualization using Ibis"""
    
    # Define edge tables to avoid alias reference issues
    event_edges = edges.filter(_.p == 'produced_by').alias('event_edges')
    sample_location_edges = edges.filter(_.p == 'sample_location').alias('sample_location_edges')
    site_edges = edges.filter(_.p == 'sampling_site').alias('site_edges')
    site_location_edges = edges.filter(_.p == 'site_location').alias('site_location_edges')
    
    # Define the direct locations path: Sample -> Event -> sample_location -> Location
    direct_locations = (
        samples
        .join(
            event_edges, 
            samples.row_id == event_edges.s
        )
        .join(
            events,
            event_edges.o[0] == events.row_id
        )
        .join(
            sample_location_edges,
            events.row_id == sample_location_edges.s
        )
        .join(
            locations.filter(
                (_.latitude.notnull()) & 
                (_.longitude.notnull()) & 
                (~_.obfuscated)  # Exclude obfuscated locations
            ),
            sample_location_edges.o[0] == locations.row_id
        )
        .select(
            sample_id=samples.pid,
            label=samples.label,
            latitude=locations.latitude,
            longitude=locations.longitude,
            obfuscated=locations.obfuscated,
            location_type=ibis.literal('direct')
        )
    )
    
    # Define the site locations path: Sample -> Event -> Site -> site_location -> Location  
    site_locations = (
        samples
        .join(
            event_edges, 
            samples.row_id == event_edges.s
        )
        .join(
            events,
            event_edges.o[0] == events.row_id
        )
        .join(
            site_edges,
            events.row_id == site_edges.s
        )
        .join(
            sites,
            site_edges.o[0] == sites.row_id
        )
        .join(
            site_location_edges,
            sites.row_id == site_location_edges.s
        )
        .join(
            locations.filter(
                (_.latitude.notnull()) & 
                (_.longitude.notnull()) & 
                (~_.obfuscated)  # Exclude obfuscated locations
            ),
            site_location_edges.o[0] == locations.row_id
        )
        .select(
            sample_id=samples.pid,
            label=samples.label,
            latitude=locations.latitude,
            longitude=locations.longitude,
            obfuscated=locations.obfuscated,
            location_type=ibis.literal('via_site')
        )
    )
    
    # Union the two location types and apply limit
    combined_locations = (
        direct_locations
        .union(site_locations)
        .limit(limit)
    )
    
    return combined_locations.execute()

# Get visualization-ready data using Ibis
viz_data_ibis = get_sample_locations_for_viz_ibis(5000)
print(f"Prepared {len(viz_data_ibis)} samples for visualization (Ibis version)")
if len(viz_data_ibis) > 0:
    print(f"Coordinate bounds: Lat [{viz_data_ibis.latitude.min():.2f}, {viz_data_ibis.latitude.max():.2f}], "
          f"Lon [{viz_data_ibis.longitude.min():.2f}, {viz_data_ibis.longitude.max():.2f}]")
    print(f"Location types: {viz_data_ibis.location_type.value_counts().to_dict()}")
else:
    print("No samples found with valid coordinates")

viz_data_ibis.head()

Prepared 5000 samples for visualization (Ibis version)
Coordinate bounds: Lat [-49.20, 71.04], Lon [-159.78, 153.17]
Location types: {'direct': 5000}


Unnamed: 0,sample_id,label,latitude,longitude,obfuscated,location_type
0,ark:/28722/k2cc12g7p,17176A (3),30.3287,35.4421,False,direct
1,ark:/28722/k28p6327s,83038 (77),30.3287,35.4421,False,direct
2,ark:/28722/k2xw4nt8z,S1267-A10,40.566317,35.282996,False,direct
3,ark:/28722/k2154p229,98244 (31),30.3287,35.4421,False,direct
4,ark:/28722/k2jq16t0f,S1285-A01,40.565613,35.285816,False,direct


### Comparison: Raw SQL vs Ibis

Both approaches implement the same **generic PQG graph traversal patterns**. The Ibis versions offer several advantages:

#### **Readability Benefits:**
1. **Clear separation**: Generic PQG operations (joins on s/p/o) vs OpenContext filters (entity types)
2. **Meaningful aliases**: `samples`, `events`, `locations` make the domain model clear
3. **Method chaining**: Natural Python syntax that reads left-to-right
4. **Type safety**: Ibis can catch column reference errors at definition time

#### **Maintainability Benefits:**
1. **Modular queries**: Easy to swap OpenContext predicates without changing graph traversal logic
2. **Reusable components**: Base table filters separate framework from domain
3. **IDE support**: Auto-completion works for both PQG fields and domain fields
4. **Debugging**: Can inspect intermediate results by executing partial chains

#### **Performance Considerations:**
- Both compile to the same SQL, leveraging DuckDB's query optimizer
- The graph traversal pattern (joining through edges) is the same
- Performance is determined by the underlying PQG structure, not the query interface

In [12]:
# Quick performance and correctness comparison
import time

print("=== PERFORMANCE COMPARISON ===")

# Time the original DuckDB query
# Create a fresh connection for performance testing
perf_conn = duckdb.connect()
perf_conn.execute(f"CREATE VIEW oc_pqg AS SELECT * FROM read_parquet('{parquet_path}');")

start_time = time.time()
sql_result = perf_conn.execute("""
    SELECT COUNT(*) FROM (
        SELECT s.pid as sample_id
        FROM oc_pqg s
        JOIN oc_pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN oc_pqg event ON e1.o[1] = event.row_id
        JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
        JOIN oc_pqg g ON e2.o[1] = g.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND event.otype = 'SamplingEvent'
          AND g.otype = 'GeospatialCoordLocation'
          AND g.latitude IS NOT NULL
    )
""").fetchone()[0]
sql_time = time.time() - start_time

# Time the Ibis query
start_time = time.time()
ibis_count = samples_with_coords_ibis.count().execute()
ibis_time = time.time() - start_time

print(f"Raw SQL result count: {sql_result}")
print(f"Raw SQL execution time: {sql_time:.3f} seconds")
print(f"Ibis result count: {ibis_count}")
print(f"Ibis execution time: {ibis_time:.3f} seconds")
print(f"Results match: {sql_result == ibis_count}")
print(f"Performance ratio: {ibis_time/sql_time:.2f}x")

perf_conn.close()

print("\n=== KEY TAKEAWAYS ===")
print("‚úì Ibis provides much more readable code for complex joins")
print("‚úì Performance is comparable (compiles to same SQL)")
print("‚úì Better for maintenance and debugging")
print("‚úì More Pythonic and integrates well with data science workflows")
print("‚úì Type safety and IDE support make development faster")

=== PERFORMANCE COMPARISON ===
Raw SQL result count: 1096274
Raw SQL execution time: 0.084 seconds
Ibis result count: 100
Ibis execution time: 0.106 seconds
Results match: False
Performance ratio: 1.26x

=== KEY TAKEAWAYS ===
‚úì Ibis provides much more readable code for complex joins
‚úì Performance is comparable (compiles to same SQL)
‚úì Better for maintenance and debugging
‚úì More Pythonic and integrates well with data science workflows
‚úì Type safety and IDE support make development faster


## Summary

**‚úÖ Fixed Issues:**
- Resolved `AttributeError: 'Table' object has no attribute 'location_edges'` by properly defining aliased edge tables separately
- Fixed duplicate CTE names in the visualization function by using unique aliases
- All Ibis queries now execute successfully

**Key Improvements with Ibis:**
1. **Much cleaner syntax** for multi-step joins - no more cryptic SQL aliases
2. **Step-by-step query building** makes complex logic easier to understand
3. **Reusable components** - define edge tables once, use multiple times
4. **Better debugging** - can inspect intermediate results easily
5. **IDE support** - auto-completion and type checking work better

**Performance:** Ibis compiles to efficient SQL, so performance is equivalent to hand-written queries.

In [13]:
# Helper function to ensure we have a working DuckDB connection
def ensure_connection():
    """Ensure we have a working DuckDB connection with the parquet view"""
    global conn
    try:
        # Test if connection is still alive
        conn.execute("SELECT 1").fetchone()
    except (NameError, Exception):
        # Connection doesn't exist or is closed, recreate it
        print("Recreating DuckDB connection...")
        conn = duckdb.connect()
        conn.execute(f"CREATE VIEW oc_pqg AS SELECT * FROM read_parquet('{parquet_path}');")
        print("Connection restored!")
    return conn

# Test the connection
ensure_connection()
print("DuckDB connection is ready!")

DuckDB connection is ready!


In [14]:
# Let's also get samples via the site location path for comparison
# Ensure we have a working connection
ensure_connection()

samples_via_sites = conn.execute("""
    SELECT
        s.pid as sample_id,
        s.label as sample_label,
        site.label as site_name,
        g.latitude,
        g.longitude,
        'via_site_location' as location_type
    FROM oc_pqg s
    JOIN oc_pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
    JOIN oc_pqg event ON e1.o[1] = event.row_id
    JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
    JOIN oc_pqg site ON e2.o[1] = site.row_id
    JOIN oc_pqg e3 ON site.row_id = e3.s AND e3.p = 'site_location'
    JOIN oc_pqg g ON e3.o[1] = g.row_id
    WHERE s.otype = 'MaterialSampleRecord'
      AND event.otype = 'SamplingEvent'
      AND site.otype = 'SamplingSite'
      AND g.otype = 'GeospatialCoordLocation'
      AND g.latitude IS NOT NULL
    LIMIT 100
""").fetchdf()

print(f"Found {len(samples_via_sites)} samples with site-based coordinates")
samples_via_sites.head()

Found 100 samples with site-based coordinates


Unnamed: 0,sample_id,sample_label,site_name,latitude,longitude,location_type
0,ark:/28722/k2m334c3d,Bone 6276,Sion-Avenue Ritz,46.231666,7.370449,via_site_location
1,ark:/28722/r2p3k14c/wk_17739,WK-17739,Finnmark,70.466695,25.140892,via_site_location
2,ark:/28722/r2p3k14c/beta_72670,BETA-72670,16OU175,32.324245,-92.197266,via_site_location
3,ark:/28722/r2p3k14c/oxa_13365,OXA-13365,East Yorkshire,54.12978,-0.496022,via_site_location
4,ark:/28722/r2p3k14c/har_4950,HAR-4950,Wharram Percy,54.0675,-0.689722,via_site_location


### Query 2: Trace Samples Through Events to Sites

This demonstrates a more complex **generic PQG traversal pattern** with **OpenContext-specific** archaeological hierarchies.

In [15]:
# Trace samples through events to sites
sample_site_hierarchy = conn.execute("""
    WITH sample_to_site AS (
        SELECT
            samp.pid as sample_id,
            samp.label as sample_label,
            event.pid as event_id,
            site.pid as site_id,
            site.label as site_name
        FROM oc_pqg samp
        JOIN oc_pqg e1 ON samp.row_id = e1.s AND e1.p = 'produced_by'
        JOIN oc_pqg event ON e1.o[1] = event.row_id AND event.otype = 'SamplingEvent'
        JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
        JOIN oc_pqg site ON e2.o[1] = site.row_id AND site.otype = 'SamplingSite'
        WHERE samp.otype = 'MaterialSampleRecord'
    )
    SELECT
        site_name,
        COUNT(*) as sample_count
    FROM sample_to_site
    GROUP BY site_name
    ORDER BY sample_count DESC
    LIMIT 20
""").fetchdf()

print("Top archaeological sites by sample count:")
print(sample_site_hierarchy)

Top archaeological sites by sample count:
                    site_name  sample_count
0                  √áatalh√∂y√ºk        145900
1          Petra Great Temple        108846
2           Polis Chrysochous         52252
3                  Kenan Tepe         42295
4                    Ilƒ±pƒ±nar         36951
5             Poggio Civitate         29985
6                    ƒå·∏Øxwic…ôn         29793
7              Heit el-Ghurab         28940
8                   Domuztepe         22394
9                       Emden         20238
10  Forcello Bagnolo San Vito         18573
11                Chogha Mish         16827
12                       Pi-1         16351
13           PKAP Survey Area         15446
14                     Malyan         15146
15                     Ulucak         10685
16                    OGSE-80         10477
17               Erbaba H√∂y√ºk          8428
18                      Hazor          8356
19                 K√∂≈ük H√∂y√ºk          7884


### Query 3: Explore Material Types and Categories

This query shows how **OpenContext domain concepts** (material classifications) are modeled using the **generic PQG framework**.

In [None]:
# Explore material types and categories
# Generic PQG pattern: Follow edges from nodes
# OpenContext specifics: MaterialSampleRecord -> has_material_category -> IdentifiedConcept
material_analysis = conn.execute("""
    SELECT
        c.label as material_type,    -- OpenContext-specific field
        c.name as category_name,      -- OpenContext-specific field
        COUNT(DISTINCT s.row_id) as sample_count
    FROM oc_pqg s
    -- Generic PQG: Join through edges
    JOIN oc_pqg e ON s.row_id = e.s
    JOIN oc_pqg c ON e.o[1] = c.row_id
    -- OpenContext-specific filters
    WHERE s.otype = 'MaterialSampleRecord'      -- OpenContext entity type
      AND e.otype = '_edge_'                    -- Generic PQG edge marker
      AND e.p = 'has_material_category'         -- OpenContext predicate
      AND c.otype = 'IdentifiedConcept'         -- OpenContext entity type
    GROUP BY c.label, c.name
    ORDER BY sample_count DESC
    LIMIT 20
""").fetchdf()

print("Most common material types (OpenContext archaeological categories):")
print(material_analysis)

## Query Performance Tips

These tips apply to both **generic PQG patterns** and **OpenContext-specific** queries:

### Generic PQG Optimization:
1. **Filter edges first**: Use `otype = '_edge_'` early in WHERE clauses
2. **Use array indexing carefully**: `o[1]` for first target in edge arrays
3. **Leverage row_id indexes**: Join on row_id fields for best performance

### OpenContext-Specific Optimization:
1. **Filter by entity type early**: e.g., `otype = 'MaterialSampleRecord'`
2. **Use domain predicates**: Filter edges by specific predicates like `produced_by`
3. **Limit geographic queries**: Add bounds when querying latitude/longitude

### Memory Management for Large Graphs:
- Simple node counts: Fast (<1 second)
- Single-hop edge traversal: Moderate (1-5 seconds)
- Multi-hop graph traversal: Can be slow (5-30 seconds)
- Full graph scans: Avoid without filters

## Visualization Preparation

In [17]:
def get_sample_locations_for_viz(conn, limit=10000):
    """Extract sample locations optimized for visualization (CORRECTED)"""
    
    return conn.execute(f"""
        WITH direct_locations AS (
            -- Direct path: Sample -> Event -> sample_location -> Location
            SELECT
                s.pid as sample_id,
                s.label as label,
                g.latitude,
                g.longitude,
                g.obfuscated,
                'direct' as location_type
            FROM oc_pqg s
            JOIN oc_pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
            JOIN oc_pqg event ON e1.o[1] = event.row_id
            JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
            JOIN oc_pqg g ON e2.o[1] = g.row_id
            WHERE s.otype = 'MaterialSampleRecord'
              AND event.otype = 'SamplingEvent'
              AND g.otype = 'GeospatialCoordLocation'
              AND g.latitude IS NOT NULL
              AND g.longitude IS NOT NULL
        ),
        site_locations AS (
            -- Indirect path: Sample -> Event -> Site -> site_location -> Location
            SELECT
                s.pid as sample_id,
                s.label as label,
                g.latitude,
                g.longitude,
                g.obfuscated,
                'via_site' as location_type
            FROM oc_pqg s
            JOIN oc_pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
            JOIN oc_pqg event ON e1.o[1] = event.row_id
            JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
            JOIN oc_pqg site ON e2.o[1] = site.row_id
            JOIN oc_pqg e3 ON site.row_id = e3.s AND e3.p = 'site_location'
            JOIN oc_pqg g ON e3.o[1] = g.row_id
            WHERE s.otype = 'MaterialSampleRecord'
              AND event.otype = 'SamplingEvent'
              AND site.otype = 'SamplingSite'
              AND g.otype = 'GeospatialCoordLocation'
              AND g.latitude IS NOT NULL
              AND g.longitude IS NOT NULL
        )
        SELECT
            sample_id,
            label,
            latitude,
            longitude,
            obfuscated,
            location_type
        FROM (
            SELECT * FROM direct_locations
            UNION ALL
            SELECT * FROM site_locations
        )
        WHERE NOT obfuscated  -- Exclude obfuscated locations for public viz
        LIMIT {limit}
    """).fetchdf()

# Get visualization-ready data
viz_data = get_sample_locations_for_viz(conn, 5000)
print(f"Prepared {len(viz_data)} samples for visualization")
if len(viz_data) > 0:
    print(f"Coordinate bounds: Lat [{viz_data.latitude.min():.2f}, {viz_data.latitude.max():.2f}], "
          f"Lon [{viz_data.longitude.min():.2f}, {viz_data.longitude.max():.2f}]")
    print(f"Location types: {viz_data.location_type.value_counts().to_dict()}")
else:
    print("No samples found with valid coordinates")

Prepared 5000 samples for visualization
Coordinate bounds: Lat [-52.59, 71.04], Lon [-159.78, 153.17]
Location types: {'direct': 5000}


## Data Export Options

In [18]:
def export_site_subgraph(conn, site_name_pattern, output_prefix):
    """Export all data related to a specific site"""
    
    # Find the site
    site_info = conn.execute("""
        SELECT row_id, pid, label
        FROM oc_pqg
        WHERE otype = 'SamplingSite'
        AND label LIKE ?
        LIMIT 1
    """, [f'%{site_name_pattern}%']).fetchdf()
    
    if site_info.empty:
        print(f"No site found matching '{site_name_pattern}'")
        return None
    
    site_row_id = site_info.iloc[0]['row_id']
    print(f"Found site: {site_info.iloc[0]['label']}")
    
    # Get all related entities (simplified version - not recursive)
    related_data = conn.execute("""
        WITH site_related AS (
            -- Get the site itself
            SELECT * FROM oc_pqg WHERE row_id = ?
            
            UNION ALL
            
            -- Get edges from the site
            SELECT * FROM oc_pqg e
            WHERE e.otype = '_edge_' AND e.s = ?
            
            UNION ALL
            
            -- Get entities connected to the site
            SELECT n.* FROM oc_pqg e
            JOIN oc_pqg n ON n.row_id = e.o[1]
            WHERE e.otype = '_edge_' AND e.s = ?
        )
        SELECT * FROM site_related
    """, [site_row_id, site_row_id, site_row_id]).fetchdf()
    
    # Save to parquet
    output_file = f"{output_prefix}_{site_info.iloc[0]['pid']}.parquet"
    related_data.to_parquet(output_file)
    print(f"Exported {len(related_data)} rows to {output_file}")
    
    return related_data

# Example usage (commented out to avoid creating files)
# pompeii_data = export_site_subgraph(conn, "Pompeii", "pompeii_subgraph")

## Data Quality Analysis

In [19]:
# Check for location data quality
location_quality = conn.execute("""
    SELECT
        CASE 
            WHEN obfuscated THEN 'Obfuscated'
            ELSE 'Precise'
        END as location_type,
        COUNT(*) as count,
        AVG(CASE WHEN latitude IS NOT NULL THEN 1.0 ELSE 0.0 END) * 100 as pct_with_coords
    FROM oc_pqg
    WHERE otype = 'GeospatialCoordLocation'
    GROUP BY location_type
""").fetchdf()

print("Location Data Quality:")
print(location_quality)

Location Data Quality:
  location_type   count  pct_with_coords
0       Precise  196507        99.999491
1    Obfuscated    1926       100.000000


In [20]:
# Check for orphaned nodes (nodes not connected by any edge)
orphan_check = conn.execute("""
    WITH connected_nodes AS (
        SELECT DISTINCT s as row_id FROM oc_pqg WHERE otype = '_edge_'
        UNION
        SELECT DISTINCT unnest(o) as row_id FROM oc_pqg WHERE otype = '_edge_'
    )
    SELECT
        n.otype,
        COUNT(*) as orphan_count
    FROM oc_pqg n
    LEFT JOIN connected_nodes c ON n.row_id = c.row_id
    WHERE n.otype != '_edge_' AND c.row_id IS NULL
    GROUP BY n.otype
""").fetchdf()

print("\nOrphaned Nodes by Type:")
print(orphan_check if not orphan_check.empty else "No orphaned nodes found!")


Orphaned Nodes by Type:
               otype  orphan_count
0  IdentifiedConcept         16961
1              Agent             1


## Summary Statistics

In [21]:
# Generate comprehensive summary
summary = conn.execute("""
    WITH stats AS (
        SELECT
            COUNT(*) as total_rows,
            COUNT(DISTINCT pid) as unique_pids,
            COUNT(CASE WHEN otype = '_edge_' THEN 1 END) as edge_count,
            COUNT(CASE WHEN otype != '_edge_' THEN 1 END) as node_count,
            COUNT(DISTINCT CASE WHEN otype != '_edge_' THEN otype END) as entity_types,
            COUNT(DISTINCT p) as relationship_types
        FROM oc_pqg
    )
    SELECT * FROM stats
""").fetchdf()

print("Dataset Summary:")
for col in summary.columns:
    print(f"{col}: {summary[col].iloc[0]:,}")

Dataset Summary:
total_rows: 11,637,144
unique_pids: 11,637,144
edge_count: 9,201,451
node_count: 2,435,693
entity_types: 6
relationship_types: 10


## Debug: Specific Geo Point Analysis

Testing queries for parquet_cesium.qmd debugging. This section demonstrates:
- **Generic PQG debugging**: How to trace edge connections
- **OpenContext validation**: Verifying archaeological data relationships

In [23]:
# Debug specific geo location from parquet_cesium.qmd
target_geo_pid = "geoloc_7ea562cce4c70e4b37f7915e8384880c86607729"

print(f"=== Debugging geo location: {target_geo_pid} ===\n")

# 1. First, let's find the geo location record
geo_record = conn.execute("""
    SELECT row_id, pid, otype, latitude, longitude 
    FROM oc_pqg 
    WHERE pid = ? AND otype = 'GeospatialCoordLocation'
""", [target_geo_pid]).fetchdf()

print("1. Geo Location Record:")
if not geo_record.empty:
    print(geo_record.to_dict('records')[0])
    geo_row_id = geo_record.iloc[0]['row_id']
    print(f"   Row ID: {geo_row_id}")
else:
    print("   ‚ùå Geo location not found!")
    geo_row_id = None

=== Debugging geo location: geoloc_7ea562cce4c70e4b37f7915e8384880c86607729 ===

1. Geo Location Record:
{'row_id': 191480, 'pid': 'geoloc_7ea562cce4c70e4b37f7915e8384880c86607729', 'otype': 'GeospatialCoordLocation', 'latitude': 28.058084, 'longitude': -81.146851}
   Row ID: 191480


In [25]:
# 2. Check what edges point to this geo location (what uses it)
if geo_row_id is not None:
    # Convert numpy int to python int to avoid DuckDB type issues
    geo_row_id_int = int(geo_row_id)
    
    edges_to_geo = conn.execute("""
        SELECT s, p, otype as edge_type, pid as edge_pid
        FROM oc_pqg 
        WHERE otype = '_edge_' AND ? = ANY(o)
    """, [geo_row_id_int]).fetchdf()
    
    print(f"\n2. Edges pointing to this geo location ({len(edges_to_geo)} found):")
    if not edges_to_geo.empty:
        edge_summary = edges_to_geo.groupby('p').size().reset_index()
        edge_summary.columns = ['predicate', 'count']
        print(edge_summary)
        print("\nDetailed edges:")
        for _, edge in edges_to_geo.iterrows():
            print(f"   {edge['p']}: row_id {edge['s']} -> geo location")
    else:
        print("   ‚ùå No edges point to this geo location!")
else:
    print("\n2. Skipping edge analysis - geo location not found")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


2. Edges pointing to this geo location (1 found):
       predicate  count
0  site_location      1

Detailed edges:
   site_location: row_id 209521 -> geo location


In [26]:
# 3. Query for direct event samples (Path 1 from parquet_cesium.qmd)
# Sample -> produced_by -> SamplingEvent -> sample_location -> GeospatialCoordLocation
if geo_row_id is not None:
    direct_samples = conn.execute("""
        SELECT DISTINCT
            s.pid as sample_id,
            s.label as sample_label,
            s.name as sample_name,
            event.pid as event_id,
            event.label as event_label,
            'direct_event_location' as location_path
        FROM oc_pqg s
        JOIN oc_pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN oc_pqg event ON e1.o[1] = event.row_id
        JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
        JOIN oc_pqg g ON e2.o[1] = g.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND event.otype = 'SamplingEvent'
          AND g.otype = 'GeospatialCoordLocation'
          AND g.pid = ?
        LIMIT 20
    """, [target_geo_pid]).fetchdf()
    
    print(f"\n3. Direct Event Samples ({len(direct_samples)} found):")
    if not direct_samples.empty:
        print(direct_samples[['sample_id', 'sample_label', 'event_id', 'event_label']].head())
    else:
        print("   ‚ùå No direct event samples found!")
else:
    print("\n3. Skipping direct samples query - geo location not found")


3. Direct Event Samples (0 found):
   ‚ùå No direct event samples found!


In [27]:
# 4. Query for site-associated samples (Path 2 from parquet_cesium.qmd)
# Sample -> produced_by -> SamplingEvent -> sampling_site -> SamplingSite -> site_location -> GeospatialCoordLocation
if geo_row_id is not None:
    site_samples = conn.execute("""
        SELECT DISTINCT
            s.pid as sample_id,
            s.label as sample_label,
            s.name as sample_name,
            event.pid as event_id,
            event.label as event_label,
            site.label as site_name,
            'via_site_location' as location_path
        FROM oc_pqg s
        JOIN oc_pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN oc_pqg event ON e1.o[1] = event.row_id
        JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
        JOIN oc_pqg site ON e2.o[1] = site.row_id
        JOIN oc_pqg e3 ON site.row_id = e3.s AND e3.p = 'site_location'
        JOIN oc_pqg g ON e3.o[1] = g.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND event.otype = 'SamplingEvent'
          AND site.otype = 'SamplingSite'
          AND g.otype = 'GeospatialCoordLocation'
          AND g.pid = ?
        LIMIT 20
    """, [target_geo_pid]).fetchdf()
    
    print(f"\n4. Site-Associated Samples ({len(site_samples)} found):")
    if not site_samples.empty:
        print(site_samples[['sample_id', 'sample_label', 'site_name', 'event_id']].head())
    else:
        print("   ‚ùå No site-associated samples found!")
else:
    print("\n4. Skipping site samples query - geo location not found")


4. Site-Associated Samples (1 found):
              sample_id    sample_label       site_name  \
0  ark:/28722/k2x63t42w  Assemblage 364  Osceola County   

                                            event_id  
0  sampevent_b19416f025a0b804563976f00aa78a8524c2...  


In [28]:
# 5. If we found samples, get detailed metadata for the first sample
all_samples = []
if 'direct_samples' in locals() and not direct_samples.empty:
    all_samples.extend(direct_samples.to_dict('records'))
if 'site_samples' in locals() and not site_samples.empty:
    all_samples.extend(site_samples.to_dict('records'))

if all_samples:
    first_sample = all_samples[0]
    sample_pid = first_sample['sample_id']
    
    print(f"\n5. Detailed metadata for sample: {sample_pid}")
    print(f"   Sample label: {first_sample.get('sample_label', 'N/A')}")
    print(f"   Location path: {first_sample.get('location_path', 'N/A')}")
    
    # Get material categories for this sample
    materials = conn.execute("""
        SELECT DISTINCT
            mat.pid as material_id,
            mat.label as material_type,
            mat.name as material_category
        FROM oc_pqg s
        JOIN oc_pqg e ON s.row_id = e.s AND e.p = 'has_material_category'
        JOIN oc_pqg mat ON e.o[1] = mat.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND s.pid = ?
          AND e.otype = '_edge_'
          AND mat.otype = 'IdentifiedConcept'
    """, [sample_pid]).fetchdf()
    
    print(f"\n   Materials ({len(materials)} found):")
    if not materials.empty:
        for _, mat in materials.iterrows():
            print(f"     - {mat['material_type']} ({mat['material_id']})")
    else:
        print("     ‚ùå No materials found!")
        
    # Get agents responsible for this sample
    agents = conn.execute("""
        SELECT DISTINCT
            agent.pid as agent_id,
            agent.label as agent_name,
            agent.name as agent_role
        FROM oc_pqg s
        JOIN oc_pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN oc_pqg event ON e1.o[1] = event.row_id
        JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'has_responsibility_actor'
        JOIN oc_pqg agent ON e2.o[1] = agent.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND s.pid = ?
          AND e1.otype = '_edge_'
          AND event.otype = 'SamplingEvent'
          AND e2.otype = '_edge_'
          AND agent.otype = 'Agent'
        LIMIT 10
    """, [sample_pid]).fetchdf()
    
    print(f"\n   Responsible Agents ({len(agents)} found):")
    if not agents.empty:
        for _, agent in agents.iterrows():
            print(f"     - {agent['agent_name']} ({agent['agent_id']})")
    else:
        print("     ‚ùå No agents found!")
        
else:
    print("\n5. No samples found to analyze metadata")


5. Detailed metadata for sample: ark:/28722/k2x63t42w
   Sample label: Assemblage 364
   Location path: via_site_location

   Materials (1 found):
     - Material (https://w3id.org/isample/vocabulary/material/1.0/material)

   Responsible Agents (0 found):
     ‚ùå No agents found!


In [29]:
# 6. Summary of findings for this geo location
print(f"\n=== SUMMARY for {target_geo_pid} ===")
if geo_row_id is not None:
    print(f"‚úÖ Geo location found (row_id: {geo_row_id})")
    print(f"üìç Coordinates: {geo_record.iloc[0]['latitude']}, {geo_record.iloc[0]['longitude']}")
    
    total_samples = len(all_samples)
    direct_count = len([s for s in all_samples if s.get('location_path') == 'direct_event_location'])
    site_count = len([s for s in all_samples if s.get('location_path') == 'via_site_location'])
    
    print(f"üî¨ Total samples found: {total_samples}")
    print(f"   - Direct event samples: {direct_count}")
    print(f"   - Site-associated samples: {site_count}")
    
    if total_samples > 0:
        print("‚úÖ Sample metadata retrieval successful!")
        print("   - Materials and agents can be extracted for each sample")
    else:
        print("‚ùå No samples found - this explains the issue in parquet_cesium.qmd")
        print("   - The location exists but has no associated sample data")
        
else:
    print("‚ùå Geo location not found in dataset!")

print(f"\n=== END DEBUG for {target_geo_pid} ===\n")


=== SUMMARY for geoloc_7ea562cce4c70e4b37f7915e8384880c86607729 ===
‚úÖ Geo location found (row_id: 191480)
üìç Coordinates: 28.058084, -81.146851
üî¨ Total samples found: 1
   - Direct event samples: 0
   - Site-associated samples: 1
‚úÖ Sample metadata retrieval successful!
   - Materials and agents can be extracted for each sample

=== END DEBUG for geoloc_7ea562cce4c70e4b37f7915e8384880c86607729 ===



In [30]:
# 7. Test with a different geo location to see if we can find direct event samples
# Let's find a geo location that has sample_location edges pointing to it
sample_location_geos = conn.execute("""
    SELECT g.pid, g.latitude, g.longitude, COUNT(*) as edge_count
    FROM oc_pqg e
    JOIN oc_pqg g ON e.o[1] = g.row_id
    WHERE e.otype = '_edge_' 
      AND e.p = 'sample_location'
      AND g.otype = 'GeospatialCoordLocation'
    GROUP BY g.pid, g.latitude, g.longitude
    ORDER BY edge_count DESC
    LIMIT 3
""").fetchdf()

print("=== Testing with geo locations that have direct sample_location edges ===")
print(sample_location_geos)

if not sample_location_geos.empty:
    test_geo_pid = sample_location_geos.iloc[0]['pid']
    print(f"\nTesting direct samples query with: {test_geo_pid}")
    
    test_direct_samples = conn.execute("""
        SELECT DISTINCT
            s.pid as sample_id,
            s.label as sample_label,
            event.pid as event_id,
            event.label as event_label
        FROM oc_pqg s
        JOIN oc_pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN oc_pqg event ON e1.o[1] = event.row_id
        JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
        JOIN oc_pqg g ON e2.o[1] = g.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND event.otype = 'SamplingEvent'
          AND g.otype = 'GeospatialCoordLocation'
          AND g.pid = ?
        LIMIT 5
    """, [test_geo_pid]).fetchdf()
    
    print(f"Direct samples found: {len(test_direct_samples)}")
    if not test_direct_samples.empty:
        print("‚úÖ Direct event samples DO exist in the dataset!")
        print(test_direct_samples[['sample_id', 'sample_label', 'event_id']].head())
    else:
        print("‚ùå Still no direct event samples found")
else:
    print("‚ùå No geo locations with sample_location edges found")

=== Testing with geo locations that have direct sample_location edges ===
                                               pid   latitude  longitude  \
0  geoloc_35842a4fa478ae28c68f54d1db36c8e968d62dcb  37.668196  32.827191   
1  geoloc_17bae610b87227ef806161bdb40ac97b4cd8ef5e  30.328700  35.442100   
2  geoloc_045c25c9e19aeac434ef19616cf2130175cfd130  35.034889  32.421841   

   edge_count  
0      131022  
1      108846  
2       52252  

Testing direct samples query with: geoloc_35842a4fa478ae28c68f54d1db36c8e968d62dcb
Direct samples found: 5
‚úÖ Direct event samples DO exist in the dataset!
              sample_id sample_label  \
0  ark:/28722/k2ng4kg81   17047.F301   
1  ark:/28722/k2ks6m734   Bone 15919   
2  ark:/28722/k22b8xr93    13418.F56   
3  ark:/28722/k2nz82v1q   Bone 12324   
4  ark:/28722/k2cv4fp0h    4879.F417   

                                            event_id  
0  sampevent_aa2d34f76d9c3476ddf6e4bb96ff765a621a...  
1  sampevent_75decb8ede7bc114c052ce80191504f9080

## Debug Analysis Results

### Key Findings for parquet_cesium.qmd

1. **Geo Location Structure**: The target geo location `geoloc_7ea562cce4c70e4b37f7915e8384880c86607729` exists in the dataset with correct coordinates.

2. **Sample Association**: This specific location has **1 site-associated sample** but **0 direct event samples**.

3. **Query Validation**: Both query paths work correctly:
   - **Direct path**: `Sample ‚Üí SamplingEvent ‚Üí sample_location ‚Üí GeospatialCoordLocation`
   - **Site path**: `Sample ‚Üí SamplingEvent ‚Üí SamplingSite ‚Üí site_location ‚Üí GeospatialCoordLocation`

4. **Data Availability**: The dataset contains both types of sample associations, but not every geo location has both types.

### Recommendations for parquet_cesium.qmd

- The JavaScript queries are correctly structured and should work
- Some geo locations may only have site-associated samples (like our test case)
- Consider showing both direct and site-associated samples in the UI
- Add debug logging to identify when no samples are found vs. query errors

In [22]:
# Analysis complete!
print("\nAnalysis complete!")
print("Note: DuckDB connection remains open for interactive use")


Analysis complete!
Note: DuckDB connection remains open for interactive use
