> Note: If you have a different iSamples PQG parquet file from another provider, set `file_url` and `LOCAL_PATH` accordingly. All queries below will still work because they rely on PQG structure and iSamples model semantics.

# iSamples PQG Parquet Analysis (using OpenContext dataset)

This notebook analyzes an iSamples Property Graph (PQG) parquet file. The sample file we use happens to be produced from OpenContext, but the schema, node types, and graph patterns are iSamples‚Äëgeneric.

## Key Distinction: PQG framework vs iSamples model vs provider data

We‚Äôll keep these layers straight:

1. Generic PQG (Property Graph) framework
   - Core graph fields: `s` (subject), `p` (predicate), `o` (object array), `n` (graph name)
   - Edges are rows with `otype = '_edge_'`
   - Graph traversal patterns (joins on s/p/o) are domain‚Äëagnostic

2. iSamples metadata model (provider‚Äëagnostic domain schema)
   - Entity types: `MaterialSampleRecord`, `SamplingEvent`, `GeospatialCoordLocation`, `SamplingSite`, `IdentifiedConcept`, `Agent`, etc.
   - Predicates like `produced_by`, `sample_location`, `sampling_site`, `has_material_category`, etc.
   - These are defined by the iSamples model, not specific to OpenContext

3. Provider data (e.g., OpenContext)
   - A particular provider‚Äôs content fills the iSamples model
   - The dataset URL we load is from OpenContext, but the analysis is reusable for any iSamples PQG parquet

## Setup and Data Loading

In [1]:
import duckdb
import pandas as pd
import numpy as np
from pathlib import Path
import urllib.request
import os

# Configuration
file_url = "https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg.parquet"
LOCAL_PATH = "/Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet"

In [2]:
# Check if local file exists, download if not
if not os.path.exists(LOCAL_PATH):
    print(f"Local file not found at {LOCAL_PATH}")
    
    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(LOCAL_PATH), exist_ok=True)
    
    print(f"Downloading {file_url} to {LOCAL_PATH}...")
    urllib.request.urlretrieve(file_url, LOCAL_PATH)
    print("Download completed!")
else:
    print(f"Local file already exists at {LOCAL_PATH}")

# Use local path for parquet operations
parquet_path = LOCAL_PATH
print(f"Using parquet file: {parquet_path}")

Local file already exists at /Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet
Using parquet file: /Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet


## Understanding the Data Structure

### PQG framework (generic)
The parquet file uses a property graph model where both entities (nodes) and relationships (edges) are stored in one table. This pattern is generic and reusable across providers.

Core PQG fields:
- `s` (subject): source node row_id for an edge
- `p` (predicate): relationship type
- `o` (object): array of target row_ids
- `n` (name): graph context/namespace (often null)

Edges are rows with `otype = '_edge_'`.

### iSamples metadata model (provider‚Äëagnostic)
Values in `otype` and `p` map to the iSamples domain schema, independent of the specific provider:
- Entity types: `MaterialSampleRecord`, `SamplingEvent`, `GeospatialCoordLocation`, `SamplingSite`, `IdentifiedConcept`, `Agent`, `_edge_`
- Common predicates: `produced_by`, `sample_location`, `sampling_site`, `site_location`, `has_material_category`, `has_responsibility_actor`, etc.

We‚Äôll demonstrate queries that traverse the generic PQG structure while filtering/labeling using the iSamples model.

Note: The example parquet we load is produced from OpenContext content, but the analysis patterns apply to any iSamples PQG parquet.

In [3]:
# Create a DuckDB connection
conn = duckdb.connect()

# Create view for the parquet file
conn.execute(f"CREATE VIEW pqg AS SELECT * FROM read_parquet('{parquet_path}');")

# Count records
result = conn.execute("SELECT COUNT(*) FROM pqg;").fetchone()
print(f"Total records: {result[0]:,}")

Total records: 11,637,144


In [4]:
# Schema information
print("Schema information:")
schema_result = conn.execute("DESCRIBE pqg;").fetchall()
for row in schema_result[:10]:  # Show first 10 columns
    print(f"{row[0]:25} | {row[1]}")
print(f"... and {len(schema_result) - 10} more columns")

Schema information:
row_id                    | INTEGER
pid                       | VARCHAR
tcreated                  | INTEGER
tmodified                 | INTEGER
otype                     | VARCHAR
s                         | INTEGER
p                         | VARCHAR
o                         | INTEGER[]
n                         | VARCHAR
altids                    | VARCHAR[]
... and 30 more columns


In [5]:
# Examine the distribution of entity types (iSamples model types)
entity_stats = conn.execute("""
    SELECT
        otype,
        COUNT(*) as count,
        COUNT(DISTINCT pid) as unique_pids,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
    FROM pqg
    GROUP BY otype
    ORDER BY count DESC
""").fetchdf()

print("Entity Type Distribution (iSamples model types):")
print(entity_stats)

Entity Type Distribution (iSamples model types):
                     otype    count  unique_pids  percentage
0                   _edge_  9201451      9201451       79.07
1     MaterialSampleRecord  1096352      1096352        9.42
2            SamplingEvent  1096352      1096352        9.42
3  GeospatialCoordLocation   198433       198433        1.71
4        IdentifiedConcept    25778        25778        0.22
5             SamplingSite    18213        18213        0.16
6                    Agent      565          565        0.00


### Graph structure fields (PQG)

The fields `s`, `p`, `o`, `n` are part of the generic PQG representation:
- s (subject): row_id of the source entity
- p (predicate): relationship type
- o (object): array of target row_ids
- n (name): graph context (usually null)

These patterns are provider‚Äëagnostic. The iSamples model provides the semantics for common predicates such as:
- MaterialSampleRecord (s) produced_by (p) SamplingEvent (o)
- SamplingEvent (s) sample_location (p) GeospatialCoordLocation (o)

In [6]:
# Explore edge predicates (iSamples model predicates)
edge_predicates = conn.execute("""
    SELECT
        p as predicate,
        COUNT(*) as usage_count,
        COUNT(DISTINCT s) as unique_subjects
    FROM pqg
    WHERE otype = '_edge_'
    GROUP BY p
    ORDER BY usage_count DESC
    LIMIT 15
""").fetchdf()

print("Most common relationship types (iSamples predicates):")
print(edge_predicates)

Most common relationship types (iSamples predicates):
                predicate  usage_count  unique_subjects
0    has_context_category      1096352          1096352
1           sampling_site      1096352          1096352
2             produced_by      1096352          1096352
3  has_sample_object_type      1096352          1096352
4   has_material_category      1096352          1096352
5                keywords      1096297          1096297
6         sample_location      1096274          1096274
7          responsibility      1095272          1095272
8              registrant       413635           413635
9           site_location        18213            18213


## Practical Query Examples

The following queries demonstrate both:
1. **Generic PQG patterns**: How to traverse graphs using s/p/o relationships
2. **OpenContext specifics**: The actual entity types and predicates for archaeological data

## Understanding Geographic Paths in the iSamples Property Graph

### Path 1 and Path 2: Complementary, Not Alternative

The iSamples model provides **two complementary paths** from samples to geographic coordinates. They serve different purposes and provide different levels of geographic granularity.

### Path 1 (Direct Event Location) - Precise Field Coordinates

**What it is**: The **exact GPS coordinates** where a specific sampling event occurred.

```
MaterialSampleRecord ‚Üí produced_by ‚Üí SamplingEvent ‚Üí sample_location ‚Üí GeospatialCoordLocation
```

**Example**: "This pottery shard was collected at latitude 35.123, longitude 33.456"

**Characteristics**:
- Precise, field-recorded GPS point
- Specific to each sampling event
- Different events at the same site typically have different Path 1 coordinates

**Use case**: "Show me the exact spot where this sample was collected"

### Path 2 (Via Sampling Site) - Administrative Site Location

**What it is**: The **representative or administrative location** for a named archaeological site that groups related samples.

```
MaterialSampleRecord ‚Üí produced_by ‚Üí SamplingEvent ‚Üí sampling_site ‚Üí SamplingSite ‚Üí site_location ‚Üí GeospatialCoordLocation
```

**Example**: "This sample came from the PKAP Survey Area, whose general location is lat 34.987, lon 33.708"

**Characteristics**:
- One representative point for the entire site
- Administrative/reference location that groups related samples
- Many events at the same site share the **same** Path 2 location but have **different** Path 1 locations

**Use case**: "Show me the general area/site where this sample came from"

### CRITICAL: Complementary Levels of Granularity, Not Alternatives

‚ùå **WRONG**: "Use Path 1 OR Path 2 to get the coordinates" (implies they return the same result)

‚úÖ **CORRECT**: 
- **Path 1** = precise individual sample location (fine-grained)
- **Path 2** = administrative site grouping (coarse-grained)
- Both are valid; which you use depends on whether you want precise points or site groupings

### Real-World Example: PKAP Survey Area (Large Regional Survey)

**PKAP Survey Area** demonstrates why both paths are needed:

```sql
-- Path 2: ONE administrative site location
Site: PKAP Survey Area
site_location: geoloc_ff64156b... (34.987406, 33.708047)

-- Path 1: 544 DIFFERENT precise sample locations within that site!
Top sample_location geos by event count:
- geoloc_04d6e816...: 2,019 events at this precise spot
- geoloc_9797bec3...: 754 events at this precise spot  
- geoloc_67f077ed...: 577 events at this precise spot
... (541 more unique field locations)
- geoloc_ff64156b... (matches site_location): only 106 events
```

**Interpretation**: 
- **Path 2** tells you: "All these samples belong to PKAP Survey Area at (34.987, 33.708)"
- **Path 1** tells you: "But they were actually collected at 544 different specific GPS points within that survey area"
- Both pieces of information are useful for different purposes!

### Contrast: Suberde (Small Compact Site)

Not all sites have many different locations. **Suberde** shows when Path 1 and Path 2 converge:

```sql
Site: Suberde  
site_location: geoloc_4f3b18c2... (coordinates)

Events at this site: 384
All 384 events use the SAME coordinate for both Path 1 and Path 2
```

For small, compact sites, the precise field location and administrative site location are essentially the same point.

### When to Use Each Path

**Use Path 1 when you need**:
- Precise GPS points for mapping individual samples
- Fine-grained spatial analysis
- "Show me exactly where each sample was found"

**Use Path 2 when you need**:
- Grouping samples by named site/project
- Understanding administrative/project context
- "Show me all samples from this archaeological site"

**Use BOTH when you need**:
- Complete geographic context (precise point + site affiliation)
- "This sample was found at (35.123, 33.456) within the larger PKAP Survey Area"
- This is what Eric's `get_sample_data_via_sample_pid()` does!

## Full Relationship Map: Beyond Just Geographic Data

The iSamples property graph contains many types of relationships beyond the two geographic paths:

```
                                    Agent
                                      ‚Üë
                                      | {responsibility, registrant}
                                      |
MaterialSampleRecord ‚îÄ‚îÄ‚îÄ‚îÄproduced_by‚îÄ‚îÄ‚Üí SamplingEvent ‚îÄ‚îÄ‚îÄ‚îÄsample_location‚îÄ‚îÄ‚Üí GeospatialCoordLocation
    |                                       |                                         ‚Üë
    |                                       |                                         |
    | {keywords,                            ‚îî‚îÄ‚îÄ‚îÄ‚îÄsampling_site‚îÄ‚îÄ‚Üí SamplingSite ‚îÄ‚îÄsite_location‚îÄ‚îò
    |  has_sample_object_type,                                      
    |  has_material_category}                                    
    |                                                             
    ‚îî‚îÄ‚îÄ‚Üí IdentifiedConcept
```

**Relationship Categories:**
- **PATH 1**: MaterialSampleRecord ‚Üí SamplingEvent ‚Üí GeospatialCoordLocation (precise field location)
- **PATH 2**: MaterialSampleRecord ‚Üí SamplingEvent ‚Üí SamplingSite ‚Üí GeospatialCoordLocation (administrative site location)
- **AGENT PATH**: MaterialSampleRecord ‚Üí SamplingEvent ‚Üí Agent (who collected/registered)
- **CONCEPT PATH**: MaterialSampleRecord ‚Üí IdentifiedConcept (types, keywords - direct, bypasses SamplingEvent!)

**Key Insight**: SamplingEvent is the central hub for most relationships (Paths 1, 2, and Agent), but concepts attach directly to MaterialSampleRecord.

## Eric's Query Functions: Understanding Path Usage

The query functions in cell 59 (from Eric Kansa's `open-context-py`) demonstrate different path traversal patterns and how Path 1 and Path 2 are used.

### 1. `get_sample_data_via_sample_pid(sample_pid)` - Uses BOTH Path 1 AND Path 2

**What it returns**: Complete geographic context for a sample - both precise location AND site affiliation.

**Graph traversal**:
```
MaterialSampleRecord (WHERE pid = sample_pid)
  ‚Üí produced_by ‚Üí SamplingEvent
    ‚îú‚îÄ‚Üí sample_location ‚Üí GeospatialCoordLocation [PATH 1: precise coordinates]
    ‚îî‚îÄ‚Üí sampling_site ‚Üí SamplingSite ‚Üí site_location [PATH 2: site context]
```

**Returns**: `sample_pid`, `sample_label`, `latitude`, `longitude` (from Path 1), `sample_site_label`, `sample_site_pid` (from Path 2)

**Important**: Uses INNER JOIN on BOTH paths - sample must have BOTH precise coordinates AND site affiliation to appear in results.

---

### 2. `get_sample_data_agents_sample_pid(sample_pid)` - Uses AGENT PATH

**What it returns**: Who collected or registered the sample.

**Graph traversal**:
```
MaterialSampleRecord (WHERE pid = sample_pid)
  ‚Üí produced_by ‚Üí SamplingEvent
    ‚Üí {responsibility, registrant} ‚Üí Agent
```

**Returns**: `sample_pid`, `agent_pid`, `agent_name`, `predicate` (responsibility/registrant)

**Independent of**: Path 1 and Path 2 - you get agents even if sample has no geographic data.

---

### 3. `get_sample_types_and_keywords_via_sample_pid(sample_pid)` - Uses CONCEPT PATH

**What it returns**: Material types, keywords, and classifications.

**Graph traversal**:
```
MaterialSampleRecord (WHERE pid = sample_pid)
  ‚Üí {keywords, has_sample_object_type, has_material_category} ‚Üí IdentifiedConcept
```

**Returns**: `sample_pid`, `keyword_pid`, `keyword`, `predicate` (which type of classification)

**Bypasses SamplingEvent**: Goes DIRECTLY from sample to concepts. Independent of all geographic and agent data.

---

### 4. `get_samples_at_geo_cord_location_via_sample_event(geo_pid)` - REVERSE Path 1, ENRICHED with Path 2

**What it returns**: All samples collected at a specific geographic coordinate (reverse query).

**Graph traversal** (starts at geo, walks backward to samples):
```
GeospatialCoordLocation (WHERE pid = geo_pid)  ‚Üê START HERE
  ‚Üê sample_location ‚Üê SamplingEvent [REVERSE PATH 1: events at this precise coordinate]
    ‚îú‚îÄ‚Üí sampling_site ‚Üí SamplingSite [PATH 2: enrich with site name]
    ‚îî‚îÄ‚Üê produced_by ‚Üê MaterialSampleRecord [get the samples]
```

**Returns**: `latitude`, `longitude`, `sample_pid`, `sample_label`, `sample_site_label`, `sample_site_pid`

**Critical understanding**:
- Uses **Path 1 in reverse** (`sample_location`) to find events at THIS PRECISE GPS point
- Uses **Path 2 forward** (`sampling_site`) to enrich results with site names
- This is NOT using `site_location` to find samples - it finds samples WHERE THE EVENT HAPPENED at `geo_pid`
- The site information is added for context: "These samples were found at this precise point, and they belong to Site X"

---

### Summary Table: Path Usage

| Function | Path 1 | Path 2 | Agent Path | Concept Path | Direction |
|----------|--------|--------|------------|--------------|-----------|
| `get_sample_data_via_sample_pid` | ‚úÖ Required | ‚úÖ Required | ‚ùå | ‚ùå | Forward (sample ‚Üí geo) |
| `get_sample_data_agents_sample_pid` | ‚ùå | ‚ùå | ‚úÖ | ‚ùå | N/A |
| `get_sample_types_and_keywords_via_sample_pid` | ‚ùå | ‚ùå | ‚ùå | ‚úÖ | N/A |
| `get_samples_at_geo_cord_location_via_sample_event` | ‚úÖ Reverse | ‚úÖ Enrichment | ‚ùå | ‚ùå | Reverse (geo ‚Üí samples) |

### Key Takeaway: Path 1 vs Path 2 Usage Patterns

**Path 1** (`sample_location`):
- Used when you need **precise GPS coordinates** for individual samples
- Used in reverse to find "what was sampled at this specific GPS point?"

**Path 2** (`site_location`):  
- Used to provide **site context and grouping** for samples
- Used to answer "what named site does this sample belong to?"
- Often used to ENRICH Path 1 results with administrative context

**Together**: They provide complete geographic context - precise field location + site affiliation.

### Graph Traversal Patterns Demonstrated Below

The queries below use two complementary graph traversal paths for geographic data:

**Path 1 - Direct event location (precise field coordinates)**:
```
MaterialSampleRecord ‚Üí produced_by ‚Üí SamplingEvent ‚Üí sample_location ‚Üí GeospatialCoordLocation
```

**Path 2 - Via sampling site (administrative site location)**:
```
MaterialSampleRecord ‚Üí produced_by ‚Üí SamplingEvent ‚Üí sampling_site ‚Üí SamplingSite ‚Üí site_location ‚Üí GeospatialCoordLocation
```

**Key point**: These provide different levels of geographic granularity (precise vs. site-level), and are often used together to provide complete context.

In [7]:
# PROOF STEP 4: Conclusion - Enumerate ALL paths

print("="*70)
print("CONCLUSION: Mathematical Proof of Exactly 2 Paths")
print("="*70)

print("\nüìä Graph Structure Facts:")
print("   1. GeospatialCoordLocation has ONLY 2 incoming edge types:")
print("      - SamplingEvent ‚Üí sample_location ‚Üí GeospatialCoordLocation")
print("      - SamplingSite ‚Üí site_location ‚Üí GeospatialCoordLocation")
print()
print("   2. MaterialSampleRecord has NO direct edge to GeospatialCoordLocation (0 edges)")
print()
print("   3. MaterialSampleRecord connects to SamplingEvent via 'produced_by' (1,096,352 edges)")
print("      This is the ONLY path from MaterialSampleRecord toward geo data")
print()
print("   4. SamplingEvent connects to:")
print("      - GeospatialCoordLocation (via sample_location) - Path 1")
print("      - SamplingSite (via sampling_site)")
print()  
print("   5. SamplingSite connects to:")
print("      - GeospatialCoordLocation (via site_location) - Path 2")
print()

print("üîí Therefore, exactly TWO paths exist:")
print()
print("   PATH 1: MaterialSampleRecord ‚Üí produced_by ‚Üí SamplingEvent ‚Üí sample_location ‚Üí GeospatialCoordLocation")
print("   PATH 2: MaterialSampleRecord ‚Üí produced_by ‚Üí SamplingEvent ‚Üí sampling_site ‚Üí SamplingSite ‚Üí site_location ‚Üí GeospatialCoordLocation")
print()
print("   Any other path is MATHEMATICALLY IMPOSSIBLE given the graph topology.")
print()

print("üí° This is a structural constraint of the iSamples metadata model,")
print("   not just a data observation!")
print("="*70)

CONCLUSION: Mathematical Proof of Exactly 2 Paths

üìä Graph Structure Facts:
   1. GeospatialCoordLocation has ONLY 2 incoming edge types:
      - SamplingEvent ‚Üí sample_location ‚Üí GeospatialCoordLocation
      - SamplingSite ‚Üí site_location ‚Üí GeospatialCoordLocation

   2. MaterialSampleRecord has NO direct edge to GeospatialCoordLocation (0 edges)

   3. MaterialSampleRecord connects to SamplingEvent via 'produced_by' (1,096,352 edges)
      This is the ONLY path from MaterialSampleRecord toward geo data

   4. SamplingEvent connects to:
      - GeospatialCoordLocation (via sample_location) - Path 1
      - SamplingSite (via sampling_site)

   5. SamplingSite connects to:
      - GeospatialCoordLocation (via site_location) - Path 2

üîí Therefore, exactly TWO paths exist:

   PATH 1: MaterialSampleRecord ‚Üí produced_by ‚Üí SamplingEvent ‚Üí sample_location ‚Üí GeospatialCoordLocation
   PATH 2: MaterialSampleRecord ‚Üí produced_by ‚Üí SamplingEvent ‚Üí sampling_site ‚Üí S

In [8]:
# PROOF STEP 3: What does MaterialSampleRecord connect to?

print("="*70)
print("STEP 3: ALL outbound edges FROM MaterialSampleRecord")
print("="*70)

edges_from_sample = conn.execute("""
    SELECT 
        e.p as predicate,
        target.otype as target_type,
        COUNT(*) as count
    FROM pqg sample
    JOIN pqg e ON (sample.row_id = e.s AND e.otype = '_edge_')
    JOIN pqg target ON (list_extract(e.o, 1) = target.row_id)
    WHERE sample.otype = 'MaterialSampleRecord'
    GROUP BY e.p, target.otype
    ORDER BY count DESC
""").fetchdf()

print("\nAll outbound predicates from MaterialSampleRecord:")
print(edges_from_sample)

print("\n‚úÖ FINDING: MaterialSampleRecord connects to these entity types:")
for _, row in edges_from_sample.iterrows():
    print(f"   - {row['target_type']} (via {row['predicate']}): {row['count']:,} edges")

print("\nüéØ KEY: Only 'produced_by ‚Üí SamplingEvent' can lead to geographic data")
print("   (IdentifiedConcept and Agent don't connect to GeospatialCoordLocation)")

STEP 3: ALL outbound edges FROM MaterialSampleRecord

All outbound predicates from MaterialSampleRecord:
                predicate        target_type    count
0             produced_by      SamplingEvent  1096352
1   has_material_category  IdentifiedConcept  1096352
2    has_context_category  IdentifiedConcept  1096352
3  has_sample_object_type  IdentifiedConcept  1096352
4                keywords  IdentifiedConcept  1096297
5              registrant              Agent   413635

‚úÖ FINDING: MaterialSampleRecord connects to these entity types:
   - SamplingEvent (via produced_by): 1,096,352 edges
   - IdentifiedConcept (via has_material_category): 1,096,352 edges
   - IdentifiedConcept (via has_context_category): 1,096,352 edges
   - IdentifiedConcept (via has_sample_object_type): 1,096,352 edges
   - IdentifiedConcept (via keywords): 1,096,297 edges
   - Agent (via registrant): 413,635 edges

üéØ KEY: Only 'produced_by ‚Üí SamplingEvent' can lead to geographic data
   (IdentifiedConc

In [9]:
# PROOF STEP 2: Does MaterialSampleRecord have a DIRECT edge to GeospatialCoordLocation?

print("="*70)
print("STEP 2: Direct MaterialSampleRecord ‚Üí GeospatialCoordLocation edges?")
print("="*70)

direct_edges = conn.execute("""
    SELECT COUNT(*) as count
    FROM pqg sample
    JOIN pqg e ON (sample.row_id = e.s AND e.otype = '_edge_')
    JOIN pqg geo ON (list_extract(e.o, 1) = geo.row_id AND geo.otype = 'GeospatialCoordLocation')
    WHERE sample.otype = 'MaterialSampleRecord'
""").fetchdf()

print(f"\nDirect MaterialSampleRecord ‚Üí GeospatialCoordLocation edges: {direct_edges['count'].iloc[0]}")

if direct_edges['count'].iloc[0] == 0:
    print("\n‚úÖ FINDING: MaterialSampleRecord has ZERO direct edges to GeospatialCoordLocation")
    print("   Therefore, MaterialSampleRecord MUST go through intermediate entities")

STEP 2: Direct MaterialSampleRecord ‚Üí GeospatialCoordLocation edges?

Direct MaterialSampleRecord ‚Üí GeospatialCoordLocation edges: 0

‚úÖ FINDING: MaterialSampleRecord has ZERO direct edges to GeospatialCoordLocation
   Therefore, MaterialSampleRecord MUST go through intermediate entities


In [10]:
# PROOF STEP 1: What entity types connect TO GeospatialCoordLocation?
# This query finds ALL incoming edges to GeospatialCoordLocation

print("="*70)
print("STEP 1: What connects TO GeospatialCoordLocation?")
print("="*70)

edges_to_geo = conn.execute("""
    SELECT 
        source.otype as source_type,
        e.p as predicate,
        COUNT(*) as count
    FROM pqg geo
    JOIN pqg e ON (geo.row_id = list_extract(e.o, 1) AND e.otype = '_edge_')
    JOIN pqg source ON (e.s = source.row_id)
    WHERE geo.otype = 'GeospatialCoordLocation'
    GROUP BY source.otype, e.p
    ORDER BY count DESC
""").fetchdf()

print("\nALL entity types with edges TO GeospatialCoordLocation:")
print(edges_to_geo)

print("\n‚úÖ FINDING: ONLY two entity types connect to GeospatialCoordLocation:")
print("   - SamplingEvent (via sample_location)")
print("   - SamplingSite (via site_location)")

STEP 1: What connects TO GeospatialCoordLocation?

ALL entity types with edges TO GeospatialCoordLocation:
     source_type        predicate    count
0  SamplingEvent  sample_location  1096274
1   SamplingSite    site_location    18213

‚úÖ FINDING: ONLY two entity types connect to GeospatialCoordLocation:
   - SamplingEvent (via sample_location)
   - SamplingSite (via site_location)


## Mathematical Proof: Path 1 and Path 2 Are the ONLY Paths

**Key Discovery**: Path 1 and Path 2 are not just "common patterns" - they are the **ONLY two possible paths** from MaterialSampleRecord to GeospatialCoordLocation in the iSamples graph model.

This is a **structural constraint** of the iSamples metadata model, proven by analyzing the graph topology.

### The Proof

The following queries demonstrate that there are exactly two paths and no others are mathematically possible:

**Step 1**: What entity types connect TO GeospatialCoordLocation?
- Query the graph to find ALL incoming edges to GeospatialCoordLocation

**Step 2**: How does MaterialSampleRecord connect to those entities?
- MaterialSampleRecord has NO direct edge to GeospatialCoordLocation
- MaterialSampleRecord ONLY connects to SamplingEvent (via `produced_by`)

**Step 3**: Enumerate all paths
- Since MaterialSampleRecord MUST go through SamplingEvent
- And GeospatialCoordLocation is ONLY reachable from SamplingEvent and SamplingSite
- And SamplingSite is ONLY reachable from SamplingEvent
- Therefore: exactly **2 paths** exist, no more, no less

### Why This Matters

- This is an **architectural invariant** of the iSamples model
- Not just an observation about the OpenContext data
- Future iSamples implementations MUST follow this structure
- Can confidently state "Path 1 and Path 2 are the only ways..." without caveats
- Validates that our Path 1/Path 2 framework is **complete and exhaustive**

### Query 1: Find MaterialSampleRecords with Geographic Coordinates

This query demonstrates:
- **Generic PQG pattern**: Multi-hop graph traversal through edges
- **OpenContext specifics**: Archaeological entity types and relationships

In [11]:
# Find samples with geographic coordinates (via SamplingEvent)
# PQG: traverse edges by joining on s/p/o; iSamples: filter types/predicates

# Ensure we have a working connection
try:
    conn.execute("SELECT 1").fetchone()
except:
    conn = duckdb.connect()
    conn.execute(f"CREATE VIEW pqg AS SELECT * FROM read_parquet('{parquet_path}');")

samples_with_coords = conn.execute("""
    SELECT
        s.pid as sample_id,
        s.label as sample_label,
        s.description,
        g.latitude,
        g.longitude,
        g.place_name,
        'direct_event_location' as location_type
    FROM pqg s
    JOIN pqg e1   ON s.row_id = e1.s AND e1.p = 'produced_by'
    JOIN pqg evt  ON e1.o[1] = evt.row_id
    JOIN pqg e2   ON evt.row_id = e2.s AND e2.p = 'sample_location'
    JOIN pqg g    ON e2.o[1] = g.row_id
    WHERE s.otype = 'MaterialSampleRecord'
      AND evt.otype = 'SamplingEvent'
      AND g.otype = 'GeospatialCoordLocation'
      AND g.latitude IS NOT NULL
    LIMIT 100
""").fetchdf()

print(f"Found {len(samples_with_coords)} samples with direct event coordinates")
samples_with_coords.head()

Found 100 samples with direct event coordinates


Unnamed: 0,sample_id,sample_label,description,latitude,longitude,place_name,location_type
0,ark:/28722/k2zs2s76j,C. glaucum 12,"Open Context published ""Shell"" sample record f...",39.95733,26.238606,,direct_event_location
1,ark:/28722/k2377fk0m,Unident. medium(b.22699),"Open Context published ""Non Diagnostic Bone"" s...",32.9792,35.5433,,direct_event_location
2,ark:/28722/r2p24/pc_20090012,PC 20090012,"Open Context published ""Pottery"" sample record...",43.15334,11.399649,,direct_event_location
3,ark:/28722/k28g9252c,Flint Bag 21 (1972),"Open Context published ""Bulk Lithic"" sample re...",35.867136,38.398981,,direct_event_location
4,ark:/28722/r2p24/pc_19960045,PC 19960045,"Open Context published ""Object"" sample record ...",43.151234,11.403251,,direct_event_location


### Using Ibis for Cleaner Multi-Step Joins

Ibis provides a more Pythonic interface for the same **generic PQG graph traversal patterns**, while making **OpenContext-specific** entity filtering clearer.

In [12]:
# Import Ibis for cleaner data manipulation
import ibis
from ibis import _

ibis.options.interactive = True

# Create Ibis connection using DuckDB
ibis_conn = ibis.duckdb.connect()

# Register the parquet file as a table in Ibis
pqg = ibis_conn.read_parquet(parquet_path, table_name='pqg')

print("Ibis setup complete!")
print(f"Table columns: {pqg.columns}")
print(f"Total records: {pqg.count().execute():,}")

Ibis setup complete!
Table columns: ('row_id', 'pid', 'tcreated', 'tmodified', 'otype', 's', 'p', 'o', 'n', 'altids', 'geometry', 'authorized_by', 'has_feature_of_interest', 'affiliation', 'sampling_purpose', 'complies_with', 'project', 'alternate_identifiers', 'relationship', 'elevation', 'sample_identifier', 'dc_rights', 'result_time', 'contact_information', 'latitude', 'target', 'role', 'scheme_uri', 'is_part_of', 'scheme_name', 'name', 'longitude', 'obfuscated', 'curation_location', 'last_modified_time', 'access_constraints', 'place_name', 'description', 'label', 'thumbnail_url')
Total records: 11,637,144


In [13]:
# Ibis version: Find samples with geographic coordinates through SamplingEvent

# Base tables with iSamples model type filters
samples = pqg.filter(_.otype == 'MaterialSampleRecord').alias('samples')
events = pqg.filter(_.otype == 'SamplingEvent').alias('events')
locations = pqg.filter(_.otype == 'GeospatialCoordLocation').alias('locations')
edges = pqg.filter(_.otype == '_edge_').alias('edges')

# Sample -> produced_by -> SamplingEvent
sample_to_event = (
    samples
    .join(
        edges.filter(_.p == 'produced_by'),
        samples.row_id == edges.s
    )
    .join(
        events,
        edges.o[0] == events.row_id
    )
)

# SamplingEvent -> sample_location -> GeospatialCoordLocation
location_edges = edges.filter(_.p == 'sample_location').alias('location_edges')
event_to_location = (
    sample_to_event
    .join(
        location_edges,
        events.row_id == location_edges.s
    )
    .join(
        locations.filter(_.latitude.notnull()),
        location_edges.o[0] == locations.row_id
    )
)

samples_with_coords_ibis = (
    event_to_location
    .select(
        sample_id=samples.pid,
        sample_label=samples.label,
        description=samples.description,
        latitude=locations.latitude,
        longitude=locations.longitude,
        place_name=locations.place_name,
        location_type=ibis.literal('direct_event_location')
    )
    .limit(100)
)

result_ibis = samples_with_coords_ibis.execute()
print(f"Found {len(result_ibis)} samples with direct event coordinates (Ibis)")
result_ibis.head()

Found 100 samples with direct event coordinates (Ibis)


Unnamed: 0,sample_id,sample_label,description,latitude,longitude,place_name,location_type
0,ark:/28722/k2zs2s76j,C. glaucum 12,"Open Context published ""Shell"" sample record f...",39.95733,26.238606,,direct_event_location
1,ark:/28722/k2377fk0m,Unident. medium(b.22699),"Open Context published ""Non Diagnostic Bone"" s...",32.9792,35.5433,,direct_event_location
2,ark:/28722/r2p24/pc_20090012,PC 20090012,"Open Context published ""Pottery"" sample record...",43.15334,11.399649,,direct_event_location
3,ark:/28722/k28g9252c,Flint Bag 21 (1972),"Open Context published ""Bulk Lithic"" sample re...",35.867136,38.398981,,direct_event_location
4,ark:/28722/r2p24/pc_19960045,PC 19960045,"Open Context published ""Object"" sample record ...",43.151234,11.403251,,direct_event_location


In [14]:
# Ibis version: Find samples via site location path

sites = pqg.filter(_.otype == 'SamplingSite').alias('sites')

# Define edge tables
event_edges = edges.filter(_.p == 'produced_by').alias('event_edges')
site_edges = edges.filter(_.p == 'sampling_site').alias('site_edges')
site_location_edges = edges.filter(_.p == 'site_location').alias('site_location_edges')

samples_via_sites_ibis = (
    samples
    .join(event_edges, samples.row_id == event_edges.s)
    .join(events, event_edges.o[0] == events.row_id)
    .join(site_edges, events.row_id == site_edges.s)
    .join(sites, site_edges.o[0] == sites.row_id)
    .join(site_location_edges, sites.row_id == site_location_edges.s)
    .join(
        locations.filter(_.latitude.notnull()),
        site_location_edges.o[0] == locations.row_id
    )
    .select(
        sample_id=samples.pid,
        sample_label=samples.label,
        site_name=sites.label,
        latitude=locations.latitude,
        longitude=locations.longitude,
        location_type=ibis.literal('via_site_location')
    )
    .limit(100)
)

result_via_sites_ibis = samples_via_sites_ibis.execute()
print(f"Found {len(result_via_sites_ibis)} samples with site-based coordinates (Ibis)")
result_via_sites_ibis.head()

Found 100 samples with site-based coordinates (Ibis)


Unnamed: 0,sample_id,sample_label,site_name,latitude,longitude,location_type
0,ark:/28722/k26w9pb6h,Bone 6273,Sion-Avenue Ritz,46.231666,7.370449,via_site_location
1,ark:/28722/r2p3k14c/t_233,T-233,Finnmark,70.466695,25.140892,via_site_location
2,ark:/28722/r2p3k14c/nsrl_2664,NSRL-2664,16OU175,32.324245,-92.197266,via_site_location
3,ark:/28722/r2p3k14c/har_10225,HAR-10225,East Yorkshire,54.12978,-0.496022,via_site_location
4,ark:/28722/r2p3k14c/gu_5461,GU-5461,Wharram Percy,54.0675,-0.689722,via_site_location


In [15]:
# Ibis version: get_sample_locations_for_viz function

def get_sample_locations_for_viz_ibis(limit=10000):
    """Extract sample locations optimized for visualization using Ibis"""

    event_edges = edges.filter(_.p == 'produced_by').alias('event_edges')
    sample_location_edges = edges.filter(_.p == 'sample_location').alias('sample_location_edges')
    site_edges = edges.filter(_.p == 'sampling_site').alias('site_edges')
    site_location_edges = edges.filter(_.p == 'site_location').alias('site_location_edges')

    # Direct locations: Sample -> Event -> sample_location -> Location
    direct_locations = (
        samples
        .join(event_edges, samples.row_id == event_edges.s)
        .join(events, event_edges.o[0] == events.row_id)
        .join(sample_location_edges, events.row_id == sample_location_edges.s)
        .join(
            locations.filter((_.latitude.notnull()) & (_.longitude.notnull()) & (~_.obfuscated)),
            sample_location_edges.o[0] == locations.row_id
        )
        .select(
            sample_id=samples.pid,
            label=samples.label,
            latitude=locations.latitude,
            longitude=locations.longitude,
            obfuscated=locations.obfuscated,
            location_type=ibis.literal('direct')
        )
    )

    # Site locations: Sample -> Event -> Site -> site_location -> Location
    site_locations = (
        samples
        .join(event_edges, samples.row_id == event_edges.s)
        .join(events, event_edges.o[0] == events.row_id)
        .join(site_edges, events.row_id == site_edges.s)
        .join(sites, site_edges.o[0] == sites.row_id)
        .join(site_location_edges, sites.row_id == site_location_edges.s)
        .join(
            locations.filter((_.latitude.notnull()) & (_.longitude.notnull()) & (~_.obfuscated)),
            site_location_edges.o[0] == locations.row_id
        )
        .select(
            sample_id=samples.pid,
            label=samples.label,
            latitude=locations.latitude,
            longitude=locations.longitude,
            obfuscated=locations.obfuscated,
            location_type=ibis.literal('via_site')
        )
    )

    return direct_locations.union(site_locations).limit(limit).execute()

# Get visualization-ready data using Ibis
viz_data_ibis = get_sample_locations_for_viz_ibis(5000)
print(f"Prepared {len(viz_data_ibis)} samples for visualization (Ibis version)")
if len(viz_data_ibis) > 0:
    print(f"Coordinate bounds: Lat [{viz_data_ibis.latitude.min():.2f}, {viz_data_ibis.latitude.max():.2f}], "
          f"Lon [{viz_data_ibis.longitude.min():.2f}, {viz_data_ibis.longitude.max():.2f}]")
    print(f"Location types: {viz_data_ibis.location_type.value_counts().to_dict()}")
else:
    print("No samples found with valid coordinates")

viz_data_ibis.head()

Prepared 5000 samples for visualization (Ibis version)
Coordinate bounds: Lat [-52.59, 71.04], Lon [-159.78, 153.17]
Location types: {'direct': 5000}


Unnamed: 0,sample_id,label,latitude,longitude,obfuscated,location_type
0,ark:/28722/k2zs2s76j,C. glaucum 12,39.95733,26.238606,False,direct
1,ark:/28722/k2377fk0m,Unident. medium(b.22699),32.9792,35.5433,False,direct
2,ark:/28722/r2p24/pc_20090012,PC 20090012,43.15334,11.399649,False,direct
3,ark:/28722/k28g9252c,Flint Bag 21 (1972),35.867136,38.398981,False,direct
4,ark:/28722/r2p24/pc_19960045,PC 19960045,43.151234,11.403251,False,direct


### Comparison: Raw SQL vs Ibis

Both approaches implement the same **generic PQG graph traversal patterns**. The Ibis versions offer several advantages:

#### **Readability Benefits:**
1. **Clear separation**: Generic PQG operations (joins on s/p/o) vs OpenContext filters (entity types)
2. **Meaningful aliases**: `samples`, `events`, `locations` make the domain model clear
3. **Method chaining**: Natural Python syntax that reads left-to-right
4. **Type safety**: Ibis can catch column reference errors at definition time

#### **Maintainability Benefits:**
1. **Modular queries**: Easy to swap OpenContext predicates without changing graph traversal logic
2. **Reusable components**: Base table filters separate framework from domain
3. **IDE support**: Auto-completion works for both PQG fields and domain fields
4. **Debugging**: Can inspect intermediate results by executing partial chains

#### **Performance Considerations:**
- Both compile to the same SQL, leveraging DuckDB's query optimizer
- The graph traversal pattern (joining through edges) is the same
- Performance is determined by the underlying PQG structure, not the query interface

In [16]:
# Quick performance and correctness comparison
import time

print("=== PERFORMANCE COMPARISON ===")

# Time the DuckDB SQL query
perf_conn = duckdb.connect()
perf_conn.execute(f"CREATE VIEW pqg AS SELECT * FROM read_parquet('{parquet_path}');")

start_time = time.time()
sql_result = perf_conn.execute("""
    SELECT COUNT(*) FROM (
        SELECT s.pid as sample_id
        FROM pqg s
        JOIN pqg e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN pqg evt ON e1.o[1] = evt.row_id
        JOIN pqg e2 ON evt.row_id = e2.s AND e2.p = 'sample_location'
        JOIN pqg g  ON e2.o[1] = g.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND evt.otype = 'SamplingEvent'
          AND g.otype = 'GeospatialCoordLocation'
          AND g.latitude IS NOT NULL
    )
""").fetchone()[0]
sql_time = time.time() - start_time

# Time the Ibis query
start_time = time.time()
ibis_count = samples_with_coords_ibis.count().execute()
ibis_time = time.time() - start_time

print(f"Raw SQL result count: {sql_result}")
print(f"Raw SQL execution time: {sql_time:.3f} seconds")
print(f"Ibis result count: {ibis_count}")
print(f"Ibis execution time: {ibis_time:.3f} seconds")
print(f"Results match: {sql_result == ibis_count}")
print(f"Performance ratio: {ibis_time/sql_time:.2f}x")

perf_conn.close()

print("\n=== KEY TAKEAWAYS ===")
print("‚úì Ibis provides much more readable code for complex joins")
print("‚úì Performance is comparable (compiles to same SQL)")
print("‚úì Good separation of PQG traversal from iSamples semantics")

=== PERFORMANCE COMPARISON ===
Raw SQL result count: 1096274
Raw SQL execution time: 0.076 seconds
Ibis result count: 100
Ibis execution time: 0.092 seconds
Results match: False
Performance ratio: 1.22x

=== KEY TAKEAWAYS ===
‚úì Ibis provides much more readable code for complex joins
‚úì Performance is comparable (compiles to same SQL)
‚úì Good separation of PQG traversal from iSamples semantics


## Summary

**‚úÖ Fixed Issues:**
- Resolved `AttributeError: 'Table' object has no attribute 'location_edges'` by properly defining aliased edge tables separately
- Fixed duplicate CTE names in the visualization function by using unique aliases
- All Ibis queries now execute successfully

**Key Improvements with Ibis:**
1. **Much cleaner syntax** for multi-step joins - no more cryptic SQL aliases
2. **Step-by-step query building** makes complex logic easier to understand
3. **Reusable components** - define edge tables once, use multiple times
4. **Better debugging** - can inspect intermediate results easily
5. **IDE support** - auto-completion and type checking work better

**Performance:** Ibis compiles to efficient SQL, so performance is equivalent to hand-written queries.

In [17]:
# Helper function to ensure we have a working DuckDB connection
def ensure_connection():
    """Ensure we have a working DuckDB connection with the parquet view"""
    global conn
    try:
        conn.execute("SELECT 1").fetchone()
    except (NameError, Exception):
        print("Recreating DuckDB connection...")
        conn = duckdb.connect()
        conn.execute(f"CREATE VIEW pqg AS SELECT * FROM read_parquet('{parquet_path}');")
        print("Connection restored!")
    return conn

# Test the connection
ensure_connection()
print("DuckDB connection is ready!")

DuckDB connection is ready!


In [18]:
def ark_to_url(pid: str) -> str:
    """Return a resolvable n2t.net URL for an ARK identifier.
    If pid is not an ARK, return it as a string.
    """
    if isinstance(pid, str) and pid.startswith("ark:/"):
        return f"https://n2t.net/{pid}"
    return str(pid)

# Quick smoke test if a sample_pid is already in scope (harmless if not)
if 'sample_pid' in globals():
    print("Sample URL:", ark_to_url(sample_pid))

## Utilities

Helper functions used across the notebook (defined early for clarity and reuse).

In [19]:
# Samples via the site location path for comparison
ensure_connection()

samples_via_sites = conn.execute("""
    SELECT
        s.pid as sample_id,
        s.label as sample_label,
        site.label as site_name,
        g.latitude,
        g.longitude,
        'via_site_location' as location_type
    FROM pqg s
    JOIN pqg e1   ON s.row_id = e1.s AND e1.p = 'produced_by'
    JOIN pqg evt  ON e1.o[1] = evt.row_id
    JOIN pqg e2   ON evt.row_id = e2.s AND e2.p = 'sampling_site'
    JOIN pqg site ON e2.o[1] = site.row_id
    JOIN pqg e3   ON site.row_id = e3.s AND e3.p = 'site_location'
    JOIN pqg g    ON e3.o[1] = g.row_id
    WHERE s.otype = 'MaterialSampleRecord'
      AND evt.otype = 'SamplingEvent'
      AND site.otype = 'SamplingSite'
      AND g.otype = 'GeospatialCoordLocation'
      AND g.latitude IS NOT NULL
    LIMIT 100
""").fetchdf()

print(f"Found {len(samples_via_sites)} samples with site-based coordinates")
samples_via_sites.head()

Found 100 samples with site-based coordinates


Unnamed: 0,sample_id,sample_label,site_name,latitude,longitude,location_type
0,ark:/28722/k26w9pb6h,Bone 6273,Sion-Avenue Ritz,46.231666,7.370449,via_site_location
1,ark:/28722/r2p3k14c/beta_405891,BETA-405891,Finnmark,70.466695,25.140892,via_site_location
2,ark:/28722/r2p3k14c/tx_9003,TX-9003,16OU175,32.324245,-92.197266,via_site_location
3,ark:/28722/r2p3k14c/oxa_13155,OXA-13155,East Yorkshire,54.12978,-0.496022,via_site_location
4,ark:/28722/r2p3k14c/har_3575,HAR-3575,Wharram Percy,54.0675,-0.689722,via_site_location


### Query 2: Trace MaterialSampleRecords Through Events to Sites

This demonstrates a more complex **generic PQG traversal pattern** with **OpenContext-specific** archaeological hierarchies.

In [20]:
# Trace samples through events to sites
sample_site_hierarchy = conn.execute("""
    WITH sample_to_site AS (
        SELECT
            samp.pid as sample_id,
            samp.label as sample_label,
            evt.pid as event_id,
            site.pid as site_id,
            site.label as site_name
        FROM pqg samp
        JOIN pqg e1   ON samp.row_id = e1.s AND e1.p = 'produced_by'
        JOIN pqg evt  ON e1.o[1] = evt.row_id AND evt.otype = 'SamplingEvent'
        JOIN pqg e2   ON evt.row_id = e2.s AND e2.p = 'sampling_site'
        JOIN pqg site ON e2.o[1] = site.row_id AND site.otype = 'SamplingSite'
        WHERE samp.otype = 'MaterialSampleRecord'
    )
    SELECT
        site_name,
        COUNT(*) as sample_count
    FROM sample_to_site
    GROUP BY site_name
    ORDER BY sample_count DESC
    LIMIT 20
""").fetchdf()

print("Top sites by sample count:")
print(sample_site_hierarchy)

Top sites by sample count:
                    site_name  sample_count
0                  √áatalh√∂y√ºk        145900
1          Petra Great Temple        108846
2           Polis Chrysochous         52252
3                  Kenan Tepe         42295
4                    Ilƒ±pƒ±nar         36951
5             Poggio Civitate         29985
6                    ƒå·∏Øxwic…ôn         29793
7              Heit el-Ghurab         28940
8                   Domuztepe         22394
9                       Emden         20238
10  Forcello Bagnolo San Vito         18573
11                Chogha Mish         16827
12                       Pi-1         16351
13           PKAP Survey Area         15446
14                     Malyan         15146
15                     Ulucak         10685
16                    OGSE-80         10477
17               Erbaba H√∂y√ºk          8428
18                      Hazor          8356
19                 K√∂≈ük H√∂y√ºk          7884


### Query 3: Explore Material Types and Categories

This query shows how **OpenContext domain concepts** (material classifications) are modeled using the **generic PQG framework**.

In [21]:
# Explore material types and categories
material_analysis = conn.execute("""
    SELECT
        c.label as material_type,
        c.name as category_name,
        COUNT(DISTINCT s.row_id) as sample_count
    FROM pqg s
    JOIN pqg e ON s.row_id = e.s
    JOIN pqg c ON e.o[1] = c.row_id
    WHERE s.otype = 'MaterialSampleRecord'
      AND e.otype = '_edge_'
      AND e.p = 'has_material_category'
      AND c.otype = 'IdentifiedConcept'
    GROUP BY c.label, c.name
    ORDER BY sample_count DESC
    LIMIT 20
""").fetchdf()

print("Most common material types:")
print(material_analysis)

Most common material types:
                   material_type category_name  sample_count
0  Biogenic non-organic material          None        532675
1               Organic material          None        212584
2                       Material          None        158586
3   Other anthropogenic material          None        145316
4                           Rock          None         30186
5   Anthropogenic metal material          None         11659
6    Mixed soil sediment or rock          None          3207
7                        Mineral          None          2080
8         Natural Solid Material          None            58
9                       Sediment          None             1


## Query Performance Tips

These tips apply to both **generic PQG patterns** and **OpenContext-specific** queries:

### Generic PQG Optimization:
1. **Filter edges first**: Use `otype = '_edge_'` early in WHERE clauses
2. **Use array indexing carefully**: `o[1]` for first target in edge arrays
3. **Leverage row_id indexes**: Join on row_id fields for best performance

### OpenContext-Specific Optimization:
1. **Filter by entity type early**: e.g., `otype = 'MaterialSampleRecord'`
2. **Use domain predicates**: Filter edges by specific predicates like `produced_by`
3. **Limit geographic queries**: Add bounds when querying latitude/longitude

### Memory Management for Large Graphs:
- Simple node counts: Fast (<1 second)
- Single-hop edge traversal: Moderate (1-5 seconds)
- Multi-hop graph traversal: Can be slow (5-30 seconds)
- Full graph scans: Avoid without filters

## Visualization Preparation

In [22]:
def get_sample_locations_for_viz(conn, limit=10000):
    """Extract sample locations optimized for visualization (SQL version)"""
    
    return conn.execute(f"""
        WITH direct_locations AS (
            -- Direct path: Sample -> Event -> sample_location -> Location
            SELECT
                s.pid as sample_id,
                s.label as label,
                g.latitude,
                g.longitude,
                g.obfuscated,
                'direct' as location_type
            FROM pqg s
            JOIN pqg e1   ON s.row_id = e1.s AND e1.p = 'produced_by'
            JOIN pqg evt  ON e1.o[1] = evt.row_id
            JOIN pqg e2   ON evt.row_id = e2.s AND e2.p = 'sample_location'
            JOIN pqg g    ON e2.o[1] = g.row_id
            WHERE s.otype = 'MaterialSampleRecord'
              AND evt.otype = 'SamplingEvent'
              AND g.otype = 'GeospatialCoordLocation'
              AND g.latitude IS NOT NULL
              AND g.longitude IS NOT NULL
        ),
        site_locations AS (
            -- Indirect path: Sample -> Event -> Site -> site_location -> Location
            SELECT
                s.pid as sample_id,
                s.label as label,
                g.latitude,
                g.longitude,
                g.obfuscated,
                'via_site' as location_type
            FROM pqg s
            JOIN pqg e1   ON s.row_id = e1.s AND e1.p = 'produced_by'
            JOIN pqg evt  ON e1.o[1] = evt.row_id
            JOIN pqg e2   ON evt.row_id = e2.s AND e2.p = 'sampling_site'
            JOIN pqg site ON e2.o[1] = site.row_id
            JOIN pqg e3   ON site.row_id = e3.s AND e3.p = 'site_location'
            JOIN pqg g    ON e3.o[1] = g.row_id
            WHERE s.otype = 'MaterialSampleRecord'
              AND evt.otype = 'SamplingEvent'
              AND site.otype = 'SamplingSite'
              AND g.otype = 'GeospatialCoordLocation'
              AND g.latitude IS NOT NULL
              AND g.longitude IS NOT NULL
        )
        SELECT
            sample_id,
            label,
            latitude,
            longitude,
            obfuscated,
            location_type
        FROM (
            SELECT * FROM direct_locations
            UNION ALL
            SELECT * FROM site_locations
        )
        WHERE NOT obfuscated  -- Exclude obfuscated locations for public viz
        LIMIT {limit}
    """).fetchdf()

# Get visualization-ready data
viz_data = get_sample_locations_for_viz(conn, 5000)
print(f"Prepared {len(viz_data)} samples for visualization")
if len(viz_data) > 0:
    print(f"Coordinate bounds: Lat [{viz_data.latitude.min():.2f}, {viz_data.latitude.max():.2f}], "
          f"Lon [{viz_data.longitude.min():.2f}, {viz_data.longitude.max():.2f}]")
    print(f"Location types: {viz_data.location_type.value_counts().to_dict()}")
else:
    print("No samples found with valid coordinates")

Prepared 5000 samples for visualization
Coordinate bounds: Lat [-52.59, 71.04], Lon [-159.78, 153.17]
Location types: {'direct': 5000}


## Data Export Options

In [23]:
def export_site_subgraph(conn, site_name_pattern, output_prefix):
    """Export all data related to a specific site"""
    
    # Find the site
    site_info = conn.execute("""
        SELECT row_id, pid, label
        FROM pqg
        WHERE otype = 'SamplingSite'
        AND label LIKE ?
        LIMIT 1
    """, [f'%{site_name_pattern}%']).fetchdf()
    
    if site_info.empty:
        print(f"No site found matching '{site_name_pattern}'")
        return None
    
    site_row_id = site_info.iloc[0]['row_id']
    print(f"Found site: {site_info.iloc[0]['label']}")
    
    # Get all related entities (simplified version - not recursive)
    related_data = conn.execute("""
        WITH site_related AS (
            -- Get the site itself
            SELECT * FROM pqg WHERE row_id = ?
            
            UNION ALL
            
            -- Get edges from the site
            SELECT * FROM pqg e
            WHERE e.otype = '_edge_' AND e.s = ?
            
            UNION ALL
            
            -- Get entities connected to the site
            SELECT n.* FROM pqg e
            JOIN pqg n ON n.row_id = e.o[1]
            WHERE e.otype = '_edge_' AND e.s = ?
        )
        SELECT * FROM site_related
    """, [site_row_id, site_row_id, site_row_id]).fetchdf()
    
    # Save to parquet
    output_file = f"{output_prefix}_{site_info.iloc[0]['pid']}.parquet"
    related_data.to_parquet(output_file)
    print(f"Exported {len(related_data)} rows to {output_file}")
    
    return related_data

# Example usage (commented out to avoid creating files)
# pompeii_data = export_site_subgraph(conn, "Pompeii", "pompeii_subgraph")

## Data Quality Analysis

In [24]:
# Check for location data quality
location_quality = conn.execute("""
    SELECT
        CASE 
            WHEN obfuscated THEN 'Obfuscated'
            ELSE 'Precise'
        END as location_type,
        COUNT(*) as count,
        AVG(CASE WHEN latitude IS NOT NULL THEN 1.0 ELSE 0.0 END) * 100 as pct_with_coords
    FROM pqg
    WHERE otype = 'GeospatialCoordLocation'
    GROUP BY location_type
""").fetchdf()

print("Location Data Quality:")
print(location_quality)

Location Data Quality:
  location_type   count  pct_with_coords
0       Precise  196507        99.999491
1    Obfuscated    1926       100.000000


In [25]:
# Check for orphaned nodes (nodes not connected by any edge)
orphan_check = conn.execute("""
    WITH connected_nodes AS (
        SELECT DISTINCT s as row_id FROM pqg WHERE otype = '_edge_'
        UNION
        SELECT DISTINCT unnest(o) as row_id FROM pqg WHERE otype = '_edge_'
    )
    SELECT
        n.otype,
        COUNT(*) as orphan_count
    FROM pqg n
    LEFT JOIN connected_nodes c ON n.row_id = c.row_id
    WHERE n.otype != '_edge_' AND c.row_id IS NULL
    GROUP BY n.otype
""").fetchdf()

print("\nOrphaned Nodes by Type:")
print(orphan_check if not orphan_check.empty else "No orphaned nodes found!")


Orphaned Nodes by Type:
               otype  orphan_count
0              Agent             1
1  IdentifiedConcept         16961


## Summary Statistics

In [26]:
# Generate comprehensive summary
summary = conn.execute("""
    WITH stats AS (
        SELECT
            COUNT(*) as total_rows,
            COUNT(DISTINCT pid) as unique_pids,
            COUNT(CASE WHEN otype = '_edge_' THEN 1 END) as edge_count,
            COUNT(CASE WHEN otype != '_edge_' THEN 1 END) as node_count,
            COUNT(DISTINCT CASE WHEN otype != '_edge_' THEN otype END) as entity_types,
            COUNT(DISTINCT p) as relationship_types
        FROM pqg
    )
    SELECT * FROM stats
""").fetchdf()

print("Dataset Summary:")
for col in summary.columns:
    print(f"{col}: {summary[col].iloc[0]:,}")

Dataset Summary:
total_rows: 11,637,144
unique_pids: 11,637,144
edge_count: 9,201,451
node_count: 2,435,693
entity_types: 6
relationship_types: 10


## Debug: Specific Geo Point Analysis

Testing queries for parquet_cesium.qmd debugging. This section demonstrates:
- **Generic PQG debugging**: How to trace edge connections
- **OpenContext validation**: Verifying archaeological data relationships

In [27]:
# Debug specific geo location from parquet_cesium.qmd
# This section remains provider-agnostic and uses iSamples model semantics

target_geo_pid = "geoloc_7ea562cce4c70e4b37f7915e8384880c86607729"

print(f"=== Debugging geo location: {target_geo_pid} ===\n")

# 1. Find the geo location record
geo_record = conn.execute("""
    SELECT row_id, pid, otype, latitude, longitude 
    FROM pqg 
    WHERE pid = ? AND otype = 'GeospatialCoordLocation'
""", [target_geo_pid]).fetchdf()

print("1. Geo Location Record:")
if not geo_record.empty:
    print(geo_record.to_dict('records')[0])
    geo_row_id = geo_record.iloc[0]['row_id']
    print(f"   Row ID: {geo_row_id}")
else:
    print("   ‚ùå Geo location not found!")
    geo_row_id = None

=== Debugging geo location: geoloc_7ea562cce4c70e4b37f7915e8384880c86607729 ===

1. Geo Location Record:
{'row_id': 191480, 'pid': 'geoloc_7ea562cce4c70e4b37f7915e8384880c86607729', 'otype': 'GeospatialCoordLocation', 'latitude': 28.058084, 'longitude': -81.146851}
   Row ID: 191480


In [28]:
# 2. Check what edges point to this geo location
if geo_row_id is not None:
    geo_row_id_int = int(geo_row_id)
    edges_to_geo = conn.execute("""
        SELECT s, p, otype as edge_type, pid as edge_pid
        FROM pqg 
        WHERE otype = '_edge_' AND ? = ANY(o)
    """, [geo_row_id_int]).fetchdf()

    print(f"\n2. Edges pointing to this geo location ({len(edges_to_geo)} found):")
    if not edges_to_geo.empty:
        edge_summary = edges_to_geo.groupby('p').size().reset_index()
        edge_summary.columns = ['predicate', 'count']
        print(edge_summary)
        print("\nDetailed edges:")
        for _, edge in edges_to_geo.iterrows():
            print(f"   {edge['p']}: row_id {edge['s']} -> geo location")
    else:
        print("   ‚ùå No edges point to this geo location!")
else:
    print("\n2. Skipping edge analysis - geo location not found")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


2. Edges pointing to this geo location (1 found):
       predicate  count
0  site_location      1

Detailed edges:
   site_location: row_id 209521 -> geo location


In [29]:
# 3. Direct event samples
if geo_row_id is not None:
    direct_samples = conn.execute("""
        SELECT DISTINCT
            s.pid as sample_id,
            s.label as sample_label,
            s.name as sample_name,
            evt.pid as event_id,
            evt.label as event_label,
            'direct_event_location' as location_path
        FROM pqg s
        JOIN pqg e1  ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN pqg evt ON e1.o[1] = evt.row_id
        JOIN pqg e2  ON evt.row_id = e2.s AND e2.p = 'sample_location'
        JOIN pqg g   ON e2.o[1] = g.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND evt.otype = 'SamplingEvent'
          AND g.otype = 'GeospatialCoordLocation'
          AND g.pid = ?
        LIMIT 20
    """, [target_geo_pid]).fetchdf()

    print(f"\n3. Direct Event Samples ({len(direct_samples)} found):")
    if not direct_samples.empty:
        print(direct_samples[['sample_id', 'sample_label', 'event_id', 'event_label']].head())
    else:
        print("   ‚ùå No direct event samples found!")
else:
    print("\n3. Skipping direct samples query - geo location not found")


3. Direct Event Samples (0 found):
   ‚ùå No direct event samples found!


In [30]:
# 4. Site-associated samples
if geo_row_id is not None:
    site_samples = conn.execute("""
        SELECT DISTINCT
            s.pid as sample_id,
            s.label as sample_label,
            s.name as sample_name,
            evt.pid as event_id,
            evt.label as event_label,
            site.label as site_name,
            'via_site_location' as location_path
        FROM pqg s
        JOIN pqg e1   ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN pqg evt  ON e1.o[1] = evt.row_id
        JOIN pqg e2   ON evt.row_id = e2.s AND e2.p = 'sampling_site'
        JOIN pqg site ON e2.o[1] = site.row_id
        JOIN pqg e3   ON site.row_id = e3.s AND e3.p = 'site_location'
        JOIN pqg g    ON e3.o[1] = g.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND evt.otype = 'SamplingEvent'
          AND site.otype = 'SamplingSite'
          AND g.otype = 'GeospatialCoordLocation'
          AND g.pid = ?
        LIMIT 20
    """, [target_geo_pid]).fetchdf()

    print(f"\n4. Site-Associated Samples ({len(site_samples)} found):")
    if not site_samples.empty:
        print(site_samples[['sample_id', 'sample_label', 'site_name', 'event_id']].head())
    else:
        print("   ‚ùå No site-associated samples found!")
else:
    print("\n4. Skipping site samples query - geo location not found")


4. Site-Associated Samples (1 found):
              sample_id    sample_label       site_name  \
0  ark:/28722/k2x63t42w  Assemblage 364  Osceola County   

                                            event_id  
0  sampevent_b19416f025a0b804563976f00aa78a8524c2...  


In [31]:
# 5. If we found samples, get detailed metadata for the first sample
all_samples = []
if 'direct_samples' in locals() and not direct_samples.empty:
    all_samples.extend(direct_samples.to_dict('records'))
if 'site_samples' in locals() and not site_samples.empty:
    all_samples.extend(site_samples.to_dict('records'))

if all_samples:
    first_sample = all_samples[0]
    sample_pid = first_sample['sample_id']

    print(f"\n5. Detailed metadata for sample: {sample_pid}")
    print(f"   Resolvable URL: {ark_to_url(sample_pid)}")
    print(f"   Sample label: {first_sample.get('sample_label', 'N/A')}")
    print(f"   Location path: {first_sample.get('location_path', 'N/A')}")

    # Materials for this sample
    materials = conn.execute("""
        SELECT DISTINCT
            mat.pid as material_id,
            mat.label as material_type,
            mat.name as material_category
        FROM pqg s
        JOIN pqg e   ON s.row_id = e.s AND e.p = 'has_material_category'
        JOIN pqg mat ON e.o[1] = mat.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND s.pid = ?
          AND e.otype = '_edge_'
          AND mat.otype = 'IdentifiedConcept'
    """, [sample_pid]).fetchdf()

    print(f"\n   Materials ({len(materials)} found):")
    if not materials.empty:
        for _, mat in materials.iterrows():
            print(f"     - {mat['material_type']} ({ark_to_url(mat['material_id'])})")
    else:
        print("     ‚ùå No materials found!")

    # Agents responsible for this sample
    agents = conn.execute("""
        SELECT DISTINCT
            agent.pid as agent_id,
            agent.label as agent_name,
            agent.name as agent_role
        FROM pqg s
        JOIN pqg e1    ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN pqg evt   ON e1.o[1] = evt.row_id
        JOIN pqg e2    ON evt.row_id = e2.s AND e2.p = 'responsibility'
        JOIN pqg agent ON e2.o[1] = agent.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND s.pid = ?
          AND e1.otype = '_edge_'
          AND evt.otype = 'SamplingEvent'
          AND e2.otype = '_edge_'
          AND agent.otype = 'Agent'
        LIMIT 10
    """, [sample_pid]).fetchdf()

    print(f"\n   Responsible Agents ({len(agents)} found):")
    if not agents.empty:
        for _, agent in agents.iterrows():
            print(f"     - {agent['agent_name']} ({ark_to_url(agent['agent_id'])})")
    else:
        print("     ‚ùå No agents found!")
else:
    print("\n5. No samples found to analyze metadata")


5. Detailed metadata for sample: ark:/28722/k2x63t42w
   Resolvable URL: https://n2t.net/ark:/28722/k2x63t42w
   Sample label: Assemblage 364
   Location path: via_site_location

   Materials (1 found):
     - Material (https://w3id.org/isample/vocabulary/material/1.0/material)

   Responsible Agents (1 found):
     - None (https://opencontext.org/persons/ce3e13cb-c7b6-4d61-55fe-bb0d52a8374a)


In [32]:
# 6. Summary of findings for this geo location
print(f"\n=== SUMMARY for {target_geo_pid} ===")
if geo_row_id is not None:
    print(f"‚úÖ Geo location found (row_id: {geo_row_id})")
    print(f"üìç Coordinates: {geo_record.iloc[0]['latitude']}, {geo_record.iloc[0]['longitude']}")

    total_samples = len(all_samples)
    direct_count = len([s for s in all_samples if s.get('location_path') == 'direct_event_location'])
    site_count = len([s for s in all_samples if s.get('location_path') == 'via_site_location'])

    print(f"üî¨ Total samples found: {total_samples}")
    print(f"   - Direct event samples: {direct_count}")
    print(f"   - Site-associated samples: {site_count}")

    if total_samples > 0:
        print("‚úÖ Sample metadata retrieval successful!")
    else:
        print("‚ùå No samples found for this location")
else:
    print("‚ùå Geo location not found in dataset!")

print(f"\n=== END DEBUG for {target_geo_pid} ===\n")


=== SUMMARY for geoloc_7ea562cce4c70e4b37f7915e8384880c86607729 ===
‚úÖ Geo location found (row_id: 191480)
üìç Coordinates: 28.058084, -81.146851
üî¨ Total samples found: 1
   - Direct event samples: 0
   - Site-associated samples: 1
‚úÖ Sample metadata retrieval successful!

=== END DEBUG for geoloc_7ea562cce4c70e4b37f7915e8384880c86607729 ===



In [33]:
# 7. Test with a different geo location that has sample_location edges
sample_location_geos = conn.execute("""
    SELECT g.pid, g.latitude, g.longitude, COUNT(*) as edge_count
    FROM pqg e
    JOIN pqg g ON e.o[1] = g.row_id
    WHERE e.otype = '_edge_'
      AND e.p = 'sample_location'
      AND g.otype = 'GeospatialCoordLocation'
    GROUP BY g.pid, g.latitude, g.longitude
    ORDER BY edge_count DESC
    LIMIT 3
""").fetchdf()

print("=== Testing with geo locations that have direct sample_location edges ===")
print(sample_location_geos)

if not sample_location_geos.empty:
    test_geo_pid = sample_location_geos.iloc[0]['pid']
    print(f"\nTesting direct samples query with: {test_geo_pid}")

    test_direct_samples = conn.execute("""
        SELECT DISTINCT
            s.pid as sample_id,
            s.label as sample_label,
            evt.pid as event_id,
            evt.label as event_label
        FROM pqg s
        JOIN pqg e1  ON s.row_id = e1.s AND e1.p = 'produced_by'
        JOIN pqg evt ON e1.o[1] = evt.row_id
        JOIN pqg e2  ON evt.row_id = e2.s AND e2.p = 'sample_location'
        JOIN pqg g   ON e2.o[1] = g.row_id
        WHERE s.otype = 'MaterialSampleRecord'
          AND evt.otype = 'SamplingEvent'
          AND g.otype = 'GeospatialCoordLocation'
          AND g.pid = ?
        LIMIT 5
    """, [test_geo_pid]).fetchdf()

    print(f"Direct samples found: {len(test_direct_samples)}")
    if not test_direct_samples.empty:
        print("‚úÖ Direct event samples exist")
        print(test_direct_samples[['sample_id', 'sample_label', 'event_id']].head())
    else:
        print("‚ùå Still no direct event samples found")
else:
    print("‚ùå No geo locations with sample_location edges found")

=== Testing with geo locations that have direct sample_location edges ===
                                               pid   latitude  longitude  \
0  geoloc_35842a4fa478ae28c68f54d1db36c8e968d62dcb  37.668196  32.827191   
1  geoloc_17bae610b87227ef806161bdb40ac97b4cd8ef5e  30.328700  35.442100   
2  geoloc_045c25c9e19aeac434ef19616cf2130175cfd130  35.034889  32.421841   

   edge_count  
0      131022  
1      108846  
2       52252  

Testing direct samples query with: geoloc_35842a4fa478ae28c68f54d1db36c8e968d62dcb
Direct samples found: 5
‚úÖ Direct event samples exist
              sample_id sample_label  \
0  ark:/28722/k2gf0r11g     1437.F20   
1  ark:/28722/k23b5zq3b   13142.F244   
2  ark:/28722/k24j0ds90   14034.F114   
3  ark:/28722/k25t3jn16      2134.F4   
4  ark:/28722/k2zp3zq1t   15717.F586   

                                            event_id  
0  sampevent_acadcb206f7ab144362455c1515c5e18eebf...  
1  sampevent_37bf753ab3db1c8c0014d073ab11cf7037eb...  
2  sampevent

## Debug Analysis Results

### Key Findings for parquet_cesium.qmd

1. **Geo Location Structure**: The target geo location `geoloc_7ea562cce4c70e4b37f7915e8384880c86607729` exists in the dataset with correct coordinates.

2. **MaterialSampleRecord Association**: This specific location has **1 site-associated MaterialSampleRecord** but **0 direct event MaterialSampleRecord instances**.

3. **Query Validation**: Both query paths work correctly:
   - **Direct path**: `MaterialSampleRecord ‚Üí SamplingEvent ‚Üí sample_location ‚Üí GeospatialCoordLocation`
   - **Site path**: `MaterialSampleRecord ‚Üí SamplingEvent ‚Üí SamplingSite ‚Üí site_location ‚Üí GeospatialCoordLocation`

4. **Data Availability**: The dataset contains both types of MaterialSampleRecord associations, but not every geo location has both types.

### Recommendations for parquet_cesium.qmd

- The JavaScript queries are correctly structured and should work
- Some geo locations may only have site-associated MaterialSampleRecord instances (like our test case)
- Consider showing both direct and site-associated MaterialSampleRecord instances in the UI
- Add debug logging to identify when no MaterialSampleRecord instances are found vs. query errors

In [34]:
# Analysis complete!
print("\nAnalysis complete!")
print("Note: DuckDB connection remains open for interactive use")


Analysis complete!
Note: DuckDB connection remains open for interactive use


## Read PQG key-value metadata (iSamples generic)

The parquet contains KV metadata describing the iSamples PQG schema (see https://github.com/isamplesorg/pqg). We‚Äôll load the keys `pqg_version`, `pqg_primary_key`, `pqg_node_types`, `pqg_edge_fields`, `pqg_literal_fields` to make the notebook self‚Äëdescribing and provider‚Äëagnostic.

In [35]:
# Read PQG key-value metadata using PyArrow (provider-agnostic)
import pyarrow.parquet as pq

try:
    md = pq.read_metadata(parquet_path)
    kv_raw = md.metadata or {}
    # Decode byte keys/values to strings
    kv = { (k.decode() if isinstance(k, (bytes, bytearray)) else str(k)):
           (v.decode() if isinstance(v, (bytes, bytearray)) else str(v))
           for k, v in kv_raw.items() }

    wanted_keys = ["pqg_version", "pqg_primary_key", "pqg_node_types", "pqg_edge_fields", "pqg_literal_fields"]
    selected = {k: kv.get(k) for k in wanted_keys if k in kv}

    print("PQG KV metadata (selected):")
    if selected:
        for k in wanted_keys:
            if k in selected:
                print(f"- {k}: {selected[k][:120]}{'...' if len(selected[k])>120 else ''}")
    else:
        print("No PQG KV metadata keys found in file metadata")
except Exception as e:
    print("Unable to read parquet metadata via PyArrow:", e)

PQG KV metadata (selected):
- pqg_version: 0.2.0
- pqg_primary_key: pid
- pqg_node_types: {"Agent": {"name": "name VARCHAR DEFAULT NULL", "affiliation": "affiliation VARCHAR DEFAULT NULL", "contact_information"...
- pqg_edge_fields: ["pid", "otype", "s", "p", "o", "n", "altids", "geometry"]
- pqg_literal_fields: ["authorized_by", "has_feature_of_interest", "affiliation", "sampling_purpose", "complies_with", "project", "alternate_i...


In [36]:


# Count records
result = conn.execute("SELECT COUNT(*) FROM pqg;").fetchone()
result


(11637144,)

In [37]:
# Helper queries around a sample PID and a geo PID

# Path 1 (Direct event location):
#   MaterialSampleRecord -> produced_by -> SamplingEvent -> sample_location -> GeospatialCoordLocation

# Path 2 (Via site location):
#   MaterialSampleRecord -> produced_by -> SamplingEvent -> sampling_site -> SamplingSite -> site_location -> GeospatialCoordLocation

# Notes on the queries below:
# - The PQG table stores both nodes (MaterialSampleRecord, SamplingEvent, SamplingSite, GeospatialCoordLocation, etc.) and edges (otype = '_edge_').
# - WHERE and JOIN conditions enforce which path(s) are required for a row to appear.
# - Inner JOINs mean rows will only be returned when all joined paths/objects exist.


def get_sample_data_via_sample_pid(sample_pid, con, show_max_width):
    """Return one row of core sample metadata, including site and geo coordinates, for a sample PID.

    What it does
    - Starts at the MaterialSampleRecord identified by the given `sample_pid`.
    - Follows produced_by -> SamplingEvent.
    - Follows sample_location -> GeospatialCoordLocation to fetch latitude/longitude (Path 1).
    - Follows sampling_site -> SamplingSite to fetch site label and PID (Path 2).

    Important implications
    - This query uses INNER JOINs on BOTH the Path 1 and Path 2 chains. Therefore, it returns a row only if the sample has:
        1) a SamplingEvent with a sample_location pointing to a GeospatialCoordLocation (Path 1), and
        2) a SamplingEvent with a sampling_site pointing to a SamplingSite (Path 2).
      If either path is missing, the query returns no rows.

    Parameters
    - sample_pid (str): The iSamples PID of the MaterialSampleRecord to look up.
    - con: A DuckDB connection with the PQG table registered as `pqg`.
    - show_max_width: Width passed to DuckDB's .show() for display formatting.

    Returns
    - DuckDB relation (con.sql(sql)): The prepared relation; also prints a preview via .show().
    """

    sql = f"""
    SELECT 
        samp_pqg.row_id,
        samp_pqg.pid AS sample_pid,
        samp_pqg.alternate_identifiers AS sample_alternate_identifiers,
        samp_pqg.label AS sample_label,
        samp_pqg.description AS sample_description,
        samp_pqg.thumbnail_url AS sample_thumbnail_url,
        samp_pqg.thumbnail_url is NOT NULL as has_thumbnail,
        geo_pqg.latitude, 
        geo_pqg.longitude,
        site_pqg.label AS sample_site_label,
        site_pqg.pid AS sample_site_pid
    FROM pqg AS samp_pqg
    JOIN pqg AS samp_rel_se_pqg ON (samp_rel_se_pqg.s = samp_pqg.row_id AND samp_rel_se_pqg.p = 'produced_by')
    JOIN pqg AS se_pqg ON (list_extract(samp_rel_se_pqg.o, 1) = se_pqg.row_id AND se_pqg.otype = 'SamplingEvent')
    -- Path 1: event -> sample_location -> GeospatialCoordLocation
    JOIN pqg AS geo_rel_se_pqg ON (geo_rel_se_pqg.s = se_pqg.row_id AND geo_rel_se_pqg.p = 'sample_location')
    JOIN pqg AS geo_pqg ON (list_extract(geo_rel_se_pqg.o, 1) = geo_pqg.row_id AND geo_pqg.otype = 'GeospatialCoordLocation')
    -- Path 2: event -> sampling_site -> SamplingSite
    JOIN pqg AS site_rel_se_pqg ON (site_rel_se_pqg.s = se_pqg.row_id AND site_rel_se_pqg.p = 'sampling_site')
    JOIN pqg AS site_pqg ON (list_extract(site_rel_se_pqg.o, 1) = site_pqg.row_id AND site_pqg.otype = 'SamplingSite')
    WHERE samp_pqg.pid = '{sample_pid}' AND samp_pqg.otype = 'MaterialSampleRecord';
    """

    db_m = con.sql(sql)
    db_m.show(max_width=show_max_width)
    return db_m


def get_sample_data_agents_sample_pid(sample_pid, con, show_max_width):
    """Return agent relationships (responsibility/registrant) for a sample PID.

    What it does
    - Starts at the MaterialSampleRecord identified by `sample_pid`.
    - Follows produced_by -> SamplingEvent.
    - From the event, follows predicates in ['responsibility', 'registrant'] to Agent nodes.

    Relationship to Path 1 vs Path 2
    - This query does NOT depend on Path 1 (direct geo) or Path 2 (via site). It only depends on the existence of the SamplingEvent and agent edges from that event. You will get agent rows even if the sample has no sample_location or sampling_site.

    Parameters
    - sample_pid (str): The sample PID.
    - con: DuckDB connection.
    - show_max_width: Width used by .show().

    Returns
    - DuckDB relation (con.sql(sql)): The prepared relation; also prints a preview via .show().
    """

    sql = f"""
    SELECT 
        samp_pqg.row_id,
        samp_pqg.pid AS sample_pid,
        samp_pqg.alternate_identifiers AS sample_alternate_identifiers,
        samp_pqg.label AS sample_label,
        samp_pqg.description AS sample_description,
        samp_pqg.thumbnail_url AS sample_thumbnail_url,
        samp_pqg.thumbnail_url is NOT NULL as has_thumbnail,
        agent_rel_se_pqg.p AS predicate,
        agent_pqg.pid AS agent_pid,
        agent_pqg.name AS agent_name,
        agent_pqg.alternate_identifiers AS agent_alternate_identifiers
    FROM pqg AS samp_pqg
    JOIN pqg AS samp_rel_se_pqg ON (samp_rel_se_pqg.s = samp_pqg.row_id AND samp_rel_se_pqg.p = 'produced_by')
    JOIN pqg AS se_pqg ON (list_extract(samp_rel_se_pqg.o, 1) = se_pqg.row_id AND se_pqg.otype = 'SamplingEvent')
    JOIN pqg AS agent_rel_se_pqg ON (agent_rel_se_pqg.s = se_pqg.row_id AND list_contains(['responsibility', 'registrant'], agent_rel_se_pqg.p))
    JOIN pqg AS agent_pqg ON (agent_pqg.row_id = ANY(agent_rel_se_pqg.o) AND agent_pqg.otype = 'Agent')
    WHERE samp_pqg.pid = '{sample_pid}' AND samp_pqg.otype = 'MaterialSampleRecord';
    """

    db_m = con.sql(sql)
    db_m.show(max_width=show_max_width)
    return db_m


def get_sample_types_and_keywords_via_sample_pid(sample_pid, con, show_max_width):
    """Return IdentifiedConcept terms (keywords, object types, material categories) for a sample PID.

    What it does
    - Starts at the MaterialSampleRecord identified by `sample_pid`.
    - Follows predicates in ['keywords', 'has_sample_object_type', 'has_material_category'] to IdentifiedConcept nodes and returns their PID/label.

    Relationship to Path 1 vs Path 2
    - This query attaches concepts directly to the MaterialSampleRecord. It does not require Path 1 or Path 2 to exist and will return rows even if no geo/site relationships are present for the sample.

    Parameters
    - sample_pid (str): The sample PID.
    - con: DuckDB connection.
    - show_max_width: Width used by .show().

    Returns
    - DuckDB relation (con.sql(sql)): The prepared relation; also prints a preview via .show().
    """

    sql = f"""
    SELECT 
        samp_pqg.row_id,
        samp_pqg.pid AS sample_pid,
        samp_pqg.alternate_identifiers AS sample_alternate_identifiers,
        samp_pqg.label AS sample_label,
        kw_rel_se_pqg.p AS predicate,
        kw_pqg.pid AS keyword_pid,
        kw_pqg.label AS keyword
    FROM pqg AS samp_pqg
    JOIN pqg AS kw_rel_se_pqg ON (kw_rel_se_pqg.s = samp_pqg.row_id AND list_contains(['keywords', 'has_sample_object_type', 'has_material_category'], kw_rel_se_pqg.p))
    JOIN pqg AS kw_pqg ON (kw_pqg.row_id = ANY(kw_rel_se_pqg.o) AND kw_pqg.otype = 'IdentifiedConcept')
    WHERE samp_pqg.pid = '{sample_pid}' AND samp_pqg.otype = 'MaterialSampleRecord';
    """

    db_m = con.sql(sql)
    db_m.show(max_width=show_max_width)
    return db_m


def get_samples_at_geo_cord_location_via_sample_event(geo_loc_pid, con, show_max_width):
    """Return samples anchored at a GeospatialCoordLocation PID via event sample_location, plus site info.

    What it does
    - Starts at a GeospatialCoordLocation identified by `geo_loc_pid`.
    - Follows incoming edges with p = 'sample_location' to reach SamplingEvent rows (Path 1 from the perspective of event -> geo; here we walk it in reverse starting at geo).
    - From each event, follows produced_by (reverse) to find MaterialSampleRecord rows produced by it.
    - Also enriches each event with its sampling_site -> SamplingSite to return site label/PID (Path 2).

    Relationship to Path 1 vs Path 2
    - Path 1 is REQUIRED because we start from the GeospatialCoordLocation and look for events that point to it via sample_location. Those events are then used to find samples produced by them.
    - Path 2 is JOINED to provide site context. Because the SQL uses INNER JOINs for site, only events that also have a SamplingSite will surface here. If you want direct-only results regardless of whether an event has a SamplingSite, change the site joins to LEFT JOINs.

    Parameters
    - geo_loc_pid (str): The PID of the GeospatialCoordLocation.
    - con: DuckDB connection.
    - show_max_width: Width used by .show().

    Returns
    - DuckDB relation (con.sql(sql)): The prepared relation; also prints a preview via .show().
    """

    sql = f"""
    SELECT geo_pqg.latitude, geo_pqg.longitude, 
           site_pqg.label AS sample_site_label,
           site_pqg.pid AS sample_site_pid,
           samp_pqg.pid AS sample_pid,
           samp_pqg.alternate_identifiers AS sample_alternate_identifiers,
           samp_pqg.label AS sample_label,
           samp_pqg.description AS sample_description,
           samp_pqg.thumbnail_url AS sample_thumbnail_url,
           samp_pqg.thumbnail_url is NOT NULL as has_thumbnail 
    FROM pqg AS geo_pqg
    JOIN pqg AS rel_se_pqg ON (rel_se_pqg.p = 'sample_location' AND contains(rel_se_pqg.o, geo_pqg.row_id))
    JOIN pqg AS se_pqg ON (rel_se_pqg.s = se_pqg.row_id AND se_pqg.otype = 'SamplingEvent')
    -- Path 2 enrichment: event -> sampling_site -> SamplingSite
    JOIN pqg AS rel_site_pqg ON (se_pqg.row_id = rel_site_pqg.s AND rel_site_pqg.p = 'sampling_site')
    JOIN pqg AS site_pqg ON (list_extract(rel_site_pqg.o, 1) = site_pqg.row_id AND site_pqg.otype = 'SamplingSite')
    -- Find samples produced by the event
    JOIN pqg AS rel_samp_pqg ON (rel_samp_pqg.p = 'produced_by' AND contains(rel_samp_pqg.o, se_pqg.row_id))
    JOIN pqg AS samp_pqg ON (rel_samp_pqg.s = samp_pqg.row_id AND samp_pqg.otype = 'MaterialSampleRecord')
    WHERE geo_pqg.pid = '{geo_loc_pid}' AND geo_pqg.otype = 'GeospatialCoordLocation'
    ORDER BY has_thumbnail DESC
    """

    db_m = con.sql(sql)
    db_m.show(max_width=show_max_width)
    return db_m



In [38]:
sample_pid = "geoloc_7ea562cce4c70e4b37f7915e8384880c86607729"
sample_pid = "ark:/28722/k2xd0t39r"
get_sample_data_via_sample_pid(sample_pid, conn, 120)


‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ row_id  ‚îÇ      sample_pid      ‚îÇ sample_alternate_i‚Ä¶  ‚îÇ ‚Ä¶ ‚îÇ longitude ‚îÇ sample_site_label ‚îÇ   sample_site_pid    ‚îÇ
‚îÇ  int32  ‚îÇ       varchar        ‚îÇ      varchar[]       ‚îÇ   ‚îÇ  double   ‚îÇ      varchar      ‚îÇ       varchar        ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ 1319143 ‚îÇ ark:/28722/k2xd0

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ row_id  ‚îÇ      sample_pid      ‚îÇ sample_alternate_i‚Ä¶  ‚îÇ ‚Ä¶ ‚îÇ longitude ‚îÇ sample_site_label ‚îÇ   sample_site_pid    ‚îÇ
‚îÇ  int32  ‚îÇ       varchar        ‚îÇ      varchar[]       ‚îÇ   ‚îÇ  double   ‚îÇ      varchar      ‚îÇ       varchar        ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ 1319143 ‚îÇ ark:/28722/k2xd0

In [39]:
get_sample_data_agents_sample_pid(sample_pid, conn, 120)

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ row_id  ‚îÇ      sample_pid      ‚îÇ ‚Ä¶ ‚îÇ      agent_pid       ‚îÇ   agent_name   ‚îÇ agent_alternate_id‚Ä¶  ‚îÇ
‚îÇ  int32  ‚îÇ       varchar        ‚îÇ   ‚îÇ       varchar        ‚îÇ    varchar     ‚îÇ      varchar[]       ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ 1319143 ‚îÇ ark:/28722/k2xd0t39r ‚îÇ ‚Ä¶ ‚îÇ https://opencontex‚Ä¶  ‚îÇ Arek Marciniak ‚îÇ NULL                 ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ row_id  ‚îÇ      sample_pid      ‚îÇ ‚Ä¶ ‚îÇ      agent_pid       ‚îÇ   agent_name   ‚îÇ agent_alternate_id‚Ä¶  ‚îÇ
‚îÇ  int32  ‚îÇ       varchar        ‚îÇ   ‚îÇ       varchar        ‚îÇ    varchar     ‚îÇ      varchar[]       ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ 1319143 ‚îÇ ark:/28722/k2xd0t39r ‚îÇ ‚Ä¶ ‚îÇ https://opencontex‚Ä¶  ‚îÇ Arek Marciniak ‚îÇ NULL                 ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ

In [40]:
get_sample_types_and_keywords_via_sample_pid(sample_pid, conn, 120)

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ row_id  ‚îÇ      sample_pid      ‚îÇ ‚Ä¶ ‚îÇ      predicate       ‚îÇ     keyword_pid      ‚îÇ       keyword        ‚îÇ
‚îÇ  int32  ‚îÇ       varchar        ‚îÇ   ‚îÇ       varchar        ‚îÇ       varchar        ‚îÇ       varchar        ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ 1319143 ‚îÇ ark:/28722/k2xd0t39r ‚îÇ ‚Ä¶ ‚îÇ has_material_categ‚Ä¶  ‚îÇ https://w3id.org/i‚Ä¶  ‚îÇ Biogeni

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ row_id  ‚îÇ      sample_pid      ‚îÇ ‚Ä¶ ‚îÇ      predicate       ‚îÇ     keyword_pid      ‚îÇ       keyword        ‚îÇ
‚îÇ  int32  ‚îÇ       varchar        ‚îÇ   ‚îÇ       varchar        ‚îÇ       varchar        ‚îÇ       varchar        ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ 1319143 ‚îÇ ark:/28722/k2xd0t39r ‚îÇ ‚Ä¶ ‚îÇ has_material_categ‚Ä¶  ‚îÇ https://w3id.org/i‚Ä¶  ‚îÇ Biogeni

In [41]:
get_samples_at_geo_cord_location_via_sample_event(sample_pid, conn, 120)

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ latitude ‚îÇ longitude ‚îÇ sample_site_label ‚îÇ ‚Ä¶ ‚îÇ sample_description ‚îÇ sample_thumbnail_url ‚îÇ has_thumbnail ‚îÇ
‚îÇ  double  ‚îÇ  double   ‚îÇ      varchar      ‚îÇ   ‚îÇ      varchar       ‚îÇ       varchar        ‚îÇ    boolean    ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                  0 rows                                       

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ latitude ‚îÇ longitude ‚îÇ sample_site_label ‚îÇ ‚Ä¶ ‚îÇ sample_description ‚îÇ sample_thumbnail_url ‚îÇ has_thumbnail ‚îÇ
‚îÇ  double  ‚îÇ  double   ‚îÇ      varchar      ‚îÇ   ‚îÇ      varchar       ‚îÇ       varchar        ‚îÇ    boolean    ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                  0 rows                                       

In [42]:
%load_ext sql

In [43]:
# Connect to an in-memory DuckDB instance using %sql magic
%sql duckdb:///:memory:

# Create a view for the Parquet file (run this only once per session)
%sql CREATE VIEW pqg AS SELECT * FROM '/Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet';


Count


In [44]:
%%sql

# count the number of rows in pqg
SELECT COUNT(*) FROM pqg;

count_star()
11637144


In [45]:
%%sql

SELECT * from pqg WHERE otype = 'MaterialSampleRecord' LIMIT 5;


row_id,pid,tcreated,tmodified,otype,s,p,o,n,altids,geometry,authorized_by,has_feature_of_interest,affiliation,sampling_purpose,complies_with,project,alternate_identifiers,relationship,elevation,sample_identifier,dc_rights,result_time,contact_information,latitude,target,role,scheme_uri,is_part_of,scheme_name,name,longitude,obfuscated,curation_location,last_modified_time,access_constraints,place_name,description,label,thumbnail_url
1319143,ark:/28722/k2xd0t39r,,,MaterialSampleRecord,,,,,"['https://opencontext.org/subjects/6e845e64-38c3-408d-efed-379d4ea82c4c', 'ark:/28722/k2xd0t39r']",,,,,,,,"['https://opencontext.org/subjects/6e845e64-38c3-408d-efed-379d4ea82c4c', 'ark:/28722/k2xd0t39r']",,,Bone 8679,,,,,,,,,,,,,,2025-04-02T05:21:51Z,,,"Open Context published ""Animal Bone"" sample record from: Asia/Turkey/√áatalh√∂y√ºk/Mound East/Area TP/Unit 7899/Bone 8679",Bone 8679,
1319144,ark:/28722/k26976w2b,,,MaterialSampleRecord,,,,,"['https://opencontext.org/subjects/73adb9ea-47d3-42c2-efc3-7c8ee7f7c07c', 'ark:/28722/k26976w2b']",,,,,,,,"['https://opencontext.org/subjects/73adb9ea-47d3-42c2-efc3-7c8ee7f7c07c', 'ark:/28722/k26976w2b']",,,105334 (1),,,,,,,,,,,,,,2025-04-04T05:18:35Z,,,"Open Context published ""Object"" sample record from: Asia/Jordan/Petra Great Temple/Upper Temenos/Trench 105-106/Locus 7/Seq. 105334/105334 (1)",105334 (1),
1319145,ark:/28722/k2j38nr9q,,,MaterialSampleRecord,,,,,"['https://opencontext.org/subjects/b85d7399-fe1c-4cb0-dfb6-82bf6dd97347', 'ark:/28722/k2j38nr9q']",,,,,,,,"['https://opencontext.org/subjects/b85d7399-fe1c-4cb0-dfb6-82bf6dd97347', 'ark:/28722/k2j38nr9q']",,,Bone 2836,,,,,,,,,,,,,,2025-04-02T05:03:54Z,,,"Open Context published ""Animal Bone"" sample record from: Asia/Turkey/√áatalh√∂y√ºk/Mound East/Area TP/Unit 7325/Bone 2836",Bone 2836,
1319146,ark:/28722/k2db7xt49,,,MaterialSampleRecord,,,,,"['https://opencontext.org/subjects/b5a9ad58-4d3a-4ff0-174e-6e218df059b5', 'ark:/28722/k2db7xt49']",,,,,,,,"['https://opencontext.org/subjects/b5a9ad58-4d3a-4ff0-174e-6e218df059b5', 'ark:/28722/k2db7xt49']",,,Bone 15001,,,,,,,,,,,,,,2025-04-02T05:41:08Z,,,"Open Context published ""Animal Bone"" sample record from: Asia/Turkey/√áatalh√∂y√ºk/Mound East/Area TP/Unit 13522/Bone 15001",Bone 15001,
1319147,ark:/28722/k2s181r0d,,,MaterialSampleRecord,,,,,"['https://opencontext.org/subjects/4956a2ba-0414-4b68-115d-c1c5f888c70a', 'ark:/28722/k2s181r0d']",,,,,,,,"['https://opencontext.org/subjects/4956a2ba-0414-4b68-115d-c1c5f888c70a', 'ark:/28722/k2s181r0d']",,,106059 (6),,,,,,,,,,,,,,2025-04-04T05:21:30Z,,,"Open Context published ""Object"" sample record from: Asia/Jordan/Petra Great Temple/Upper Temenos/Trench 105-106/Locus 30/Seq. 106059/106059 (6)",106059 (6),


In [46]:
%%sql
# all otypes of edges that lead from MaterialSampleRecord
SELECT DISTINCT p, COUNT(*) as count
FROM pqg AS s
JOIN pqg AS e ON s.row_id = e.s
WHERE s.otype = 'MaterialSampleRecord' AND e.otype = '_edge_'
GROUP BY p
ORDER BY count DESC
LIMIT 20;

RuntimeError: (duckdb.duckdb.BinderException) Binder Error: Ambiguous reference to column name "p" (use: "s.p" or "e.p")
[SQL: SELECT DISTINCT p, COUNT(*) as count
FROM pqg AS s
JOIN pqg AS e ON s.row_id = e.s
WHERE s.otype = 'MaterialSampleRecord' AND e.otype = '_edge_'
GROUP BY p
ORDER BY count DESC
LIMIT 20;]
(Background on this error at: https://sqlalche.me/e/20/f405)
