# Typed Edge Queries with OpenContext PQG Data

This notebook demonstrates the new **typed edge functionality** in the PQG (Property Query Graph) library. 

**What's new in this version:**
- Type-safe edge operations for 14 iSamples edge types
- Automatic edge type inference from (subject_type, predicate, object_type)
- Specialized query methods for each edge type
- Edge validation against iSamples schema
- Edge type statistics and analysis

**Data source:** OpenContext archaeological data (~11.6M records) in property graph format

**PQG Library:** https://github.com/isamplesorg/pqg (PR #6 - Typed Edges)

## Setup: Load OpenContext Parquet File

In [1]:
import duckdb
import sys
from pathlib import Path
import pandas as pd

# Add PQG library to path
pqg_path = Path.home() / "C" / "src" / "iSamples" / "pqg"
if str(pqg_path) not in sys.path:
    sys.path.insert(0, str(pqg_path))

# Path to OpenContext parquet file (in same directory as this notebook)
oc_parquet_path = Path("~/Data/iSample/pqg_refining/oc_isamples_pqg.parquet")

print(f"Loading OpenContext PQG data...")
print(f"  File: {oc_parquet_path}")
print(f"  Size: {oc_parquet_path.stat().st_size / (1024**2):.1f} MB")

# Create DuckDB connection
conn = duckdb.connect(':memory:')

# Load parquet into a table
conn.execute(f"""
    CREATE TABLE pqg_data AS 
    SELECT * FROM read_parquet('{oc_parquet_path}')
""")

# Create view for PQG
conn.execute("CREATE VIEW pqg AS SELECT * FROM pqg_data")

# Quick stats
total_records = conn.execute("SELECT COUNT(*) FROM pqg").fetchone()[0]
entity_types = conn.execute("""
    SELECT otype, COUNT(*) as count
    FROM pqg
    GROUP BY otype
    ORDER BY count DESC
""").df()

print(f"\n✅ Loaded {total_records:,} records")
print(f"\nEntity type distribution:")
print(entity_types.to_string(index=False))

Loading OpenContext PQG data...
  File: oc_isamples_pqg.parquet
  Size: 690.9 MB

✅ Loaded 11,637,144 records

Entity type distribution:
                  otype   count
                 _edge_ 9201451
   MaterialSampleRecord 1096352
          SamplingEvent 1096352
GeospatialCoordLocation  198433
      IdentifiedConcept   25778
           SamplingSite   18213
                  Agent     565


## Initialize PQG with Typed Edge Support

In [2]:
from pqg import pqg_singletable as pqg
from pqg.typed_edges import TypedEdgeQueries
from pqg.edge_types import ISamplesEdgeType

# Create PQG instance
parquet_source = f"read_parquet('{oc_parquet_path}')"
pqg_instance = pqg.PQG(dbinstance=conn, source=parquet_source)
pqg_instance._table = 'pqg'
pqg_instance._isparquet = True  # Read-only mode
pqg_instance._node_pk = 'pid'

# Initialize basic types manually
pqg_instance._types = {
    'MaterialSampleRecord': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
    'SamplingEvent': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
    'GeospatialCoordLocation': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'latitude': 'DOUBLE', 'longitude': 'DOUBLE'},
    'SamplingSite': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
    'IdentifiedConcept': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
    'Agent': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
    '_edge_': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 's': 'INTEGER', 'p': 'VARCHAR', 'o': 'INTEGER[]'}
}

# Create TypedEdgeQueries wrapper
typed_queries = TypedEdgeQueries(pqg_instance)

print("✅ PQG instance initialized")
print(f"   Table: {pqg_instance._table}")
print(f"   Read-only: {pqg_instance._isparquet}")
print(f"\n✅ TypedEdgeQueries wrapper ready")
print(f"   Supports {len(ISamplesEdgeType)} iSamples edge types")

✅ PQG instance initialized
   Table: pqg
   Read-only: True

✅ TypedEdgeQueries wrapper ready
   Supports 14 iSamples edge types


## Edge Type Discovery: What's in the Data?

The iSamples schema defines **14 theoretical edge types**. Let's see which ones actually exist in the OpenContext data.

In [3]:
print("Analyzing edge types in OpenContext data...")
print("(This may take 1-2 minutes for 9.2M edges)\n")

# Get edge type statistics
edge_stats = typed_queries.get_edge_type_statistics()

print(f"Found {len(edge_stats)} edge types out of 14 possible:\n")
print(f"{'Edge Type':<50} {'Count':>12} {'Subject → Object'}")
print("=" * 100)

for edge_type, count in edge_stats:
    print(f"{edge_type.name:<50} {count:>12,} {str(edge_type)}")

# Show which edge types are missing
found_types = {et for et, _ in edge_stats}
missing_types = set(ISamplesEdgeType) - found_types

if missing_types:
    print(f"\n\nEdge types NOT found in OpenContext data ({len(missing_types)}):")
    for et in sorted(missing_types, key=lambda x: x.name):
        print(f"  - {et.name}: {str(et)}")

Analyzing edge types in OpenContext data...
(This may take 1-2 minutes for 9.2M edges)

Found 10 edge types out of 14 possible:

Edge Type                                                 Count Subject → Object
MSR_HAS_CONTEXT_CATEGORY                              1,096,352 MaterialSampleRecord --has_context_category--> IdentifiedConcept
MSR_HAS_MATERIAL_CATEGORY                             1,096,352 MaterialSampleRecord --has_material_category--> IdentifiedConcept
MSR_HAS_SAMPLE_OBJECT_TYPE                            1,096,352 MaterialSampleRecord --has_sample_object_type--> IdentifiedConcept
MSR_PRODUCED_BY                                       1,096,352 MaterialSampleRecord --produced_by--> SamplingEvent
EVENT_SAMPLING_SITE                                   1,096,352 SamplingEvent --sampling_site--> SamplingSite
MSR_KEYWORDS                                          1,096,297 MaterialSampleRecord --keywords--> IdentifiedConcept
EVENT_SAMPLE_LOCATION                                 1,0

## Example 1: Query Edges by Type

**Use case:** Find all samples and their sampling events

**Edge type:** `MaterialSampleRecord --produced_by--> SamplingEvent`

In [4]:
print("Finding samples and their sampling events...\n")

# Get first 10 MSR_PRODUCED_BY edges
samples = []
for subject_pid, predicate, object_pids, named_graph, edge_type in \
    typed_queries.get_edges_by_type(ISamplesEdgeType.MSR_PRODUCED_BY, limit=10):
    
    samples.append({
        'sample_pid': subject_pid,
        'event_pid': object_pids[0] if object_pids else None,
        'edge_type': edge_type.name
    })

df = pd.DataFrame(samples)
print(f"Found {len(df)} sample → event relationships:\n")
print(df.to_string(index=False))

# Show edge type info
et = ISamplesEdgeType.MSR_PRODUCED_BY
print(f"\nEdge type details:")
print(f"  Name: {et.name}")
print(f"  Subject type: {et.subject_type}")
print(f"  Predicate: {et.predicate}")
print(f"  Object type: {et.object_type}")
print(f"  Pattern: {str(et)}")

Finding samples and their sampling events...

Found 10 sample → event relationships:

                    sample_pid                                          event_pid       edge_type
          ark:/28722/k2ng4nj6s sampevent_633406f642145d31ce5b492242cbdf83053da8e5 MSR_PRODUCED_BY
          ark:/28722/k21z4812t sampevent_d75c926a2f79c69a93c95bb73be2e8fe0fbb74db MSR_PRODUCED_BY
          ark:/28722/k26t0m190 sampevent_553bcbd591a6aca368d8eb95f6124cbf0dee1c5b MSR_PRODUCED_BY
          ark:/28722/k2p84qc16 sampevent_a37dd62f5317e6b997b422790639ed409d0e72f7 MSR_PRODUCED_BY
          ark:/28722/k20v8vv6q sampevent_f76c36965b349e793745f2930d877fb16d6d72ed MSR_PRODUCED_BY
          ark:/28722/k26t11289 sampevent_fcbab45bc0072379719f9045207ecfed90b1b70c MSR_PRODUCED_BY
          ark:/28722/k2z60v955 sampevent_f7ae224293a582796305c789d091294d0c191229 MSR_PRODUCED_BY
  ark:/28722/r2p3k14c/wis_1797 sampevent_14b06d706519c1cff8a8ea979c7a5ff441ee1d3f MSR_PRODUCED_BY
ark:/28722/r2p3k14c/iaaa_82877 s

## Example 2: Multi-Hop Traversal with Typed Edges

**Use case:** Find where a sample was collected (geographic coordinates)

**Graph path:** 
```
MaterialSampleRecord 
  --produced_by--> SamplingEvent 
    --sample_location--> GeospatialCoordLocation
```

In [5]:
def find_sample_location_typed(typed_queries, sample_pid, max_samples=5):
    """
    Find geographic location for a sample using typed edge queries.
    
    Returns: dict with location info or None
    """
    # Step 1: Find sampling event (MSR_PRODUCED_BY)
    event_pid = None
    for s, p, o_list, n, et in typed_queries.get_edges_by_type(
        ISamplesEdgeType.MSR_PRODUCED_BY, limit=10000
    ):
        if s == sample_pid:
            event_pid = o_list[0] if o_list else None
            break
    
    if not event_pid:
        return None
    
    # Step 2: Find location from event (EVENT_SAMPLE_LOCATION)
    location_pid = None
    for s, p, o_list, n, et in typed_queries.get_edges_by_type(
        ISamplesEdgeType.EVENT_SAMPLE_LOCATION, limit=10000
    ):
        if s == event_pid:
            location_pid = o_list[0] if o_list else None
            break
    
    if not location_pid:
        # Try path 2: via sampling site
        site_pid = None
        for s, p, o_list, n, et in typed_queries.get_edges_by_type(
            ISamplesEdgeType.EVENT_SAMPLING_SITE, limit=10000
        ):
            if s == event_pid:
                site_pid = o_list[0] if o_list else None
                break
        
        if site_pid:
            for s, p, o_list, n, et in typed_queries.get_edges_by_type(
                ISamplesEdgeType.SITE_LOCATION, limit=10000
            ):
                if s == site_pid:
                    location_pid = o_list[0] if o_list else None
                    break
    
    if not location_pid:
        return None
    
    # Step 3: Get location coordinates
    location = typed_queries.pqg.getNode(location_pid, max_depth=0)
    
    return {
        'sample_pid': sample_pid,
        'event_pid': event_pid,
        'location_pid': location_pid,
        'latitude': location.get('latitude'),
        'longitude': location.get('longitude')
    }

# Test with a few samples
print("Finding geographic locations for samples...\n")

# Get some sample PIDs
sample_pids = conn.execute("""
    SELECT DISTINCT pid 
    FROM pqg 
    WHERE otype = 'MaterialSampleRecord' 
    LIMIT 5
""").fetchall()

results = []
for (sample_pid,) in sample_pids:
    location = find_sample_location_typed(typed_queries, sample_pid)
    if location:
        results.append(location)

if results:
    df = pd.DataFrame(results)
    print(f"Found locations for {len(df)} samples:\n")
    print(df.to_string(index=False))
else:
    print("No locations found for the selected samples")

Finding geographic locations for samples...

No locations found for the selected samples


## Example 3: Explore Sample Keywords

**Use case:** Find what concepts/keywords are associated with samples

**Edge type:** `MaterialSampleRecord --keywords--> IdentifiedConcept`

In [6]:
print("Analyzing sample keywords...\n")

# Get first 20 keyword edges
keyword_data = []
for subject_pid, predicate, object_pids, named_graph, edge_type in \
    typed_queries.get_edges_by_type(ISamplesEdgeType.MSR_KEYWORDS, limit=20):
    
    # Get sample label
    sample = pqg_instance.getNode(subject_pid, max_depth=0)
    
    # Get keyword concepts
    keywords = []
    for kw_pid in object_pids:
        concept = pqg_instance.getNode(kw_pid, max_depth=0)
        if concept:
            keywords.append(concept.get('label', kw_pid))
    
    keyword_data.append({
        'sample': sample.get('label', subject_pid[:30]),
        'keywords': ', '.join(keywords[:3]) + ('...' if len(keywords) > 3 else '')
    })

df = pd.DataFrame(keyword_data)
print(f"Sample keywords (first 20 samples):\n")
print(df.to_string(index=False))

Analyzing sample keywords...

Sample keywords (first 20 samples):

           sample                                                                keywords
 2012-18-147.13-1                                                         long bone, Aves
      Fish F17621                                                 Rajidae, odontode scale
      Fish F13570                                             Pleuronectiformes, vertebra
       Reg. 42244                    lamps (lighting devices), terracotta (clay material)
Arch. Frag. 14485                                                               sandstone
        Bone 4501                                          caudal vertebra, Sheep or goat
       20112C (3)                      amphorae (storage vessels), pottery (visual works)
       AM662:1326                                                             iron, metal
        Bone 3334          metacarpal bone of digit 3, Distal epiphysis fused, Ovis aries
       Reg. 36230                

## Example 4: Material Type Distribution

**Use case:** What material types are in the dataset?

**Edge type:** `MaterialSampleRecord --has_material_category--> IdentifiedConcept`

In [7]:
print("Analyzing material type distribution...\n")

# Collect material categories
material_counts = {}
for subject_pid, predicate, object_pids, named_graph, edge_type in \
    typed_queries.get_edges_by_type(ISamplesEdgeType.MSR_HAS_MATERIAL_CATEGORY, limit=10000):
    
    for material_pid in object_pids:
        # Get material label
        material = pqg_instance.getNode(material_pid, max_depth=0)
        label = material.get('label', material_pid) if material else material_pid
        material_counts[label] = material_counts.get(label, 0) + 1

# Sort by count
sorted_materials = sorted(material_counts.items(), key=lambda x: x[1], reverse=True)

print(f"Material type distribution (top 10 from first 10K samples):\n")
print(f"{'Material Type':<50} {'Count':>10}")
print("=" * 65)
for material, count in sorted_materials[:10]:
    print(f"{material:<50} {count:>10,}")

Analyzing material type distribution...

Material type distribution (top 10 from first 10K samples):

Material Type                                           Count
Biogenic non-organic material                           6,635
Organic material                                        1,852
Material                                                  865
Other anthropogenic material                              648


## Example 5: Edge Validation

**Use case:** Validate that edges match iSamples schema constraints

The typed edge system can validate that edges conform to expected types.

In [8]:
print("Testing edge validation...\n")

# Get a sample edge to validate
sample_pid = conn.execute("""
    SELECT pid FROM pqg WHERE otype = 'MaterialSampleRecord' LIMIT 1
""").fetchone()[0]

event_pid = conn.execute(f"""
    SELECT p.pid
    FROM pqg e
    JOIN pqg p ON p.row_id = e.o[1]
    WHERE e.otype = '_edge_'
      AND e.s = (SELECT row_id FROM pqg WHERE pid = '{sample_pid}')
      AND e.p = 'produced_by'
    LIMIT 1
""").fetchone()

if event_pid:
    event_pid = event_pid[0]
    
    # Test valid edge
    is_valid, error = typed_queries.validate_edge(
        sample_pid, 
        'produced_by', 
        event_pid,
        expected_type=ISamplesEdgeType.MSR_PRODUCED_BY
    )
    
    print(f"✅ Valid edge:")
    print(f"   {sample_pid[:50]}")
    print(f"   --produced_by-->")
    print(f"   {event_pid[:50]}")
    print(f"   Valid: {is_valid}")
    
    # Test invalid expectation
    is_valid, error = typed_queries.validate_edge(
        sample_pid,
        'produced_by',
        event_pid,
        expected_type=ISamplesEdgeType.MSR_KEYWORDS  # Wrong type!
    )
    
    print(f"\n❌ Invalid edge (wrong expected type):")
    print(f"   Valid: {is_valid}")
    print(f"   Error: {error}")
else:
    print("Could not find sample with produced_by edge for validation test")

Testing edge validation...

✅ Valid edge:
   ark:/28722/k2xd0t39r
   --produced_by-->
   sampevent_ea34d607c59db0543f948d21c2fb2ae0279e035a
   Valid: True

❌ Invalid edge (wrong expected type):
   Valid: False
   Error: Expected MaterialSampleRecord --keywords--> IdentifiedConcept, but inferred MaterialSampleRecord --produced_by--> SamplingEvent


## Example 6: Query All Edges from a Subject Type

**Use case:** Find all outgoing edges from MaterialSampleRecord nodes

This shows all the different relationship types a sample can have.

In [9]:
print("Finding all edge types from MaterialSampleRecord...\n")

# Get a sample to explore
sample_pid = conn.execute("""
    SELECT pid FROM pqg WHERE otype = 'MaterialSampleRecord' LIMIT 1
""").fetchone()[0]

print(f"Sample: {sample_pid}\n")

# Find all edges from this sample
edges_found = []
for s, p, o, edge_type in typed_queries.get_edges_by_subject_type(
    'MaterialSampleRecord', limit=1000
):
    if s == sample_pid:
        edges_found.append({
            'predicate': p,
            'edge_type': edge_type.name,
            'target': o[:50] + '...' if len(o) > 50 else o,
            'target_type': edge_type.object_type
        })

if edges_found:
    df = pd.DataFrame(edges_found)
    print(f"Found {len(df)} outgoing edges:\n")
    print(df.to_string(index=False))
else:
    print("No edges found (may need to increase limit)")

Finding all edge types from MaterialSampleRecord...

Sample: ark:/28722/k2xd0t39r

No edges found (may need to increase limit)


## Performance Comparison: Typed vs Raw SQL

Let's compare the performance of typed edge queries vs raw SQL for bulk operations.

In [10]:
import time

print("Comparing performance: Typed Edge Query vs Raw SQL\n")
print("Task: Find all MaterialSampleRecord → SamplingEvent edges\n")

# Method 1: Typed edge query
start = time.time()
typed_count = sum(1 for _ in typed_queries.get_edges_by_type(
    ISamplesEdgeType.MSR_PRODUCED_BY, limit=10000
))
typed_time = time.time() - start

print(f"Typed Edge Query (10K limit):")
print(f"  Time: {typed_time*1000:.1f}ms")
print(f"  Edges found: {typed_count:,}")

# Method 2: Raw SQL
start = time.time()
sql_count = conn.execute("""
    SELECT COUNT(*)
    FROM pqg e
    JOIN pqg s ON s.row_id = e.s
    JOIN pqg o ON o.row_id = e.o[1]
    WHERE e.otype = '_edge_'
      AND e.p = 'produced_by'
      AND s.otype = 'MaterialSampleRecord'
      AND o.otype = 'SamplingEvent'
    LIMIT 10000
""").fetchone()[0]
sql_time = time.time() - start

print(f"\nRaw SQL (10K limit):")
print(f"  Time: {sql_time*1000:.1f}ms")
print(f"  Edges found: {sql_count:,}")

# Analysis
print(f"\nPerformance difference: {typed_time/sql_time:.1f}x")
print(f"\nConclusion:")
if typed_time < sql_time * 1.5:
    print("  Typed edge queries have comparable performance to raw SQL")
    print("  Use typed edges for better code clarity and type safety")
else:
    print("  Raw SQL is faster for bulk operations")
    print("  Use typed edges for exploration, SQL for production")

Comparing performance: Typed Edge Query vs Raw SQL

Task: Find all MaterialSampleRecord → SamplingEvent edges

Typed Edge Query (10K limit):
  Time: 55.3ms
  Edges found: 10,000

Raw SQL (10K limit):
  Time: 23.5ms
  Edges found: 1,096,352

Performance difference: 2.4x

Conclusion:
  Raw SQL is faster for bulk operations
  Use typed edges for exploration, SQL for production


## Summary: When to Use Typed Edges

### ✅ Use Typed Edge Queries when:
1. **Exploring data structure** - Edge types make relationships explicit
2. **Type safety matters** - Validate edges against schema
3. **Learning the graph model** - Clear edge type names and patterns
4. **Development/prototyping** - More readable than SQL joins
5. **Documentation** - Self-documenting code with edge type names

### ⚠️ Consider Raw SQL when:
1. **Bulk operations** (10K+ edges) - SQL is faster
2. **Complex aggregations** - GROUP BY, HAVING, etc.
3. **Production performance** - Every millisecond counts
4. **Custom queries** - Need flexibility beyond typed patterns

### 🎯 Best Practice: Hybrid Approach
- Use typed edges for initial exploration and understanding
- Optimize critical paths with SQL
- Document both approaches for learning value

## The 14 iSamples Edge Types

**MaterialSampleRecord edges (8 types):**
1. `MSR_CURATION` → MaterialSampleCuration
2. `MSR_HAS_CONTEXT_CATEGORY` → IdentifiedConcept
3. `MSR_HAS_MATERIAL_CATEGORY` → IdentifiedConcept
4. `MSR_HAS_SAMPLE_OBJECT_TYPE` → IdentifiedConcept
5. `MSR_KEYWORDS` → IdentifiedConcept
6. `MSR_PRODUCED_BY` → SamplingEvent
7. `MSR_REGISTRANT` → Agent
8. `MSR_RELATED_RESOURCE` → SampleRelation

**SamplingEvent edges (4 types):**
1. `EVENT_HAS_CONTEXT_CATEGORY` → IdentifiedConcept
2. `EVENT_RESPONSIBILITY` → Agent
3. `EVENT_SAMPLE_LOCATION` → GeospatialCoordLocation
4. `EVENT_SAMPLING_SITE` → SamplingSite

**MaterialSampleCuration edges (1 type):**
1. `CURATION_RESPONSIBILITY` → Agent

**SamplingSite edges (1 type):**
1. `SITE_LOCATION` → GeospatialCoordLocation

## Next Steps

- **PQG Repository:** https://github.com/isamplesorg/pqg
- **PR #6 (Typed Edges):** https://github.com/isamplesorg/pqg/pull/6
- **Documentation:** `pqg/docs/typed-edges.md`
- **Example script:** `pqg/examples/typed_edges_demo.py`
- **Tests:** `pqg/tests/test_typed_edges.py`