# PQG Demo: Property Graph Queries with OpenContext Data

This notebook demonstrates using the `pqg` library to query OpenContext archaeological data stored as a property graph in parquet format.

**What you'll learn:**
- How to set up PQG with parquet files
- Basic graph operations (node retrieval, edge traversal)
- When to use PQG vs raw SQL
- Graph traversal patterns for iSamples data

**Data source:** `~/Data/iSample/oc_isamples_pqg.parquet` (691MB, 11.6M records)

**PQG Library:** https://github.com/isamplesorg/pqg

## Setup: Load the Parquet File

In [1]:
import duckdb
import time
from pathlib import Path
import pandas as pd

# Path to OpenContext parquet file
oc_parquet_path = Path.home() / "Data" / "iSample" / "oc_isamples_pqg.parquet"

print(f"üîÑ Creating fresh DuckDB connection and loading parquet...")
print(f"   File: {oc_parquet_path}")
print(f"   Size: {oc_parquet_path.stat().st_size / (1024**2):.1f} MB")

# Force completely fresh connection and clear any caches
conn = duckdb.connect(':memory:')

# Load data directly into a table (not a view) to avoid caching issues
conn.execute(f"""
    CREATE TABLE pqg_data AS 
    SELECT * FROM read_parquet('{oc_parquet_path}')
""")

# Create a view that references the table
conn.execute("CREATE VIEW pqg AS SELECT * FROM pqg_data")

# Verify the setup works
total_records = conn.execute("SELECT COUNT(*) FROM pqg").fetchone()[0]
sample_check = conn.execute("SELECT COUNT(*) FROM pqg WHERE otype = 'MaterialSampleRecord'").fetchone()[0]

print(f"‚úÖ Successfully loaded {total_records:,} records")
print(f"‚úÖ Found {sample_check:,} MaterialSampleRecord entries")
print(f"‚úÖ Ready for PQG operations")

üîÑ Creating fresh DuckDB connection and loading parquet...
   File: /Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet
   Size: 690.9 MB
‚úÖ Successfully loaded 11,637,144 records
‚úÖ Found 1,096,352 MaterialSampleRecord entries
‚úÖ Ready for PQG operations
‚úÖ Successfully loaded 11,637,144 records
‚úÖ Found 1,096,352 MaterialSampleRecord entries
‚úÖ Ready for PQG operations


## Quick Data Overview (SQL)

First, let's understand what's in the data using raw SQL.

In [2]:
# Total record count
total = conn.execute("SELECT COUNT(*) as total FROM pqg").fetchone()[0]
print(f"Total records: {total:,}")

# Entity type distribution
print("\nEntity types:")
result = conn.execute("""
    SELECT otype, COUNT(*) as count
    FROM pqg
    GROUP BY otype
    ORDER BY count DESC
""").df()
result

Total records: 11,637,144

Entity types:


Unnamed: 0,otype,count
0,_edge_,9201451
1,MaterialSampleRecord,1096352
2,SamplingEvent,1096352
3,GeospatialCoordLocation,198433
4,IdentifiedConcept,25778
5,SamplingSite,18213
6,Agent,565


## Initialize PQG Instance

Now let's wrap the parquet data with PQG's Python API.

In [3]:
# Import PQG
import sys
sys.path.insert(0, str(Path.home() / "C" / "src" / "iSamples" / "pqg"))

from pqg import pqg_singletable as pqg

# Create PQG instance with the fresh connection
def create_pqg_instance(conn, table_name='pqg'):
    """Initialize PQG wrapper around parquet data"""
    # For parquet files, we need to use the read_parquet source format
    parquet_path = Path.home() / "Data" / "iSample" / "oc_isamples_pqg.parquet"
    parquet_source = f"read_parquet('{parquet_path}')"
    
    pqg_instance = pqg.PQG(dbinstance=conn, source=parquet_source)
    pqg_instance._table = table_name
    pqg_instance._isparquet = True  # Read-only mode
    pqg_instance._node_pk = 'pid'   # Primary lookup field
    
    # Load metadata from parquet file to initialize _types
    try:
        pqg_instance.loadMetadataParquet()
        print("‚úÖ Loaded PQG metadata from parquet")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not load parquet metadata: {e}")
        print("   This may be normal if the parquet doesn't have PQG metadata")
        # Initialize basic types manually for all entity types in the data
        pqg_instance._types = {
            'MaterialSampleRecord': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
            'SamplingEvent': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
            'GeospatialCoordLocation': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'latitude': 'DOUBLE', 'longitude': 'DOUBLE'},
            'SamplingSite': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
            'IdentifiedConcept': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
            'Agent': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 'label': 'VARCHAR'},
            '_edge_': {'pid': 'VARCHAR', 'otype': 'VARCHAR', 's': 'INTEGER', 'p': 'VARCHAR', 'o': 'INTEGER[]'}
        }
        print(f"   Initialized basic types: {list(pqg_instance._types.keys())}")
    
    return pqg_instance

# Recreate PQG instance with the fresh connection
pqg_instance = create_pqg_instance(conn)

print("‚úÖ PQG instance created with fresh connection")
print(f"   Table: {pqg_instance._table}")
print(f"   Read-only mode: {pqg_instance._isparquet}")
print(f"   Primary key: {pqg_instance._node_pk}")
print(f"   Types loaded: {len(pqg_instance._types)} entity types")

‚ö†Ô∏è Could not load parquet metadata: Catalog Error: View with name "pqg" already exists!
   This may be normal if the parquet doesn't have PQG metadata
   Initialized basic types: ['MaterialSampleRecord', 'SamplingEvent', 'GeospatialCoordLocation', 'SamplingSite', 'IdentifiedConcept', 'Agent', '_edge_']
‚úÖ PQG instance created with fresh connection
   Table: pqg
   Read-only mode: True
   Primary key: pid
   Types loaded: 7 entity types


## Example 1: Single Node Retrieval

**Use Case:** Get details about a specific sample, location, or event

**When to use PQG:** ‚úÖ Single node lookups - PQG handles row_id conversion automatically

In [4]:
# First, find a sample PID to work with
sample_pid = conn.execute("""
    SELECT pid 
    FROM pqg 
    WHERE otype = 'MaterialSampleRecord' 
    LIMIT 1
""").fetchone()[0]

print(f"Sample PID: {sample_pid}")

Sample PID: ark:/28722/k2xd0t39r


In [5]:
# SQL approach (raw query)
start = time.time()
sql_result = conn.execute(f"""
    SELECT *
    FROM pqg
    WHERE pid = '{sample_pid}'
""").df()
sql_time = time.time() - start

print(f"SQL approach: {sql_time*1000:.2f}ms")
print(f"Columns returned: {len(sql_result.columns)}")
print(f"Many NULL columns: {sql_result.isnull().sum().sum()} nulls out of {sql_result.size} values")

SQL approach: 6.10ms
Columns returned: 40
Many NULL columns: 31 nulls out of 40 values


In [6]:
# PQG approach (cleaner API)
start = time.time()
pqg_result = pqg_instance.getNode(sample_pid, max_depth=0)
pqg_time = time.time() - start

print(f"PQG approach: {pqg_time*1000:.2f}ms")
print(f"\nNode details:")
print(pqg_result)

PQG approach: 2.46ms

Node details:
{'pid': 'ark:/28722/k2xd0t39r', 'otype': 'MaterialSampleRecord', 'label': 'Bone 8679'}


**Comparison:**
- **SQL**: Returns all columns (many NULL), requires manual filtering
- **PQG**: Returns only populated fields, cleaner dict interface
- **Performance**: Similar for single node (~same speed)
- **Winner**: üèÜ PQG for single node retrieval (cleaner API)

## Example 2: Node with Relationships

**Use Case:** Get a node AND all its immediate connections

**PQG feature:** `max_depth=1` automatically expands related nodes

In [7]:
# PQG with depth=1 (auto-expand relationships)
start = time.time()
expanded = pqg_instance.getNode(sample_pid, max_depth=1)
pqg_expanded_time = time.time() - start

print(f"PQG with max_depth=1: {pqg_expanded_time*1000:.2f}ms")
print(f"\nNode type: {expanded.get('otype')}")
print(f"Node PID: {expanded.get('pid')}")
print(f"\nRelated entities:")
for key, value in expanded.items():
    if key not in ['pid', 'otype', 'row_id', 'label', 'description']:
        print(f"  {key}: {value}")

PQG with max_depth=1: 50.17ms

Node type: MaterialSampleRecord
Node PID: ark:/28722/k2xd0t39r

Related entities:
  produced_by: {'pid': 'sampevent_ea34d607c59db0543f948d21c2fb2ae0279e035a', 'otype': 'SamplingEvent', 'label': 'Sampling event for: Bone 8679', 'sampling_site': {'pid': 'https://opencontext.org/subjects/e44a115a-dfcb-4971-6750-40955df2c062', 'otype': 'SamplingSite', 'label': '√áatalh√∂y√ºk', 'site_location': {'pid': 'geoloc_b8a942487844671e9f7343397454258529381489', 'otype': 'GeospatialCoordLocation', 'latitude': 37.6675, 'longitude': 32.828333}}, 'sample_location': {'pid': 'geoloc_35842a4fa478ae28c68f54d1db36c8e968d62dcb', 'otype': 'GeospatialCoordLocation', 'latitude': 37.668196, 'longitude': 32.827191}, 'responsibility': {'pid': 'https://opencontext.org/persons/fd2d702f-1ec6-4865-cccc-da8af166cc83', 'otype': 'Agent', 'label': None}}
  keywords: [{'pid': 'https://purl.obolibrary.org/obo/UBERON_0001684', 'otype': 'IdentifiedConcept', 'label': 'mandible'}, {'pid': 'https://

In [8]:
# SQL equivalent (complex multi-step query)
start = time.time()

# Step 1: Get the node
node = conn.execute(f"SELECT * FROM pqg WHERE pid = '{sample_pid}'").fetchone()

# Step 2: Find edges from this node
edges = conn.execute(f"""
    SELECT p, o
    FROM pqg
    WHERE otype = '_edge_'
      AND s = (SELECT row_id FROM pqg WHERE pid = '{sample_pid}')
""").fetchall()

# Step 3: Resolve each target
related = {}
for predicate, obj_ids in edges:
    if obj_ids:
        targets = conn.execute(f"""
            SELECT pid
            FROM pqg
            WHERE row_id = ANY({obj_ids})
        """).fetchall()
        related[predicate] = [t[0] for t in targets]

sql_expanded_time = time.time() - start

print(f"SQL multi-step approach: {sql_expanded_time*1000:.2f}ms")
print(f"Related entities found: {len(related)}")
for pred, targets in related.items():
    print(f"  {pred}: {targets}")

SQL multi-step approach: 13.28ms
Related entities found: 6
  keywords: ['https://purl.obolibrary.org/obo/UBERON_0001684', 'https://eol.org/pages/32609438#gbif-sub']
  has_sample_object_type: ['https://w3id.org/isample/vocabulary/materialsampleobjecttype/1.0/organismpart']
  registrant: ['https://opencontext.org/persons/fd2d702f-1ec6-4865-cccc-da8af166cc83', 'https://opencontext.org/persons/94aa8533-450d-4fc7-4dc1-3cb12a3fe52c', 'https://opencontext.org/persons/7cdf91a1-3230-41fb-274a-e927734f8de6']
  produced_by: ['sampevent_ea34d607c59db0543f948d21c2fb2ae0279e035a']
  has_material_category: ['https://w3id.org/isample/vocabulary/material/1.0/biogenicnonorganicmaterial']
  has_context_category: ['https://w3id.org/isample/vocabulary/sampledfeature/1.0/pasthumanoccupationsite']


**Comparison:**
- **SQL**: Requires 3+ separate queries, manual row_id‚Üípid resolution
- **PQG**: Single method call, automatic resolution
- **Performance**: PQG may be slightly slower but much more readable
- **Winner**: üèÜ PQG for development/exploration (SQL for production if speed critical)

## Example 3: Graph Traversal - Sample to Geographic Location

**Use Case:** Find where a sample was collected (multi-hop traversal)

**Graph path:** `Sample ‚Üí produced_by ‚Üí SamplingEvent ‚Üí sample_location ‚Üí Location`

In [9]:
# PQG approach (readable graph navigation)
def get_sample_location_pqg(pqg_instance, sample_pid):
    """Get geographic location for a sample using PQG"""
    
    # Get sample with immediate edges
    sample = pqg_instance.getNode(sample_pid, max_depth=1)
    if not sample:
        return None
    
    # Navigate to sampling event
    # When max_depth=1, related objects are expanded as full dictionaries
    produced_by = sample.get('produced_by')
    if not produced_by:
        return None
    
    # Extract the event - it's already expanded, so use it directly
    event = produced_by
    
    # Extract location (Path 1: direct location)
    sample_location = event.get('sample_location')
    if sample_location:
        return {
            'path': 'direct',
            'location_pid': sample_location.get('pid'),
            'location': sample_location
        }
    
    # Try Path 2: via sampling site
    sampling_site = event.get('sampling_site')
    if sampling_site:
        site_location = sampling_site.get('site_location')
        if site_location:
            return {
                'path': 'via_site',
                'site_pid': sampling_site.get('pid'),
                'location_pid': site_location.get('pid'),
                'location': site_location
            }
    
    return None

# Test with a sample that has location
sample_with_location = conn.execute("""
    SELECT DISTINCT s.pid
    FROM pqg s
    WHERE s.otype = 'MaterialSampleRecord'
      AND EXISTS (
        SELECT 1 FROM pqg e1
        WHERE e1.otype = '_edge_' 
          AND e1.s = s.row_id 
          AND e1.p = 'produced_by'
      )
    LIMIT 1
""").fetchone()[0]

print(f"Testing with sample: {sample_with_location}")

start = time.time()
pqg_location = get_sample_location_pqg(pqg_instance, sample_with_location)
pqg_traversal_time = time.time() - start

print(f"\nPQG traversal: {pqg_traversal_time*1000:.2f}ms")
if pqg_location:
    print(f"Path used: {pqg_location['path']}")
    print(f"Location PID: {pqg_location['location_pid']}")
    if pqg_location['location']:
        print(f"Coordinates: {pqg_location['location'].get('latitude')}, {pqg_location['location'].get('longitude')}")
else:
    print("No location found")

Testing with sample: ark:/28722/k24x5c63r

PQG traversal: 43.88ms
Path used: direct
Location PID: geoloc_17bae610b87227ef806161bdb40ac97b4cd8ef5e
Coordinates: 30.3287, 35.4421


In [10]:
# SQL approach (complex joins)
start = time.time()

sql_location = conn.execute(f"""
    WITH sample_event AS (
        SELECT e.o[1] as event_row_id
        FROM pqg e
        WHERE e.otype = '_edge_'
          AND e.s = (SELECT row_id FROM pqg WHERE pid = '{sample_with_location}')
          AND e.p = 'produced_by'
    ),
    -- Path 1: Direct location
    direct_location AS (
        SELECT 'direct' as path, e.o[1] as location_row_id
        FROM pqg e, sample_event se
        WHERE e.otype = '_edge_'
          AND e.s = se.event_row_id
          AND e.p = 'sample_location'
    ),
    -- Path 2: Via site
    site_location AS (
        SELECT 'via_site' as path, e2.o[1] as location_row_id
        FROM pqg e1, sample_event se, pqg e2
        WHERE e1.otype = '_edge_'
          AND e1.s = se.event_row_id
          AND e1.p = 'sampling_site'
          AND e2.otype = '_edge_'
          AND e2.s = e1.o[1]
          AND e2.p = 'site_location'
    ),
    combined AS (
        SELECT * FROM direct_location
        UNION ALL
        SELECT * FROM site_location
    )
    SELECT c.path, l.pid as location_pid, l.latitude, l.longitude
    FROM combined c
    JOIN pqg l ON l.row_id = c.location_row_id
    LIMIT 1
""").fetchone()

sql_traversal_time = time.time() - start

print(f"SQL traversal: {sql_traversal_time*1000:.2f}ms")
if sql_location:
    print(f"Path used: {sql_location[0]}")
    print(f"Location PID: {sql_location[1]}")
    print(f"Coordinates: {sql_location[2]}, {sql_location[3]}")
else:
    print("No location found")

SQL traversal: 12.71ms
Path used: via_site
Location PID: geoloc_17bae610b87227ef806161bdb40ac97b4cd8ef5e
Coordinates: 30.3287, 35.4421


**Comparison:**
- **SQL**: Single query but complex CTEs, hard to understand
- **PQG**: Step-by-step navigation, very readable
- **Performance**: SQL likely faster (single query vs multiple)
- **Winner**: üèÜ PQG for learning/development, SQL for production bulk queries

**Key insight:** For 1-3 hop traversals on single nodes, PQG's clarity wins. For bulk operations (10K+ samples), use SQL.

## Example 4: Reverse Traversal - Find Samples at a Location

**Use Case:** Given a location, find all samples collected there

**Graph path (reversed):** `Location ‚Üê sample_location ‚Üê Event ‚Üê produced_by ‚Üê Sample`

In [11]:
# First, find a location that has samples
location_pid = conn.execute("""
    SELECT DISTINCT l.pid
    FROM pqg l
    WHERE l.otype = 'GeospatialCoordLocation'
      AND l.latitude IS NOT NULL
      AND l.longitude IS NOT NULL
    LIMIT 1
""").fetchone()[0]

print(f"Testing with location: {location_pid}")

Testing with location: geoloc_9ed0e5532e46ec6e60c00dadad969f10a2d2e945


In [12]:
# PQG approach (using getRelations for reverse lookup)
def find_samples_at_location_pqg(pqg_instance, location_pid):
    """Find samples at a location using PQG reverse traversal"""
    samples = []
    
    # Find events that reference this location
    for subj, pred, obj in pqg_instance.getRelations(obj=location_pid, predicate='sample_location', maxrows=1000):
        event_pid = subj  # Event that has this location
        
        # Find samples produced by this event
        for s_subj, s_pred, s_obj in pqg_instance.getRelations(obj=event_pid, predicate='produced_by', maxrows=1000):
            sample_pid = s_subj
            samples.append(sample_pid)
    
    return samples

start = time.time()
pqg_samples = find_samples_at_location_pqg(pqg_instance, location_pid)
pqg_reverse_time = time.time() - start

print(f"PQG reverse traversal: {pqg_reverse_time*1000:.2f}ms")
print(f"Samples found: {len(pqg_samples)}")
if pqg_samples:
    print(f"First 3 samples: {pqg_samples[:3]}")

PQG reverse traversal: 3.83ms
Samples found: 0


In [13]:
# SQL approach (optimized for reverse lookup)
start = time.time()

sql_samples = conn.execute(f"""
    SELECT DISTINCT s.pid
    FROM pqg l
    -- Find edges pointing TO location
    JOIN pqg e1 ON e1.otype = '_edge_' 
                AND e1.p = 'sample_location'
                AND e1.o[1] = l.row_id
    -- Find edges pointing TO event
    JOIN pqg e2 ON e2.otype = '_edge_'
                AND e2.p = 'produced_by'
                AND e2.o[1] = e1.s
    -- Get sample node
    JOIN pqg s ON s.row_id = e2.s
                AND s.otype = 'MaterialSampleRecord'
    WHERE l.pid = '{location_pid}'
""").fetchall()

sql_reverse_time = time.time() - start

print(f"SQL reverse traversal: {sql_reverse_time*1000:.2f}ms")
print(f"Samples found: {len(sql_samples)}")
if sql_samples:
    print(f"First 3 samples: {[s[0] for s in sql_samples[:3]]}")

SQL reverse traversal: 6.26ms
Samples found: 0


**Comparison:**
- **SQL**: Single join query, very efficient
- **PQG**: Multiple API calls via getRelations(), more overhead
- **Performance**: SQL significantly faster for reverse traversal
- **Winner**: üèÜ SQL for reverse traversal ("what points to X" queries)

**Key insight:** PQG's getRelations() is less optimized for reverse lookups. Use SQL for these patterns.

## Example 5: Entity Counting - When SQL Wins

**Use Case:** Count how many of each entity type exist

**This is a bulk aggregation - SQL's sweet spot**

In [14]:
# SQL approach (instant)
start = time.time()
sql_counts = conn.execute("""
    SELECT otype, COUNT(*) as count
    FROM pqg
    GROUP BY otype
    ORDER BY count DESC
""").df()
sql_count_time = time.time() - start

print(f"SQL aggregation: {sql_count_time*1000:.2f}ms")
print(sql_counts)

SQL aggregation: 13.68ms
                     otype    count
0                   _edge_  9201451
1            SamplingEvent  1096352
2     MaterialSampleRecord  1096352
3  GeospatialCoordLocation   198433
4        IdentifiedConcept    25778
5             SamplingSite    18213
6                    Agent      565


In [15]:
# PQG approach (iterating through records)
start = time.time()

pqg_counts = {}
for pid, otype in pqg_instance.getIds(maxrows=10000):  # Limited to 10K for demo
    pqg_counts[otype] = pqg_counts.get(otype, 0) + 1

pqg_count_time = time.time() - start

print(f"PQG iteration (10K rows only): {pqg_count_time*1000:.2f}ms")
print(f"\nPartial counts:")
for otype, count in sorted(pqg_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {otype}: {count:,}")

print(f"\n‚ö†Ô∏è Note: This is only 10K out of {total:,} records")
print(f"   Full dataset would take ~{(pqg_count_time * total / 10000):.1f}s with PQG")
print(f"   SQL took {sql_count_time*1000:.2f}ms for all records")

PQG iteration (10K rows only): 5.29ms

Partial counts:
  geoloc_f401f04667bf510a353d06b7025a7c66e13ea56b: 1
  geoloc_a133d0d8e1ca8888388c7a22073d2b6441fe3fe1: 1
  geoloc_09edf9bc6e1a3588eec87b7a538dc16ea9790c4c: 1
  geoloc_fdaa27592833b5745dbabd9138b12e3d9eef0c6f: 1
  geoloc_210c6b24821fb1618d50d7fb00a40e5f84c0be73: 1
  geoloc_01efb1b15b61bb5a709ce6c9d6f7328b498142d5: 1
  geoloc_4f5491c0f318bdc7cf7d1fa4394abff83aabc5be: 1
  geoloc_15c5a7434a62ce59128a5da660282b4f757d4a55: 1
  geoloc_40a0a5e7a80854a2aab3ec3a7ba8cdefbecf4513: 1
  geoloc_a20b62a269c10205eb6b202ec5d608cfef03a5bd: 1
  geoloc_d4aab0c40ed12254798bccf8e685f782c78059d3: 1
  geoloc_e0c9c5e581aaa786af9197a205b5fd4afa2da442: 1
  geoloc_7deec984bc3310f1f41838cee0f5790a739ecee3: 1
  geoloc_6552d7a5de431f7841d5966a7f9bdeb9e9820a5c: 1
  geoloc_1952cdcdb8f3c565a5c8a36a9f88767d2117e010: 1
  geoloc_1bf5a02dc6484e38d824b5ab68a9ff11260dcdfb: 1
  geoloc_d7e6837bb122dfcf0058242e8ac001837b1c1d02: 1
  geoloc_17eaccaccc28994ea556bc2cee0ab360e67

**Comparison:**
- **SQL**: Instant, uses columnar storage optimizations
- **PQG**: Must iterate through records, 100-1000x slower
- **Winner**: üèÜ SQL for bulk aggregations (no contest)

**Key insight:** Never use PQG for GROUP BY operations on large datasets.

## Decision Matrix: When to Use PQG vs SQL

| Use Case | PQG | SQL | Rationale |
|----------|-----|-----|----------|
| Single node lookup | ‚úÖ | ‚ö†Ô∏è | PQG handles row_id conversion, cleaner API |
| Multi-hop traversal (1-3 hops) | ‚úÖ | ‚ö†Ô∏è | PQG more readable, acceptable performance |
| Reverse graph traversal | ‚ö†Ô∏è | ‚úÖ | SQL more efficient for "what points to X" |
| Bulk aggregations (10K+ rows) | ‚ùå | ‚úÖ | SQL dramatically faster |
| Visualization queries | ‚ùå | ‚úÖ | Need specific projections, performance-critical |
| Data quality analysis | ‚ùå | ‚úÖ | Requires full table scans |
| Learning/prototyping | ‚úÖ | ‚ö†Ô∏è | PQG clearer for understanding graph structure |
| Production web queries | ‚ùå | ‚úÖ | SQL already optimized and tested |

**Legend:**
- ‚úÖ Recommended
- ‚ö†Ô∏è Works but not optimal  
- ‚ùå Not recommended

## Key Takeaways

1. **PQG excels at:**
   - Interactive exploration and learning
   - Single node operations with relationship expansion
   - Development and prototyping
   - Making complex graph patterns more readable

2. **SQL excels at:**
   - Bulk operations and aggregations
   - Reverse traversals ("what points to X")
   - Production performance-critical queries
   - Full table scans and statistics

3. **Best practice: Hybrid approach**
   - Use PQG for exploratory analysis
   - Identify performance bottlenecks
   - Optimize critical sections with SQL
   - Document both approaches for learning value

4. **Performance rule of thumb:**
   - <100 nodes: PQG is fine
   - 100-1000 nodes: PQG acceptable, profile first
   - 1000+ nodes: Strongly prefer SQL
   - Aggregations: Always use SQL

## Next Steps

To explore further:

1. **Try modifying the traversal functions** - Add more hops, different predicates
2. **Explore other entity types** - Try with SamplingSite, IdentifiedConcept, etc.
3. **Compare with existing notebook** - See `oc_parquet_analysis_enhanced.ipynb` for production SQL patterns
4. **Contribute back to PQG** - If you find API gaps or performance improvements

## Resources

- **PQG Repository:** https://github.com/isamplesorg/pqg
- **PQG Documentation:** https://github.com/isamplesorg/pqg/tree/main/docs
- **Integration Plan:** `/Users/raymondyee/C/src/iSamples/isamples-python/PQG_INTEGRATION_PLAN.md`
- **Production Examples:** `oc_parquet_analysis_enhanced.ipynb`