# Understanding PQG Schema: Narrow vs Wide

**Learning Objective**: Build intuition for the two ways to store property graph relationships in Parquet format.

**Key Mental Model**: Both schemas represent the **exact same relationships** - they're just different *serializations* of the same semantic grammar.

---

## The Core Question

How do we store this relationship in a flat table?

```
Sample "ark:/28722/k2ng4nj6s" --produced_by--> Event "sampevent_633406f6..."
```

Two answers:
1. **Narrow**: Create a separate "edge" row
2. **Wide**: Add a column to the sample row


## Setup: Load Both Schemas

In [1]:
import duckdb
from pathlib import Path
import pandas as pd

# Local parquet files (both in this directory)
narrow_path = Path("~/Data/iSample/pqg_refining/oc_isamples_pqg.parquet")
wide_path = Path("~/Data/iSample/pqg_refining/oc_isamples_pqg_wide.parquet")

# Verify files exist
print("File sizes:")
print(f"  Narrow: {narrow_path.stat().st_size / (1024**2):.1f} MB")
print(f"  Wide:   {wide_path.stat().st_size / (1024**2):.1f} MB")
print(f"\n  Difference: {(1 - wide_path.stat().st_size / narrow_path.stat().st_size) * 100:.0f}% smaller")

# Create connection
con = duckdb.connect(':memory:')

File sizes:
  Narrow: 690.9 MB
  Wide:   275.3 MB

  Difference: 60% smaller


---

# Part 1: What Does "Narrow" Look Like?

In the narrow schema, relationships are stored as **separate edge rows** with:
- `otype = '_edge_'` (marks this as an edge, not an entity)
- `s` = source row_id (integer)
- `p` = predicate name (string like 'produced_by')
- `o` = object row_id(s) (integer array)

In [2]:
# What's in the narrow schema?
print("=" * 60)
print("NARROW SCHEMA: Row type distribution")
print("=" * 60)

con.sql(f"""
    SELECT 
        otype,
        COUNT(*) as count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) as pct
    FROM read_parquet('{narrow_path}')
    GROUP BY otype
    ORDER BY count DESC
""").show()

NARROW SCHEMA: Row type distribution
┌─────────────────────────┬─────────┬────────┐
│          otype          │  count  │  pct   │
│         varchar         │  int64  │ double │
├─────────────────────────┼─────────┼────────┤
│ _edge_                  │ 9201451 │   79.1 │
│ MaterialSampleRecord    │ 1096352 │    9.4 │
│ SamplingEvent           │ 1096352 │    9.4 │
│ GeospatialCoordLocation │  198433 │    1.7 │
│ IdentifiedConcept       │   25778 │    0.2 │
│ SamplingSite            │   18213 │    0.2 │
│ Agent                   │     565 │    0.0 │
└─────────────────────────┴─────────┴────────┘



### 💡 Key Insight #1: Edge Rows Dominate

Notice that **79% of all rows are `_edge_` rows**! 

For every entity (sample, event, location), there are multiple edge rows connecting it to other entities.

This is the main source of file size and query complexity in narrow format.

In [3]:
# Let's look at actual edge rows
print("Example edge rows (showing s, p, o fields):")
print()

con.sql(f"""
    SELECT 
        row_id,
        s as source_row_id,
        p as predicate,
        o as object_row_ids
    FROM read_parquet('{narrow_path}')
    WHERE otype = '_edge_'
    LIMIT 10
""").show()

Example edge rows (showing s, p, o fields):

┌─────────┬───────────────┬───────────────┬────────────────┐
│ row_id  │ source_row_id │   predicate   │ object_row_ids │
│  int32  │     int32     │    varchar    │    int32[]     │
├─────────┼───────────────┼───────────────┼────────────────┤
│ 2435694 │        212310 │ site_location │ [28766]        │
│ 2435695 │        209300 │ site_location │ [28809]        │
│ 2435696 │        210422 │ site_location │ [28836]        │
│ 2435697 │        203465 │ site_location │ [28905]        │
│ 2435698 │        214874 │ site_location │ [28930]        │
│ 2435699 │        210742 │ site_location │ [28991]        │
│ 2435700 │        205667 │ site_location │ [28998]        │
│ 2435701 │        210726 │ site_location │ [29287]        │
│ 2435702 │        205451 │ site_location │ [29307]        │
│ 2435703 │        215175 │ site_location │ [29441]        │
├─────────┴───────────────┴───────────────┴────────────────┤
│ 10 rows                               

In [4]:
# What predicate types exist?
print("Predicate types in narrow schema:")
print()

con.sql(f"""
    SELECT 
        p as predicate,
        COUNT(*) as edge_count
    FROM read_parquet('{narrow_path}')
    WHERE otype = '_edge_'
    GROUP BY p
    ORDER BY edge_count DESC
""").show()

Predicate types in narrow schema:

┌────────────────────────┬────────────┐
│       predicate        │ edge_count │
│        varchar         │   int64    │
├────────────────────────┼────────────┤
│ has_context_category   │    1096352 │
│ has_material_category  │    1096352 │
│ has_sample_object_type │    1096352 │
│ produced_by            │    1096352 │
│ sampling_site          │    1096352 │
│ keywords               │    1096297 │
│ sample_location        │    1096274 │
│ responsibility         │    1095272 │
│ registrant             │     413635 │
│ site_location          │      18213 │
├────────────────────────┴────────────┤
│ 10 rows                   2 columns │
└─────────────────────────────────────┘



### 💡 Key Insight #2: The 10 Predicate Types

These 10 predicates define ALL the relationships in OpenContext iSamples data:

| Predicate | Meaning | Count |
|-----------|---------|-------|
| `produced_by` | Sample → SamplingEvent | ~1.1M |
| `keywords` | Sample → Concept | ~1.1M |
| `has_context_category` | Sample → Concept | ~1.1M |
| `has_material_category` | Sample → Concept | ~1.1M |
| `has_sample_object_type` | Sample → Concept | ~1.1M |
| `sampling_site` | Event → Site | ~1.1M |
| `sample_location` | Event → GeoLocation | ~1.1M |
| `responsibility` | Event → Agent | ~1.1M |
| `registrant` | Sample → Agent | ~413K |
| `site_location` | Site → GeoLocation | ~18K |

---

# Part 2: What Does "Wide" Look Like?

In the wide schema:
- **No `_edge_` rows exist** (they've been eliminated!)
- Each predicate becomes a **column** named `p__<predicate>`
- The column contains an array of target row_ids

In [5]:
# What's in the wide schema?
print("=" * 60)
print("WIDE SCHEMA: Row type distribution")
print("=" * 60)

con.sql(f"""
    SELECT 
        otype,
        COUNT(*) as count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) as pct
    FROM read_parquet('{wide_path}')
    GROUP BY otype
    ORDER BY count DESC
""").show()

WIDE SCHEMA: Row type distribution
┌─────────────────────────┬─────────┬────────┐
│          otype          │  count  │  pct   │
│         varchar         │  int64  │ double │
├─────────────────────────┼─────────┼────────┤
│ MaterialSampleRecord    │ 1110412 │   45.1 │
│ SamplingEvent           │ 1110412 │   45.1 │
│ GeospatialCoordLocation │  199147 │    8.1 │
│ IdentifiedConcept       │   25929 │    1.1 │
│ SamplingSite            │   18213 │    0.7 │
│ Agent                   │     577 │    0.0 │
└─────────────────────────┴─────────┴────────┘



### 💡 Key Insight #3: No Edge Rows!

The `_edge_` type is **completely gone**. We went from 11.6M rows to 2.5M rows.

But wait - where did the relationship information go?

In [6]:
# Show the new p__* columns
print("Wide schema columns (p__* columns highlighted):")
print()

schema = con.sql(f"DESCRIBE SELECT * FROM read_parquet('{wide_path}')").df()

# Highlight the predicate columns
for _, row in schema.iterrows():
    name = row['column_name']
    dtype = row['column_type']
    if name.startswith('p__'):
        print(f"  ⭐ {name:<30} {dtype}")
    else:
        print(f"     {name:<30} {dtype}")

Wide schema columns (p__* columns highlighted):

     row_id                         INTEGER
     pid                            VARCHAR
     tcreated                       INTEGER
     tmodified                      INTEGER
     otype                          VARCHAR
     n                              VARCHAR
     altids                         VARCHAR[]
     geometry                       BLOB
     authorized_by                  VARCHAR[]
     has_feature_of_interest        VARCHAR
     affiliation                    VARCHAR
     sampling_purpose               VARCHAR
     complies_with                  VARCHAR[]
     project                        VARCHAR
     alternate_identifiers          VARCHAR[]
     relationship                   VARCHAR
     elevation                      VARCHAR
     sample_identifier              VARCHAR
     dc_rights                      VARCHAR
     result_time                    VARCHAR
     contact_information            VARCHAR
     latitude         

In [7]:
# Look at a sample row with its embedded relationships
print("Example: A MaterialSampleRecord with embedded relationships")
print()

con.sql(f"""
    SELECT 
        pid,
        label,
        p__produced_by,
        p__keywords,
        p__has_material_category
    FROM read_parquet('{wide_path}')
    WHERE otype = 'MaterialSampleRecord'
    AND p__produced_by IS NOT NULL
    LIMIT 5
""").show(max_width=200)

Example: A MaterialSampleRecord with embedded relationships

┌──────────────────────┬──────────────┬────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────┐
│         pid          │    label     │ p__produced_by │                                                    p__keywords                                                     │ p__has_material_category │
│       varchar        │   varchar    │    int32[]     │                                                      int32[]                                                       │         int32[]          │
├──────────────────────┼──────────────┼────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────┤
│ ark:/28722/k2wq0f07p │ Batch 13     │ [892343]       │ [2438661, 2439143]                                                            

### 💡 Key Insight #4: Relationships Are Now Columns

Instead of:
```
Row 1: {otype: 'sample', pid: 'ark:/123', ...}
Row 2: {otype: '_edge_', s: 1, p: 'produced_by', o: [456]}
```

We now have:
```
Row 1: {otype: 'sample', pid: 'ark:/123', p__produced_by: [456], ...}
```

The edge information is **denormalized** into the entity row itself.

---

# Part 3: Side-by-Side Query Comparison

**Task**: Find samples with their geographic coordinates

This requires traversing:
```
MaterialSampleRecord → SamplingEvent → GeospatialCoordLocation
```

In [8]:
import time

# NARROW SCHEMA QUERY
print("=" * 60)
print("NARROW SCHEMA QUERY")
print("=" * 60)
print()

narrow_sql = f"""
SELECT 
    samp.pid as sample_pid,
    samp.label as sample_label,
    geo.latitude,
    geo.longitude
FROM read_parquet('{narrow_path}') AS samp
-- Join to edge row: sample -> event
JOIN read_parquet('{narrow_path}') AS e1 
    ON e1.s = samp.row_id 
    AND e1.p = 'produced_by'
    AND e1.otype = '_edge_'
-- Join to event entity
JOIN read_parquet('{narrow_path}') AS event 
    ON event.row_id = e1.o[1]
-- Join to edge row: event -> location
JOIN read_parquet('{narrow_path}') AS e2 
    ON e2.s = event.row_id 
    AND e2.p = 'sample_location'
    AND e2.otype = '_edge_'
-- Join to location entity
JOIN read_parquet('{narrow_path}') AS geo 
    ON geo.row_id = e2.o[1]
WHERE samp.otype = 'MaterialSampleRecord'
LIMIT 5
"""

print("SQL Query (note: 7 table references, 4 joins through edges):")
print(narrow_sql)

start = time.time()
result_narrow = con.sql(narrow_sql).df()
narrow_time = time.time() - start

print(f"\nExecution time: {narrow_time*1000:.1f}ms")
print(f"\nResults:")
print(result_narrow.to_string(index=False))

NARROW SCHEMA QUERY

SQL Query (note: 7 table references, 4 joins through edges):

SELECT 
    samp.pid as sample_pid,
    samp.label as sample_label,
    geo.latitude,
    geo.longitude
FROM read_parquet('~/Data/iSample/pqg_refining/oc_isamples_pqg.parquet') AS samp
-- Join to edge row: sample -> event
JOIN read_parquet('~/Data/iSample/pqg_refining/oc_isamples_pqg.parquet') AS e1 
    ON e1.s = samp.row_id 
    AND e1.p = 'produced_by'
    AND e1.otype = '_edge_'
-- Join to event entity
JOIN read_parquet('~/Data/iSample/pqg_refining/oc_isamples_pqg.parquet') AS event 
    ON event.row_id = e1.o[1]
-- Join to edge row: event -> location
JOIN read_parquet('~/Data/iSample/pqg_refining/oc_isamples_pqg.parquet') AS e2 
    ON e2.s = event.row_id 
    AND e2.p = 'sample_location'
    AND e2.otype = '_edge_'
-- Join to location entity
JOIN read_parquet('~/Data/iSample/pqg_refining/oc_isamples_pqg.parquet') AS geo 
    ON geo.row_id = e2.o[1]
WHERE samp.otype = 'MaterialSampleRecord'
LIMIT 5


In [9]:
# WIDE SCHEMA QUERY
print("=" * 60)
print("WIDE SCHEMA QUERY")
print("=" * 60)
print()

wide_sql = f"""
SELECT 
    samp.pid as sample_pid,
    samp.label as sample_label,
    geo.latitude,
    geo.longitude
FROM read_parquet('{wide_path}') AS samp
-- Direct join via p__produced_by column
JOIN read_parquet('{wide_path}') AS event 
    ON event.row_id = samp.p__produced_by[1]
-- Direct join via p__sample_location column  
JOIN read_parquet('{wide_path}') AS geo 
    ON geo.row_id = event.p__sample_location[1]
WHERE samp.otype = 'MaterialSampleRecord'
LIMIT 5
"""

print("SQL Query (note: 3 table references, direct column access):")
print(wide_sql)

start = time.time()
result_wide = con.sql(wide_sql).df()
wide_time = time.time() - start

print(f"\nExecution time: {wide_time*1000:.1f}ms")
print(f"\nResults:")
print(result_wide.to_string(index=False))

WIDE SCHEMA QUERY

SQL Query (note: 3 table references, direct column access):

SELECT 
    samp.pid as sample_pid,
    samp.label as sample_label,
    geo.latitude,
    geo.longitude
FROM read_parquet('~/Data/iSample/pqg_refining/oc_isamples_pqg_wide.parquet') AS samp
-- Direct join via p__produced_by column
JOIN read_parquet('~/Data/iSample/pqg_refining/oc_isamples_pqg_wide.parquet') AS event 
    ON event.row_id = samp.p__produced_by[1]
-- Direct join via p__sample_location column  
JOIN read_parquet('~/Data/iSample/pqg_refining/oc_isamples_pqg_wide.parquet') AS geo 
    ON geo.row_id = event.p__sample_location[1]
WHERE samp.otype = 'MaterialSampleRecord'
LIMIT 5


Execution time: 67.2ms

Results:
          sample_pid sample_label  latitude  longitude
ark:/28722/k24q7t61v    3587.F234 37.668196  32.827191
ark:/28722/k29p2wj9b     6515.F47 37.666389  32.822500
ark:/28722/k2n878m09   Bone 30518 30.011600  52.408600
ark:/28722/k2x354n03       4004-4 34.981240  33.707123
ark:/28722/k2rf

In [10]:
# Performance summary
print("=" * 60)
print("PERFORMANCE COMPARISON")
print("=" * 60)
print()
print(f"{'Metric':<25} {'Narrow':>15} {'Wide':>15} {'Improvement':>15}")
print("-" * 70)
print(f"{'Query time':<25} {narrow_time*1000:>12.1f}ms {wide_time*1000:>12.1f}ms {narrow_time/wide_time:>14.1f}x")
print(f"{'Table references':<25} {'7':>15} {'3':>15} {'57% fewer':>15}")
print(f"{'Edge row joins':<25} {'4':>15} {'0':>15} {'100% fewer':>15}")

PERFORMANCE COMPARISON

Metric                             Narrow            Wide     Improvement
----------------------------------------------------------------------
Query time                       100.6ms         67.2ms            1.5x
Table references                        7               3       57% fewer
Edge row joins                          4               0      100% fewer


---

# Part 4: The Tradeoffs

## ✅ Advantages of Wide Schema

| Benefit | Why It Matters |
|---------|----------------|
| **60% smaller files** | Less storage, faster downloads |
| **79% fewer rows** | Less I/O, faster scans |
| **2-3x faster queries** | No edge table joins |
| **Simpler SQL** | Easier to write and understand |

## ❌ Disadvantages of Wide Schema

| Drawback | Why It Matters |
|----------|----------------|
| **Schema rigidity** | Adding new predicate = new column (ALTER TABLE) |
| **Sparse columns** | Many NULL values if predicates vary by entity type |
| **Shared predicate ambiguity** | `p__responsibility` used by both Event and Curation |
| **No edge metadata** | Can't store timestamps/confidence on relationships |
| **Harder updates** | Modifying array columns vs insert/delete rows |

In [11]:
# Demonstrate sparse columns issue
print("Sparse columns demonstration:")
print("(How many entities actually USE each p__* column?)")
print()

con.sql(f"""
    SELECT
        'p__produced_by' as column_name,
        COUNT(*) FILTER (WHERE p__produced_by IS NOT NULL) as non_null,
        COUNT(*) as total,
        ROUND(COUNT(*) FILTER (WHERE p__produced_by IS NOT NULL) * 100.0 / COUNT(*), 1) as pct_used
    FROM read_parquet('{wide_path}')
    UNION ALL
    SELECT
        'p__site_location',
        COUNT(*) FILTER (WHERE p__site_location IS NOT NULL),
        COUNT(*),
        ROUND(COUNT(*) FILTER (WHERE p__site_location IS NOT NULL) * 100.0 / COUNT(*), 1)
    FROM read_parquet('{wide_path}')
    UNION ALL
    SELECT
        'p__registrant',
        COUNT(*) FILTER (WHERE p__registrant IS NOT NULL),
        COUNT(*),
        ROUND(COUNT(*) FILTER (WHERE p__registrant IS NOT NULL) * 100.0 / COUNT(*), 1)
    FROM read_parquet('{wide_path}')
""").show()

print("\n💡 Most p__* columns are NULL for most rows!")
print("   This is expected - not all entity types use all predicates.")

Sparse columns demonstration:
(How many entities actually USE each p__* column?)

┌──────────────────┬──────────┬─────────┬──────────┐
│   column_name    │ non_null │  total  │ pct_used │
│     varchar      │  int64   │  int64  │  double  │
├──────────────────┼──────────┼─────────┼──────────┤
│ p__produced_by   │  1110412 │ 2464690 │     45.1 │
│ p__site_location │    18213 │ 2464690 │      0.7 │
│ p__registrant    │   421521 │ 2464690 │     17.1 │
└──────────────────┴──────────┴─────────┴──────────┘


💡 Most p__* columns are NULL for most rows!
   This is expected - not all entity types use all predicates.


In [12]:
# Demonstrate shared predicate ambiguity
print("Shared predicate demonstration:")
print("(p__responsibility is used by BOTH SamplingEvent and Curation)")
print()

con.sql(f"""
    SELECT
        otype,
        COUNT(*) FILTER (WHERE p__responsibility IS NOT NULL) as has_responsibility
    FROM read_parquet('{wide_path}')
    WHERE p__responsibility IS NOT NULL
    GROUP BY otype
    ORDER BY has_responsibility DESC
""").show()

print("\n💡 When querying p__responsibility, you MUST also check otype!")
print("   Otherwise you might mix Event and Curation relationships.")

Shared predicate demonstration:
(p__responsibility is used by BOTH SamplingEvent and Curation)

┌───────────────┬────────────────────┐
│     otype     │ has_responsibility │
│    varchar    │       int64        │
├───────────────┼────────────────────┤
│ SamplingEvent │            1109332 │
└───────────────┴────────────────────┘


💡 When querying p__responsibility, you MUST also check otype!
   Otherwise you might mix Event and Curation relationships.


---

# Part 5: When to Use Which?

## Use Narrow Schema When:
- Schema is evolving (new predicates added frequently)
- You need edge metadata (timestamps, confidence, provenance)
- Write-heavy workloads (frequent relationship updates)
- Mixed entity types with very different predicate sets

## Use Wide Schema When:
- **Read-heavy analytics** ✓ (our use case!)
- Schema is stable and well-defined
- Query performance is critical
- Working with browser-based tools (DuckDB-WASM)
- File size matters (remote access, downloads)

---

# Part 6: The Mental Model

## Key Takeaway: Grammar vs Serialization

```
┌─────────────────────────────────────────────────────────────┐
│                   iSamples Edge Grammar                      │
│                   (14 typed relationships)                   │
│                                                              │
│   MaterialSampleRecord --produced_by--> SamplingEvent       │
│   MaterialSampleRecord --keywords--> IdentifiedConcept      │
│   SamplingEvent --sample_location--> GeospatialCoordLocation│
│   ...etc (14 total)                                         │
└─────────────────────────────────────────────────────────────┘
                           │
                           │ Can be serialized as:
                           ▼
        ┌──────────────────┴──────────────────┐
        │                                      │
        ▼                                      ▼
┌───────────────────┐                ┌───────────────────┐
│   NARROW FORMAT   │                │    WIDE FORMAT    │
│                   │                │                   │
│ Edge rows with    │                │ p__* columns on   │
│ s/p/o fields      │                │ entity rows       │
│                   │                │                   │
│ ✓ Flexible        │                │ ✓ Fast reads      │
│ ✓ Edge metadata   │                │ ✓ Small files     │
│ ✗ More rows       │                │ ✗ Schema rigid    │
│ ✗ Complex queries │                │ ✗ Sparse columns  │
└───────────────────┘                └───────────────────┘
```

**The typed edge types (ISamplesEdgeType) define WHAT relationships are valid.**

**Narrow vs Wide defines HOW those relationships are stored.**

---

# Part 7: Data Snapshot Comparison (June vs December 2025)

**Important Discovery**: Our narrow and wide files are from **different dates**!

- **Narrow**: June 9, 2025
- **Wide**: December 1, 2025

This lets us see how OpenContext data evolved over 6 months.

In [13]:
import os
from datetime import datetime

print("=" * 70)
print("FILE METADATA")
print("=" * 70)

for name, path in [("Narrow (June)", narrow_path), ("Wide (December)", wide_path)]:
    stat = os.stat(path)
    mtime = datetime.fromtimestamp(stat.st_mtime)
    print(f"{name}: {stat.st_size / (1024**2):.1f} MB, modified {mtime.strftime('%Y-%m-%d')}")

FILE METADATA
Narrow (June): 690.9 MB, modified 2025-06-09
Wide (December): 275.3 MB, modified 2025-12-01


## Entity Count Changes Over Time

In [14]:
# Compare entity counts between snapshots
print("=" * 70)
print("ENTITY COUNT CHANGES: June 2025 → December 2025")
print("=" * 70)

comparison_df = con.sql(f"""
    WITH narrow AS (
        SELECT otype, COUNT(*) as june_count
        FROM read_parquet('{narrow_path}')
        WHERE otype != '_edge_'
        GROUP BY otype
    ),
    wide AS (
        SELECT otype, COUNT(*) as dec_count
        FROM read_parquet('{wide_path}')
        GROUP BY otype
    )
    SELECT
        COALESCE(n.otype, w.otype) as entity_type,
        COALESCE(n.june_count, 0) as june_2025,
        COALESCE(w.dec_count, 0) as dec_2025,
        COALESCE(w.dec_count, 0) - COALESCE(n.june_count, 0) as change,
        ROUND((COALESCE(w.dec_count, 0) - COALESCE(n.june_count, 0)) * 100.0 / NULLIF(n.june_count, 0), 2) as pct_change
    FROM narrow n
    FULL OUTER JOIN wide w ON n.otype = w.otype
    ORDER BY june_2025 DESC
""").df()

print()
print(comparison_df.to_string(index=False))

ENTITY COUNT CHANGES: June 2025 → December 2025

            entity_type  june_2025  dec_2025  change  pct_change
   MaterialSampleRecord    1096352   1110412   14060        1.28
          SamplingEvent    1096352   1110412   14060        1.28
GeospatialCoordLocation     198433    199147     714        0.36
      IdentifiedConcept      25778     25929     151        0.59
           SamplingSite      18213     18213       0        0.00
                  Agent        565       577      12        2.12


## Sample Overlap Analysis

How many samples are in both snapshots vs added/removed?

In [15]:
# Analyze sample overlap
print("=" * 70)
print("SAMPLE OVERLAP ANALYSIS")
print("=" * 70)

overlap = con.sql(f"""
    WITH june AS (
        SELECT pid FROM read_parquet('{narrow_path}')
        WHERE otype = 'MaterialSampleRecord'
    ),
    dec AS (
        SELECT pid FROM read_parquet('{wide_path}')
        WHERE otype = 'MaterialSampleRecord'
    )
    SELECT
        (SELECT COUNT(*) FROM june) as june_total,
        (SELECT COUNT(*) FROM dec) as dec_total,
        (SELECT COUNT(*) FROM june j JOIN dec d ON j.pid = d.pid) as in_both,
        (SELECT COUNT(*) FROM june j LEFT JOIN dec d ON j.pid = d.pid WHERE d.pid IS NULL) as june_only,
        (SELECT COUNT(*) FROM dec d LEFT JOIN june j ON d.pid = j.pid WHERE j.pid IS NULL) as dec_only
""").df()

print(f"""
  June 2025 samples:     {overlap['june_total'].iloc[0]:>10,}
  December 2025 samples: {overlap['dec_total'].iloc[0]:>10,}

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  In BOTH snapshots:     {overlap['in_both'].iloc[0]:>10,}  ({overlap['in_both'].iloc[0]/overlap['june_total'].iloc[0]*100:.2f}% stable)
  June only (removed):   {overlap['june_only'].iloc[0]:>10,}
  December only (added): {overlap['dec_only'].iloc[0]:>10,}
""")

SAMPLE OVERLAP ANALYSIS

  June 2025 samples:      1,096,352
  December 2025 samples:  1,110,412

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  In BOTH snapshots:      1,096,347  (100.00% stable)
  June only (removed):            5
  December only (added):     14,065



## Samples Removed (5 records)

These samples existed in June but not December - likely test/duplicate records.

In [16]:
# What was removed?
print("Samples in June but NOT in December:")
print()

removed = con.sql(f"""
    SELECT j.pid, j.label
    FROM read_parquet('{narrow_path}') j
    LEFT JOIN read_parquet('{wide_path}') d ON j.pid = d.pid AND d.otype = 'MaterialSampleRecord'
    WHERE j.otype = 'MaterialSampleRecord'
    AND d.pid IS NULL
""").df()

for _, row in removed.iterrows():
    print(f"  {row['pid']}: {row['label']}")

Samples in June but NOT in December:

  ark:/28722/r2p24/pc_0_a: PC 0(a)
  ark:/28722/r2p24/pc_0_b: PC 0(b)
  ark:/28722/r2p24/pc_0_c: PC 0(c)
  ark:/28722/r2p24/pc_0_e: PC 0(e)
  ark:/28722/r2p24/pc_0_d: PC 0(d)


## Samples Added (~14,000 new records)

New samples from ongoing archaeological excavations.

In [17]:
# What was added?
print("New samples in December (showing first 15):")
print()

added = con.sql(f"""
    SELECT d.pid, d.label
    FROM read_parquet('{wide_path}') d
    LEFT JOIN read_parquet('{narrow_path}') j ON d.pid = j.pid AND j.otype = 'MaterialSampleRecord'
    WHERE d.otype = 'MaterialSampleRecord'
    AND j.pid IS NULL
    LIMIT 15
""").df()

for _, row in added.iterrows():
    label = str(row['label'])[:50] + "..." if len(str(row['label'])) > 50 else row['label']
    print(f"  {row['pid']}: {label}")

New samples in December (showing first 15):

  ark:/28722/k23v02m5g: Lithic ID: 6682
  ark:/28722/k2c82sr98: Bot. Spec. 38486
  ark:/28722/k2cz3qh26: Bot. Spec. 39337
  ark:/28722/k2db8j12v: Bot. Spec. 35974
  ark:/28722/k2697nb7z: Lithic ID: 2256
  ark:/28722/k2fx7s95k: Bot. Spec. 41169
  ark:/28722/k2417gd6x: Lithic ID: 5878
  ark:/28722/k2qf96x70: Lithic ID: 3071
  ark:/28722/k24j10c72: Lithic ID: 7692
  ark:/28722/k28630r3m: Lithic ID: 1719
  ark:/28722/k22523w4j: Bot. Spec. 36558
  ark:/28722/k2br98t4q: Bot. Spec. 39198
  ark:/28722/k27p9gr17: Bot. Spec. 38657
  ark:/28722/k2vq3fp03: Bot. Spec. 41110
  ark:/28722/k2wh32w0g: Lithic ID: 5452


## What Types of Samples Were Added?

In [18]:
# Categorize new samples by label pattern
print("New samples by category (based on label patterns):")
print()

new_sample_types = con.sql(f"""
    WITH new_samples AS (
        SELECT d.pid, d.label
        FROM read_parquet('{wide_path}') d
        LEFT JOIN read_parquet('{narrow_path}') j ON d.pid = j.pid AND j.otype = 'MaterialSampleRecord'
        WHERE d.otype = 'MaterialSampleRecord'
        AND j.pid IS NULL
    )
    SELECT
        CASE
            WHEN label LIKE 'Lithic ID%' THEN 'Lithic (stone tools)'
            WHEN label LIKE 'Bot. Spec%' THEN 'Botanical specimens'
            WHEN label LIKE 'SF T%' THEN 'Special Finds (SF)'
            WHEN label LIKE '%Tile%' THEN 'Tiles'
            WHEN label LIKE 'VdM%' THEN 'VdM collection'
            WHEN label LIKE 'bf%' THEN 'bf collection'
            ELSE 'Other'
        END as sample_category,
        COUNT(*) as count
    FROM new_samples
    GROUP BY sample_category
    ORDER BY count DESC
""").df()

print(new_sample_types.to_string(index=False))

New samples by category (based on label patterns):

     sample_category  count
Lithic (stone tools)   7903
 Botanical specimens   4977
  Special Finds (SF)    381
      VdM collection    268
               Other    247
       bf collection    221
               Tiles     68


## New Vocabulary Terms (IdentifiedConcepts)

151 new concepts were added - mostly lithic analysis terminology.

In [19]:
# New vocabulary terms
print("New IdentifiedConcepts (vocabulary terms) - first 15:")
print()

new_concepts = con.sql(f"""
    SELECT d.label
    FROM read_parquet('{wide_path}') d
    LEFT JOIN read_parquet('{narrow_path}') j ON d.pid = j.pid AND j.otype = 'IdentifiedConcept'
    WHERE d.otype = 'IdentifiedConcept'
    AND j.pid IS NULL
    LIMIT 15
""").df()

for _, row in new_concepts.iterrows():
    print(f"  • {row['label']}")

New IdentifiedConcepts (vocabulary terms) - first 15:

  • Age (period) :: Late Iron 2
  • Termination Type :: Plunge (ID:4)
  • Termination Type :: Step (ID:3)
  • Tool Type :: Scraper - Thumbnail (ID:26)
  • Tool Type :: Flake (ID:1)
  • Vessel Part Present :: Spout
  • Unifacial/Bifacial :: Not Recorded (ID:4)
  • Tool Type :: Drill (ID:16)
  • Width Location :: Proximal (ID:1)
  • Age (period) :: Late Iron 1
  • Subtype :: Drill Bit (ID:2)
  • Raw Material :: Chert (ID:1)
  • Striking Platform :: No (ID:2)
  • Termination Type :: Hinge (ID:2)
  • Raw Material :: Jasper - Red (ID:7)


## Data Enrichment: Changes to Existing Samples

Even samples that existed in June got enriched with additional relationships.

In [20]:
# Compare relationship counts for overlapping samples
print("=" * 70)
print("RELATIONSHIP ENRICHMENT (sampling 1000 overlapping samples)")
print("=" * 70)

# Get edge counts from narrow
narrow_edges = con.sql(f"""
    WITH sample_subset AS (
        SELECT j.pid, j.row_id
        FROM read_parquet('{narrow_path}') j
        JOIN read_parquet('{wide_path}') d ON j.pid = d.pid
        WHERE j.otype = 'MaterialSampleRecord' AND d.otype = 'MaterialSampleRecord'
        LIMIT 1000
    )
    SELECT
        s.pid,
        e.p as predicate,
        array_length(e.o) as target_count
    FROM sample_subset s
    JOIN read_parquet('{narrow_path}') e ON e.s = s.row_id AND e.otype = '_edge_'
""").df()

narrow_agg = narrow_edges.groupby(['pid', 'predicate'])['target_count'].sum().reset_index()
narrow_agg.columns = ['pid', 'predicate', 'june_count']

# Get relationship counts from wide
wide_rels = con.sql(f"""
    WITH sample_subset AS (
        SELECT j.pid
        FROM read_parquet('{narrow_path}') j
        JOIN read_parquet('{wide_path}') d ON j.pid = d.pid
        WHERE j.otype = 'MaterialSampleRecord' AND d.otype = 'MaterialSampleRecord'
        LIMIT 1000
    )
    SELECT w.pid, 'produced_by' as predicate, COALESCE(array_length(w.p__produced_by), 0) as count
    FROM read_parquet('{wide_path}') w
    WHERE w.pid IN (SELECT pid FROM sample_subset) AND w.otype = 'MaterialSampleRecord'
    UNION ALL
    SELECT w.pid, 'keywords', COALESCE(array_length(w.p__keywords), 0)
    FROM read_parquet('{wide_path}') w
    WHERE w.pid IN (SELECT pid FROM sample_subset) AND w.otype = 'MaterialSampleRecord'
    UNION ALL
    SELECT w.pid, 'registrant', COALESCE(array_length(w.p__registrant), 0)
    FROM read_parquet('{wide_path}') w
    WHERE w.pid IN (SELECT pid FROM sample_subset) AND w.otype = 'MaterialSampleRecord'
""").df()

wide_agg = wide_rels.groupby(['pid', 'predicate'])['count'].sum().reset_index()
wide_agg.columns = ['pid', 'predicate', 'dec_count']

# Merge and compare
enrichment_comparison = pd.merge(narrow_agg, wide_agg, on=['pid', 'predicate'], how='outer').fillna(0)
enrichment_comparison['change'] = enrichment_comparison['dec_count'] - enrichment_comparison['june_count']

# Summary by predicate
print("\nRelationship changes by predicate type:")
print("(Positive change = more relationships in December)")
print()

summary = enrichment_comparison.groupby('predicate').agg({
    'june_count': 'sum',
    'dec_count': 'sum',
    'change': 'sum'
}).reset_index()

for _, row in summary.iterrows():
    change_str = f"+{int(row['change'])}" if row['change'] > 0 else str(int(row['change']))
    print(f"  {row['predicate']:<20} {int(row['june_count']):>8} → {int(row['dec_count']):>8}  ({change_str})")

RELATIONSHIP ENRICHMENT (sampling 1000 overlapping samples)

Relationship changes by predicate type:
(Positive change = more relationships in December)

  has_context_category     1000 →        0  (-1000)
  has_material_category     1000 →        0  (-1000)
  has_sample_object_type     1000 →        0  (-1000)
  keywords                 2920 →     3673  (+753)
  produced_by              1000 →     1000  (0)
  registrant                657 →      524  (-133)


## Example: Keyword Enrichment

Some samples gained additional keyword tags between snapshots.

In [21]:
# Show a specific example of keyword enrichment
sample_pid = "ark:/28722/k2wq0f07p"

print(f"Example: Sample {sample_pid}")
print()

# June keywords
narrow_kw = con.sql(f"""
    SELECT t.label
    FROM read_parquet('{narrow_path}') s
    JOIN read_parquet('{narrow_path}') e ON e.s = s.row_id AND e.otype = '_edge_' AND e.p = 'keywords'
    JOIN read_parquet('{narrow_path}') t ON t.row_id = e.o[1]
    WHERE s.pid = '{sample_pid}'
""").df()

print("June 2025 keywords:")
for _, row in narrow_kw.iterrows():
    print(f"  • {row['label']}")

# December keywords
wide_kw = con.sql(f"""
    WITH sample AS (
        SELECT p__keywords
        FROM read_parquet('{wide_path}')
        WHERE pid = '{sample_pid}'
    )
    SELECT t.label
    FROM read_parquet('{wide_path}') t
    WHERE t.row_id IN (SELECT UNNEST(p__keywords) FROM sample)
""").df()

print("\nDecember 2025 keywords:")
for _, row in wide_kw.iterrows():
    print(f"  • {row['label']}")

print("\n💡 The sample was enriched with additional vocabulary terms!")

Example: Sample ark:/28722/k2wq0f07p

June 2025 keywords:
  • pottery

December 2025 keywords:
  • ceramic (material)
  • pottery

💡 The sample was enriched with additional vocabulary terms!


## Summary: What Changed in 6 Months?

| Category | Change | Details |
|----------|--------|---------|
| **Samples** | +14,060 (+1.3%) | Mostly lithics & botanical specimens |
| **Concepts** | +151 | Lithic analysis vocabulary terms |
| **Agents** | +12 | New researchers/registrants |
| **Removed** | 5 | Test records (PC 0 a-e) |
| **Enriched** | Many | Additional keywords & registrants |

### Key Insight

The differences between our narrow and wide files are **real data changes**, not schema encoding differences. OpenContext is actively:
1. Adding samples from ongoing excavations
2. Enriching existing records with better vocabulary tagging
3. Expanding the controlled vocabulary (IdentifiedConcepts)

**For schema validation**: We would need files generated from the same source data on the same date.

---

# Exercises

Try these to solidify your understanding:

### Exercise 1: Count Samples by Material Type
Write queries for BOTH schemas to count samples grouped by material category.

### Exercise 2: Find Samples at a Specific Site
Query: Given a SamplingSite label, find all MaterialSampleRecords collected there.
Path: `Sample → Event → Site`

### Exercise 3: Add a Hypothetical New Predicate
Imagine iSamples adds `msr_analyzed_by` (Sample → Agent).
- How would you add this to narrow schema?
- How would you add this to wide schema?
- Which is easier?

In [22]:
# Your exercise code here



---

# Summary

| Aspect | Narrow | Wide |
|--------|--------|------|
| **Storage** | Edge rows (s/p/o) | p__* columns |
| **File size** | 691 MB | 275 MB (60% smaller) |
| **Row count** | 11.6M | 2.5M (79% fewer) |
| **Query joins** | 7 (through edges) | 3 (direct) |
| **Query speed** | Baseline | 2-3x faster |
| **Schema changes** | Easy (new rows) | Hard (new columns) |
| **Edge metadata** | Supported | Not supported |
| **Best for** | Writes, flexibility | Reads, analytics |

**For iSamples analytics (our use case): Wide schema is the clear winner.**

---

## References

- **PQG Repository**: https://github.com/isamplesorg/pqg
- **PR #6 (Typed Edges)**: https://github.com/isamplesorg/pqg/pull/6
- **Eric's Wide Schema Code**: https://github.com/ekansa/open-context-py/.../isamples_pqg.py
- **Previous Demo**: `pqg_demo.ipynb` (typed edges with narrow schema)
- **Exploration Notes**: `PQG_WIDE_EXPLORATION.md`