# OpenContext Parquet Analysis - Enhanced Version

This notebook provides comprehensive analysis of the OpenContext iSamples property graph parquet file.

## Setup and Data Loading

In [1]:
import duckdb
import pandas as pd
import numpy as np
from pathlib import Path
import urllib.request
import os

# Configuration
file_url = "https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg.parquet"
LOCAL_PATH = "/Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet"

In [2]:
# Check if local file exists, download if not
if not os.path.exists(LOCAL_PATH):
    print(f"Local file not found at {LOCAL_PATH}")
    
    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(LOCAL_PATH), exist_ok=True)
    
    print(f"Downloading {file_url} to {LOCAL_PATH}...")
    urllib.request.urlretrieve(file_url, LOCAL_PATH)
    print("Download completed!")
else:
    print(f"Local file already exists at {LOCAL_PATH}")

# Use local path for parquet operations
parquet_path = LOCAL_PATH
print(f"Using parquet file: {parquet_path}")

Local file already exists at /Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet
Using parquet file: /Users/raymondyee/Data/iSample/oc_isamples_pqg.parquet


## Understanding the Data Structure

This parquet file uses a **property graph model** where both entities (nodes) and relationships (edges) are stored in a single table. The `otype` field determines whether a row is:
- An entity (e.g., `MaterialSampleRecord`, `GeospatialCoordLocation`)
- A relationship (`_edge_`) connecting entities

Key insight: To get meaningful data, you'll often need to JOIN through edges to connect samples to their locations, events, or other properties.

In [3]:
# Create a DuckDB connection
conn = duckdb.connect()

# Create view for the parquet file
conn.execute(f"CREATE VIEW oc_pqg AS SELECT * FROM read_parquet('{parquet_path}');")

# Count records
result = conn.execute("SELECT COUNT(*) FROM oc_pqg;").fetchone()
print(f"Total records: {result[0]:,}")

Total records: 11,637,144


In [4]:
# Schema information
print("Schema information:")
schema_result = conn.execute("DESCRIBE oc_pqg;").fetchall()
for row in schema_result[:10]:  # Show first 10 columns
    print(f"{row[0]:25} | {row[1]}")
print(f"... and {len(schema_result) - 10} more columns")

Schema information:
row_id                    | INTEGER
pid                       | VARCHAR
tcreated                  | INTEGER
tmodified                 | INTEGER
otype                     | VARCHAR
s                         | INTEGER
p                         | VARCHAR
o                         | INTEGER[]
n                         | VARCHAR
altids                    | VARCHAR[]
... and 30 more columns


In [5]:
# Examine the distribution of entity types in detail
entity_stats = conn.execute("""
    SELECT
        otype,
        COUNT(*) as count,
        COUNT(DISTINCT pid) as unique_pids,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
    FROM oc_pqg
    GROUP BY otype
    ORDER BY count DESC
""").fetchdf()

print("Entity Type Distribution:")
print(entity_stats)

Entity Type Distribution:
                     otype    count  unique_pids  percentage
0                   _edge_  9201451      9201451       79.07
1            SamplingEvent  1096352      1096352        9.42
2     MaterialSampleRecord  1096352      1096352        9.42
3  GeospatialCoordLocation   198433       198433        1.71
4        IdentifiedConcept    25778        25778        0.22
5             SamplingSite    18213        18213        0.16
6                    Agent      565          565        0.00


### Graph Structure Fields

The fields `s`, `p`, `o`, `n` are used for edges:
- **s** (subject): row_id of the source entity
- **p** (predicate): the type of relationship
- **o** (object): array of target row_ids
- **n** (name): graph context (usually null)

Example: A sample (s) has_material_category (p) pointing to a concept (o).

In [6]:
# Explore edge predicates
edge_predicates = conn.execute("""
    SELECT
        p as predicate,
        COUNT(*) as usage_count,
        COUNT(DISTINCT s) as unique_subjects
    FROM oc_pqg
    WHERE otype = '_edge_'
    GROUP BY p
    ORDER BY usage_count DESC
    LIMIT 15
""").fetchdf()

print("Most common relationship types:")
print(edge_predicates)

Most common relationship types:
                predicate  usage_count  unique_subjects
0   has_material_category      1096352          1096352
1  has_sample_object_type      1096352          1096352
2    has_context_category      1096352          1096352
3           sampling_site      1096352          1096352
4             produced_by      1096352          1096352
5                keywords      1096297          1096297
6         sample_location      1096274          1096274
7          responsibility      1095272          1095272
8              registrant       413635           413635
9           site_location        18213            18213


## Practical Query Examples

### Query 1: Find Samples with Geographic Coordinates

In [7]:
# Find all samples with geographic coordinates
samples_with_coords = conn.execute("""
    SELECT
        s.pid as sample_id,
        s.label as sample_label,
        s.description,
        g.latitude,
        g.longitude,
        g.place_name
    FROM oc_pqg s
    JOIN oc_pqg e ON s.row_id = e.s
    JOIN oc_pqg g ON e.o[1] = g.row_id
    WHERE s.otype = 'MaterialSampleRecord'
      AND e.otype = '_edge_'
      AND e.p = 'sample_location'
      AND g.otype = 'GeospatialCoordLocation'
      AND g.latitude IS NOT NULL
    LIMIT 100
""").fetchdf()

print(f"Found {len(samples_with_coords)} samples with coordinates")
samples_with_coords.head()

Found 0 samples with coordinates


Unnamed: 0,sample_id,sample_label,description,latitude,longitude,place_name


### Query 2: Trace Samples Through Events to Sites

In [8]:
# Trace samples through events to sites
sample_site_hierarchy = conn.execute("""
    WITH sample_to_site AS (
        SELECT
            samp.pid as sample_id,
            samp.label as sample_label,
            event.pid as event_id,
            site.pid as site_id,
            site.label as site_name
        FROM oc_pqg samp
        JOIN oc_pqg e1 ON samp.row_id = e1.s AND e1.p = 'produced_by'
        JOIN oc_pqg event ON e1.o[1] = event.row_id AND event.otype = 'SamplingEvent'
        JOIN oc_pqg e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
        JOIN oc_pqg site ON e2.o[1] = site.row_id AND site.otype = 'SamplingSite'
        WHERE samp.otype = 'MaterialSampleRecord'
    )
    SELECT
        site_name,
        COUNT(*) as sample_count
    FROM sample_to_site
    GROUP BY site_name
    ORDER BY sample_count DESC
    LIMIT 20
""").fetchdf()

print("Top archaeological sites by sample count:")
print(sample_site_hierarchy)

Top archaeological sites by sample count:
                    site_name  sample_count
0                  Çatalhöyük        145900
1          Petra Great Temple        108846
2           Polis Chrysochous         52252
3                  Kenan Tepe         42295
4                    Ilıpınar         36951
5             Poggio Civitate         29985
6                    Čḯxwicən         29793
7              Heit el-Ghurab         28940
8                   Domuztepe         22394
9                       Emden         20238
10  Forcello Bagnolo San Vito         18573
11                Chogha Mish         16827
12                       Pi-1         16351
13           PKAP Survey Area         15446
14                     Malyan         15146
15                     Ulucak         10685
16                    OGSE-80         10477
17               Erbaba Höyük          8428
18                      Hazor          8356
19                 Köşk Höyük          7884


### Query 3: Explore Material Types and Categories

In [9]:
# Explore material types and categories
material_analysis = conn.execute("""
    SELECT
        c.label as material_type,
        c.name as category_name,
        COUNT(DISTINCT s.row_id) as sample_count
    FROM oc_pqg s
    JOIN oc_pqg e ON s.row_id = e.s
    JOIN oc_pqg c ON e.o[1] = c.row_id
    WHERE s.otype = 'MaterialSampleRecord'
      AND e.otype = '_edge_'
      AND e.p = 'has_material_category'
      AND c.otype = 'IdentifiedConcept'
    GROUP BY c.label, c.name
    ORDER BY sample_count DESC
    LIMIT 20
""").fetchdf()

print("Most common material types:")
print(material_analysis)

Most common material types:
                   material_type category_name  sample_count
0  Biogenic non-organic material          None        532675
1               Organic material          None        212584
2                       Material          None        158586
3   Other anthropogenic material          None        145316
4                           Rock          None         30186
5   Anthropogenic metal material          None         11659
6    Mixed soil sediment or rock          None          3207
7                        Mineral          None          2080
8         Natural Solid Material          None            58
9                       Sediment          None             1


## Query Performance Tips

1. **Always filter by `otype` first** - This dramatically reduces the search space
2. **Use CTEs (WITH clauses)** for complex multi-hop queries
3. **Limit results during exploration** - Add `LIMIT 1000` while testing queries
4. **Create views for common patterns** - Reuse complex joins

### Memory Management
For the full 11M row dataset:
- Simple counts and filters: Fast (<1 second)
- Single-hop joins: Moderate (1-5 seconds)
- Multi-hop joins: Can be slow (5-30 seconds)
- Full table scans: Avoid without filters

## Visualization Preparation

In [10]:
def get_sample_locations_for_viz(conn, limit=10000):
    """Extract sample locations optimized for visualization"""
    
    return conn.execute(f"""
        WITH located_samples AS (
            SELECT
                s.pid as sample_id,
                s.label as label,
                g.latitude,
                g.longitude,
                g.obfuscated,
                e.p as location_type
            FROM oc_pqg s
            JOIN oc_pqg e ON s.row_id = e.s
            JOIN oc_pqg g ON e.o[1] = g.row_id
            WHERE s.otype = 'MaterialSampleRecord'
              AND e.otype = '_edge_'
              AND e.p IN ('sample_location', 'sampling_site')
              AND g.otype = 'GeospatialCoordLocation'
              AND g.latitude IS NOT NULL
              AND g.longitude IS NOT NULL
        )
        SELECT
            sample_id,
            label,
            latitude,
            longitude,
            obfuscated,
            location_type
        FROM located_samples
        WHERE NOT obfuscated  -- Exclude obfuscated locations for public viz
        LIMIT {limit}
    """).fetchdf()

# Get visualization-ready data
viz_data = get_sample_locations_for_viz(conn, 5000)
print(f"Prepared {len(viz_data)} samples for visualization")
print(f"Coordinate bounds: Lat [{viz_data.latitude.min():.2f}, {viz_data.latitude.max():.2f}], "
      f"Lon [{viz_data.longitude.min():.2f}, {viz_data.longitude.max():.2f}]")

Prepared 0 samples for visualization
Coordinate bounds: Lat [nan, nan], Lon [nan, nan]


## Data Export Options

In [11]:
def export_site_subgraph(conn, site_name_pattern, output_prefix):
    """Export all data related to a specific site"""
    
    # Find the site
    site_info = conn.execute("""
        SELECT row_id, pid, label
        FROM oc_pqg
        WHERE otype = 'SamplingSite'
        AND label LIKE ?
        LIMIT 1
    """, [f'%{site_name_pattern}%']).fetchdf()
    
    if site_info.empty:
        print(f"No site found matching '{site_name_pattern}'")
        return None
    
    site_row_id = site_info.iloc[0]['row_id']
    print(f"Found site: {site_info.iloc[0]['label']}")
    
    # Get all related entities (simplified version - not recursive)
    related_data = conn.execute("""
        WITH site_related AS (
            -- Get the site itself
            SELECT * FROM oc_pqg WHERE row_id = ?
            
            UNION ALL
            
            -- Get edges from the site
            SELECT * FROM oc_pqg e
            WHERE e.otype = '_edge_' AND e.s = ?
            
            UNION ALL
            
            -- Get entities connected to the site
            SELECT n.* FROM oc_pqg e
            JOIN oc_pqg n ON n.row_id = e.o[1]
            WHERE e.otype = '_edge_' AND e.s = ?
        )
        SELECT * FROM site_related
    """, [site_row_id, site_row_id, site_row_id]).fetchdf()
    
    # Save to parquet
    output_file = f"{output_prefix}_{site_info.iloc[0]['pid']}.parquet"
    related_data.to_parquet(output_file)
    print(f"Exported {len(related_data)} rows to {output_file}")
    
    return related_data

# Example usage (commented out to avoid creating files)
# pompeii_data = export_site_subgraph(conn, "Pompeii", "pompeii_subgraph")

## Data Quality Analysis

In [12]:
# Check for location data quality
location_quality = conn.execute("""
    SELECT
        CASE 
            WHEN obfuscated THEN 'Obfuscated'
            ELSE 'Precise'
        END as location_type,
        COUNT(*) as count,
        AVG(CASE WHEN latitude IS NOT NULL THEN 1.0 ELSE 0.0 END) * 100 as pct_with_coords
    FROM oc_pqg
    WHERE otype = 'GeospatialCoordLocation'
    GROUP BY location_type
""").fetchdf()

print("Location Data Quality:")
print(location_quality)

Location Data Quality:
  location_type   count  pct_with_coords
0    Obfuscated    1926       100.000000
1       Precise  196507        99.999491


In [13]:
# Check for orphaned nodes (nodes not connected by any edge)
orphan_check = conn.execute("""
    WITH connected_nodes AS (
        SELECT DISTINCT s as row_id FROM oc_pqg WHERE otype = '_edge_'
        UNION
        SELECT DISTINCT unnest(o) as row_id FROM oc_pqg WHERE otype = '_edge_'
    )
    SELECT
        n.otype,
        COUNT(*) as orphan_count
    FROM oc_pqg n
    LEFT JOIN connected_nodes c ON n.row_id = c.row_id
    WHERE n.otype != '_edge_' AND c.row_id IS NULL
    GROUP BY n.otype
""").fetchdf()

print("\nOrphaned Nodes by Type:")
print(orphan_check if not orphan_check.empty else "No orphaned nodes found!")


Orphaned Nodes by Type:
               otype  orphan_count
0              Agent             1
1  IdentifiedConcept         16961


## Summary Statistics

In [14]:
# Generate comprehensive summary
summary = conn.execute("""
    WITH stats AS (
        SELECT
            COUNT(*) as total_rows,
            COUNT(DISTINCT pid) as unique_pids,
            COUNT(CASE WHEN otype = '_edge_' THEN 1 END) as edge_count,
            COUNT(CASE WHEN otype != '_edge_' THEN 1 END) as node_count,
            COUNT(DISTINCT CASE WHEN otype != '_edge_' THEN otype END) as entity_types,
            COUNT(DISTINCT p) as relationship_types
        FROM oc_pqg
    )
    SELECT * FROM stats
""").fetchdf()

print("Dataset Summary:")
for col in summary.columns:
    print(f"{col}: {summary[col].iloc[0]:,}")

Dataset Summary:
total_rows: 11,637,144
unique_pids: 11,637,144
edge_count: 9,201,451
node_count: 2,435,693
entity_types: 6
relationship_types: 10


In [15]:
# Close the connection
conn.close()
print("\nAnalysis complete!")


Analysis complete!
