# RDFSolve: PDB RDF - no counts

This notebook demonstrates faster schema discovery:
1. Setting up an endpoint and graph
2. Generating fast VoID descriptions using CONSTRUCT queries **without** COUNT aggregations
3. Extracting schema from the VoID description
4. Analyzing the results as DataFrame and JSON

In [None]:
import pandas as pd
from rdfsolve.rdfsolve import RDFSolver
from rdfsolve.void_parser import VoidParser, generate_void_from_endpoint
import warnings
warnings.filterwarnings('ignore')

## Step 1: Configure Dataset Parameters

We'll configure the PDB dataset with its SPARQL endpoint and metadata.

In [None]:
# AOPWIKI configuration
endpoint_url = "https://rdfportal.org/pdb/sparql"
dataset_name = "pdb"
void_iri = "http://rdfportal.org/dataset/pdbj"
graph_uri = "http://rdfportal.org/dataset/pdbj"  # Specify the correct graph URI
working_path = "."

print(f"Dataset: {dataset_name}")
print(f"Endpoint: {endpoint_url}")
print(f"VoID IRI: {void_iri}")
print(f"Graph URI: {graph_uri}")
print(f"Mode: Complete (with COUNT aggregations)")

Dataset: pdb
Endpoint: https://rdfportal.org/pdb/sparql
VoID IRI: http://rdfportal.org/dataset/pdbj
Graph URI: http://rdfportal.org/dataset/pdbj
Mode: Complete (with COUNT aggregations)


## Step 2: Initialize RDFSolver

Create an RDFSolver instance with our configuration.

In [None]:
try:
    # Initialize RDFSolver with our configuration
    solver = RDFSolver(
        endpoint=endpoint_url,
        path=working_path,
        void_iri=void_iri,
        dataset_name=dataset_name
    )
    
    print("RDFSolver initialized successfully")
    print(f"Endpoint: {solver.endpoint}")
    print(f"Dataset: {solver.dataset_name}")
    
except Exception as e:
    print(f"Error: {e}")

RDFSolver initialized successfully
Endpoint: https://rdfportal.org/pdb/sparql
Dataset: pdb


## Step 3: Generate Fast VoID Description

Generate VoID **without** COUNT aggregations for fast discovery. This is much faster but doesn't provide count statistics.

Three CONSTRUCT queries get the partitions for classes, properties, and datatypes using SELECT DISTINCT instead of COUNT.

In [4]:
try:
    # Generate fast VoID without count aggregations
    
    fast_void_graph = solver.void_generator(
        graph_uri=graph_uri,
        output_file=f"{dataset_name}_void.ttl",
        counts=False  # Fast discovery without counts,
        
    )
    
    print(f"Graph contains {len(fast_void_graph)} triples")
    print(f"Saved to: {dataset_name}_void.ttl")
    
except Exception as e:
    print(f"Error: {e}")

Generating VoID from endpoint: https://rdfportal.org/pdb/sparql
Using graph URI: http://rdfportal.org/dataset/pdbj
Fast mode: Skipping COUNT aggregations
üöÄ Starting VoID extraction from SPARQL endpoint
üì° Endpoint: https://rdfportal.org/pdb/sparql
üéØ Graph: http://rdfportal.org/dataset/pdbj
üîß Mode: Traditional VoID (SPARQL processing)
üîÑ Starting query: class_partitions


‚úÖ Finished query: class_partitions (took 2.29s)
üìä Parsing results for class_partitions...
üîÑ Starting query: property_partitions


KeyboardInterrupt: 

## Quick Dataset Structure Check

Before running raw extraction, let's check if the dataset has the required structure (rdf:type relations).

In [6]:
# Quick check of dataset structure for raw extraction compatibility
try:
    from SPARQLWrapper import SPARQLWrapper, JSON
    
    sparql = SPARQLWrapper(endpoint_url)
    
    # Test 1: Check if we have rdf:type relations
    type_check_query = f"""
    SELECT (COUNT(*) as ?count)
    WHERE {{
        GRAPH <{graph_uri}> {{
            ?s a ?type .
        }}
    }}
    LIMIT 1
    """
    
    sparql.setQuery(type_check_query)
    sparql.setReturnFormat(JSON)
    
    print("Checking dataset structure...")
    results = sparql.query().convert()
    type_count = int(results["results"]["bindings"][0]["count"]["value"])
    
    print(f"Dataset has {type_count:,} rdf:type statements")
    
    if type_count > 0:
        print("‚úì Dataset is compatible with raw extraction mode")
    else:
        print("‚ö† Dataset may not be compatible with raw extraction")
        print("  (requires subjects to have rdf:type relations)")
    
    # Test 2: Sample a few triples to see structure
    sample_query = f"""
    SELECT ?s ?p ?o
    WHERE {{
        GRAPH <{graph_uri}> {{
            ?s ?p ?o .
        }}
    }}
    LIMIT 5
    """
    
    sparql.setQuery(sample_query)
    print(f"\nSample triples from {graph_uri}:")
    
    sample_results = sparql.query().convert()
    for i, binding in enumerate(sample_results["results"]["bindings"][:3]):
        s = binding["s"]["value"]
        p = binding["p"]["value"]
        o = binding["o"]["value"]
        print(f"  {i+1}. {s[:50]}... -> {p.split('/')[-1]} -> {o[:50]}...")
    
except Exception as e:
    print(f"Structure check failed: {e}")
    print("Proceeding anyway...")

Checking dataset structure...
Dataset has 1,165,797,810 rdf:type statements
‚úì Dataset is compatible with raw extraction mode

Sample triples from http://rdfportal.org/dataset/pdbj:
  1. http://rdf.wwpdb.org/pdb/100D/atom_sites/100D... -> 22-rdf-syntax-ns#type -> http://rdf.wwpdb.org/schema/pdbx-v50.owl#atom_site...
  2. http://rdf.wwpdb.org/pdb/300D/atom_sites/300D... -> 22-rdf-syntax-ns#type -> http://rdf.wwpdb.org/schema/pdbx-v50.owl#atom_site...
  3. http://rdf.wwpdb.org/pdb/400D/atom_sites/400D... -> 22-rdf-syntax-ns#type -> http://rdf.wwpdb.org/schema/pdbx-v50.owl#atom_site...


In [None]:
try:
    print("Using NEW raw extraction mode...")
    
    # Use VoidParser.from_sparql with raw_extraction=True
    print("Note: Raw extraction requires subjects to have rdf:type relations")
    print("Attempting raw extraction...")
    
    raw_parser = VoidParser.from_sparql(
        endpoint_url=endpoint_url,
        graph_uri=graph_uri,
        output_file=f"{dataset_name}_raw_void.ttl",
        raw_extraction=True,      # NEW: Enable raw extraction
        preserve_values=True      # NEW: Preserve actual values (not just Resource/Literal)
    )
    
    print("Raw extraction VoID completed!")
    
except Exception as e:
    print(f"Raw extraction error: {e}")
    print("\nThis might happen if:")
    print("1. The SPARQL endpoint doesn't support some query features")
    print("2. The dataset doesn't have rdf:type relations for subjects")
    print("3. There are SPARQL syntax compatibility issues")
    print("\nTrying fallback to traditional mode...")
    
    try:
        # Fallback to traditional mode
        raw_parser = VoidParser.from_sparql(
            endpoint_url=endpoint_url,
            graph_uri=graph_uri,
            output_file=f"{dataset_name}_traditional_void.ttl",
            raw_extraction=False     # Traditional mode
        )
        print("Fallback to traditional mode succeeded!")
        
    except Exception as fallback_error:
        print(f"Fallback also failed: {fallback_error}")
        raw_parser = None

Using NEW raw extraction mode...
üöÄ Starting VoID extraction from SPARQL endpoint
üì° Endpoint: https://rdfportal.org/pdb/sparql
üéØ Graph: http://rdfportal.org/dataset/pdbj
‚ö° Mode: Raw extraction (Python post-processing)
üîÑ Starting query: class_partitions
‚úÖ Finished query: class_partitions (took 3.02s)
üìä Parsing results for class_partitions...
‚ö° Raw extraction: processing will be done in Python
üîÑ Starting query: raw_triples
‚úÖ Finished query: class_partitions (took 3.02s)
üìä Parsing results for class_partitions...
‚ö° Raw extraction: processing will be done in Python
üîÑ Starting query: raw_triples
‚ùå Query raw_triples failed after 1.38s: QueryBadFormed: A bad request has been sent to the endpoint: probably the SPARQL query is badly formed. 

Response:
b"Virtuoso 37000 Error SP030: SPARQL compiler, line 37: syntax error at 'BIND' before '('\n\nSPARQL query:\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nPREFIX void-ext: <http://ldf.fi/void-ext#>\nC

Traceback (most recent call last):
  File "/home/javi/rdfsolve-1/.venv/lib/python3.13/site-packages/SPARQLWrapper/Wrapper.py", line 924, in _query
    response = urlopener(request, timeout=self.timeout)
  File "/home/javi/.local/share/uv/python/cpython-3.13.2-linux-x86_64-gnu/lib/python3.13/urllib/request.py", line 189, in urlopen
    return opener.open(url, data, timeout)
           ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/home/javi/.local/share/uv/python/cpython-3.13.2-linux-x86_64-gnu/lib/python3.13/urllib/request.py", line 495, in open
    response = meth(req, response)
  File "/home/javi/.local/share/uv/python/cpython-3.13.2-linux-x86_64-gnu/lib/python3.13/urllib/request.py", line 604, in http_response
    response = self.parent.error(
        'http', request, response, code, msg, hdrs)
  File "/home/javi/.local/share/uv/python/cpython-3.13.2-linux-x86_64-gnu/lib/python3.13/urllib/request.py", line 533, in error
    return self._call_chain(*args)
           ~~~~~~~~~~~~~~~~^^^^^^^

## Alternative: Simplified Raw Extraction

If the standard raw extraction fails, we can try a simplified approach that doesn't require all subjects to have rdf:type relations.

In [7]:
# If the standard raw extraction failed, try a simplified approach
# This creates a custom simplified raw extraction query

try:
    print("Trying simplified raw extraction approach...")
    
    # Create a simplified query manually
    from SPARQLWrapper import SPARQLWrapper, TURTLE
    from rdflib import Graph
    
    sparql = SPARQLWrapper(endpoint_url)
    sparql.setReturnFormat(TURTLE)
    
    # Simple query to get some triples with basic type info
    simple_query = f"""
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX void-ext: <http://ldf.fi/void-ext#>
    CONSTRUCT {{
        ?triple void-ext:subject ?subject ;
                void-ext:predicate ?predicate ;
                void-ext:object ?object ;
                void-ext:subjectType "Resource" ;
                void-ext:objectType ?object_type .
    }}
    WHERE {{
        GRAPH <{graph_uri}> {{
            ?subject ?predicate ?object .
            
            BIND(IF(isLiteral(?object), "Literal", "Resource") AS ?object_type)
            
            # Generate unique triple identifier  
            BIND(IRI(CONCAT('{graph_uri}/void/triple_',
                           MD5(CONCAT(STR(?subject), STR(?predicate), STR(?object))))) AS ?triple)
        }}
    }}
    LIMIT 1000
    """
    
    print("Executing simplified extraction query...")
    sparql.setQuery(simple_query)
    
    results = sparql.query().convert()
    
    # Create VoID graph and parser
    if results:
        void_graph = Graph()
        if isinstance(results, bytes):
            void_graph.parse(data=results.decode('utf-8'), format="turtle")
        else:
            void_graph.parse(data=str(results), format="turtle")
        
        print(f"Simplified extraction successful! Got {len(void_graph)} triples")
        
        # Create parser from the simplified data
        simple_parser = VoidParser(void_graph)
        simple_parser._process_raw_triples(preserve_values=True)
        
        print("Simplified raw extraction completed!")
        
    else:
        print("No results from simplified query")
        simple_parser = None
        
except Exception as e:
    print(f"Simplified extraction also failed: {e}")
    simple_parser = None

Trying simplified raw extraction approach...
Executing simplified extraction query...
Simplified extraction successful! Got 5000 triples
Post-processed 1000 raw triples into 1 schema triples
Found 1 unique properties
Preserved values: True
Simplified raw extraction completed!


## Step 3c: Extract Schema from Raw Extraction Mode

The raw extraction preserves actual values instead of just classifying them as "Resource" or "Literal".

In [None]:
try:
    print("Extracting schema from raw extraction...")
    
    # Get schema as DataFrame (with preserved values!)
    raw_schema_df = raw_parser.to_schema(filter_void_nodes=True)
    
    print("Raw extraction schema completed")
    print(f"Total schema triples: {len(raw_schema_df)}")
    print(f"Unique classes: {raw_schema_df['subject_class'].nunique()}")
    print(f"Unique properties: {raw_schema_df['property'].nunique()}")
    
    # Compare value preservation
    print(f"\nValue preservation comparison:")
    print(f"Traditional mode object classes: Resource, Literal")
    print(f"Raw mode unique objects: {raw_schema_df['object_class'].nunique()}")
    
    # Show sample of preserved values
    print(f"\nSample preserved object values:")
    unique_objects = raw_schema_df['object_class'].unique()
    for obj in unique_objects[:10]:
        if obj not in ['Resource', 'Literal']:
            print(f"  - {obj}")
    
except Exception as e:
    print(f"Raw schema extraction failed: {e}")
    import traceback
    traceback.print_exc()

## Step 3d: Compare Traditional vs Raw Extraction

See the difference in data preservation between the two modes.

In [None]:
# Compare the two approaches if both were successful
if 'fast_schema_df' in globals() and 'raw_schema_df' in globals():
    print("COMPARISON: Traditional vs Raw Extraction")
    print("=" * 50)
    
    print(f"Traditional mode:")
    print(f"  - Schema triples: {len(fast_schema_df):,}")
    print(f"  - Unique objects: {fast_schema_df['object_class'].nunique()}")
    print(f"  - Object types: {set(fast_schema_df['object_class'].unique())}")
    
    print(f"\nRaw extraction mode:")
    print(f"  - Schema triples: {len(raw_schema_df):,}")  
    print(f"  - Unique objects: {raw_schema_df['object_class'].nunique()}")
    print(f"  - Sample objects: {list(raw_schema_df['object_class'].unique())[:5]}...")
    
    # Show side-by-side comparison for same property
    print(f"\nSide-by-side comparison (same property):")
    if len(fast_schema_df) > 0 and len(raw_schema_df) > 0:
        # Find a common property
        common_props = set(fast_schema_df['property'].unique()) & set(raw_schema_df['property'].unique())
        if common_props:
            prop = list(common_props)[0]
            print(f"Property: {prop}")
            
            fast_objects = set(fast_schema_df[fast_schema_df['property'] == prop]['object_class'].unique())
            raw_objects = set(raw_schema_df[raw_schema_df['property'] == prop]['object_class'].unique())
            
            print(f"  Traditional: {fast_objects}")
            print(f"  Raw mode:    {list(raw_objects)[:3]}...")
            
else:
    print("Run both traditional and raw extraction modes to compare")

# Display sample from raw extraction
if 'raw_schema_df' in globals():
    print(f"\nSample from raw extraction (preserves actual values):")
    display(raw_schema_df[~raw_schema_df.object_class.isin(["Class", "Resource", "Literal"])].head(10))

## Alternative: Raw Extraction with Traditional Classification

You can also use raw extraction for performance benefits while still classifying values as Resource/Literal for traditional VoID compatibility.

In [None]:
# Alternative: Raw extraction with traditional VoID classification
# This gives you performance benefits while maintaining VoID compatibility

try:
    print("Raw extraction with traditional classification...")
    
    # Raw extraction but with preserve_values=False for traditional classification
    classified_parser = VoidParser.from_sparql(
        endpoint_url=endpoint_url,
        graph_uri=graph_uri,
        raw_extraction=True,      # Fast extraction
        preserve_values=False     # Traditional Resource/Literal classification
    )
    
    classified_schema = classified_parser.to_schema(filter_void_nodes=True)
    
    print(f"Fast extraction + traditional classification:")
    print(f"  - Schema triples: {len(classified_schema):,}")
    print(f"  - Object types: {set(classified_schema['object_class'].unique())}")
    print(f"  - Performance: Fast (raw extraction)")
    print(f"  - Compatibility: Full VoID compatibility")
    
except Exception as e:
    print(f"Classified raw extraction error: {e}")

## Step 4: Extract Schema from Fast VoID

Extract schema structure from the fast-generated VoID description.

In [None]:
try:
    print("Extracting schema from fast VoID...")
    fast_parser = VoidParser(fast_void_graph)
    
    # Get schema as DataFrame
    fast_schema_df = fast_parser.to_schema(filter_void_nodes=True)
    
    print("Fast schema extraction completed")
    print(f"Total schema triples: {len(fast_schema_df)}")
    print(f"Unique classes: {fast_schema_df['subject_class'].nunique()}")
    print(f"Unique properties: {fast_schema_df['property'].nunique()}")
    
except Exception as e:
    print(f"Fast schema extraction failed: {e}")

Extracting schema from fast VoID...
Fast schema extraction failed: name 'fast_void_graph' is not defined


## Step 5: Schema Visualization

Display a sample of the extracted schema from fast discovery.

In [None]:
# Show sample of the fast schema (excluding generic classes)
display(fast_schema_df[~fast_schema_df.object_class.isin(["Class", "Resource"])].head(10))

NameError: name 'fast_schema_df' is not defined

## Step 6: Analyze AOP Wiki RDF Key Event

Examine the `KeyEvent` class as an example of detailed analysis:

In [None]:
try:
    # Focus on DirectedInteraction class
    di_schema = fast_schema_df[fast_schema_df["subject_class"] == "KeyEvent"]

    print(f"DirectedInteraction Analysis (Complete Mode):")
    print(f"Properties found: {len(di_schema)}")

    if len(di_schema) > 0:
        print(f"\nKeyEvent Properties:")
        for _, row in di_schema.head(15).iterrows():
            print(f"  {row['property']:20} -> {row['object_class']}")

        # Look for database cross-references (bdb*)
        bdb_props = di_schema[di_schema["property"].str.contains("bdb", na=False)]
        if len(bdb_props) > 0:
            print(f"\nDatabase Cross-References (bdb*):")
            print(f"Found {len(bdb_props)} bdb properties")
            for _, row in bdb_props.iterrows():
                print(f"  {row['property']:15} -> {row['object_class']}")
        else:
            print("\nNo bdb* properties found in KeyEvent")
    else:
        print("\nKeyEvent class not found in schema")
        print("Available classes:")
        for cls in fast_schema_df["subject_class"].unique()[:10]:
            print(f"  - {cls}")

except Exception as e:
    print(f"KeyEvent analysis failed: {e}")

DirectedInteraction Analysis (Complete Mode):
Properties found: 35

KeyEvent Properties:
  PATO_0001241         -> GO_0008150
  PATO_0001241         -> Literal
  PATO_0001241         -> PATO_0001241
  PATO_0001241         -> OrganContext
  PATO_0001241         -> CellTypeContext
  label                -> Literal
  identifier           -> KeyEvent
  source               -> Literal
  PATO_0000047         -> Literal
  title                -> Literal
  alternative          -> Literal
  isPartOf             -> AdverseOutcomePathway
  CellTypeContext      -> PATO_0001241
  CellTypeContext      -> CellTypeContext
  CellTypeContext      -> OrganContext

No bdb* properties found in KeyEvent


## Step 7: Export Fast Discovery Results

Export the fast discovery schema as JSON and CSV files.

In [None]:
try:
    # Export as JSON
    print("Generating JSON schema (fast discovery)...")
    fast_schema_json = fast_parser.to_json(filter_void_nodes=True)
    
    print("Fast JSON export completed")
    print(f"Total triples: {fast_schema_json['metadata']['total_triples']}")
    print(f"Classes: {len(fast_schema_json['metadata']['classes'])}")
    print(f"Properties: {len(fast_schema_json['metadata']['properties'])}")
    print(f"Object types: {len(fast_schema_json['metadata']['objects'])}")
    
    # Save JSON to file
    import json
    with open(f"{dataset_name}_schema.json", "w") as f:
        json.dump(fast_schema_json, f, indent=2)
    print(f"\nFast JSON schema saved to: {dataset_name}_schema.json")
    
    # Export as CSV
    fast_schema_df.to_csv(f"{dataset_name}_schema.csv", index=False)
    print(f"Fast CSV schema saved to: {dataset_name}_schema.csv")
    
except Exception as e:
    print(f"Fast export failed: {e}")

Generating JSON schema (fast discovery)...
Fast JSON export completed
Total triples: 240
Classes: 26
Properties: 65
Object types: 28

Fast JSON schema saved to: aopwikirdf_complete_schema.json
Fast CSV schema saved to: aopwikirdf_complete_schema.csv


## Optional: Sample Limiting for Very Large Datasets

For extremely large datasets, you can add a sample limit for even faster discovery:

In [None]:
# Example: Ultra-fast discovery with sample limit
# Uncomment to try with a sample of 1000 triples for very fast exploration

# try:
#     print("Ultra-fast discovery with sampling...")
#     
#     sampled_void_graph = solver.void_generator(
#         graph_uri=graph_uri,
#         output_file=f"{dataset_name}_sampled_void.ttl",
#         counts=False,
#         sample_limit=1000  # Only sample 1000 triples
#     )
#     
#     print(f"Sampled VoID contains {len(sampled_void_graph)} triples")
#     
# except Exception as e:
#     print(f"Sampled mode error: {e}")

## JSON-LD Export

Export the VoID description and schema as JSON-LD with automatic prefix extraction.

In [None]:
# Export AOP-Wiki RDF (Fast) data as JSON-LD (automatic prefix extraction)
print("Exporting AOP-Wiki RDF (Fast) VoID and Schema as JSON-LD...")

# Export complete VoID with automatic context
void_jsonld = solver.export_void_jsonld(
    output_file="aopwikirdf_fast_void.jsonld",
    indent=2
)

# Export schema only with automatic context
schema_jsonld = solver.export_schema_jsonld(
    output_file="aopwikirdf_fast_schema.jsonld",
    indent=2,
    filter_void_nodes=True
)

print(f"Exported files:")
print(f"  - aopwikirdf_fast_void.jsonld ({len(void_jsonld)} chars)")
print(f"  - aopwikirdf_fast_schema.jsonld ({len(schema_jsonld)} chars)")

# Show automatically extracted prefixes
prefixes = solver._extract_prefixes_from_void()
print(f"\nAuto-extracted prefixes: {', '.join(sorted(prefixes.keys()))}")

print(f"\nSchema Preview:")
print(schema_jsonld[:300] + "..." if len(schema_jsonld) > 300 else schema_jsonld)