# RDFSolve: WikiPathways

This notebook demonstrates the RDFSolve workflow:
1. Setting up an endpoint and graph
2. Generating VoID descriptions using CONSTRUCT queries.
3. Extracting schema from the VoID description
4. Analyzing the results as DataFrame and JSON

We'll use WikiPathways as an example dataset.

In [12]:
import pandas as pd
from rdfsolve.rdfsolve import RDFSolver
from rdfsolve.void_parser import VoidParser, generate_void_from_endpoint
import warnings
warnings.filterwarnings('ignore')

## Step 1: Configure Dataset Parameters

We'll configure the WikiPathways dataset with its SPARQL endpoint and metadata.

In [13]:
# WikiPathways configuration
endpoint_url = "https://sparql.wikipathways.org/sparql"
dataset_name = "wikipathways"
void_iri = "https://wikipathways.org/void"
graph_uri = "http://rdf.wikipathways.org/"  # Specify the correct graph URI
working_path = "."

print(f"Dataset: {dataset_name}")
print(f"Endpoint: {endpoint_url}")
print(f"VoID IRI: {void_iri}")
print(f"Graph URI: {graph_uri}")

Dataset: wikipathways
Endpoint: https://sparql.wikipathways.org/sparql
VoID IRI: https://wikipathways.org/void
Graph URI: http://rdf.wikipathways.org/


## Step 2: Initialize RDFSolver

Create an RDFSolver instance with our configuration.

In [14]:
try:
    # Initialize RDFSolver with our configuration
    solver = RDFSolver(
        endpoint=endpoint_url,
        path=working_path,
        void_iri=void_iri,
        dataset_name=dataset_name
    )
    print(f"Endpoint: {solver.endpoint}")
    print(f"Dataset: {solver.dataset_name}")
    
except Exception as e:
    print(f"Error: {e}")

Endpoint: https://sparql.wikipathways.org/sparql
Dataset: wikipathways


## Step 3: Generate VoID Description

Three CONSTRUCT queries get the partitions for classes, properties, and datatypes from the specified graph.

In [15]:
try:    
    # Generate VoID using CONSTRUCT query approach with explicit graph URI
    void_graph = solver.void_generator(
        graph_uri=graph_uri,
        output_file=f"{dataset_name}_void_final.ttl"
    )
    
    print(f"Graph contains {len(void_graph)} triples")
    print(f"Saved to: {dataset_name}_void_final.ttl")
    
except Exception as e:
    print(f"Error: {e}")

Generating VoID from endpoint: https://sparql.wikipathways.org/sparql
Using graph URI: http://rdf.wikipathways.org/
Starting query: class_partitions
Finished query: class_partitions (took 0.11s)
Starting query: property_partitions
Finished query: property_partitions (took 1.87s)
Finished query: property_partitions (took 1.87s)
Starting query: datatype_partitions
Starting query: datatype_partitions
Finished query: datatype_partitions (took 35.04s)
Finished query: datatype_partitions (took 35.04s)
VoID description saved to wikipathways_void_final.ttl
VoID generation completed successfully
Graph contains 3788 triples
Saved to: wikipathways_void_final.ttl
VoID description saved to wikipathways_void_final.ttl
VoID generation completed successfully
Graph contains 3788 triples
Saved to: wikipathways_void_final.ttl


## Step 4: Extract Schema from VoID

`VoidParser` via `solver.extract_schema()` extracts the schema structure from the generated VoID.

## Step 3b: Fast VoID Generation (Optional)

For faster discovery without count aggregations, you can use `counts=False`. This skips the COUNT queries and generates VoID faster, useful for exploration or large datasets.

In [16]:
try:
    # Example: Fast VoID generation without count aggregations
    print("Fast mode (counts=False)...")
    
    fast_void_graph = solver.void_generator(
        graph_uri=graph_uri,
        output_file=f"{dataset_name}_void_fast.ttl",
        counts=False  # Skip COUNT queries for faster execution
    )
    
    print(f"Fast VoID contains {len(fast_void_graph)} triples")
    
except Exception as e:
    print(f"Fast mode error: {e}")
    
# You can also use sample_limit for even faster discovery:
# solver.void_generator(graph_uri=graph_uri, sample_limit=1000)

Demonstrating fast mode (counts=False)...
Generating VoID from endpoint: https://sparql.wikipathways.org/sparql
Using graph URI: http://rdf.wikipathways.org/
Fast mode: Skipping COUNT aggregations
Starting query: class_partitions
Finished query: class_partitions (took 0.09s)
Starting query: property_partitions
Finished query: property_partitions (took 7.28s)
Starting query: datatype_partitions
Finished query: property_partitions (took 7.28s)
Starting query: datatype_partitions
Query datatype_partitions failed after 30.05s: The read operation timed out
Query datatype_partitions timed out - common with complex queries
Skipping optional query: datatype_partitions
VoID description saved to wikipathways_void_fast.ttl
VoID generation completed successfully
Fast VoID contains 2552 triples
Query datatype_partitions failed after 30.05s: The read operation timed out
Query datatype_partitions timed out - common with complex queries
Skipping optional query: datatype_partitions
VoID description sav

Extract and analyze the schema from the fast VoID generation to compare with the full mode.

In [17]:
# Extract schema from fast VoID
try:
    print("\nExtracting schema from fast VoID...")
    fast_parser = VoidParser(fast_void_graph)
    
    # Get schema as DataFrame
    fast_schema_df = fast_parser.to_schema(filter_void_nodes=True)
    
    print("Fast schema extraction completed")
    print(f"Total schema triples: {len(fast_schema_df)}")
    print(f"Unique classes: {fast_schema_df['subject_class'].nunique()}")
    print(f"Unique properties: {fast_schema_df['property'].nunique()}")
    
    # Show sample of the schema
    print("\nSample schema (fast mode):")
    display(fast_schema_df[~fast_schema_df.object_class.isin(["Class", "Resource"])].head())
    
except Exception as e:
    print(f"Fast schema extraction failed: {e}")


Extracting schema from fast VoID...
Fast schema extraction completed
Total schema triples: 1247
Unique classes: 40
Unique properties: 133

Sample schema (fast mode):


Unnamed: 0,subject_class,subject_uri,property,property_uri,object_class,object_uri
28,shape,http://vocabularies.wikipathways.org/gpml#shape,type,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,DatatypeProperty,http://www.w3.org/2002/07/owl#DatatypeProperty
56,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...
57,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,DataNode,http://vocabularies.wikipathways.org/wp#DataNode
61,DataNode,http://vocabularies.wikipathways.org/wp#DataNode,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...
62,DataNode,http://vocabularies.wikipathways.org/wp#DataNode,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,DataNode,http://vocabularies.wikipathways.org/wp#DataNode


In [18]:
# Analyze DirectedInteraction class in fast mode
try:
    print("DirectedInteraction Analysis (Fast Mode):")
    di_schema_fast = fast_schema_df[fast_schema_df['subject_class'] == 'DirectedInteraction']
    
    print(f"Properties found: {len(di_schema_fast)}")
    
    if len(di_schema_fast) > 0:
        print(f"\nDirectedInteraction Properties (Fast Mode):")
        for _, row in di_schema_fast.head(15).iterrows():
            print(f"  {row['property']:20} -> {row['object_class']}")
        
        # Look for database cross-references (bdb*)
        bdb_props_fast = di_schema_fast[di_schema_fast['property'].str.contains('bdb', na=False)]
        if len(bdb_props_fast) > 0:
            print(f"\nDatabase Cross-References (bdb*) - Fast Mode:")
            print(f"Found {len(bdb_props_fast)} bdb properties")
            for _, row in bdb_props_fast.iterrows():
                print(f"  {row['property']:15} -> {row['object_class']}")
        else:
            print("\nNo bdb* properties found in DirectedInteraction (fast mode)")
    else:
        print("\nDirectedInteraction class not found in fast schema")
        print("Available classes in fast mode:")
        for cls in fast_schema_df['subject_class'].unique()[:10]:
            print(f"  - {cls}")
            
except Exception as e:
    print(f"Fast DirectedInteraction analysis failed: {e}")

DirectedInteraction Analysis (Fast Mode):
Properties found: 60

DirectedInteraction Properties (Fast Mode):
  type                 -> Class
  type                 -> Resource
  seeAlso              -> Resource
  source               -> Literal
  identifier           -> Literal
  isPartOf             -> Pathway
  isPartOf             -> Collection
  references           -> PublicationXref
  references           -> PublicationReference
  bdbChEBI             -> Metabolite
  bdbChEBI             -> DataNode
  bdbChemspider        -> Resource
  bdbHmdb              -> Metabolite
  bdbHmdb              -> DataNode
  bdbInChIKey          -> Resource

Database Cross-References (bdb*) - Fast Mode:
Found 13 bdb properties
  bdbChEBI        -> Metabolite
  bdbChEBI        -> DataNode
  bdbChemspider   -> Resource
  bdbHmdb         -> Metabolite
  bdbHmdb         -> DataNode
  bdbInChIKey     -> Resource
  bdbKeggCompound -> Resource
  bdbReactome     -> DataNode
  bdbReactome     -> Pathway
  bd

In [19]:
# Export fast mode schema as JSON and CSV
try:
    # Export as JSON
    print("Generating JSON schema (fast mode)...")
    fast_schema_json = fast_parser.to_json(filter_void_nodes=True)
    
    print("Fast JSON export completed")
    print(f"Total triples: {fast_schema_json['metadata']['total_triples']}")
    print(f"Classes: {len(fast_schema_json['metadata']['classes'])}")
    print(f"Properties: {len(fast_schema_json['metadata']['properties'])}")
    print(f"Object types: {len(fast_schema_json['metadata']['objects'])}")
    
    # Save JSON to file
    import json
    with open(f"{dataset_name}_schema_fast.json", "w") as f:
        json.dump(fast_schema_json, f, indent=2)
    print(f"\nFast JSON schema saved to: {dataset_name}_schema_fast.json")
    
    # Export as CSV
    fast_schema_df.to_csv(f"{dataset_name}_schema_fast.csv", index=False)
    print(f"Fast CSV schema saved to: {dataset_name}_schema_fast.csv")
    
    # Compare with full mode (if available)
    if 'schema_df' in globals():
        print(f"\nComparison:")
        print(f"Full mode triples: {len(schema_df)}")
        print(f"Fast mode triples: {len(fast_schema_df)}")
        print(f"Difference: {len(schema_df) - len(fast_schema_df)} triples")
    
except Exception as e:
    print(f"Fast export failed: {e}")

Generating JSON schema (fast mode)...
Fast JSON export completed
Total triples: 1247
Classes: 42
Properties: 139
Object types: 39

Fast JSON schema saved to: wikipathways_schema_fast.json
Fast CSV schema saved to: wikipathways_schema_fast.csv

Comparison:
Full mode triples: 1247
Fast mode triples: 1247
Difference: 0 triples


In [21]:
# Extract schema
parser = solver.extract_schema()
# Get schema as DataFrame
schema_df = parser.to_schema(filter_void_nodes=True)

print(f"Total schema triples: {len(schema_df)}")
print(f"Unique classes: {schema_df['subject_class'].nunique()}")
print(f"Unique properties: {schema_df['property'].nunique()}")


Total schema triples: 1247
Unique classes: 40
Unique properties: 133


In [22]:
schema_df[~schema_df.object_class.isin(["Class", "Resource"])].head()

Unnamed: 0,subject_class,subject_uri,property,property_uri,object_class,object_uri
28,shape,http://vocabularies.wikipathways.org/gpml#shape,type,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,DatatypeProperty,http://www.w3.org/2002/07/owl#DatatypeProperty
56,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...
57,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,DataNode,http://vocabularies.wikipathways.org/wp#DataNode
61,DataNode,http://vocabularies.wikipathways.org/wp#DataNode,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...
62,DataNode,http://vocabularies.wikipathways.org/wp#DataNode,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,DataNode,http://vocabularies.wikipathways.org/wp#DataNode


## Step 5: Analyze WikiPathways DirectedInteraction

Examine the `DirectedInteraction` class as an example:

In [23]:
try:
    # Focus on DirectedInteraction class
    di_schema = schema_df[schema_df['subject_class'] == 'DirectedInteraction']
    
    print(f"DirectedInteraction Analysis:")
    print(f"Properties found: {len(di_schema)}")
    
    if len(di_schema) > 0:
        print(f"\nDirectedInteraction Properties:")
        for _, row in di_schema.head(15).iterrows():
            print(f"  {row['property']:20} -> {row['object_class']}")
        
        # Look for database cross-references (bdb*)
        bdb_props = di_schema[di_schema['property'].str.contains('bdb', na=False)]
        if len(bdb_props) > 0:
            print(f"\nDatabase Cross-References (bdb*):")
            print(f"Found {len(bdb_props)} bdb properties")
            for _, row in bdb_props.iterrows():
                print(f"  {row['property']:15} -> {row['object_class']}")
        else:
            print("\nNo bdb* properties found in DirectedInteraction")
    else:
        print("\nDirectedInteraction class not found in schema")
        print("Available classes:")
        for cls in schema_df['subject_class'].unique()[:10]:
            print(f"  - {cls}")
            
except Exception as e:
    print(f"DirectedInteraction analysis failed: {e}")

DirectedInteraction Analysis:
Properties found: 60

DirectedInteraction Properties:
  type                 -> Class
  type                 -> Resource
  seeAlso              -> Resource
  source               -> Literal
  identifier           -> Literal
  isPartOf             -> Pathway
  isPartOf             -> Collection
  references           -> PublicationXref
  references           -> PublicationReference
  bdbChEBI             -> Metabolite
  bdbChEBI             -> DataNode
  bdbChemspider        -> Resource
  bdbHmdb              -> Metabolite
  bdbHmdb              -> DataNode
  bdbInChIKey          -> Resource

Database Cross-References (bdb*):
Found 13 bdb properties
  bdbChEBI        -> Metabolite
  bdbChEBI        -> DataNode
  bdbChemspider   -> Resource
  bdbHmdb         -> Metabolite
  bdbHmdb         -> DataNode
  bdbInChIKey     -> Resource
  bdbKeggCompound -> Resource
  bdbReactome     -> DataNode
  bdbReactome     -> Pathway
  bdbReactome     -> Resource
  bdbWikid

## Step 6: Schema Export

In [24]:
try:
    # Export as JSON
    print("Generating JSON schema...")
    schema_json = parser.to_json(filter_void_nodes=True)
    
    print("JSON export completed")
    print(f"Total triples: {schema_json['metadata']['total_triples']}")
    print(f"Classes: {len(schema_json['metadata']['classes'])}")
    print(f"Properties: {len(schema_json['metadata']['properties'])}")
    print(f"Object types: {len(schema_json['metadata']['objects'])}")
    
    
    # Save JSON to file
    import json
    with open(f"{dataset_name}_schema.json", "w") as f:
        json.dump(schema_json, f, indent=2)
    print(f"\nJSON schema saved to: {dataset_name}_schema.json")
    
except Exception as e:
    print(f"JSON export failed: {e}")

Generating JSON schema...
JSON export completed
Total triples: 1247
Classes: 42
Properties: 139
Object types: 39

JSON schema saved to: wikipathways_schema.json
