# RDFSolve: WikiPathways - no counts

This notebook demonstrates faster schema discovery:
1. Setting up an endpoint and graph
2. Generating fast VoID descriptions using CONSTRUCT queries **without** COUNT aggregations
3. Extracting schema from the VoID description
4. Analyzing the results as DataFrame and JSON

In [1]:
import pandas as pd
from rdfsolve.rdfsolve import RDFSolver
from rdfsolve.void_parser import VoidParser, generate_void_from_endpoint
import warnings
warnings.filterwarnings('ignore')

## Step 1: Configure Dataset Parameters

We'll configure the WikiPathways dataset with its SPARQL endpoint and metadata.

In [2]:
# WikiPathways configuration
endpoint_url = "https://sparql.wikipathways.org/sparql"
dataset_name = "wikipathways_fast"
void_iri = "https://wikipathways.org/void"
graph_uri = "http://rdf.wikipathways.org/"  # Specify the correct graph URI
working_path = "."

print(f"Dataset: {dataset_name}")
print(f"Endpoint: {endpoint_url}")
print(f"VoID IRI: {void_iri}")
print(f"Graph URI: {graph_uri}")
print(f"Mode: Fast Discovery (no COUNT aggregations)")

Dataset: wikipathways_fast
Endpoint: https://sparql.wikipathways.org/sparql
VoID IRI: https://wikipathways.org/void
Graph URI: http://rdf.wikipathways.org/
Mode: Fast Discovery (no COUNT aggregations)


## Step 2: Initialize RDFSolver

Create an RDFSolver instance with our configuration.

In [3]:
try:
    # Initialize RDFSolver with our configuration
    solver = RDFSolver(
        endpoint=endpoint_url,
        path=working_path,
        void_iri=void_iri,
        dataset_name=dataset_name
    )
    
    print("RDFSolver initialized successfully")
    print(f"Endpoint: {solver.endpoint}")
    print(f"Dataset: {solver.dataset_name}")
    
except Exception as e:
    print(f"Error: {e}")

RDFSolver initialized successfully
Endpoint: https://sparql.wikipathways.org/sparql
Dataset: wikipathways_fast


## Step 3: Generate Fast VoID Description

Generate VoID **without** COUNT aggregations for fast discovery. This is much faster but doesn't provide count statistics.

Three CONSTRUCT queries get the partitions for classes, properties, and datatypes using SELECT DISTINCT instead of COUNT.

In [4]:
try:
    # Generate fast VoID without count aggregations
    
    fast_void_graph = solver.void_generator(
        graph_uri=graph_uri,
        output_file=f"{dataset_name}_void.ttl",
        counts=False  # Fast discovery without counts
    )
    
    print(f"Graph contains {len(fast_void_graph)} triples")
    print(f"Saved to: {dataset_name}_void.ttl")
    
except Exception as e:
    print(f"Error: {e}")

Generating VoID from endpoint: https://sparql.wikipathways.org/sparql
Using graph URI: http://rdf.wikipathways.org/
Fast mode: Skipping COUNT aggregations
Starting query: class_partitions
Finished query: class_partitions (took 0.14s)
Starting query: property_partitions
Finished query: property_partitions (took 6.62s)
Starting query: datatype_partitions
Finished query: property_partitions (took 6.62s)
Starting query: datatype_partitions
Finished query: datatype_partitions (took 35.32s)
VoID description saved to wikipathways_fast_void.ttl
VoID generation completed successfully
Graph contains 2557 triples
Saved to: wikipathways_fast_void.ttl
Finished query: datatype_partitions (took 35.32s)
VoID description saved to wikipathways_fast_void.ttl
VoID generation completed successfully
Graph contains 2557 triples
Saved to: wikipathways_fast_void.ttl


## Step 4: Extract Schema from Fast VoID

Extract schema structure from the fast-generated VoID description.

In [5]:
try:
    print("Extracting schema from fast VoID...")
    fast_parser = VoidParser(fast_void_graph)
    
    # Get schema as DataFrame
    fast_schema_df = fast_parser.to_schema(filter_void_nodes=True)
    
    print("Fast schema extraction completed")
    print(f"Total schema triples: {len(fast_schema_df)}")
    print(f"Unique classes: {fast_schema_df['subject_class'].nunique()}")
    print(f"Unique properties: {fast_schema_df['property'].nunique()}")
    
except Exception as e:
    print(f"Fast schema extraction failed: {e}")

Extracting schema from fast VoID...
Fast schema extraction completed
Total schema triples: 1247
Unique classes: 40
Unique properties: 133
Fast schema extraction completed
Total schema triples: 1247
Unique classes: 40
Unique properties: 133


## Step 5: Schema Visualization

Display a sample of the extracted schema from fast discovery.

In [6]:
# Show sample of the fast schema (excluding generic classes)
display(fast_schema_df[~fast_schema_df.object_class.isin(["Class", "Resource"])].head(10))

Unnamed: 0,subject_class,subject_uri,property,property_uri,object_class,object_uri
28,shape,http://vocabularies.wikipathways.org/gpml#shape,type,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,DatatypeProperty,http://www.w3.org/2002/07/owl#DatatypeProperty
56,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...
57,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,DataNode,http://vocabularies.wikipathways.org/wp#DataNode
61,DataNode,http://vocabularies.wikipathways.org/wp#DataNode,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...
62,DataNode,http://vocabularies.wikipathways.org/wp#DataNode,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,DataNode,http://vocabularies.wikipathways.org/wp#DataNode
67,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,Complex,http://vocabularies.wikipathways.org/wp#Complex
68,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,DataNode,http://vocabularies.wikipathways.org/wp#DataNode
69,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,Rna,http://vocabularies.wikipathways.org/wp#Rna
70,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,Pathway,http://vocabularies.wikipathways.org/wp#Pathway
71,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,Protein,http://vocabularies.wikipathways.org/wp#Protein


## Step 6: Analyze DirectedInteraction (Fast Mode)

Examine the `DirectedInteraction` class using the fast discovery results:

In [7]:
try:
    print("DirectedInteraction Analysis (Fast Discovery):")
    di_schema_fast = fast_schema_df[fast_schema_df['subject_class'] == 'DirectedInteraction']
    
    print(f"Properties found: {len(di_schema_fast)}")
    
    if len(di_schema_fast) > 0:
        print(f"\nDirectedInteraction Properties:")
        for _, row in di_schema_fast.head(15).iterrows():
            print(f"  {row['property']:20} -> {row['object_class']}")
        
        # Look for database cross-references (bdb*)
        bdb_props_fast = di_schema_fast[di_schema_fast['property'].str.contains('bdb', na=False)]
        if len(bdb_props_fast) > 0:
            print(f"\nDatabase Cross-References (bdb*):")
            print(f"Found {len(bdb_props_fast)} bdb properties")
            for _, row in bdb_props_fast.iterrows():
                print(f"  {row['property']:15} -> {row['object_class']}")
        else:
            print("\nNo bdb* properties found in DirectedInteraction")
    else:
        print("\nDirectedInteraction class not found in fast schema")
        print("Available classes in fast discovery:")
        for cls in fast_schema_df['subject_class'].unique()[:10]:
            print(f"  - {cls}")
            
except Exception as e:
    print(f"Fast DirectedInteraction analysis failed: {e}")

DirectedInteraction Analysis (Fast Discovery):
Properties found: 60

DirectedInteraction Properties:
  type                 -> Class
  type                 -> Resource
  seeAlso              -> Resource
  source               -> Literal
  identifier           -> Literal
  isPartOf             -> Pathway
  isPartOf             -> Collection
  references           -> PublicationXref
  references           -> PublicationReference
  bdbChEBI             -> Metabolite
  bdbChEBI             -> DataNode
  bdbChemspider        -> Resource
  bdbHmdb              -> Metabolite
  bdbHmdb              -> DataNode
  bdbInChIKey          -> Resource

Database Cross-References (bdb*):
Found 13 bdb properties
  bdbChEBI        -> Metabolite
  bdbChEBI        -> DataNode
  bdbChemspider   -> Resource
  bdbHmdb         -> Metabolite
  bdbHmdb         -> DataNode
  bdbInChIKey     -> Resource
  bdbKeggCompound -> Resource
  bdbReactome     -> DataNode
  bdbReactome     -> Pathway
  bdbReactome     -> Re

## Step 7: Export Fast Discovery Results

Export the fast discovery schema as JSON and CSV files.

In [8]:
try:
    # Export as JSON
    print("Generating JSON schema (fast discovery)...")
    fast_schema_json = fast_parser.to_json(filter_void_nodes=True)
    
    print("Fast JSON export completed")
    print(f"Total triples: {fast_schema_json['metadata']['total_triples']}")
    print(f"Classes: {len(fast_schema_json['metadata']['classes'])}")
    print(f"Properties: {len(fast_schema_json['metadata']['properties'])}")
    print(f"Object types: {len(fast_schema_json['metadata']['objects'])}")
    
    # Save JSON to file
    import json
    with open(f"{dataset_name}_schema.json", "w") as f:
        json.dump(fast_schema_json, f, indent=2)
    print(f"\nFast JSON schema saved to: {dataset_name}_schema.json")
    
    # Export as CSV
    fast_schema_df.to_csv(f"{dataset_name}_schema.csv", index=False)
    print(f"Fast CSV schema saved to: {dataset_name}_schema.csv")
    
except Exception as e:
    print(f"Fast export failed: {e}")

Generating JSON schema (fast discovery)...
Fast JSON export completed
Total triples: 1247
Classes: 42
Properties: 139
Object types: 39

Fast JSON schema saved to: wikipathways_fast_schema.json
Fast JSON export completed
Total triples: 1247
Classes: 42
Properties: 139
Object types: 39

Fast JSON schema saved to: wikipathways_fast_schema.json
Fast CSV schema saved to: wikipathways_fast_schema.csv
Fast CSV schema saved to: wikipathways_fast_schema.csv


## Optional: Sample Limiting for Very Large Datasets

For extremely large datasets, you can add a sample limit for even faster discovery:

In [9]:
# Example: Ultra-fast discovery with sample limit
# Uncomment to try with a sample of 1000 triples for very fast exploration

# try:
#     print("Ultra-fast discovery with sampling...")
#     
#     sampled_void_graph = solver.void_generator(
#         graph_uri=graph_uri,
#         output_file=f"{dataset_name}_sampled_void.ttl",
#         counts=False,
#         sample_limit=1000  # Only sample 1000 triples
#     )
#     
#     print(f"Sampled VoID contains {len(sampled_void_graph)} triples")
#     
# except Exception as e:
#     print(f"Sampled mode error: {e}")

## JSON-LD Export

Export the VoID description and schema as JSON-LD with automatic prefix extraction.

In [None]:
# Export WikiPathways RDF (Fast) data as JSON-LD (automatic prefix extraction)
print("Exporting WikiPathways RDF (Fast) VoID and Schema as JSON-LD...")

# Export complete VoID with automatic context
void_jsonld = solver.export_void_jsonld(
    output_file="wikipathways_fast_void.jsonld",
    indent=2
)

# Export schema only with automatic context
schema_jsonld = solver.export_schema_jsonld(
    output_file="wikipathways_fast_schema.jsonld",
    indent=2,
    filter_void_nodes=True
)

print(f"Exported files:")
print(f"  - wikipathways_fast_void.jsonld ({len(void_jsonld)} chars)")
print(f"  - wikipathways_fast_schema.jsonld ({len(schema_jsonld)} chars)")

# Show automatically extracted prefixes
prefixes = solver._extract_prefixes_from_void()
print(f"\nAuto-extracted prefixes: {', '.join(sorted(prefixes.keys()))}")

print(f"\nSchema Preview:")
print(schema_jsonld[:300] + "..." if len(schema_jsonld) > 300 else schema_jsonld)