# RDFSolve: WikiPathways - Complete Analysis

This notebook demonstrates VoID generation with full count aggregations:
1. Setting up an endpoint and graph
2. Generating comprehensive VoID descriptions using CONSTRUCT queries with COUNT aggregations
3. Extracting detailed schema from the VoID description
4. Analyzing the results as DataFrame and JSON


In [1]:
import pandas as pd
from rdfsolve.rdfsolve import RDFSolver
from rdfsolve.void_parser import VoidParser, generate_void_from_endpoint
import warnings
warnings.filterwarnings('ignore')

## Step 1: Configure Dataset Parameters

We'll configure the WikiPathways dataset with its SPARQL endpoint and metadata.

In [2]:
# WikiPathways configuration
endpoint_url = "https://sparql.wikipathways.org/sparql"
dataset_name = "wikipathways_complete"
void_iri = "https://wikipathways.org/void"
graph_uri = "http://rdf.wikipathways.org/"  # Specify the correct graph URI
working_path = "."

print(f"Dataset: {dataset_name}")
print(f"Endpoint: {endpoint_url}")
print(f"VoID IRI: {void_iri}")
print(f"Graph URI: {graph_uri}")
print(f"Mode: Complete (with COUNT aggregations)")

Dataset: wikipathways_complete
Endpoint: https://sparql.wikipathways.org/sparql
VoID IRI: https://wikipathways.org/void
Graph URI: http://rdf.wikipathways.org/
Mode: Complete (with COUNT aggregations)


## Step 2: Initialize RDFSolver

Create an RDFSolver instance with our configuration.

In [3]:
try:
    # Initialize RDFSolver with our configuration
    solver = RDFSolver(
        endpoint=endpoint_url,
        path=working_path,
        void_iri=void_iri,
        dataset_name=dataset_name
    )
    
    print("RDFSolver initialized successfully")
    print(f"Endpoint: {solver.endpoint}")
    print(f"Dataset: {solver.dataset_name}")
    
except Exception as e:
    print(f"Error: {e}")

RDFSolver initialized successfully
Endpoint: https://sparql.wikipathways.org/sparql
Dataset: wikipathways_complete


## Step 3: Generate Complete VoID Description

Generate VoID with full COUNT aggregations. This provides complete statistics but takes longer to execute.

Three CONSTRUCT queries get the partitions for classes, properties, and datatypes from the specified graph with complete count information.

In [4]:
try:    
    # Generate VoID using CONSTRUCT query approach with full counts
    print("Generating complete VoID with COUNT aggregations...")
    
    void_graph = solver.void_generator(
        graph_uri=graph_uri,
        output_file=f"{dataset_name}_void.ttl",
        counts=True  # Full count aggregations
    )
    
    print(f"Graph contains {len(void_graph)} triples")
    print(f"Saved to: {dataset_name}_void.ttl")
    
except Exception as e:
    print(f"Error: {e}")

Generating complete VoID with COUNT aggregations...
Generating VoID from endpoint: https://sparql.wikipathways.org/sparql
Using graph URI: http://rdf.wikipathways.org/
Starting query: class_partitions


Finished query: class_partitions (took 1.91s)
Starting query: property_partitions
Finished query: property_partitions (took 2.73s)
Finished query: property_partitions (took 2.73s)
Starting query: datatype_partitions
Starting query: datatype_partitions
Finished query: datatype_partitions (took 35.23s)
VoID description saved to wikipathways_complete_void.ttl
VoID generation completed successfully
Graph contains 3788 triples
Saved to: wikipathways_complete_void.ttl
Finished query: datatype_partitions (took 35.23s)
VoID description saved to wikipathways_complete_void.ttl
VoID generation completed successfully
Graph contains 3788 triples
Saved to: wikipathways_complete_void.ttl


## Step 4: Extract Schema from Complete VoID

`VoidParser` via `solver.extract_schema()` extracts the comprehensive schema structure from the generated VoID.

In [5]:
try:
    # Extract schema
    parser = solver.extract_schema()

    # Get schema as DataFrame
    schema_df = parser.to_schema(filter_void_nodes=True)

    print(f"Total schema triples: {len(schema_df)}")
    print(f"Unique classes: {schema_df['subject_class'].nunique()}")
    print(f"Unique properties: {schema_df['property'].nunique()}")
    
except Exception as e:
    print(f"Schema extraction failed: {e}")

Total schema triples: 1247
Unique classes: 40
Unique properties: 133


## Step 5: Schema Visualization

Display a sample of the extracted schema, filtering out generic classes.

In [6]:
# Display schema sample (excluding generic classes)
display(schema_df[~schema_df.object_class.isin(["Class", "Resource"])].head(10))

Unnamed: 0,subject_class,subject_uri,property,property_uri,object_class,object_uri
28,shape,http://vocabularies.wikipathways.org/gpml#shape,type,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,DatatypeProperty,http://www.w3.org/2002/07/owl#DatatypeProperty
55,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...
56,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,DataNode,http://vocabularies.wikipathways.org/wp#DataNode
60,DataNode,http://vocabularies.wikipathways.org/wp#DataNode,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,Metabolite,http://vocabularies.wikipathways.org/wp#Metabo...
61,DataNode,http://vocabularies.wikipathways.org/wp#DataNode,seeAlso,http://www.w3.org/2000/01/rdf-schema#seeAlso,DataNode,http://vocabularies.wikipathways.org/wp#DataNode
67,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,Complex,http://vocabularies.wikipathways.org/wp#Complex
68,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,DataNode,http://vocabularies.wikipathways.org/wp#DataNode
69,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,Rna,http://vocabularies.wikipathways.org/wp#Rna
70,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,Pathway,http://vocabularies.wikipathways.org/wp#Pathway
71,GeneProduct,http://vocabularies.wikipathways.org/wp#GenePr...,identifier,http://purl.org/dc/elements/1.1/identifier,Protein,http://vocabularies.wikipathways.org/wp#Protein


## Step 6: Analyze WikiPathways DirectedInteraction

Examine the `DirectedInteraction` class as an example of detailed analysis:

In [7]:
try:
    # Focus on DirectedInteraction class
    di_schema = schema_df[schema_df['subject_class'] == 'DirectedInteraction']
    
    print(f"DirectedInteraction Analysis (Complete Mode):")
    print(f"Properties found: {len(di_schema)}")
    
    if len(di_schema) > 0:
        print(f"\nDirectedInteraction Properties:")
        for _, row in di_schema.head(15).iterrows():
            print(f"  {row['property']:20} -> {row['object_class']}")
        
        # Look for database cross-references (bdb*)
        bdb_props = di_schema[di_schema['property'].str.contains('bdb', na=False)]
        if len(bdb_props) > 0:
            print(f"\nDatabase Cross-References (bdb*):")
            print(f"Found {len(bdb_props)} bdb properties")
            for _, row in bdb_props.iterrows():
                print(f"  {row['property']:15} -> {row['object_class']}")
        else:
            print("\nNo bdb* properties found in DirectedInteraction")
    else:
        print("\nDirectedInteraction class not found in schema")
        print("Available classes:")
        for cls in schema_df['subject_class'].unique()[:10]:
            print(f"  - {cls}")
            
except Exception as e:
    print(f"DirectedInteraction analysis failed: {e}")

DirectedInteraction Analysis (Complete Mode):
Properties found: 60

DirectedInteraction Properties:
  type                 -> Class
  type                 -> Resource
  seeAlso              -> Resource
  source               -> Literal
  identifier           -> Literal
  isPartOf             -> Pathway
  isPartOf             -> Collection
  references           -> PublicationXref
  references           -> PublicationReference
  bdbChEBI             -> Metabolite
  bdbChEBI             -> DataNode
  bdbChemspider        -> Resource
  bdbHmdb              -> Metabolite
  bdbHmdb              -> DataNode
  bdbInChIKey          -> Resource

Database Cross-References (bdb*):
Found 13 bdb properties
  bdbChEBI        -> Metabolite
  bdbChEBI        -> DataNode
  bdbChemspider   -> Resource
  bdbHmdb         -> Metabolite
  bdbHmdb         -> DataNode
  bdbInChIKey     -> Resource
  bdbKeggCompound -> Resource
  bdbReactome     -> DataNode
  bdbReactome     -> Pathway
  bdbReactome     -> Res

## Step 7: Export Complete Schema

Export the complete schema as JSON and CSV files with detailed statistics.

In [8]:
try:
    # Export as JSON
    print("Generating JSON schema (complete mode)...")
    schema_json = parser.to_json(filter_void_nodes=True)
    
    print("Complete JSON export completed")
    print(f"Total triples: {schema_json['metadata']['total_triples']}")
    print(f"Classes: {len(schema_json['metadata']['classes'])}")
    print(f"Properties: {len(schema_json['metadata']['properties'])}")
    print(f"Object types: {len(schema_json['metadata']['objects'])}")
    
    # Save JSON to file
    import json
    with open(f"{dataset_name}_schema.json", "w") as f:
        json.dump(schema_json, f, indent=2)
    print(f"\nComplete JSON schema saved to: {dataset_name}_schema.json")
    
    # Export as CSV
    schema_df.to_csv(f"{dataset_name}_schema.csv", index=False)
    print(f"Complete CSV schema saved to: {dataset_name}_schema.csv")
    
except Exception as e:
    print(f"Export failed: {e}")

Generating JSON schema (complete mode)...
Complete JSON export completed
Total triples: 1247
Classes: 42
Properties: 139
Object types: 39

Complete JSON schema saved to: wikipathways_complete_schema.json
Complete CSV schema saved to: wikipathways_complete_schema.csv
Complete JSON export completed
Total triples: 1247
Classes: 42
Properties: 139
Object types: 39

Complete JSON schema saved to: wikipathways_complete_schema.json
Complete CSV schema saved to: wikipathways_complete_schema.csv


## JSON-LD Export

Export the VoID description and schema as JSON-LD with automatic prefix extraction.

In [9]:
# Export WikiPathways RDF data as JSON-LD (automatic prefix extraction)
print("Exporting WikiPathways RDF VoID and Schema as JSON-LD...")

# Export complete VoID with automatic context
void_jsonld = solver.export_void_jsonld(
    output_file="wikipathways_complete_void.jsonld",
    indent=2
)

# Export schema only with automatic context
schema_jsonld = solver.export_schema_jsonld(
    output_file="wikipathways_complete_schema.jsonld",
    indent=2,
    filter_void_nodes=True
)

print(f"Exported files:")
print(f"  - wikipathways_complete_void.jsonld ({len(void_jsonld)} chars)")
print(f"  - wikipathways_complete_schema.jsonld ({len(schema_jsonld)} chars)")

# Show automatically extracted prefixes
prefixes = solver._extract_prefixes_from_void()
print(f"\nAuto-extracted prefixes: {', '.join(sorted(prefixes.keys()))}")

print(f"\nSchema Preview:")
print(schema_jsonld[:300] + "..." if len(schema_jsonld) > 300 else schema_jsonld)

Exporting WikiPathways RDF VoID and Schema as JSON-LD...
JSON-LD exported to: wikipathways_complete_void.jsonld
JSON-LD exported to: wikipathways_complete_void.jsonld
Schema JSON-LD exported to: wikipathways_complete_schema.jsonld
Exported files:
  - wikipathways_complete_void.jsonld (256802 chars)
  - wikipathways_complete_schema.jsonld (79777 chars)

Auto-extracted prefixes: brick, csvw, dc, dcam, dcmitype, dcterms, doap, foaf, geo, gpml, ns0, ns1, ns11, ns15, ns16, ns17, ns18, obo, odrl, org, owl, pav, prof, prov, qb, rdf, rdfs, schema, sh, skos, sosa, ssn, time, vann, void, void-ext, wgs, wp, xml, xsd

Schema Preview:
{
  "@context": {
    "@context": {
      "brick": "https://brickschema.org/schema/Brick#",
      "csvw": "http://www.w3.org/ns/csvw#",
      "dc": "http://purl.org/dc/elements/1.1/",
      "dcam": "http://purl.org/dc/dcam/",
      "dcmitype": "http://purl.org/dc/dcmitype/",
      "dcterms": "http:/...
Schema JSON-LD exported to: wikipathways_complete_schema.jsonld
Ex