# RDFSolve: AOP-wiki RDF - no counts

This notebook demonstrates faster schema discovery:
1. Setting up an endpoint and graph
2. Generating fast VoID descriptions using CONSTRUCT queries **without** COUNT aggregations
3. Extracting schema from the VoID description
4. Analyzing the results as DataFrame and JSON

In [1]:
import pandas as pd
from rdfsolve.rdfsolve import RDFSolver
from rdfsolve.void_parser import VoidParser, generate_void_from_endpoint
import warnings
warnings.filterwarnings('ignore')

## Step 1: Configure Dataset Parameters

We'll configure the AOP-Wiki RDF dataset with its SPARQL endpoint and metadata.

In [2]:
# AOPWIKI configuration
endpoint_url = "https://aopwiki.rdf.bigcat-bioinformatics.org/sparql/"
dataset_name = "aopwikirdf_complete"
void_iri = "http://aopwiki.org/"
graph_uri = "http://aopwiki.org/"  # Specify the correct graph URI
working_path = "."

print(f"Dataset: {dataset_name}")
print(f"Endpoint: {endpoint_url}")
print(f"VoID IRI: {void_iri}")
print(f"Graph URI: {graph_uri}")
print(f"Mode: Complete (with COUNT aggregations)")

Dataset: aopwikirdf_complete
Endpoint: https://aopwiki.rdf.bigcat-bioinformatics.org/sparql/
VoID IRI: http://aopwiki.org/
Graph URI: http://aopwiki.org/
Mode: Complete (with COUNT aggregations)


## Step 2: Initialize RDFSolver

Create an RDFSolver instance with our configuration.

In [3]:
try:
    # Initialize RDFSolver with our configuration
    solver = RDFSolver(
        endpoint=endpoint_url,
        path=working_path,
        void_iri=void_iri,
        dataset_name=dataset_name
    )
    
    print("RDFSolver initialized successfully")
    print(f"Endpoint: {solver.endpoint}")
    print(f"Dataset: {solver.dataset_name}")
    
except Exception as e:
    print(f"Error: {e}")

RDFSolver initialized successfully
Endpoint: https://aopwiki.rdf.bigcat-bioinformatics.org/sparql/
Dataset: aopwikirdf_complete


## Step 3: Generate Fast VoID Description

Generate VoID **without** COUNT aggregations for fast discovery. This is much faster but doesn't provide count statistics.

Three CONSTRUCT queries get the partitions for classes, properties, and datatypes using SELECT DISTINCT instead of COUNT.

In [4]:
try:
    # Generate fast VoID without count aggregations
    
    fast_void_graph = solver.void_generator(
        graph_uri=graph_uri,
        output_file=f"{dataset_name}_void.ttl",
        counts=False  # Fast discovery without counts
    )
    
    print(f"Graph contains {len(fast_void_graph)} triples")
    print(f"Saved to: {dataset_name}_void.ttl")
    
except Exception as e:
    print(f"Error: {e}")

Generating VoID from endpoint: https://aopwiki.rdf.bigcat-bioinformatics.org/sparql/
Using graph URI: http://aopwiki.org/
Fast mode: Skipping COUNT aggregations
Starting query: class_partitions
Finished query: class_partitions (took 0.10s)
Starting query: property_partitions
Finished query: property_partitions (took 0.43s)
Starting query: datatype_partitions
Finished query: datatype_partitions (took 0.73s)
VoID description saved to aopwikirdf_complete_void.ttl
VoID generation completed successfully
Graph contains 686 triples
Saved to: aopwikirdf_complete_void.ttl


## Step 4: Extract Schema from Fast VoID

Extract schema structure from the fast-generated VoID description.

In [5]:
try:
    print("Extracting schema from fast VoID...")
    fast_parser = VoidParser(fast_void_graph)
    
    # Get schema as DataFrame
    fast_schema_df = fast_parser.to_schema(filter_void_nodes=True)
    
    print("Fast schema extraction completed")
    print(f"Total schema triples: {len(fast_schema_df)}")
    print(f"Unique classes: {fast_schema_df['subject_class'].nunique()}")
    print(f"Unique properties: {fast_schema_df['property'].nunique()}")
    
except Exception as e:
    print(f"Fast schema extraction failed: {e}")

Extracting schema from fast VoID...
Fast schema extraction completed
Total schema triples: 240
Unique classes: 26
Unique properties: 65


## Step 5: Schema Visualization

Display a sample of the extracted schema from fast discovery.

In [6]:
# Show sample of the fast schema (excluding generic classes)
display(fast_schema_df[~fast_schema_df.object_class.isin(["Class", "Resource"])].head(10))

Unnamed: 0,subject_class,subject_uri,property,property_uri,object_class,object_uri
0,KeyEvent,http://aopkb.org/aop_ontology#KeyEvent,PATO_0001241,http://purl.obolibrary.org/obo/PATO_0001241,GO_0008150,http://purl.obolibrary.org/obo/GO_0008150
1,KeyEvent,http://aopkb.org/aop_ontology#KeyEvent,PATO_0001241,http://purl.obolibrary.org/obo/PATO_0001241,Literal,http://www.w3.org/2000/01/rdf-schema#Literal
2,KeyEvent,http://aopkb.org/aop_ontology#KeyEvent,PATO_0001241,http://purl.obolibrary.org/obo/PATO_0001241,PATO_0001241,http://purl.obolibrary.org/obo/PATO_0001241
3,KeyEvent,http://aopkb.org/aop_ontology#KeyEvent,PATO_0001241,http://purl.obolibrary.org/obo/PATO_0001241,OrganContext,http://aopkb.org/aop_ontology#OrganContext
4,KeyEvent,http://aopkb.org/aop_ontology#KeyEvent,PATO_0001241,http://purl.obolibrary.org/obo/PATO_0001241,CellTypeContext,http://aopkb.org/aop_ontology#CellTypeContext
5,AdverseOutcomePathway,http://aopkb.org/aop_ontology#AdverseOutcomePa...,created,http://purl.org/dc/terms/created,Literal,http://www.w3.org/2000/01/rdf-schema#Literal
6,KeyEventRelationship,http://aopkb.org/aop_ontology#KeyEventRelation...,created,http://purl.org/dc/terms/created,Literal,http://www.w3.org/2000/01/rdf-schema#Literal
7,C54571,http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus...,created,http://purl.org/dc/terms/created,Literal,http://www.w3.org/2000/01/rdf-schema#Literal
9,KeyEventRelationship,http://aopkb.org/aop_ontology#KeyEventRelation...,modified,http://purl.org/dc/terms/modified,Literal,http://www.w3.org/2000/01/rdf-schema#Literal
10,AdverseOutcomePathway,http://aopkb.org/aop_ontology#AdverseOutcomePa...,modified,http://purl.org/dc/terms/modified,Literal,http://www.w3.org/2000/01/rdf-schema#Literal


## Step 6: Analyze AOP Wiki RDF Key Event

Examine the `KeyEvent` class as an example of detailed analysis:

In [10]:
try:
    # Focus on DirectedInteraction class
    di_schema = fast_schema_df[fast_schema_df["subject_class"] == "KeyEvent"]

    print(f"DirectedInteraction Analysis (Complete Mode):")
    print(f"Properties found: {len(di_schema)}")

    if len(di_schema) > 0:
        print(f"\nKeyEvent Properties:")
        for _, row in di_schema.head(15).iterrows():
            print(f"  {row['property']:20} -> {row['object_class']}")

        # Look for database cross-references (bdb*)
        bdb_props = di_schema[di_schema["property"].str.contains("bdb", na=False)]
        if len(bdb_props) > 0:
            print(f"\nDatabase Cross-References (bdb*):")
            print(f"Found {len(bdb_props)} bdb properties")
            for _, row in bdb_props.iterrows():
                print(f"  {row['property']:15} -> {row['object_class']}")
        else:
            print("\nNo bdb* properties found in KeyEvent")
    else:
        print("\nKeyEvent class not found in schema")
        print("Available classes:")
        for cls in fast_schema_df["subject_class"].unique()[:10]:
            print(f"  - {cls}")

except Exception as e:
    print(f"KeyEvent analysis failed: {e}")

DirectedInteraction Analysis (Complete Mode):
Properties found: 35

KeyEvent Properties:
  PATO_0001241         -> GO_0008150
  PATO_0001241         -> Literal
  PATO_0001241         -> PATO_0001241
  PATO_0001241         -> OrganContext
  PATO_0001241         -> CellTypeContext
  label                -> Literal
  identifier           -> KeyEvent
  source               -> Literal
  PATO_0000047         -> Literal
  title                -> Literal
  alternative          -> Literal
  isPartOf             -> AdverseOutcomePathway
  CellTypeContext      -> PATO_0001241
  CellTypeContext      -> CellTypeContext
  CellTypeContext      -> OrganContext

No bdb* properties found in KeyEvent


## Step 7: Export Fast Discovery Results

Export the fast discovery schema as JSON and CSV files.

In [8]:
try:
    # Export as JSON
    print("Generating JSON schema (fast discovery)...")
    fast_schema_json = fast_parser.to_json(filter_void_nodes=True)
    
    print("Fast JSON export completed")
    print(f"Total triples: {fast_schema_json['metadata']['total_triples']}")
    print(f"Classes: {len(fast_schema_json['metadata']['classes'])}")
    print(f"Properties: {len(fast_schema_json['metadata']['properties'])}")
    print(f"Object types: {len(fast_schema_json['metadata']['objects'])}")
    
    # Save JSON to file
    import json
    with open(f"{dataset_name}_schema.json", "w") as f:
        json.dump(fast_schema_json, f, indent=2)
    print(f"\nFast JSON schema saved to: {dataset_name}_schema.json")
    
    # Export as CSV
    fast_schema_df.to_csv(f"{dataset_name}_schema.csv", index=False)
    print(f"Fast CSV schema saved to: {dataset_name}_schema.csv")
    
except Exception as e:
    print(f"Fast export failed: {e}")

Generating JSON schema (fast discovery)...
Fast JSON export completed
Total triples: 240
Classes: 26
Properties: 65
Object types: 28

Fast JSON schema saved to: aopwikirdf_complete_schema.json
Fast CSV schema saved to: aopwikirdf_complete_schema.csv


## Optional: Sample Limiting for Very Large Datasets

For extremely large datasets, you can add a sample limit for even faster discovery:

In [9]:
# Example: Ultra-fast discovery with sample limit
# Uncomment to try with a sample of 1000 triples for very fast exploration

# try:
#     print("Ultra-fast discovery with sampling...")
#     
#     sampled_void_graph = solver.void_generator(
#         graph_uri=graph_uri,
#         output_file=f"{dataset_name}_sampled_void.ttl",
#         counts=False,
#         sample_limit=1000  # Only sample 1000 triples
#     )
#     
#     print(f"Sampled VoID contains {len(sampled_void_graph)} triples")
#     
# except Exception as e:
#     print(f"Sampled mode error: {e}")

## JSON-LD Export

Export the VoID description and schema as JSON-LD with automatic prefix extraction.

In [None]:
# Export AOP-Wiki RDF (Fast) data as JSON-LD (automatic prefix extraction)
print("Exporting AOP-Wiki RDF (Fast) VoID and Schema as JSON-LD...")

# Export complete VoID with automatic context
void_jsonld = solver.export_void_jsonld(
    output_file="aopwikirdf_fast_void.jsonld",
    indent=2
)

# Export schema only with automatic context
schema_jsonld = solver.export_schema_jsonld(
    output_file="aopwikirdf_fast_schema.jsonld",
    indent=2,
    filter_void_nodes=True
)

print(f"Exported files:")
print(f"  - aopwikirdf_fast_void.jsonld ({len(void_jsonld)} chars)")
print(f"  - aopwikirdf_fast_schema.jsonld ({len(schema_jsonld)} chars)")

# Show automatically extracted prefixes
prefixes = solver._extract_prefixes_from_void()
print(f"\nAuto-extracted prefixes: {', '.join(sorted(prefixes.keys()))}")

print(f"\nSchema Preview:")
print(schema_jsonld[:300] + "..." if len(schema_jsonld) > 300 else schema_jsonld)