# RDFSolve: PubChem MeSH - Complete Analysis

This notebook demonstrates VoID generation with full count aggregations:
1. Setting up an endpoint and graph
2. Generating comprehensive VoID descriptions using CONSTRUCT queries with COUNT aggregations
3. Extracting detailed schema from the VoID description
4. Analyzing the results as DataFrame and JSON


In [1]:
import pandas as pd
from rdfsolve.rdfsolve import RDFSolver
from rdfsolve.void_parser import VoidParser, generate_void_from_endpoint
import warnings
warnings.filterwarnings('ignore')

## Step 1: Configure Dataset Parameters

We'll configure the PubChem MeSH dataset with its SPARQL endpoint and metadata.

In [2]:
# MeSH configuration
endpoint_url = "https://idsm.elixir-czech.cz/sparql/endpoint/idsm"
dataset_name = "mesh_headers"
void_iri = "http://id.nlm.nih.gov/mesh/heading"
graph_uri = "http://id.nlm.nih.gov/mesh/heading"  # Specify the correct graph URI
working_path = "."

print(f"Dataset: {dataset_name}")
print(f"Endpoint: {endpoint_url}")
print(f"VoID IRI: {void_iri}")
print(f"Graph URI: {graph_uri}")
print(f"Mode: Complete (with COUNT aggregations)")

Dataset: mesh_headers
Endpoint: https://idsm.elixir-czech.cz/sparql/endpoint/idsm
VoID IRI: http://id.nlm.nih.gov/mesh/heading
Graph URI: http://id.nlm.nih.gov/mesh/heading
Mode: Complete (with COUNT aggregations)


## Step 2: Initialize RDFSolver

Create an RDFSolver instance with our configuration.

In [3]:
try:
    # Initialize RDFSolver with our configuration
    solver = RDFSolver(
        endpoint=endpoint_url,
        path=working_path,
        void_iri=void_iri,
        dataset_name=dataset_name
    )
    
    print("RDFSolver initialized successfully")
    print(f"Endpoint: {solver.endpoint}")
    print(f"Dataset: {solver.dataset_name}")
    
except Exception as e:
    print(f"Error: {e}")

RDFSolver initialized successfully
Endpoint: https://idsm.elixir-czech.cz/sparql/endpoint/idsm
Dataset: mesh_headers


## Step 3: Generate Complete VoID Description

Generate VoID with full COUNT aggregations. This provides complete statistics but takes longer to execute.

Three CONSTRUCT queries get the partitions for classes, properties, and datatypes from the specified graph with complete count information.

In [4]:
try:    
    # Generate VoID using CONSTRUCT query approach with full counts

    void_graph = solver.void_generator(
        graph_uri=graph_uri,
        output_file=f"{dataset_name}_void.ttl",
        counts=True  # Full count aggregations
    )
    
    print(f"Graph contains {len(void_graph)} triples")
    print(f"Saved to: {dataset_name}_void.ttl")
    
except Exception as e:
    print(f"Error: {e}")

Generating VoID from endpoint: https://idsm.elixir-czech.cz/sparql/endpoint/idsm
Using graph URI: http://id.nlm.nih.gov/mesh/heading
Starting query: class_partitions
Finished query: class_partitions (took 0.87s)
Starting query: property_partitions


KeyboardInterrupt: 

## Step 4: Extract Schema from Complete VoID

`VoidParser` via `solver.extract_schema()` extracts the comprehensive schema structure from the generated VoID.

In [None]:
try:
    # Extract schema
    parser = solver.extract_schema()

    # Get schema as DataFrame
    print("Extracting complete schema as DataFrame...")
    schema_df = parser.to_schema(filter_void_nodes=True)

    print("Complete schema extraction completed")
    print(f"Total schema triples: {len(schema_df)}")
    print(f"Unique classes: {schema_df['subject_class'].nunique()}")
    print(f"Unique properties: {schema_df['property'].nunique()}")
    
except Exception as e:
    print(f"Schema extraction failed: {e}")

Schema extraction failed: No VoID description available. Run void_generator() first.


## Step 5: Schema Visualization

Display a sample of the extracted schema, filtering out generic classes.

In [None]:
# Display schema sample (excluding generic classes)
display(schema_df[~schema_df.object_class.isin(["Class", "Resource"])].head(10))

NameError: name 'schema_df' is not defined

## Step 6: Analyze PubChem MeSH Classes

Examine the available classes and analyze MeSH-specific structures:

In [None]:
try:
    print(f"PubChem MeSH Schema Analysis (Complete Mode):")
    print(f"Total unique classes: {schema_df['subject_class'].nunique()}")
    
    # Show top classes by frequency
    print("\nTop 10 classes by property count:")
    class_counts = schema_df['subject_class'].value_counts().head(10)
    for cls, count in class_counts.items():
        print(f"  {cls:30} ({count} properties)")
    
    # Look for MeSH-specific classes
    mesh_classes = schema_df[schema_df['subject_class'].str.contains('mesh|MeSH', case=False, na=False)]['subject_class'].unique()
    if len(mesh_classes) > 0:
        print(f"\nMeSH-specific classes found:")
        for cls in mesh_classes[:10]:
            print(f"  - {cls}")
            
        # Analyze first MeSH class in detail
        first_mesh_class = mesh_classes[0]
        mesh_schema = schema_df[schema_df['subject_class'] == first_mesh_class]
        print(f"\n{first_mesh_class} Properties:")
        for _, row in mesh_schema.head(10).iterrows():
            print(f"  {row['property']:25} -> {row['object_class']}")
    else:
        print("\nNo MeSH-specific classes found in top results")
        print("Available classes sample:")
        for cls in schema_df['subject_class'].unique()[:15]:
            print(f"  - {cls}")
            
except Exception as e:
    print(f"MeSH analysis failed: {e}")

## Step 7: Export Complete Schema

Export the complete schema as JSON and CSV files with detailed statistics.

In [None]:
try:
    # Export as JSON
    print("Generating JSON schema (complete mode)...")
    schema_json = parser.to_json(filter_void_nodes=True)
    
    print("Complete JSON export completed")
    print(f"Total triples: {schema_json['metadata']['total_triples']}")
    print(f"Classes: {len(schema_json['metadata']['classes'])}")
    print(f"Properties: {len(schema_json['metadata']['properties'])}")
    print(f"Object types: {len(schema_json['metadata']['objects'])}")
    
    # Save JSON to file
    import json
    with open(f"{dataset_name}_schema.json", "w") as f:
        json.dump(schema_json, f, indent=2)
    print(f"\nComplete JSON schema saved to: {dataset_name}_schema.json")
    
    # Export as CSV
    schema_df.to_csv(f"{dataset_name}_schema.csv", index=False)
    print(f"Complete CSV schema saved to: {dataset_name}_schema.csv")
    
except Exception as e:
    print(f"Export failed: {e}")