# RDFSolve: PubChem Compound Analysis

This notebook analyzes the PubChem Compound graph using RDFSolve:
- **Graph URI**: http://rdf.ncbi.nlm.nih.gov/pubchem/compound
- **SPARQL Endpoint**: https://idsm.elixir-czech.cz/sparql/endpoint/idsm
- **Dataset**: PubChem Compound

Explore the structure and schema of the PubChem Compound dataset.

In [1]:
import pandas as pd
from rdfsolve.rdfsolve import RDFSolver
from rdfsolve.void_parser import VoidParser
import warnings
warnings.filterwarnings('ignore')

## Step 1: Configure Dataset Parameters

In [2]:
# PubChem Compound configuration
endpoint_url = "https://idsm.elixir-czech.cz/sparql/endpoint/idsm"
graph_uri = "http://rdf.ncbi.nlm.nih.gov/pubchem/compound"
void_iri = "http://rdf.ncbi.nlm.nih.gov/pubchem/compound"
dataset_name = "pubchem_compound"
working_path = "."

print(f"Dataset: {dataset_name}")
print(f"Endpoint: {endpoint_url}")
print(f"Graph URI: {graph_uri}")
print(f"VoID IRI: {void_iri}")

Dataset: pubchem_compound
Endpoint: https://idsm.elixir-czech.cz/sparql/endpoint/idsm
Graph URI: http://rdf.ncbi.nlm.nih.gov/pubchem/compound
VoID IRI: http://rdf.ncbi.nlm.nih.gov/pubchem/compound


## Step 2: Initialize RDFSolver Instance

In [3]:
try:
    solver = RDFSolver(
        endpoint=endpoint_url,
        path=working_path,
        void_iri=void_iri,
        dataset_name=dataset_name,
    )

    print("RDFSolver initialized successfully")
    print(f"Endpoint: {solver.endpoint}")
    print(f"Dataset: {solver.dataset_name}")

except Exception as e:
    print(f"Error: {e}")

RDFSolver initialized successfully
Endpoint: https://idsm.elixir-czech.cz/sparql/endpoint/idsm
Dataset: pubchem_compound


## Step 3: Generate VoID Description

In [None]:
try:
    print("Generating VoID description...")

    void_graph = solver.void_generator(
        graph_uri=graph_uri, output_file=f"{dataset_name}_void.ttl", counts=False
    )

    print(f"VoID generation completed!")
    print(f"Graph contains {len(void_graph)} triples")
    print(f"Saved to: {dataset_name}_void.ttl")

except Exception as e:
    print(f"VoID generation failed: {e}")

Generating VoID description...
Generating VoID from endpoint: https://idsm.elixir-czech.cz/sparql/endpoint/idsm
Using graph URI: http://rdf.ncbi.nlm.nih.gov/pubchem/compound
Fast mode: Skipping COUNT aggregations
ðŸš€ Starting VoID extraction from SPARQL endpoint
ðŸ“¡ Endpoint: https://idsm.elixir-czech.cz/sparql/endpoint/idsm
ðŸŽ¯ Graph: http://rdf.ncbi.nlm.nih.gov/pubchem/compound
ðŸ”§ Mode: Traditional VoID (SPARQL processing)
ðŸ”„ Starting query: class_partitions


## Step 4: Extract Schema Information

In [None]:
try:
    print("Extracting schema from VoID...")
    parser = VoidParser(void_graph)

    schema_df = parser.to_schema(filter_void_nodes=True)

    print("Schema extraction completed")
    print(f"Total schema triples: {len(schema_df)}")
    print(f"Unique classes: {schema_df['subject_class'].nunique()}")
    print(f"Unique properties: {schema_df['property'].nunique()}")

except Exception as e:
    print(f"Schema extraction failed: {e}")

Extracting schema from VoID...
Schema extraction failed: name 'void_graph' is not defined


## Step 5: Schema Visualization and Analysis

In [None]:
# Display schema sample
if "schema_df" in locals():
    print("Schema Sample (first 10 rows):")
    display(schema_df.head(10))

    print("\nTop 10 Classes by Property Count:")
    class_counts = schema_df["subject_class"].value_counts().head(10)
    for cls, count in class_counts.items():
        print(f"  {cls}: {count} properties")

## Step 6: Domain-Specific Analysis

#TODO: Add PubChem Compound-specific analysis

In [None]:
# TODO: Implement compound-specific analysis
# - Chemical structure analysis
# - Molecular formula distribution
# - Descriptor analysis
print("TODO: Add compound analysis")

TODO: Add compound analysis


## Step 7: Export Results

In [None]:
try:
    if "parser" in locals():
        # Export as JSON
        schema_json = parser.to_json(filter_void_nodes=True)

        import json

        with open(f"{dataset_name}_schema.json", "w") as f:
            json.dump(schema_json, f, indent=2)

        # Export as CSV
        schema_df.to_csv(f"{dataset_name}_schema.csv", index=False)

        print(f"Results exported:")
        print(f"  - {dataset_name}_void.ttl")
        print(f"  - {dataset_name}_schema.json")
        print(f"  - {dataset_name}_schema.csv")

except Exception as e:
    print(f"Export failed: {e}")

## JSON-LD Export

Export the VoID description and schema as JSON-LD with automatic prefix extraction.

In [None]:
# Export PubChem Compound data as JSON-LD (automatic prefix extraction)
print("Exporting PubChem Compound VoID and Schema as JSON-LD...")

# Export complete VoID with automatic context
void_jsonld = solver.export_void_jsonld(
    output_file="pubchem_compound_void.jsonld",
    indent=2
)

# Export schema only with automatic context
schema_jsonld = solver.export_schema_jsonld(
    output_file="pubchem_compound_schema.jsonld",
    indent=2,
    filter_void_nodes=True
)

print(f"Exported files:")
print(f"  - pubchem_compound_void.jsonld ({len(void_jsonld)} chars)")
print(f"  - pubchem_compound_schema.jsonld ({len(schema_jsonld)} chars)")

# Show automatically extracted prefixes
prefixes = solver._extract_prefixes_from_void()
print(f"\nAuto-extracted prefixes: {', '.join(sorted(prefixes.keys()))}")

print(f"\nSchema Preview:")
print(schema_jsonld[:300] + "..." if len(schema_jsonld) > 300 else schema_jsonld)

Exporting PubChem Compound VoID and Schema as JSON-LD...


NameError: name 'solver' is not defined