In [16]:
# %% [markdown]
# # Querying OntoCompChem Knowledge Graph
# This notebook retrieves 100 instances from the OntoCompChem Blazegraph endpoint

# %%
# Install required packages if needed
!pip install SPARQLWrapper pandas

# %%
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

# Set up the SPARQL endpoint
# endpoint_url = "https://theworldavatar.io/chemistry/blazegraph/namespace/ontocompchem/sparql"
endpoint_url = "https://theworldavatar.io/chemistry/blazegraph/namespace/ontospecies/sparql"



# %%
# Define the query to get 100 random instances
query = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?instance ?type ?label
WHERE {
  ?instance a ?type .
  OPTIONAL { ?instance rdfs:label ?label }
}
LIMIT 10
"""


# %%
# Execute the query
def get_results(endpoint_url, query):
    sparql = SPARQLWrapper(endpoint_url)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    return results["results"]["bindings"]

results = get_results(endpoint_url, query)

# %%
# Convert to pandas DataFrame (fixed error handling)
def safe_get(value_dict, key='value'):
    if isinstance(value_dict, dict):
        return value_dict.get(key, None)
    return None

df = pd.DataFrame([{
    'instance': safe_get(r.get('instance')),
    'type': safe_get(r.get('type')),
    'label': safe_get(r.get('label'))
} for r in results])

# Display the results
print(f"Retrieved {len(df)} instances")
df.head()

Retrieved 10 instances


Unnamed: 0,instance,type,label
0,http://www.theworldavatar.com/kb/ontospecies/R...,http://www.theworldavatar.com/ontology/ontokin...,https://education.jlab.org/itselemental/ele083...
1,http://www.theworldavatar.com/kb/ontospecies/R...,http://www.theworldavatar.com/ontology/ontokin...,https://periodic.lanl.gov/83.shtml
2,http://www.theworldavatar.com/kb/ontospecies/R...,http://www.theworldavatar.com/ontology/ontokin...,https://physics.nist.gov/cgi-bin/Elements/elIn...
3,http://www.theworldavatar.com/kb/ontospecies/R...,http://www.theworldavatar.com/ontology/ontokin...,DOI:10.1364/JOSAB.6.001627
4,http://www.theworldavatar.com/kb/ontospecies/R...,http://www.theworldavatar.com/ontology/ontokin...,https://education.jlab.org/itselemental/ele084...


This code downloads **all RDF triples** from the Blazegraph namespace `ontocompchem` and saves them in **Turtle (TTL) format** to a local file. Here's a step-by-step breakdown:

---

### **What the Code Does**
1. **SPARQL Endpoint Target**  
   - Connects to:  
     `https://theworldavatar.io/chemistry/blazegraph/namespace/ontocompchem/sparql`  
   - This is a Blazegraph database containing chemical computation data.

2. **SPARQL Query**  
   - Uses a `CONSTRUCT` query to retrieve **all triples** (`?s ?p ?o`) in the dataset.  
   - Equivalent to "Give me every subject-predicate-object combination."

3. **Request Configuration**  
   - `headers={"Accept": "text/turtle"}`: Requests the response in **Turtle format** (a compact RDF format).  
   - `params`: Passes the SPARQL query as a URL parameter.

4. **Save Results**  
   - Writes the raw Turtle response to `ontocompchem-triples.ttl`.  
   - Uses UTF-8 encoding to handle special characters.

5. **Error Handling**  
   - `resp.raise_for_status()`: Raises an error if the HTTP request fails (e.g., 404 or 500).

---

### **Key Notes**
- **Scope**: Fetches **every triple** in the namespace (could be very large!).  
- **Format**: Output is in `.ttl` (Turtle), which is human-readable RDF.  
- **No Pagination**: Unlike the previous solution, this tries to get all data in one request (may fail for huge datasets).  
- **No Error Handling for Invalid Data**: Unlike the custom `LenientSPARQLResult` approach, this assumes the server returns valid Turtle.

---

### **When to Use This Code**
- For **small to medium-sized datasets** where a single HTTP request is sufficient.  
- When you need a **quick export** without custom parsing.  
- When you trust the server to return valid Turtle-formatted RDF.

---

### **Potential Issues**
1. **Timeout or Crash**: Large datasets may exceed server/client memory or timeout limits.  
2. **Invalid Literals**: If the data contains malformed `xsd:date` or `xsd:integer` values (as in your earlier errors), the file might still save, but parsing it later could fail.  
3. **Rate Limiting**: Public endpoints may block excessive requests.

---

### **Alternatives**
If this fails, use the **paginated approach** from the previous solution, which:  
1. Handles large datasets with `LIMIT`/`OFFSET`.  
2. Gracefully processes invalid literals.  
3. Provides progress feedback.

Let me know if you'd like help adapting this for larger datasets or handling specific data issues!

In [17]:
# Thiis code retrieves the raw Turtle text directly, bypassing RDFLib’s datatype parsing. 
# You get exactly what the SPARQL endpoint returns, with no conversion errors.

import requests

endpoint = "https://theworldavatar.io/chemistry/blazegraph/namespace/ontocompchem/sparql"
params = {
    "query": """
    CONSTRUCT { ?s ?p ?o }
    WHERE   { ?s ?p ?o }
    """
}
headers = {"Accept": "text/turtle"}

resp = requests.get(endpoint, params=params, headers=headers)
resp.raise_for_status()

with open("ontocompchem-triples.ttl", "w", encoding="utf-8") as f:
    f.write(resp.text)

print("All triples saved to ontocompchem-triples.ttl")


All triples saved to ontocompchem-triples.ttl
