# ‚òÖ‚òÖ‚òÖ‚òÖ‚òÖ Five-Star Battery Data

In this notebook, we demonstrate how to complete the Five-Star Battery Data journey by linking a structured battery dataset to **external knowledge sources**. This reflects the goal of star 5: **linked data**.

We will:
- Load ontology-annotated metadata from Zenodo
- Parse it using RDF tools
- Extract identifiers for materials used in the test object
- Query Wikidata to retrieve additional information

---

In [30]:
import requests
from time import sleep
from IPython.display import display, Image, Markdown
from rdflib import Graph
from rdflib.namespace import RDF
from ontopy import get_ontology
import pandas as pd


---

## Loading JSON-LD Metadata into an RDF Graph

This code block performs the task of retrieving and parsing structured metadata that is encoded in JSON-LD (JavaScript Object Notation for Linked Data). The metadata file describes a battery dataset using RDF (Resource Description Framework) and vocabulary terms from battery-specific ontologies.

#### Step-by-step explanation:

1. The variable `metadata_url` is assigned the URL of a JSON-LD file hosted on Zenodo.
2. An empty RDF graph is created using `rdflib.Graph()`. This graph is capable of storing triples in the form of subject‚Äìpredicate‚Äìobject.
3. The `g.parse(...)` function loads and parses the JSON-LD file directly from the specified URL into the RDF graph. The `format="json-ld"` argument explicitly informs the parser that the content is JSON-LD.
4. After parsing, the code checks the length of the graph.
   - If the graph contains one or more RDF triples, it prints a success message showing the total number of triples loaded.
   - If the graph is empty, it prints a warning message.

This RDF graph is essential for subsequent semantic queries (using SPARQL), which allow us to locate and extract information such as dataset distributions, data schemas, and semantic annotations. The RDF graph represents the structured metadata in a format that is both machine-readable and semantically meaningful.


In [31]:
metadata_url = "https://zenodo.org/records/15553919/files/metadata.jsonld"

# Create an RDF graph
g = Graph()

# Parse local JSON-LD file
g.parse(metadata_url, format="json-ld")

# Print how many triples were loaded
if len(g) > 0:
    print(f"‚úÖ Loaded {len(g)} triples.")
else:
    print("‚ö†Ô∏è No triples were loaded from the file.")



‚úÖ Loaded 455 triples.


---

## Loading the Battery Ontology into Memory and Parsing it into the RDF Graph

This code block loads the battery ontology into two parallel representations: one as a Python-accessible object model using `EMMOntoPy`, and one as RDF triples into the existing `rdflib` graph.

### Step-by-step explanation:

1. The variable `battinfo_url` is assigned the URL of the inferred version of the Battery Ontology. This ontology is hosted at a persistent identifier managed by w3id.org and contains semantic definitions for battery-related concepts.

2. The line `battinfo = get_ontology(battinfo_url).load()` uses the `EMMOntoPy` interface to load the ontology into a structured Python object model. This allows programmatic access to ontology classes, properties, and relationships using methods like `.classes()`, `.get_by_label()`, or `.search()`.

3. Simultaneously, the same ontology file is parsed into the RDF graph `g` using `g.parse(...)` with the format specified as `"turtle"`. This ensures that the semantic content of the ontology is available for SPARQL queries in the same graph that contains the dataset metadata.

4. After parsing, the graph is inspected to determine whether any triples were successfully added.
   - If the graph contains one or more RDF triples, it prints a success message indicating the total number of triples.
   - If no triples were added, it prints a warning message indicating potential failure.

This dual loading approach (into both `EMMOntoPy` and `rdflib.Graph`) provides flexible access to the ontology: human-readable and programmatic via Python classes, and queryable via SPARQL in the RDF graph.


In [32]:
battinfo_url = "https://w3id.org/emmo/domain/battery/inferred"

# Loading from web
battinfo = get_ontology(battinfo_url).load()
g.parse(battinfo_url, format="turtle")

# Print how many triples were loaded
if len(g) > 0:
    print(f"‚úÖ Loaded {len(g)} triples.")
else:
    print("‚ö†Ô∏è No triples were loaded from the file.")

‚úÖ Loaded 52253 triples.


---

## Querying the RDF Graph for Active Materials and Their Semantic Annotations

This code block defines and executes a SPARQL query that retrieves information about the active materials present in a battery dataset, as described by the RDF metadata and associated ontologies.

### Step-by-step explanation:

1. A multi-line SPARQL query is defined using a Python f-string, allowing dynamic insertion of property IRIs from the `battinfo` ontology object. The query performs the following:

   - It selects three variables: `?material`, `?type`, and `?wikidata`.
   - It matches any triple in the RDF graph where a subject `?cell` has a `hasActiveMaterial` relationship to an object `?material`. The `hasActiveMaterial` property IRI is injected from the ontology model using `battinfo.hasActiveMaterial.iri`.
   - It optionally retrieves the RDF type (`rdf:type`) of each material using `OPTIONAL { ?material rdf:type ?type }`. This ensures the query does not fail if the type is missing.
   - It optionally retrieves a `wikidataReference` IRI associated with the material‚Äôs type using `OPTIONAL { ?type battinfo:wikidataReference ?wikidata }`.

2. The SPARQL query is executed using `g.query(query)`, where `g` is the RDF graph containing both the dataset metadata and the battery ontology.

3. The results are iterated over and printed to the console. For each row:
   - The material IRI is printed.
   - If an RDF type is available, it is shown; otherwise, a placeholder string `"(no rdf:type)"` is used.
   - If a Wikidata reference is available, it is shown; otherwise, `"(no Wikidata ID)"` is printed.

This step allows for tracing each active material in the dataset to its corresponding class and external identifier (e.g. Wikidata), enabling interoperability with global knowledge graphs and material databases.


In [33]:
# Define and run the SPARQL query using the resolved IRI
query = f"""
SELECT DISTINCT ?material ?type ?wikidata
WHERE {{
  ?cell <{battinfo.hasActiveMaterial.iri}> ?material .
  OPTIONAL {{ ?material <{RDF.type}> ?type }}
  OPTIONAL {{ ?type <{battinfo.wikidataReference.iri}> ?wikidata }}
}}
"""

results = g.query(query)

# Display results
print("üîç Materials assigned via hasActiveMaterial, their types, and Wikidata references:")
for row in results:
    material = row.material
    typ = row.type if row.type else "(no rdf:type)"
    qid = row.wikidata if row.wikidata else "(no Wikidata ID)"
    print(f"- {material} (type: {typ}, Wikidata: {qid})")


üîç Materials assigned via hasActiveMaterial, their types, and Wikidata references:
- https://zenodo.org/records/15553919#81867a7d-25e2-437c-8aff-104dc6aa9c45 (type: https://w3id.org/emmo/domain/chemical-substance#substance_4c62d334_a124_40b3_9fd1_fe713d01a6af, Wikidata: https://www.wikidata.org/wiki/Q415891)
- https://zenodo.org/records/15553919#68ff065e-bf7e-47fa-abde-42b82e8e2d54 (type: https://w3id.org/emmo/domain/chemical-substance#substance_d53259a7_0d9c_48b9_a6c1_4418169df303, Wikidata: https://www.wikidata.org/wiki/Q5309)


---

## Querying Wikidata for Material Properties and Images

This code block queries the Wikidata SPARQL endpoint to retrieve additional semantic information about materials used in a battery dataset. Specifically, it attempts to extract the following for each material with a known Wikidata reference:

- The human-readable label (English name)
- The density of the material (`wdt:P2054`)
- A structural or schematic image (`wdt:P8224`)
- A general photographic image (`wdt:P18`)

### Step-by-step explanation:

1. The SPARQL endpoint for Wikidata is defined as `https://query.wikidata.org/sparql`.

2. A message is printed to indicate the start of the Wikidata lookup process.

3. The loop iterates over the `results` obtained from a prior SPARQL query against the local RDF graph, where each `row` may contain a `wikidata` URI identifying a material class.

4. For each `row`:
   - It checks if a `wikidata_uri` is present. If it is not, the row is skipped.
   - The last segment of the URI (e.g., `Q42512`) is extracted and stored as `wikidata_id`.

5. A SPARQL query is constructed to fetch:
   - The English label of the Wikidata entity using `rdfs:label` filtered by `lang="en"`.
   - The density property (`wdt:P2054`) if it exists.
   - A structural or schematic image (`wdt:P8224`) if available.
   - A photographic image (`wdt:P18`) if available.

6. The query is sent to the Wikidata endpoint using an HTTP GET request, with the response format specified as JSON.

7. If the HTTP response is successful (`status_code == 200`):
   - The JSON results are parsed.
   - If at least one result is returned:
     - The material's label, density, and image URLs (if present) are extracted from the result.
     - The label and density are printed.
     - If image URLs exist:
       - The images are displayed directly in the notebook using `IPython.display.Image`, each scaled to 300 pixels width.
     - If no images are available, a message indicates this.
   - If no data is returned for the material, a message is printed indicating the failure.

8. If the response is not successful, an error message is printed with the HTTP status code.

9. A `sleep(1)` call is used to pause for one second between requests. This is done to avoid sending queries too rapidly and overwhelming the public SPARQL endpoint, which could lead to throttling or blocking.

This code enriches local dataset metadata by linking it to publicly maintained knowledge in Wikidata, making the dataset more informative and connected within the global web of data.


In [34]:
# Define the SPARQL endpoint for Wikidata
wikidata_endpoint = "https://query.wikidata.org/sparql"

print("üîç Querying Wikidata for material density (P2054) and images (P8224, P18):")

# Loop through each row of SPARQL query results from your RDF graph
for row in results:
    wikidata_uri = row["wikidata"]
    
    # Continue only if the material has a valid Wikidata URI
    if wikidata_uri:
        # Extract the QID (e.g., "Q42512") from the full Wikidata URL
        wikidata_id = str(wikidata_uri).split('/')[-1]

        # Define a SPARQL query to fetch:
        # - the English label (human-readable name)
        # - the density (P2054)
        # - a structure image (P8224)
        # - a general photo image (P18)
        query = f"""
        SELECT ?label ?density ?img1 ?img2 WHERE {{
          wd:{wikidata_id} rdfs:label ?label .
          FILTER (lang(?label) = "en")

          OPTIONAL {{ wd:{wikidata_id} wdt:P2054 ?density . }}
          OPTIONAL {{ wd:{wikidata_id} wdt:P8224 ?img1 . }}
          OPTIONAL {{ wd:{wikidata_id} wdt:P18 ?img2 . }}
        }}
        """

        # Send the query to the Wikidata SPARQL endpoint
        response = requests.get(wikidata_endpoint, params={'query': query, 'format': 'json'})

        if response.status_code == 200:
            # Parse the JSON response from Wikidata
            bindings = response.json().get('results', {}).get('bindings', [])

            # If results are found for the material
            if bindings:
                b = bindings[0]  # Get the first result row
                label = b['label']['value']  # Material label (e.g., "Graphite")
                density = b.get('density', {}).get('value', 'N/A')  # Density value if present
                img1 = b.get('img1', {}).get('value', None)  # Structure image URL
                img2 = b.get('img2', {}).get('value', None)  # Photo image URL

                # Print the label and density
                print(f"\nüß™ {label}")
                print(f"  - Density: {density} g/cm¬≥" if density != 'N/A' else "  - Density: ‚ùå not available")

                # Display structure image (chemical diagram or schematic)
                if img1:
                    print(f"  - Structure image (P8224):")
                    display(Image(url=img1, width=300))  # Scaled to 300 px width

                # Display photo image (e.g., macro photograph of the substance)
                if img2:
                    print(f"  - Photo image (P18):")
                    display(Image(url=img2, width=300))  # Scaled to 300 px width

                # If neither image is available
                if not img1 and not img2:
                    print("  - Images: ‚ùå none available")
            else:
                print(f"- {wikidata_id}: ‚ùå no data returned from SPARQL query")
        else:
            print(f"- {wikidata_id}: ‚ùå SPARQL query failed (HTTP {response.status_code})")

        # Pause between requests to avoid hitting the endpoint too frequently
        sleep(1)


üîç Querying Wikidata for material density (P2054) and images (P8224, P18):

üß™ lithium cobalt oxide
  - Density: 1.68 g/cm¬≥
  - Structure image (P8224):



üß™ graphite
  - Density: 2.16 g/cm¬≥
  - Photo image (P18):


## Summary

In this notebook, you learned how to satisfy the requirements for **5-star battery data** by linking your dataset to external data sources like **Wikidata** and **PubChem**.

| Step              | What You Did                                               |
|-------------------|------------------------------------------------------------|
| Load metadata     | Parsed JSON-LD/RDFa metadata from a published Zenodo record|
| Extract materials | Identified semantic IRIs for active materials              |
| Link externally   | Queried Wikidata using SPARQL to retrieve additional info  |

This allows your dataset to become part of a larger **knowledge graph**, enabling automated reasoning, richer search, and future integration with AI agents.


---

<img src="https://upload.wikimedia.org/wikipedia/commons/b/b7/Flag_of_Europe.svg" alt="EU Flag" width="100"/>

**This work has received funding from the European Union under the Horizon Europe programme.**  
Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them.