# GeoCroissant to GeoDCAT Conversion

<img src="GeoCroissant.jpg" alt="GeoCroissant" width="150" style="float: right; margin-left: 50px;">

This notebook demonstrates how to convert metadata from **GeoCroissant**, a geospatial extension of MLCommons Croissant, into **GeoDCAT** (DCAT-AP for geospatial datasets).

GeoDCAT is a standardized RDF-based metadata model for publishing geospatial datasets, enabling:
-  Metadata interoperability (with CKAN, INSPIRE, EU portals)
-  Semantic web support via RDF/JSON-LD
-  Cataloging of spatial, temporal, and distribution metadata

| **GeoCroissant Field**     | **GeoDCAT Field**              |
|----------------------------|--------------------------------|
| `@id`                      | N/A                            |
| `@type`                    | `@type`                        |
| `name`                     | `title`                        |
| `description`              | `description`                  |
| `dct:temporal`             | `temporalExtent`              |
| `geocr:BoundingBox`        | `spatialExtent` / `bbox`       |
| `geocr:Geometry`           | N/A                            |
| `distribution`             | `distribution`                 |
| `contentUrl`               | `url`                          |
| `encodingFormat`           | `format`                       |
| N/A                        | `temporalExtent/start` + `end` |
| N/A                        | `spatialExtent`                |

## Install Required Libraries

We use:
- `rdflib` for manipulating RDF graphs
- `pyshacl` for validating metadata using SHACL constraints

In [1]:
!pip install rdflib pyshacl



## Define Conversion Function

We write a function to:
- Parse GeoCroissant metadata (`croissant.json`)
- Map it to GeoDCAT concepts like:
  - `dcat:Dataset`
  - `dcat:Distribution`
  - `dct:creator`, `dct:license`, `dcat:accessURL`, etc.
- Save outputs in JSON-LD(`.jsonld`) and Turtle (`.ttl`) formats


In [2]:
import json
from rdflib import Graph, Namespace, URIRef, Literal, BNode
from rdflib.namespace import DCTERMS, DCAT, FOAF, XSD, RDF


def croissant_to_geodcat_jsonld(croissant_json, output_file="geodcat.jsonld"):
    g = Graph()

    # Namespaces
    GEO = Namespace("http://www.opengis.net/ont/geosparql#")
    SCHEMA = Namespace("https://schema.org/")
    SPDX = Namespace("http://spdx.org/rdf/terms#")
    ADMS = Namespace("http://www.w3.org/ns/adms#")
    PROV = Namespace("http://www.w3.org/ns/prov#")

    g.bind("dct", DCTERMS)
    g.bind("dcat", DCAT)
    g.bind("foaf", FOAF)
    g.bind("geo", GEO)
    g.bind("schema", SCHEMA)
    g.bind("spdx", SPDX)
    g.bind("adms", ADMS)
    g.bind("prov", PROV)

    dataset_id = croissant_json.get("identifier", "dataset")
    dataset_uri = URIRef(f"https://{dataset_id}")
    g.add((dataset_uri, RDF.type, DCAT.Dataset))
    g.add((dataset_uri, RDF.type, SCHEMA.Dataset))
    g.add((dataset_uri, DCTERMS.identifier, Literal(dataset_id)))
    g.add((dataset_uri, DCTERMS.title, Literal(croissant_json["name"])))
    g.add((dataset_uri, DCTERMS.description, Literal(croissant_json["description"])))
    g.add((dataset_uri, DCTERMS.license, URIRef(croissant_json["license"])))
    if "conformsTo" in croissant_json:
        g.add((dataset_uri, DCTERMS.conformsTo, URIRef(croissant_json["conformsTo"])))

    for alt in croissant_json.get("alternateName", []):
        g.add((dataset_uri, SCHEMA.alternateName, Literal(alt)))

    if croissant_json.get("sameAs"):
        g.add((dataset_uri, SCHEMA.sameAs, URIRef(croissant_json["sameAs"])))

    creator = croissant_json.get("creator", {})
    if isinstance(creator, dict):
        creator_uri = URIRef(creator.get("url", f"https://example.org/agent/{dataset_id}"))
        g.add((creator_uri, RDF.type, FOAF.Agent))
        g.add((creator_uri, FOAF.name, Literal(creator["name"])))
        g.add((dataset_uri, DCTERMS.creator, creator_uri))

    for kw in croissant_json.get("keywords", []):
        g.add((dataset_uri, DCAT.keyword, Literal(kw)))

    # Temporal extent (hardcoded or extracted if available)
    temporal_uri = URIRef(f"{dataset_uri}/period")
    g.add((dataset_uri, DCTERMS.temporal, temporal_uri))
    g.add((temporal_uri, RDF.type, DCAT.PeriodOfTime))
    g.add((temporal_uri, DCAT.startDate, Literal("2018-01-01", datatype=XSD.date)))
    g.add((temporal_uri, DCAT.endDate, Literal("2021-12-31", datatype=XSD.date)))

    # Spatial extent (optional example)
    spatial_uri = URIRef("http://sws.geonames.org/6252001/")  # USA
    g.add((dataset_uri, DCTERMS.spatial, spatial_uri))

    # Distributions
    for dist in croissant_json.get("distribution", []):
        dist_id = dist.get("@id", "dist")
        dist_uri = URIRef(f"{dataset_uri}/distribution/{dist_id}")
        g.add((dataset_uri, DCAT.distribution, dist_uri))
        g.add((dist_uri, RDF.type, DCAT.Distribution))
        g.add((dist_uri, DCTERMS.title, Literal(dist.get("name", ""))))
        g.add((dist_uri, DCTERMS.description, Literal(dist.get("description", ""))))
        g.add((dist_uri, DCAT.accessURL, URIRef(dist.get("contentUrl", "https://example.org/data"))))
        g.add((dist_uri, DCAT.mediaType, Literal(dist.get("encodingFormat", "application/octet-stream"))))

        if "sha256" in dist:
            checksum_node = URIRef(f"{dist_uri}/checksum")
            g.add((dist_uri, SPDX.checksum, checksum_node))
            g.add((checksum_node, RDF.type, SPDX.Checksum))
            g.add((checksum_node, SPDX.algorithm, Literal("SHA256")))
            g.add((checksum_node, SPDX.checksumValue, Literal(dist["sha256"])))

        if "containedIn" in dist:
            parent_id = dist["containedIn"].get("@id")
            if parent_id:
                parent_uri = URIRef(f"{dataset_uri}/distribution/{parent_id}")
                g.add((dist_uri, DCTERMS.isPartOf, parent_uri))

        if "includes" in dist:
            g.add((dist_uri, SCHEMA.hasPart, Literal(dist["includes"])))

    if croissant_json.get("url"):
        g.add((dataset_uri, DCAT.landingPage, URIRef(croissant_json["url"])))

    g.serialize(destination=output_file, format="json-ld", indent=2)
    print(f"GeoDCAT JSON-LD metadata written to {output_file}")


if __name__ == "__main__":
    with open("croissant.json", "r") as f:
        croissant = json.load(f)

    croissant_to_geodcat_jsonld(croissant, output_file="geodcat.jsonld")

GeoDCAT JSON-LD metadata written to geodcat.jsonld


  g.add((temporal_uri, RDF.type, DCAT.PeriodOfTime))


##  Load Metadata and Generate GeoDCAT RDF

We now load the `croissant.json` file and convert it using our function. This will produce:
- `geodcat.jsonld`: GeoDCAT in JSON-LD
- `geodcat.ttl`: GeoDCAT in Turtle

In [3]:
import json
from rdflib import Graph, Namespace, URIRef, Literal, BNode
from rdflib.namespace import DCTERMS, DCAT, FOAF, XSD, RDF


def croissant_to_geodcat_jsonld(croissant_json, output_file="geodcat.jsonld"):
    g = Graph()

    # Namespaces
    GEO = Namespace("http://www.opengis.net/ont/geosparql#")
    SCHEMA = Namespace("https://schema.org/")
    SPDX = Namespace("http://spdx.org/rdf/terms#")
    ADMS = Namespace("http://www.w3.org/ns/adms#")
    PROV = Namespace("http://www.w3.org/ns/prov#")

    g.bind("dct", DCTERMS)
    g.bind("dcat", DCAT)
    g.bind("foaf", FOAF)
    g.bind("geo", GEO)
    g.bind("schema", SCHEMA)
    g.bind("spdx", SPDX)
    g.bind("adms", ADMS)
    g.bind("prov", PROV)

    dataset_id = croissant_json.get("identifier", "dataset")
    dataset_uri = URIRef(f"https://{dataset_id}")
    g.add((dataset_uri, RDF.type, DCAT.Dataset))
    g.add((dataset_uri, RDF.type, SCHEMA.Dataset))
    g.add((dataset_uri, DCTERMS.identifier, Literal(dataset_id)))
    g.add((dataset_uri, DCTERMS.title, Literal(croissant_json["name"])))
    g.add((dataset_uri, DCTERMS.description, Literal(croissant_json["description"])))
    g.add((dataset_uri, DCTERMS.license, URIRef(croissant_json["license"])))
    if "conformsTo" in croissant_json:
        g.add((dataset_uri, DCTERMS.conformsTo, URIRef(croissant_json["conformsTo"])))

    for alt in croissant_json.get("alternateName", []):
        g.add((dataset_uri, SCHEMA.alternateName, Literal(alt)))

    if croissant_json.get("sameAs"):
        g.add((dataset_uri, SCHEMA.sameAs, URIRef(croissant_json["sameAs"])))

    creator = croissant_json.get("creator", {})
    if isinstance(creator, dict):
        creator_uri = URIRef(creator.get("url", f"https://example.org/agent/{dataset_id}"))
        g.add((creator_uri, RDF.type, FOAF.Agent))
        g.add((creator_uri, FOAF.name, Literal(creator["name"])))
        g.add((dataset_uri, DCTERMS.creator, creator_uri))

    for kw in croissant_json.get("keywords", []):
        g.add((dataset_uri, DCAT.keyword, Literal(kw)))

    # Temporal extent (hardcoded or extracted if available)
    temporal_uri = URIRef(f"{dataset_uri}/period")
    g.add((dataset_uri, DCTERMS.temporal, temporal_uri))
    g.add((temporal_uri, RDF.type, DCAT.PeriodOfTime))
    g.add((temporal_uri, DCAT.startDate, Literal("2018-01-01", datatype=XSD.date)))
    g.add((temporal_uri, DCAT.endDate, Literal("2021-12-31", datatype=XSD.date)))

    # Spatial extent (optional example)
    spatial_uri = URIRef("http://sws.geonames.org/6252001/")  # USA
    g.add((dataset_uri, DCTERMS.spatial, spatial_uri))

    # Distributions
    for dist in croissant_json.get("distribution", []):
        dist_id = dist.get("@id", "dist")
        dist_uri = URIRef(f"{dataset_uri}/distribution/{dist_id}")
        g.add((dataset_uri, DCAT.distribution, dist_uri))
        g.add((dist_uri, RDF.type, DCAT.Distribution))
        g.add((dist_uri, DCTERMS.title, Literal(dist.get("name", ""))))
        g.add((dist_uri, DCTERMS.description, Literal(dist.get("description", ""))))
        g.add((dist_uri, DCAT.accessURL, URIRef(dist.get("contentUrl", "https://example.org/data"))))
        g.add((dist_uri, DCAT.mediaType, Literal(dist.get("encodingFormat", "application/octet-stream"))))

        if "sha256" in dist:
            checksum_node = URIRef(f"{dist_uri}/checksum")
            g.add((dist_uri, SPDX.checksum, checksum_node))
            g.add((checksum_node, RDF.type, SPDX.Checksum))
            g.add((checksum_node, SPDX.algorithm, Literal("SHA256")))
            g.add((checksum_node, SPDX.checksumValue, Literal(dist["sha256"])))

        if "containedIn" in dist:
            parent_id = dist["containedIn"].get("@id")
            if parent_id:
                parent_uri = URIRef(f"{dataset_uri}/distribution/{parent_id}")
                g.add((dist_uri, DCTERMS.isPartOf, parent_uri))

        if "includes" in dist:
            g.add((dist_uri, SCHEMA.hasPart, Literal(dist["includes"])))

    if croissant_json.get("url"):
        g.add((dataset_uri, DCAT.landingPage, URIRef(croissant_json["url"])))

    g.serialize(destination=output_file, format="json-ld", indent=2)
    print(f"GeoDCAT JSON-LD metadata written to {output_file}")

    g.serialize(destination="geodcat.ttl", format="turtle")
    print("GeoDCAT Turtle metadata written to geodcat.ttl")


if __name__ == "__main__":
    with open("croissant.json", "r") as f:
        croissant = json.load(f)

    croissant_to_geodcat_jsonld(croissant, output_file="geodcat.jsonld")

GeoDCAT JSON-LD metadata written to geodcat.jsonld
GeoDCAT Turtle metadata written to geodcat.ttl


## Inspect GeoDCAT JSON-LD

We reload and pretty-print the generated RDF in JSON-LD format to verify key fields like:
- Dataset identifiers
- Distributions and access URLs
- Creator, license, and temporal coverage

In [4]:
from rdflib import Graph

# Load and print the GeoDCAT JSON-LD content
g = Graph()
g.parse("geodcat.jsonld", format="json-ld")
print(g.serialize(format="json-ld", indent=2))

[
  {
    "@id": "https://10.57967/hf/0956/distribution/repo/checksum",
    "@type": [
      "http://spdx.org/rdf/terms#Checksum"
    ],
    "http://spdx.org/rdf/terms#algorithm": [
      {
        "@value": "SHA256"
      }
    ],
    "http://spdx.org/rdf/terms#checksumValue": [
      {
        "@value": "https://github.com/mlcommons/croissant/issues/80"
      }
    ]
  },
  {
    "@id": "https://huggingface.co/ibm-nasa-geospatial",
    "@type": [
      "http://xmlns.com/foaf/0.1/Agent"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@value": "IBM-NASA Prithvi Models Family"
      }
    ]
  },
  {
    "@id": "https://10.57967/hf/0956/period",
    "@type": [
      "http://www.w3.org/ns/dcat#PeriodOfTime"
    ],
    "http://www.w3.org/ns/dcat#endDate": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2021-12-31"
      }
    ],
    "http://www.w3.org/ns/dcat#startDate": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date

## SHACL Validation of RDF Metadata

To ensure that the generated RDF complies with expected structure and semantics, we use `pyshacl` to validate the data graph using the generated `.ttl` file as shape graph.

This checks:
- Class and property constraints (e.g., `dcat:Dataset`, `dcat:Distribution`)
- Value types and cardinalities


In [5]:
from pyshacl import validate
from rdflib import Graph

# Load your GeoDCAT JSON-LD
data_graph = Graph()
data_graph.parse("geodcat.jsonld", format="json-ld")

# Load SHACL shape
shacl_graph = Graph()
shacl_graph.parse("geodcat.ttl", format="turtle")

# Validate
conforms, results_graph, results_text = validate(
    data_graph,
    shacl_graph=shacl_graph,
    inference='rdfs',
    abort_on_first=False,
    meta_shacl=False,
    debug=False,
)

print("Conforms:", conforms)
print(results_text)

Conforms: True
Validation Report
Conforms: True



## Full SHACL Validation Report (Turtle)

We print the detailed validation results in Turtle for debugging and verification purposes.

In [6]:
from pyshacl import validate
from rdflib import Graph

# Load your GeoDCAT JSON-LD
data_graph = Graph()
data_graph.parse("geodcat.jsonld", format="json-ld")

# Load SHACL shape
shacl_graph = Graph()
shacl_graph.parse("geodcat.ttl", format="turtle")

# Validate
conforms, results_graph, results_text = validate(
    data_graph,
    shacl_graph=shacl_graph,
    inference='rdfs',
    abort_on_first=False,
    meta_shacl=False,
    debug=False,
)

# Print summary
print("Conforms:", conforms)
print(results_text)

# Print full RDF report as Turtle
print("\n--- Full SHACL Validation Report (Turtle) ---")
print(results_graph.serialize(format="turtle"))

Conforms: True
Validation Report
Conforms: True


--- Full SHACL Validation Report (Turtle) ---
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[] a sh:ValidationReport ;
    sh:conforms true .




## List Distribution URLs

We use `rdflib` to extract all `dcat:distribution` and their `dcat:accessURL` values for visibility and audit.

In [7]:
from rdflib import Graph, Namespace, URIRef
from rdflib.namespace import DCAT

# Load the JSON-LD metadata
g = Graph()
g.parse("geodcat.jsonld", format="json-ld")

# Get all distributions and their access URLs
for s, p, o in g.triples((None, DCAT.distribution, None)):
    dist_uri = o
    access_url = g.value(dist_uri, DCAT.accessURL)
    print(f"Distribution: {dist_uri}")
    print(f"Access URL: {access_url}")

Distribution: https://10.57967/hf/0956/distribution/repo
Access URL: https://huggingface.co/datasets/ibm-nasa-geospatial/hls_burn_scars/tree/refs%2Fconvert%2Fparquet
Distribution: https://10.57967/hf/0956/distribution/parquet-files-for-config-hls_burn_scars
Access URL: https://example.org/data
