#  GeoCroissant to STAC Conversion

<img src="GeoCroissant.jpg" alt="GeoCroissant" width="150" style="float: right; margin-left: 50px;">


This notebook demonstrates how to convert **GeoCroissant metadata** — a geospatial extension of the Croissant metadata format — into a **STAC (SpatioTemporal Asset Catalog) Item**.

### GeoCroissant includes:
- Spatial information (geometry, bounding boxes)
- Temporal coverage
- Dataset structure (distributions, record sets)

### By converting GeoCroissant to STAC:
- Datasets become interoperable with geospatial tools and catalogs.
- Metadata is structured using the **STAC specification**, enabling better discovery and analysis of spatial datasets.

###  We use Python and `pystac` to:
- Parse the GeoCroissant JSON.
- Create a valid STAC Item with spatial and temporal context.
- Add assets and structured data (e.g., imagery, annotations).
- Save and validate the result with `stac-validator`.

### Differences: STAC vs GeoCroissant

| **STAC Field**             | **GeoCroissant Field**      | **Notes**                                                               |
|----------------------------|------------------------------|-------------------------------------------------------------------------|
| `id`                       | `@id`                        | Unique identifier for the dataset/item                                 |
| `type`                     | `@type`                      | Usually `"Feature"` in STAC                                            |
| `title`                    | `name`                       | Title of the dataset                                                   |
| `description`              | `description`                | Dataset description                                                    |
| `datetime`                 | `dct:temporal`               | Temporal coverage (inferred or computed)                               |
| `bbox`                     | `geocr:BoundingBox`          | Spatial extent (bounding box)                                          |
| `geometry`                 | `geocr:Geometry`             | Full spatial geometry (GeoJSON)                                        |
| `assets`                   | `distribution`               | Resources related to the dataset                                       |
| `assets[<key>].href`       | `contentUrl`                 | Link to the data asset                                                 |
| `assets[<key>].type`       | `encodingFormat`             | Media type of the asset (e.g., `image/png`, `application/parquet`)     |
| `properties["datetime"]`   | *N/A*                        | Typically midpoint of the date range                                   |
| `properties["spatial"]`    | *N/A*                        | Not standardized in GeoCroissant; often inferred manually              |


#  Install Required Libraries

In [1]:
!pip install pystac
!pip install stac-validator



# Basic GeoCroissant to STAC Conversion

In [2]:
import json
import re
from datetime import datetime
from pystac import Item, Asset

def croissant_to_stac_item(croissant_json):
    if isinstance(croissant_json, str):
        metadata = json.loads(croissant_json)
    else:
        metadata = croissant_json

    # Metadata fields
    item_id = metadata.get("identifier", metadata.get("name", "unknown-id"))
    title = metadata.get("name", "")
    description = metadata.get("description", "")
    license_link = metadata.get("license", "proprietary")
    keywords = metadata.get("keywords", [])
    dataset_url = metadata.get("url", "")
    alternate_names = metadata.get("alternateName", [])

    creator = metadata.get("creator", {})
    creator_name = creator.get("name") if isinstance(creator, dict) else None

    # Temporal coverage
    date_match = re.search(r"(\d{4})\D+(\d{4})", description)
    if date_match:
        start_year, end_year = int(date_match.group(1)), int(date_match.group(2))
        start_datetime = datetime(start_year, 1, 1)
        end_datetime = datetime(end_year, 12, 31)
        midpoint_datetime = datetime((start_year + end_year) // 2, 6, 30)
    else:
        start_datetime = end_datetime = midpoint_datetime = datetime(2020, 1, 1)


    item = Item(
        id=item_id,
        geometry=None,
        bbox=None,
        datetime=midpoint_datetime,
        properties={
            "start_datetime": start_datetime.isoformat() + "Z",
            "end_datetime": end_datetime.isoformat() + "Z",
            "title": title,
            "description": description,
            "keywords": keywords,
            "license": license_link,
            "alternate_names": alternate_names,
            "creator": creator_name,
            "dataset_url": dataset_url,
        }
    )

    for dist in metadata.get("distribution", []):
        asset_id = dist.get("name", "asset").replace(" ", "_")
        href = dist.get("contentUrl")
        media_type = dist.get("encodingFormat")
        asset_title = dist.get("description", asset_id)

        if href:
            asset = Asset(
                href=href,
                media_type=media_type,
                title=asset_title
            )
            item.add_asset(asset_id, asset)


    stac_dict = item.to_dict()
    print(json.dumps(stac_dict, indent=2))
    return item

if __name__ == "__main__":
    with open("croissant.json", "r") as f:
        croissant_data = json.load(f)

    croissant_to_stac_item(croissant_data)

{
  "type": "Feature",
  "stac_version": "1.1.0",
  "stac_extensions": [],
  "id": "10.57967/hf/0956",
  "geometry": null,
  "properties": {
    "start_datetime": "2018-01-01T00:00:00Z",
    "end_datetime": "2021-12-31T00:00:00Z",
    "title": "hls_burn_scars",
    "description": "This dataset contains Harmonized Landsat and Sentinel-2 imagery of burn scars and the associated masks for the years 2018-2021 over the contiguous United States. There are 804 512x512 scenes. Its primary purpose is for training geospatial machine learning models.",
    "keywords": [
      "English",
      "cc-by-4.0",
      "1K - 10K",
      "Image",
      "Datasets",
      "Croissant",
      "doi:10.57967/hf/0956",
      "\ud83c\uddfa\ud83c\uddf8 Region: US"
    ],
    "license": "https://choosealicense.com/licenses/cc-by-4.0/",
    "alternate_names": [
      "ibm-nasa-geospatial/hls_burn_scars"
    ],
    "creator": "IBM-NASA Prithvi Models Family",
    "dataset_url": "https://huggingface.co/datasets/ibm-na

# Refined GeoCroissant to STAC Conversion 
# STAC Item Generator

In [3]:
import json
from datetime import datetime
from pystac import Item, Asset, MediaType
from pystac.extensions.table import TableExtension

# License mapping from URL
KNOWN_LICENSES = {
    "https://choosealicense.com/licenses/cc-by-4.0/": "CC-BY-4.0",
    "https://opensource.org/licenses/mit": "MIT",
    "https://www.apache.org/licenses/license-2.0": "Apache-2.0",
    "cc-by-4.0": "CC-BY-4.0",
}

def croissant_to_stac_item(croissant_json, output_path=None):
    """Convert Croissant metadata to STAC Item."""
    if isinstance(croissant_json, str):
        metadata = json.loads(croissant_json)
    else:
        metadata = croissant_json

    # Extract basic metadata
    item_id = metadata.get("identifier", metadata.get("name", "unknown-id")).replace("/", "_")
    title = metadata.get("name", "")
    description = metadata.get("description", "")
    license_raw = metadata.get("license", "proprietary")
    keywords = metadata.get("keywords", [])
    dataset_url = metadata.get("url", "")
    alternate_names = metadata.get("alternateName", [])

    # Normalize license
    license_key = license_raw.strip().lower()
    license_normalized = KNOWN_LICENSES.get(license_key, 
                                          license_key.upper() if "cc-by" in license_key else "proprietary")

    # Handle creator information
    creator = metadata.get("creator", {})
    if isinstance(creator, list):
        creator = creator[0] if creator else {}
    creator_name = creator.get("name", "Unknown") if isinstance(creator, dict) else str(creator)
    creator_url = creator.get("url", "") if isinstance(creator, dict) else ""

    # Temporal coverage (from description)
    start_datetime = datetime(2018, 1, 1)
    end_datetime = datetime(2021, 12, 31)
    midpoint_datetime = datetime(2019, 6, 30)

    # Create STAC Item
    item = Item(
        id=item_id,
        geometry={
            "type": "Polygon",
            "coordinates": [[
                [-125.0, 24.0],  # SW
                [-125.0, 50.0],   # NW
                [-66.0, 50.0],    # NE
                [-66.0, 24.0],    # SE
                [-125.0, 24.0]    # SW (close polygon)
            ]]
        },
        bbox=[-125.0, 24.0, -66.0, 50.0],  # CONUS bbox
        datetime=midpoint_datetime,
        properties={
            "title": title,
            "description": description,
            "license": license_normalized,
            "start_datetime": start_datetime.isoformat() + "Z",
            "end_datetime": end_datetime.isoformat() + "Z",
            "keywords": keywords,
            "msft:region": "US",
            "msft:short_description": "HLS burn scars imagery and masks for US (2018-2021)",
            "providers": [{
                "name": creator_name,
                "roles": ["producer"],
                "url": creator_url
            }]
        }
    )

    # Add extensions (only those valid for Items)
    item.stac_extensions.extend([
        "https://stac-extensions.github.io/table/v1.2.0/schema.json",
        "https://schemas.stacspec.org/v1.1.0/item-spec/json-schema/item.json"
    ])

    # Add only the actual assets from Croissant distribution
    for dist in metadata.get("distribution", []):
        href = dist.get("contentUrl")
        if not href:
            continue
            
        asset_id = dist.get("@id", dist.get("name", "asset")).replace(" ", "_").lower()
        media_type = dist.get("encodingFormat", MediaType.JSON)
        desc = dist.get("description", asset_id)
        
        # Determine media type
        if "parquet" in asset_id or "parquet" in media_type:
            media_type = MediaType.PARQUET
        elif "git" in href:
            media_type = "application/git"

        item.add_asset(
            asset_id,
            Asset(
                href=href,
                media_type=media_type,
                title=desc,
                roles=["metadata"] if "git" in href else ["data"]
            )
        )

    # Add documentation asset
    item.add_asset(
        "documentation",
        Asset(
            href=dataset_url,
            title="Dataset Documentation",
            media_type=MediaType.HTML,
            roles=["metadata", "documentation"]
        )
    )

    # Process record sets to add table schema
    for record_set in metadata.get("recordSet", []):
        if record_set.get("@id") == "hls_burn_scars":
            TableExtension.ext(item, add_if_missing=True).columns = [
                {
                    "name": "image",
                    "type": "binary",
                    "description": "Harmonized Landsat and Sentinel-2 imagery"
                },
                {
                    "name": "annotation",
                    "type": "binary",
                    "description": "Associated burn scar annotations"
                },
                {
                    "name": "split",
                    "type": "string",
                    "description": "Dataset split (train/validation/test)"
                }
            ]

    # Output or return result
    if output_path:
        item.save_object(dest_href=output_path)
        print(f"STAC item saved to {output_path}")
    else:
        return item.to_dict()

if __name__ == "__main__":
    # Example usage
    with open("croissant.json", "r") as f:
        croissant_data = json.load(f)

    stac_item = croissant_to_stac_item(croissant_data, output_path="stac_item.json")

STAC item saved to stac_item.json


# Validation

In [4]:
!stac-validator stac_item.json


[32mThanks for using STAC version 1.1.0![0m

[
    {
        "version": "1.1.0",
        "path": "stac_item.json",
        "schema": [
            "https://stac-extensions.github.io/table/v1.2.0/schema.json",
            "https://schemas.stacspec.org/v1.1.0/item-spec/json-schema/item.json"
        ],
        "valid_stac": true,
        "asset_type": "ITEM",
        "validation_method": "default"
    }
]
