## Publish an ML Training Dataset on Radiant MLHub

<img src='https://radiant-assets.s3-us-west-2.amazonaws.com/PrimaryRadiantMLHubLogo.png' alt='Radiant MLHub Logo' width='300'/>


In this tutorial, we will walk through the process of creating a self-contained STAC Catalog, and its children Collections for the labels and source imagery in an example machine learning (ML) training dataset. We will then describe the process for getting the dataset read for submission to [Radiant MLHub](https://mlhub.earth/) for review and publication.

For this example, we will use the sample training dataset from the [SpaceNet 6: Multi-Sensor All-Weather Mapping](https://spacenet.ai/sn6-challenge/).

### Setup

Let's start by importing the libraries we will use through the rest of the tutorial.

In [None]:
!pip install rio-stac==0.3.2

In [None]:
import enum
import os
import pathlib
import re
import shutil
import tarfile
import tempfile
import datetime as dt
from typing import List, Dict, Tuple

import pystac
import rasterio
from pystac.utils import str_to_datetime
from pystac.extensions.label import LabelExtension
from rio_stac.stac import create_stac_item
import geopandas as gpd
from pystac import (
    Catalog,
    Collection,
    Item,
    MediaType,
    Asset,
    Link,
    Extent,
    SpatialExtent,
    TemporalExtent,
    CatalogType,
)
from pystac.extensions.scientific import ScientificExtension
from shapely.geometry import Polygon, mapping

from pprint import PrettyPrinter

pp = PrettyPrinter(indent=2)

### Data Exploration

First, we will download the sample subset of training data provided by SpaceNet and extract the tar archive. This sample does not include the full set of labels for the dataset, but it will give us enough to work with for this example.

In [None]:
# Get the TMP directory for this system
tmp_dir = pathlib.Path(tempfile.gettempdir())

tar_url = (
    "https://s3.amazonaws.com/spacenet-dataset/spacenet/SN6_buildings"
    "/tarballs/SN6_buildings_AOI_11_Rotterdam_train_sample.tar.gz"
)
tar_root = "https://s3.amazonaws.com/spacenet-dataset/spacenet/SN6_buildings/train/"
# tar_path = tmp_dir / "sample_data.tar.gz"
# data_dir = tmp_dir / "sample_data"
tar_path = tmp_dir / "SN6_buildings_AOI_11_Rotterdam_train_sample.tar.gz"
untar_path = tmp_dir / "SN6_buildings_AOI_11_Rotterdam_train_sample"
data_dir = tmp_dir / "spacenet_6_rotterdam"

If the archive `SN6_buildings_AOI_11_Rotterdam_train_sample.tar.gz` does not already exists in our temporary directory, then we will download it using `curl` command.

In [None]:
if tar_path.exists():
    print(f"File {tar_path} already exists, skipping download")
else:
    !curl {tar_url} -o {tar_path}

Then to make the directory names more meaningful, we will rename the directory to `spacenet_6_rotterdam`, which later matches the name of the catalog.

In [None]:
if untar_path.exists():
    print("Data already extracted from archive; skipping extract.")
else:
    os.makedirs(untar_path)
    !tar -zxf {tar_path} -C {tmp_dir}

    if os.path.exists(untar_path):
        print(f"Extracted data to {untar_path}")

        os.makedirs(data_dir, exist_ok=True)
        !mv {untar_path}/* {data_dir}
        print(f"Renamed folder to {data_dir}")

        !rm -rf {untar_path}

Next, let's take a look at the directory structure within the sample data directory.

In [None]:
for root, _, files in os.walk(data_dir):
    print(root)
    if files:
        print("\t" + "\n\t".join(sorted(files)))

We can see from the directory layout that our sample data has a single AOI directory (`AOI_11_Rotterdam`), which in turn has sub-directories containing GeoJSON labels and various types of source imagery. Based on the naming convention of the files, we can guess that each GeoJSON label can be matched to the corresponding source imagery based on the filename. Furthermore, the last part of the filename (before `tile_*`) looks like a timestamp range, probably representing the datetime of the imagery capture.

For example, the `SN6_Train_AOI_11_Rotterdam_Buildings_20190804120223_20190804120456_tile_55.geojson` label could be applied to the pansharpened RGB imagery in `SN6_Train_AOI_11_Rotterdam_PS-RGB_20190804120223_20190804120456_tile_55.tif` or the SAR intensity data in `SN6_Train_AOI_11_Rotterdam_SAR-Intensity_20190804120223_20190804120456_tile_55.tif`.

Based on this observation, we can come up with a regular expression to capture the relevant parts of the label filename and use them to find different source images for those labels.

In [None]:
aoi_name = "AOI_11_Rotterdam"
aoi_dir = data_dir / aoi_name
os.chdir(data_dir)

labels_pattern = re.compile(
    r"^(?P<prefix>SN6_Train_AOI_11_Rotterdam)"
    r"_Buildings_"
    r"(?P<start_datetime>\d{14})"
    r"_"
    r"(?P<end_datetime>\d{14})"
    r"_tile_"
    r"(?P<tile>\d+)"
    r"\.geojson$"
)


class SourceType(str, enum.Enum):
    """Enumerates the possible source types."""

    RGBNIR = "RGBNIR"
    PS_RGBNIR = "PS-RGBNIR"
    SAR_Intensity = "SAR-Intensity"
    PAN = "PAN"
    PS_RGB = "PS-RGB"


def strip_meta_matches(label_path: str) -> Tuple[any]:
    """Uses Regex pattern above to strip out relevant metadata about the file"""
    label_path = os.fspath(label_path)
    label_path = pathlib.Path(label_path)
    match = labels_pattern.match(label_path.name)
    if match is None:
        raise ValueError(f"Invalid filename {label_filename}")

    prefix = match.group("prefix")
    start_datetime = match.group("start_datetime")
    end_datetime = match.group("end_datetime")
    tile = match.group("tile")

    return prefix, start_datetime, end_datetime, tile


def get_source_info(label_path: str) -> List[Dict[str, any]]:
    """Gets a list of paths (as pathlib.Path instances) to source data associated
    with the given label file path.
    """

    prefix, start_datetime, end_datetime, tile = strip_meta_matches(label_path)

    return [
        {
            # We will use relative paths when archiving the entire catalog with the dataset
            "href": f"{aoi_name}/{source_type.value}/{prefix}_{source_type.value}"
            f"_{start_datetime}_{end_datetime}_tile_{tile}.tif",
            "type": source_type.value,
            "start_datetime": start_datetime,
            "end_datetime": end_datetime,
        }
        for source_type in SourceType
    ]


def get_label_info(label_path: str) -> List[Dict[str, any]]:
    """Gets the single path and metadata attributes from the given label path"""

    prefix, start_datetime, end_datetime, tile = strip_meta_matches(label_path)

    return {
        "href": f"{aoi_name}/geojson_buildings/{prefix}_Buildings"
        f"_{start_datetime}_{end_datetime}_tile_{tile}.geojson",
        "type": "Buildings",
        "start_datetime": start_datetime,
        "end_datetime": end_datetime,
    }

For example, we can see what information our regex pattern above can learn about the geojson label filename...

In [None]:
label = (
    "SN6_Train_AOI_11_Rotterdam_Buildings_20190804120223_20190804120456_tile_55.geojson"
)
source_info = get_source_info(label)
source_info

In [None]:
label_info = get_label_info(label)
label_info

### Create Catalog Source Items

Since each of the sources for a given label cover the same spatial and temporal extents, we can combine them into a single STAC Item, with each source represented as a distinct Asset. We will create the helper functions that allow us to easily create a STAC Item from just the label filename based on the source imagery in our dataset directory.

In [None]:
def get_item_id(source_href: str, source_type: str, item_type: str) -> str:
    """Helper function to return the appropriate Item ID"""
    return (
        source_href.split("/")[-1]
        .replace(f"_{source_type}", "")
        .replace(".tif", f"_{item_type}")
        .replace(".geojson", f"_{item_type}")
    )

In [None]:
def create_source_item(label_path: str) -> Item:
    """Helper function that leverages rio-stac to create a STAC item
    from a source image Asset, and adds the rest of the images as Assets
    """
    sources = get_source_info(label_path)

    # we need the first source object
    first_source = sources[0]

    # rio-stac by default provides the filepath, so we override the item id
    item_id = get_item_id(first_source["href"], first_source["type"], "source")

    # Bootstrap the source item using rio-stac based on the first asset
    with rasterio.open(sources[0]["href"]) as src:

        item = create_stac_item(
            id=item_id,
            source=src,
            asset_name=first_source["type"],
            asset_roles=["data"],
            # Note that we use a single datetime here instead of the range from the filename
            input_datetime=str_to_datetime(first_source["start_datetime"]),
            with_proj=True,
        )

    # rio-stac does not add the Asset "type" or "title" fields, so we add them manually
    #  (all assets are Cloud-Optimized GeoTIFFs)
    item.assets[first_source["type"]].type = MediaType.COG
    item.assets[first_source["type"]].title = first_source["type"]

    # Since the spatiotemporal metadata is the same for all assets, we do not need to read
    # each one.
    for source in sources[1:]:
        asset = pystac.Asset.from_dict(
            {
                "href": source["href"],
                "roles": ["data"],
                "type": str(MediaType.COG),
                "title": source["type"],
            }
        )
        item.add_asset(source["type"], asset)

    return item

We can examine the output of our helper function `create_source_item` above to see that it has populated the required attributes for a generic source item. However, per the [STAC Item Specification](https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md), it is recommended to add more properties to the Item and its Assets, such as the [EOExtension](https://github.com/stac-extensions/eo) for electro-optical bands, e.g. RGB. For now we will stick with the core required properties for a source item.

In [None]:
source_item = create_source_item(label)
pp.pprint(source_item.to_dict())

### Create Catalog Label Items

Similar to the helper functions created above, we need some functions to more easily create a label STAC Item for the catalog.

In [None]:
def get_item_datetime(dt_str: str) -> dt.datetime:
    """Returns an items datetime based on ID string pattern"""
    return dt.datetime.strptime(str(dt_str), "%Y%m%d%H%M%S")

In [None]:
def get_geojson_extent(fname: str) -> Polygon:
    """Takes a path to GeoJSON vector file and returns
    the Polygon geometry for an Item reprojected
    """

    gdf = gpd.read_file(fname)
    gdf = gdf.to_crs("EPSG:4326")
    bounds = gdf.total_bounds
    geometry = Polygon(
        (
            (bounds[0], bounds[1]),
            (bounds[0], bounds[3]),
            (bounds[2], bounds[3]),
            (bounds[2], bounds[1]),
            (bounds[0], bounds[1]),
        )
    )
    return geometry

In [None]:
def add_label_extension(label: Item, label_meta: Dict[str, any]) -> Item:
    """This applies the STAC LabelExtension to the label item and related properties"""
    # apply the Label Extention
    label_ext = LabelExtension.ext(label, add_if_missing=True)

    label_ext.apply(
        label_description="SpaceNet 6 Building Footprints", label_type="vector"
    )

    # instantiate GeoJSON Asset
    asset = Asset(
        href=label_meta["href"],
        media_type=MediaType.GEOJSON,
    )

    # add GeoTiff Asset to item
    label.add_asset(key="buildings", asset=asset)

    return label

In [None]:
def create_label_item(label_path: str) -> Item:
    """Helper function that creates a STAC label item
    from a geojson label path and adds it as the Asset
    """
    label_meta = get_label_info(label_path)

    # rio-stac by default provides the filepath, so we override the item id
    item_id = get_item_id(label_meta["href"], label_meta["type"], "labels").replace(
        "_" + label_meta["type"], ""
    )
    item_geometry = get_geojson_extent(label_meta["href"])

    return add_label_extension(
        Item(
            id=item_id,
            datetime=get_item_datetime(label_meta["start_datetime"]),
            geometry=mapping(item_geometry),
            bbox=item_geometry.bounds,
            properties={},
        ),
        label_meta,
    )

In [None]:
def add_label_source_link(source: Item, label: Item) -> Item:
    """Takes a 1:1 source to label item relationship,
    and adds the source link to label Item
    """

    source_link = Link(rel="source", target=source, media_type=MediaType.COG)
    label.add_link(source_link)

Now we can examine the label Item output of our function `create_label_item` above after adding the source Item object reference to the Links in the label Item. This is a necessary step so that the label items can point to the appropriate source imagery Items and related Assets in our Catalog. 

In [None]:
label_item = create_label_item(label)
add_label_source_link(source_item, label_item)
pp.pprint(label_item.to_dict())

Similar to `EOExtention` there are other best practices that can be employed when creating a STAC Item. For example, since this is a label Item, we could add `label:overviews`, `label_classes` and `file:values` properties to store more information about the labels that improve indexing on the Catalog:

* `label:overviews` contain the names of the unique classes in the label file and the [Count Objects](https://github.com/stac-extensions/label#count-object) with associated classes
* `label:classes` is a list of all [Class Objects](https://github.com/stac-extensions/label#count-object) representing possible classes across the labels found in a dataset
* `file:values` can be used to store the [Mapping Object](https://github.com/stac-extensions/file#mapping-object) between numeric classification values and the descriptive string text equivalent 

### Define Catalog and Collection metadata properties

Now that we have all the helper functions in place to create both our source and label Items, we need to create the actual Catalog and its children Collections. Collections. There will be two Collections in this Catalog, one for each source and labels. The reason for this is that per [STAC Collection Specification](https://github.com/radiantearth/stac-spec/tree/master/collection-spec), we should use Collections so as to make logically related groups of Items and store the metadata that they share. In this example, the first clear delineation between the Collections is one set is raster source images in `.tif` files, while the other set is vector building footprints in `.geojson` files. The second is that the rasters are the source data while the vectors are the label data.

All of the metadata information defined below, except for the Catalog and Collection names, all came from the [SpaceNet 6 Challenge](https://spacenet.ai/sn6-challenge/) webpage.

In [None]:
# catalog specific properties
catalog_id = "spacenet_6_rotterdam"
catalog_title = "SpaceNet Multi-Sensor All-Weather Mapping Challenge - Rotterdam"
catalog_description = """
In this challenge, the training dataset contained both SAR and EO imagery, however,
the testing and scoring datasets contained only SAR data. Consequently, the EO data
could be used for pre-processing the SAR data in some fashion, such as colorization,
domain adaptation, or image translation, but cannot be used to directly map buildings.
The dataset was structured to mimic real-world scenarios where historical EO data
may be available, but concurrent EO collection with SAR is often not possible due to
inconsistent orbits of the sensors, or cloud cover that will render the EO data unusable.
"""

We can create a barebones Catalog with the above required properties

In [None]:
sn6_catalog = Catalog(
    id=catalog_id, title=catalog_title, description=catalog_description
)

In [None]:
# collection specific properties
source_collection_id = "spacenet_6_rotterdam_source"
source_collection_title = "SpaceNet 6 Rotterdam Source Imagery"

labels_collection_id = "spacenet_6_rotterdam_labels"
labels_collection_title = "SpaceNet 6 Rotterdam Labels"

citation = """
Shermeyer, J., Hogan, D., Brown, J., Etten, A.V., Weir, N., Pacifici, F.,
Hänsch, R., Bastidas, A., Soenen, S., Bacastow, T.M., & Lewis, R. (2020).
SpaceNet 6: Multi-Sensor All Weather Mapping Dataset. 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 768-777.
"""

license = "CC-BY-SA-4.0"

Here we will define another helper function that loads a default spatial and temporal extent to each Collection as they're being created, as this is a required attribute. That can be manually defined if known up front, or it can be implicitly learned from the spatial and temporal attributes of the Items in each Collection using the `Collection.update_extent_from_items` function, as seen below.

In [None]:
def get_default_extent():
    """Returns a default spatial and temporal Extent STAC object"""
    # default spatial extent is the entire globe
    default_spatial_extent = SpatialExtent([[-180, -90, 180, 90]])

    # default temporal extent is the current date
    default_temporal_extent = TemporalExtent([[]])

    return Extent(default_spatial_extent, default_temporal_extent)

In [None]:
def create_collection(id, description, license, citation):
    """Creates a skeleton Collection with required properties"""
    collection = Collection(
        id=id, license=license, extent=get_default_extent(), description=description
    )

    sci_ext = ScientificExtension.ext(collection, add_if_missing=True)
    sci_ext.apply(citation=citation)

    return collection

In [None]:
sn6_source_collection = create_collection(
    source_collection_id, source_collection_title, license, citation
)

In [None]:
sn6_labels_collection = create_collection(
    labels_collection_id, labels_collection_title, license, citation
)

### Iteratively add items to Source and Label Collections

There are many ways to do this next step, but given our dataset is so small, we can just use a non-parallelized iterative loop to create the related source and label items at the same time, and then add them to their respective Collections.

In [None]:
label_paths = [
    f for f in os.listdir(aoi_dir / "geojson_buildings") if f.endswith("geojson")
]

In [None]:
for label_path in label_paths:
    # get the geojson label filename
    label_filename = label_path.split("/")[-1]
    print(f"Creating source and label items from {label_filename}")

    # create the source and label items for a given label path
    source_item = create_source_item(label_filename)
    label_item = create_label_item(label_filename)

    # add the source link to label item
    add_label_source_link(source_item, label_item)

    # add the source and label items to collections
    sn6_source_collection.add_item(source_item)
    sn6_labels_collection.add_item(label_item)

In [None]:
sn6_source_collection.update_extent_from_items()
sn6_labels_collection.update_extent_from_items()

### Add children Collections to Catalog

With all the Items added to the source and labels Collections, we can add the two Collections as children of the Catalog.

In [None]:
sn6_catalog.add_children([sn6_source_collection, sn6_labels_collection])
sn6_catalog.describe()

### Normalize Links, validate Catalog and save to file

The last few steps we need to take in created the Catalog are normalizing all of the links between the related Items and Collections, validate that it's a valid STAC Catalog, and then save it to JSON file in our temporary `spacenet_6_rotterdam` directory.

In [None]:
sn6_catalog.normalize_hrefs(data_dir.as_posix())

In [None]:
sn6_catalog.validate_all()

In [None]:
sn6_catalog.save(catalog_type=CatalogType.SELF_CONTAINED)

### Compress catalog with dataset source images and labels into single archive

The very last step in the Catalog creation process before submitting to Radiant MLHub is compressing the entire archive we just created, so that we have a self-contained catalog bundled with all the source imagery and label files together in a single place. This will speed up processing for the Radiant team downstream.

In [None]:
def create_tar_gz(archive_name, target_dir):
    with tarfile.open(archive_name, "w:gz") as tar:
        tar.add(target_dir)
    print(f"Archive file {archive_name} created")

In [None]:
os.chdir("/home/jovyan/PlanetaryComputerExamples/tutorials")

In [None]:
output_archive_filename = f"{data_dir.name}.tar.gz"
create_tar_gz(output_archive_filename, data_dir.as_posix())

### Submit to Radiant MLHub

Now that the archive of your dataset and the Catalog has been created, you should see the tar file in your browser view to the left titled `spacenet_6_rotterdam.tar.gz`. You would need to generate a similar archive for your own dataset if you want to publish it on [Radiant MLHub](www.mlhub.earth). This is the file you will share with the Radiant Earth engineering team to streamline the process of publishing your dataset to Radiant MLHub. 

To start the process, go to the [Contribute](https://mlhub.earth/contribute) page on Radiant MLHub website, and click on General Dataset Inquiry Form (you need to create an account on Radiant MLHub to access this page).  Submit the form with as complete details as possible. This will automatically notify the Radiant team of your request. When we're ready to process and ingest your dataset, we will ask that you share this archive file with us on a cloud storage solution, such as Azure, AWS, Google Cloud/Drive or Dropbox.

### Garbage Cleanup

The following commands simply clean-up the instance enviroment of all the archive files and directories you created in this notebook. They are not necessary to run, however it should be noted that anything kept in the `tmp` directory will be flushed when the notebook server instance is shutdown. Therefore make sure to backup/download any files you wish to keep.

In [None]:
tar_path.unlink(missing_ok=True)
shutil.rmtree(data_dir, ignore_errors=True)

if os.path.exists(tar_path):
    os.remove(tar_path)

if os.path.exists(data_dir):
    os.remove(data_dir)