## Publish an ML Training Dataset on Radiant MLHub

<img src='https://radiant-assets.s3-us-west-2.amazonaws.com/PrimaryRadiantMLHubLogo.png' alt='Radiant MLHub Logo' width='300'/>


In this tutorial, we will walk through the process of creating STAC Collections for the labels and source imagery in an example machine learning (ML) training dataset. We will then describe the process for submitting this dataset to [Radiant MLHub](https://mlhub.earth/) for publication.

For this example, we will use the sample training dataset from the [SpaceNet 7: Multi-Temporal Urban Development Challenge](https://spacenet.ai/sn7-challenge/).

### Setup

Let's start by importing the libraries we will use through the rest of the tutorial.

In [1]:
!pip install rio-stac==0.3.2



In [18]:
import enum
import os
import pathlib
import re
import shutil
import tarfile
import tempfile
import urllib.parse

import pystac
import rasterio
from pystac.utils import str_to_datetime
from pystac.extensions.eo import Band, EOExtension
from rio_stac.stac import create_stac_item

### Data Exploration

First, we will download the sample subset of training data provided by SpaceNet and extract the tar archive. This sample does not include the full set of labels for the dataset, but it will give us enough to work with for this example.

In [3]:
# Get the TMP directory for this system
tmp_dir = pathlib.Path(tempfile.gettempdir())

tar_url = "https://s3.amazonaws.com/spacenet-dataset/spacenet/SN6_buildings/tarballs/SN6_buildings_AOI_11_Rotterdam_train_sample.tar.gz"
tar_path = tmp_dir / "sample_data.tar.gz"
data_dir = tmp_dir / "sample_data"

if tar_path.exists():
    print(f"File {tar_path} already exists, skipping download")
else:
    !curl {tar_url} -o {tar_path}
    
if data_dir.exists():
    print(f"Data already extracted from archive; skipping extract.")
else:
    os.makedirs(data_dir, exist_ok=True)
    !tar -xzf {tar_path} -C {tmp_dir} --transform s/SN6_buildings_AOI_11_Rotterdam_train_sample/{data_dir.name}/
    print(f"Extracted data to {data_dir}")

File /tmp/sample_data.tar.gz already exists, skipping download
Data already extracted from archive; skipping extract.


Next, let's take a look at the directory structure within the sample data directory.

In [4]:
for root, _, files in os.walk(data_dir):
    print(root)
    if files:
        print("\t" + "\n\t".join(sorted(files)))

/tmp/sample_data
/tmp/sample_data/AOI_11_Rotterdam
/tmp/sample_data/AOI_11_Rotterdam/SummaryData
	SN6_TrainSample_AOI_11_Rotterdam_Buildings.csv
/tmp/sample_data/AOI_11_Rotterdam/geojson_buildings
	SN6_Train_AOI_11_Rotterdam_Buildings_20190804120223_20190804120456_tile_55.geojson
	SN6_Train_AOI_11_Rotterdam_Buildings_20190804120223_20190804120456_tile_69.geojson
	SN6_Train_AOI_11_Rotterdam_Buildings_20190804133131_20190804133356_tile_783.geojson
	SN6_Train_AOI_11_Rotterdam_Buildings_20190822075219_20190822075510_tile_8137.geojson
	SN6_Train_AOI_11_Rotterdam_Buildings_20190822082538_20190822082826_tile_4164.geojson
	SN6_Train_AOI_11_Rotterdam_Buildings_20190822091156_20190822091502_tile_108.geojson
	SN6_Train_AOI_11_Rotterdam_Buildings_20190823082625_20190823082938_tile_442.geojson
	SN6_Train_AOI_11_Rotterdam_Buildings_20190823091132_20190823091448_tile_7924.geojson
	SN6_Train_AOI_11_Rotterdam_Buildings_20190823123151_20190823123459_tile_2317.geojson
	SN6_Train_AOI_11_Rotterdam_Building

We can see from the directory layout that our sample data has a single AOI directory (`AOI_11_Rotterdam`), which in turn has sub-directories containing GeoJSON labels and various types of source imagery. Based on the naming convention of the files, we can guess that each GeoJSON label can be matched to the corresponding source imagery based on the filename. Furthermore, the last part of the filename (before `tile_*`) looks like a timestamp range, probably representing the datetime of the imagery capture.

For example, the `SN6_Train_AOI_11_Rotterdam_Buildings_20190804120223_20190804120456_tile_55.geojson` label could be applied to the pansharpened RGB imagery in `SN6_Train_AOI_11_Rotterdam_PS-RGB_20190804120223_20190804120456_tile_55.tif` or the SAR intensity data in `SN6_Train_AOI_11_Rotterdam_SAR-Intensity_20190804120223_20190804120456_tile_55.tif`.

Based on this observation, we can come up with a regular expression to capture the relevant parts of the label filename and use them to find different source images for those labels.

In [21]:
aoi_name = "AOI_11_Rotterdam"
aoi_dir = data_dir / aoi_name

labels_pattern = re.compile(
    r"^(?P<prefix>SN6_Train_AOI_11_Rotterdam)"
    "_Buildings_"
    "(?P<start_datetime>20190804120223)"
    "_"
    "(?P<end_datetime>20190804120456)"
    "_tile_"
    "(?P<tile>55)"
    "\.geojson$"
)

class SourceType(str, enum.Enum):
    """Enumerates the possible source types.
    """
    RGBNIR = "RGBNIR"
    PS_RGBNIR = "PS-RGBNIR"
    SAR_Intensity = "SAR-Intensity"
    PAN = "PAN"
    PS_RGB = "PS-RGB"

def get_source_info(label_path):
    """Gets a list of paths (as pathlib.Path instances) to source data associated with
    the given label file path.
    """
    label_path = os.fspath(label_path)
    label_path = pathlib.Path(label_path)
    match = labels_pattern.match(label_path.name)
    if match is None:
        raise ValueError(f"Invalid filename {label_filename}")
    
    prefix = match.group("prefix")
    start_datetime = match.group("start_datetime")
    end_datetime = match.group("end_datetime")
    tile = match.group("tile")
    
    return [
        {
            # We use the path on S3 instead of the local path here
            "href": f"https://s3.amazonaws.com/spacenet-dataset/spacenet/SN6_buildings/train/" \
                f"{aoi_name}/{source_type.value}/" \
                f"{prefix}_{source_type.value}_{start_datetime}_{end_datetime}_tile_{tile}.tif",
            "type": source_type.value,
            "start_datetime": start_datetime,
            "end_datetime": end_datetime,
        }
        for source_type in SourceType
    ]
    
    

For example...

In [22]:
label = "SN6_Train_AOI_11_Rotterdam_Buildings_20190804120223_20190804120456_tile_55.geojson"
sources = get_source_info(label)
sources

[{'href': 'https://s3.amazonaws.com/spacenet-dataset/spacenet/SN6_buildings/train/AOI_11_Rotterdam/RGBNIR/SN6_Train_AOI_11_Rotterdam_RGBNIR_20190804120223_20190804120456_tile_55.tif',
  'type': 'RGBNIR',
  'start_datetime': '20190804120223',
  'end_datetime': '20190804120456'},
 {'href': 'https://s3.amazonaws.com/spacenet-dataset/spacenet/SN6_buildings/train/AOI_11_Rotterdam/PS-RGBNIR/SN6_Train_AOI_11_Rotterdam_PS-RGBNIR_20190804120223_20190804120456_tile_55.tif',
  'type': 'PS-RGBNIR',
  'start_datetime': '20190804120223',
  'end_datetime': '20190804120456'},
 {'href': 'https://s3.amazonaws.com/spacenet-dataset/spacenet/SN6_buildings/train/AOI_11_Rotterdam/SAR-Intensity/SN6_Train_AOI_11_Rotterdam_SAR-Intensity_20190804120223_20190804120456_tile_55.tif',
  'type': 'SAR-Intensity',
  'start_datetime': '20190804120223',
  'end_datetime': '20190804120456'},
 {'href': 'https://s3.amazonaws.com/spacenet-dataset/spacenet/SN6_buildings/train/AOI_11_Rotterdam/PAN/SN6_Train_AOI_11_Rotterdam_PAN

### Catalog Source Imagery

Since each of the sources for a given label cover the same spatial and temporal extents, we can combine them into a single STAC Item, with each source represented as a distinct Asset. We will create the helper functions 

In [27]:
def create_source_item(label_path):
    sources = get_source_info(label_path)
    
    first_source = sources[0]
    
    # Bootstrap the source item using rio-stac based on the first asset
    with rasterio.open(sources[0]["href"]) as src:
        item = create_stac_item(
            source=src,
            asset_name=first_source["type"],
            asset_roles=["data"],
            # Note that we use a single datetime here instead of the range from the filename
            input_datetime=str_to_datetime(first_source["start_datetime"]),
            with_proj=True,
        )
        
    # rio-stac does not add the Asset "type" or "title" fields, so we add them manually
    #  (all assets are Cloud-Optimized GeoTIFFs)
    item.assets[first_source["type"]].type = pystac.MediaType.COG
    item.assets[first_source["type"]].title = first_source["type"]
    
    # Since the spatiotemporal metadata is the same for all assets, we do not need to read 
    # each one.
    for source in sources[1:]:
        asset = pystac.Asset.from_dict({
            "href": source["href"],
            "roles": ["data"],
            "type": str(pystac.MediaType.COG),
            "title": source["type"]
        })
        item.add_asset(source["type"], asset)
    
    return item

### Catalog Labels

### Submit to Radiant MLHub

In [None]:
tar_path.unlink(missing_ok=True)
shutil.rmtree(data_dir, ignore_errors=True)