# pytdml tutorial

This tutorial will guide you through the basic functionalities of the pytdml library.

The pytdml library is a Python library that provides tools to create, read, and convert datasets in the TrainingDML-AI format.

## TrainingDML-AI

The full name of the standard is Training Data Markup Language for Artificial Intelligence, and the name could be even longer, because at the moment focuses only in modelling training datasets for deep learning models in the Earth Observation domain. It is a UML model with encodings in JSON and XML,

The fundamental problem that the TrainingDML-AI standard is trying to solve is the lack of a common format for training datasets and improve the reusability, provenance, and interoperability of training datasets. Training data is not uniform or universal, depending on the task and the nature of the objects that we are observing, the usability of the data might vary in great length. For example, if we do have a dataset of rooftops from Lisbon, it will most likely not be useful for a model that is trying to detect rooftops in Bangalore, while it could have some high degree of usefulness to detect rooftops in Madrid. Both the task and the applicability, and ideally the usage in past models, should be something that we can easily access and understand.

TrainingDML builds upon already existing standards, mainly the ISO 19100 family of standards.

<img src="https://raw.githubusercontent.com/opengeospatial/TrainingDML-AI_SWG/main/standard/part1/figures/uml_model_iso.jpg" alt="ISO standards used in TDML" width="60%" height="100%">


The following graph shows the most important concepts in the TrainingDML-AI standard:

<img src="https://raw.githubusercontent.com/opengeospatial/TrainingDML-AI_SWG/main/standard/part1/figures/overview_modularization.jpg" alt="TrainingDML-AI Modularization" width="60%" height="100%">

For more information about the standard, please visit the official [OGC Publication page](https://www.ogc.org/publications/standard/trainingdml-ai/) or the [GitHub repository](https://github.com/opengeospatial/TrainingDML-AI_SWG).

## Installation

First, let's manually check the python version, as the library is only compatible with Python 3.9 and 3.10

In [1]:
import sys
from urllib.request import parse_http_list

if (3, 9) > sys.version_info >= (3, 11):
    exit("Python version should be 3.9 or 3.10")

# We will also need to install the wget library to download some files
!pip install wget

Collecting wget
  Using cached wget-3.2-py3-none-any.whl
Installing collected packages: wget
Successfully installed wget-3.2


We will be installing the version in GitHub, as it includes many improvements and bug fixes that are still not released in pypi.

The first command will install the library and the basic dependencies:

In [None]:
!pip install pytdml@git+https://github.com/openrsgis/pytdml.git

The following command will install the library with the torch and tensoflow dependencies, that are required
for the extended functionalities in `pytdml.ml`:

In [2]:
!pip install pytdml[torch]@git+https://github.com/openrsgis/pytdml.git

Collecting pytdml@ git+https://github.com/openrsgis/pytdml.git (from pytdml[torch]@ git+https://github.com/openrsgis/pytdml.git)
  Cloning https://github.com/openrsgis/pytdml.git to /tmp/pip-install-x6digtt3/pytdml_ad503ab2790e429fa52af724f18b315a
  Running command git clone --filter=blob:none --quiet https://github.com/openrsgis/pytdml.git /tmp/pip-install-x6digtt3/pytdml_ad503ab2790e429fa52af724f18b315a
  Resolved https://github.com/openrsgis/pytdml.git to commit 20b5b4355ed2ea9c33d4cf7c80d8dfb4be30c834
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting geojson~=3.1.0 (from pytdml@ git+https://github.com/openrsgis/pytdml.git->pytdml[torch]@ git+https://github.com/openrsgis/pytdml.git)
  Using cached geojson-3.1.0-py3-none-any.whl.metadata (16 kB)
Collecting Pillow~=10.4.0 (from pytdml@ git+https://github.com/openrsgis/pytdml.git->pytdml[torch]@ g

## Basic Usage

To create a dataset with pytdml, we can do it in two ways: by creating a new dataset directly in the code, or by reading an existing dataset from a file.

In [3]:
from pytdml.type import EOTrainingDataset, AI_EOTrainingData, AI_EOTask, AI_SceneLabel

dataset = EOTrainingDataset(
    id="eotrainingdataset_1",
    name="EO Training Dataset Example",
    description="This is an example of a training dataset for the pytdml tutorial",
    license="CC-BY-SA",
    type="AI_EOTrainingDataset",
    data=[AI_EOTrainingData(
        id="eotrainingdata_1",
        type="AI_EOTrainingData",
        data_url=["https://example.com/data.tif"],
        labels=[
            AI_SceneLabel(
                type="AI_SceneLabel",
                label_class="label_1"
            )
        ]
    )],
    tasks=[AI_EOTask(
        id="task_1",
        type="AI_EOTask",
        task_type="classification"
    )]
)

dataset.to_dict()

2025-03-25 11:09:22.637871: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-25 11:09:22.647038: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-25 11:09:22.658629: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-25 11:09:22.662020: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-25 11:09:22.670991: I tensorflow/core/platform/cpu_feature_guar

{'id': 'eotrainingdataset_1',
 'name': 'EO Training Dataset Example',
 'description': 'This is an example of a training dataset for the pytdml tutorial',
 'license': 'CC-BY-SA',
 'tasks': [{'id': 'task_1',
   'type': 'AI_EOTask',
   'taskType': 'classification'}],
 'data': [{'type': 'AI_EOTrainingData',
   'id': 'eotrainingdata_1',
   'labels': [{'type': 'AI_SceneLabel',
     'isNegative': False,
     'confidence': 1.0,
     'class': 'label_1'}],
   'dataURL': ['https://example.com/data.tif']}],
 'type': 'AI_EOTrainingDataset'}

## Writing and reading from files

### Writing to a TrainingDML-AI JSON file
We can directly write the dataset into a JSON file, which can be verified using jsonschema

In [4]:
import jsonschema
import requests
import pytdml
pytdml.io.write_to_json(dataset, "dataset.json")

remote_schema_url = "https://raw.githubusercontent.com/opengeospatial/TrainingDML-AI_SWG/main/schemas/1.0/json_schema/ai_eoTrainingDataset.json"
response = requests.get(remote_schema_url)
remote_schema = response.json()

jsonschema.validate(instance=dataset.to_dict(), schema=remote_schema)

### Reading from a TrainingDML-AI JSON file


In [5]:
dataset2 = pytdml.io.read_from_json("dataset.json")

dataset == dataset2

True

## Converting from non TDML formats

### Converting from COCO format

COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset.
We can convert a COCO dataset to a TDML dataset using the `convert_coco_to_tdml` function. For more information about this dataset, check the [COCO dataset website](https://cocodataset.org/).

In [6]:
from pytdml.io.coco_converter import convert_coco_to_tdml
import wget

coco_url = "https://raw.githubusercontent.com/openrsgis/pytdml/refs/heads/main/tests/data/coco/panoptic_val2017.json"
coco_file = "panoptic_val2017.json"
wget.download(coco_url, coco_file)

tdml_dataset = convert_coco_to_tdml(coco_file)

print(f"Number of elements in the dataset: {tdml_dataset.amount_of_training_data}")

Number of elements in the dataset: 5000


## Converting from STAC format

STAC (SpatioTemporal Asset Catalog) is a standard for describing geospatial assets in a way that is easy to index and discover.
We can convert a STAC dataset to a TDML dataset using the `convert_stac_to_tdml` function. For more information about this dataset, check the [STAC dataset website](https://stacspec.org/).

In [9]:
from pytdml.io.stac_converter import convert_stac_to_tdml
import os

stac_base_url = "https://raw.githubusercontent.com/openrsgis/pytdml/refs/heads/main/tests/data/stac/{}.json"

directory = "tests/data/stac"
os.makedirs(directory, exist_ok=True)
for url in ["collection", "core-item", "extended-item", "simple-item"]:
    stac_url = stac_base_url.format(url)
    stac_file = os.path.join(directory, f"{url}.json")
    wget.download(stac_url, stac_file)
stac_file = "collection.json"
tdml_dataset = convert_stac_to_tdml(stac_file)

print(f"Number of elements in the dataset: {len(tdml_dataset.data)}")

Downloaded tests/data/stac/collection.json
Downloaded tests/data/stac/core-item.json
Downloaded tests/data/stac/extended-item.json
Downloaded tests/data/stac/simple-item.json
Number of elements in the dataset: 3


## Converting from YAML format

pytdml can also convert a dataset from a YAML file to a TDML dataset using the `yaml_to_eo_tdml` function. This is not an official encoding of TrainingDML-AI, but it is often preferred for human-readable files.

In [None]:
from pytdml.io.yaml_converter import yaml_to_eo_tdml

yaml_url = "https://raw.githubusercontent.com/openrsgis/pytdml/refs/heads/main/tests/data/yaml/UiT_HCD_California_2017.yml"
yaml_file = "UiT_HCD_California_2017.yml"
wget.download(yaml_url, yaml_file)

tdml_dataset = yaml_to_eo_tdml(yaml_file)

print(f"Number of elements in the dataset: {tdml_dataset.amount_of_training_data}")

# Advanced Usage

The `pytdml.ml` module provides a set of tools to help you train and evaluate machine learning models using the datasets created using any of the methods described above.

This advanced usage is also not part of the standard, but it is an example of what can be used for. During the code sprint we will be working on decoupling this functionality from the main library to later be included in a separate package.

## Semantic segmentation

In [75]:
import pytdml
import torch
from torchvision import transforms

training_dataset_url = "https://raw.githubusercontent.com/openrsgis/pytdml/refs/heads/main/tests/data/semantic_segmentation/GID-5C.json"
training_dataset_file = "GID-5C.json"
wget.download(training_dataset_url, training_dataset_file)
training_dataset = pytdml.io.read_from_json(training_dataset_file)
print("Load training dataset: " + training_dataset.name)
print("Number of training samples: " + str(training_dataset.amount_of_training_data))
print("Number of classes: " + str(training_dataset.number_of_classes))

class_map = pytdml.ml.create_class_map(training_dataset)
train_set, val_set, test_set = pytdml.ml.split_train_valid_test(training_dataset, 0.7, 0.2, 0.1)  # split dataset
train_dataset = pytdml.ml.TorchEOImageSegmentationTD(
    train_set,
    class_map,
    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
)

val_dataset = pytdml.ml.TorchEOImageSegmentationTD(
    val_set,
    class_map,
    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
)

# Now train_dataset and val_dataset are instances of torch.utils.data.Dataset
isinstance(train_dataset, torch.utils.data.Dataset) and \
isinstance(val_dataset, torch.utils.data.Dataset)

Load training dataset: GID-Large-scale Classification
Number of training samples: 150
Number of classes: 5


True

For more details about the advanced usage, please visit the official [pytdml tutorial](https://htmlpreview.github.io/?https://github.com/opengeospatial/TrainingDML-AI_SWG/blob/main/pytdml-tutorial/pytdml_tutorial.html).

# The tutorial starts now

In this tutorial you can find some inconsistencies in the code of the library, or things that we could improve, can you find which ones are they?

Trying to run the official tutorial it is also an interesting task, because it might contain references to old versions of the library, and we could also use the chance to open a pull request to update it.