# Work with a DGP Dataset

When working with Parallel Domain's synthetic data, the standard output format is [Dataset Governance Policy (DGP)](https://github.com/TRI-ML/dgp/blob/master/dgp/proto/README.md).
In general, the PD SDK can load from any format, as long as custom decoder exists adhering to the `DatasetDecoderProtocol`.
Out of the box, PD SDK comes with a pre-configured `DGPDatasetDecoder` which we can leverage to load data.

In this tutorial, we are going to load and access a dataset and its scenes.


## Load Dataset

Initially, we need to select the fitting decoder (in this case: `DGPDatasetDecoder`) and then tell it where our dataset is stored. The location can be either a local filesystem path or an s3 address.

In [1]:
from paralleldomain.decoding.dgp.decoder import DGPDatasetDecoder
from paralleldomain.model.dataset import Dataset  # optional import, will also be imported by DGPDatasetDecoder

dataset_path = "s3://pd-sdk-c6b4d2ea-0301-46c9-8b63-ef20c0d014e9/testset_dgp"
dgp_decoder = DGPDatasetDecoder(dataset_path=dataset_path)

dataset = dgp_decoder.get_dataset()

## Dataset Information

Now that the dataset information has been loaded, we query a couple of metadata from it:

In [2]:
print("Dataset Metadata:")
print("Name:", dataset.meta_data.name)
print("Available Annotation Types:", *[f"\t{a}" for a in dataset.available_annotation_types], sep="\n")
print("Custom Attributes:", *[f"\t{k}: {v}" for k,v in dataset.meta_data.custom_attributes.items()], sep="\n")

Dataset Metadata:
Name: DefaultDatasetName
Available Annotation Types:
	<class 'paralleldomain.model.annotation.BoundingBoxes2D'>
	<class 'paralleldomain.model.annotation.BoundingBoxes3D'>
	<class 'paralleldomain.model.annotation.SemanticSegmentation2D'>
	<class 'paralleldomain.model.annotation.SemanticSegmentation3D'>
	<class 'paralleldomain.model.annotation.InstanceSegmentation2D'>
	<class 'paralleldomain.model.annotation.InstanceSegmentation3D'>
	<class 'paralleldomain.model.annotation.Depth'>
	<class 'paralleldomain.model.annotation.Annotation'>
	<class 'paralleldomain.model.annotation.Annotation'>
	<class 'paralleldomain.model.annotation.OpticalFlow'>
	<class 'paralleldomain.model.annotation.Annotation'>
Custom Attributes:
	origin: INTERNAL
	name: DefaultDatasetName
	creator: 
	available_annotation_types: [0, 1, 2, 3, 4, 5, 6, 10, 7, 8, 9]
	creation_date: 2021-06-22T15:16:21.317Z
	version: 
	description: 


As you can see, the property `.available_annotation_types` includes classes from `paralleldomain.model.annotation`. In tutorials around reading annotations from a dataset, these exact classes will be re-used, and allows for a consistent type-check across objects.

## Access available Scenes
Every dataset consists of scenes. These can contain ordered (usually: by time) or unordered data.
In this example, we are looking to receive a list of scene names by type that have been found within the loaded dataset.

In [3]:
for sn in dataset.scene_names:
    print(f"Found scene {sn}")

for usn in dataset.unordered_scene_names:
    print(f"Found unordered scene {usn}")

Found scene pd-sdk_test_set
Found unordered scene pd-sdk_test_set


After gaining understanding what scenes are available, we can ask the dataset to return (and start lazy loading) that specific scene.

In [4]:
selected_scene = dataset.scene_names[0]  # for future
scene = dataset.get_scene(scene_name=selected_scene)

print(scene)

<paralleldomain.model.scene.Scene object at 0x7fc434b9f310>
