## Load and manage TFDS COCO Datasets

[TFDS COCO dataset](https://www.tensorflow.org/datasets/catalog/coco) defines the following features structure for data serialization:

In [None]:
import numpy as np
from tensorflow_datasets import features as f

tf_features_dict = f.FeaturesDict({
 "image": f.Image(shape=(None, None, 3), dtype=np.uint8),
 "objects": f.Sequence({
   "area": np.int64,
   "bbox": f.BBoxFeature(),
   "id": np.int64,
   "is_crowd": np.bool,
   "label": f.ClassLabel(num_classes=80),
  }),
})

This example creates a Space dataset for COCO, and copy the `objects` feature to columnar format for analysis. The Space dataset"s schema is:

In [None]:
import pyarrow as pa
from space import TfFeatures

object_schema = pa.struct([
  ("area", pa.int64()),
  ("bbox", pa.list_(pa.float32())),  # TODO: to use fixed size list.
  ("id", pa.int64()),
  ("is_crowd", pa.bool_()),
  ("label", pa.int64()),
])

ds_schema = pa.schema([
  ("id", pa.int64()),
  ("filename", pa.string()),
  ("objects", pa.list_(object_schema)),
  ("features", TfFeatures(tf_features_dict))
])

Create a new Space dataset:

In [None]:
ds = Dataset.create("/path/to/space/<mybucket>/demo",
                    ds_schema, primary_keys=["id"])

And load TFDS"s ArrayRecord files into Space  without file copy:

In [None]:
def index_fn(example: Dict[str, Any]) -> Dict[str, Any]:
  example = example["features"][0]
  return {
    "id": example["image/id"],
    "filename": example["image/filename"],
    "objects": coco_utils.tf_objects_to_pylist(example["objects"]),
  }

runner = ds.local()
runner.load_array_record("/path/to/tfds/coco/files", index_fn)

Now the `objects` field in TFDS becomes a columnar field that can be analyzed via SQL:

In [None]:
import duckdb

# Load the "objects" column into memory as PyArrow and query using DuckDB.
objects = runner.read_all(fields=["objects"])
duckdb.sql(
  "SELECT MAX(objs.area) FROM (SELECT unnest(objects) AS objs FROM objects)"
).fetchall()

Space supports data mutations and time travel back to previous versions. No need to rewrite a ML dataset for inserting/deleting/updating data any more. 

In [None]:
import pyarrow.compute as pc

# Delete a row from a Space dataset.
# The mutation creates a new snapshot, and set it as the current snapshot.
runner.delete(pc.field("id") == pc.scalar(361586))

# Time travel back to before the deletion, by setting a read "snapshot_id".
# Initial snapshot ID is 0, after loading TFDS it becomes 1, after deletion it
# is 2.
runner.read(snapshot_id=1)

A Space dataset is also a ML training data source. It is easy to integrate with Jax, Tensorflow, Pytorch, and Ray:

In [None]:
from space.tf.data_sources import SpaceDataSource

# Tensorflow random access interface:
# https://www.tensorflow.org/datasets/tfless_tfds
# feature_fields defines the feature field to read.
tf_ds = SpaceDataSource(ds, feature_fields=["features"])

# Returns a Ray dataset.
ray_ds = ds.ray_dataset()