## Manage Tensorflow COCO dataset

[TFDS COCO dataset](https://www.tensorflow.org/datasets/catalog/coco) defines the following features structure `tf_features_dict`. It is used for serializing complex nested data into bytes, and deserialize it back.

In [None]:
import numpy as np
from tensorflow_datasets import features as f

tf_features_dict = f.FeaturesDict({
 "image": f.Image(shape=(None, None, 3), dtype=np.uint8),
 "objects": f.Sequence({
   "area": np.int64,
   "bbox": f.BBoxFeature(),
   "id": np.int64,
   "is_crowd": np.bool_,
   "label": f.ClassLabel(num_classes=80),
  }),
})

In this example, we move the COCO dataset from TFDS to Space. In addition, we copy the `objects` field above from the row-oriented files to Parquet files, so we can run SQL queries on it.

The Space dataset's schema is:

In [None]:
import pyarrow as pa
from space import TfFeatures  # A custom PyArrow type.

object_schema = pa.struct([
  ("area", pa.int64()),
  ("bbox", pa.list_(pa.float32())),  # TODO: to use fixed size list.
  ("id", pa.int64()),
  ("is_crowd", pa.bool_()),
  ("label", pa.int64()),
])

ds_schema = pa.schema([
  ("id", pa.int64()),
  ("filename", pa.string()),
  ("objects", pa.list_(object_schema)),
  ("features", TfFeatures(tf_features_dict))
])

Create a new Space dataset:

In [None]:
# record_fields will be stored in ArrayRecord files.
ds = Dataset.create("/path/to/space/mybucket/demo",
  ds_schema, primary_keys=["id"], record_fields=["features"])

The following code defines a method `index_fn` that reads ArrayRecord files and builds indexes for it. The method returns three index fields (`id`, `filename`, `objects`) to be written into the Space dataset's Parquet files. At the same time, the row's address in the input ArrayRecord files are also persisted.

Calling `load_array_record` will processes all ArrayRecord files in the folder `/path/to/tfds/coco/files` using this method. The COCO dataset is now under Space's management after the call completes.

In [None]:
from typing import Any, Dict

def index_fn(example: Dict[str, Any]) -> Dict[str, Any]:
  example = example["features"][0]
  return {
    "id": example["image/id"],
    "filename": example["image/filename"],
    "objects": coco_utils.tf_objects_to_pylist(example["objects"]),
  }

runner = ds.local()
# "/path/to/tfds/coco/files" is where TFDS saves the downloaded
# ArrayRecord files.
runner.load_array_record("/path/to/tfds/coco/files", index_fn)
ds.add_tag("initialized")  # Tag the current version.

Now the `objects` field in TFDS becomes a columnar field that can be analyzed via SQL:

In [None]:
import duckdb

# Load the "objects" column into memory as PyArrow and query using DuckDB.
# The SQL query returns the largest object bbox area in the dataset.
objects = runner.read_all(fields=["objects"])
duckdb.sql(
  "SELECT MAX(objs.area) FROM (SELECT unnest(objects) AS objs FROM objects)"
).fetchall()

Space supports data mutations; each modification generates a new version (`snapshot_id`). It supports reading any previous versions (time travel).

In [None]:
import pyarrow.compute as pc

# Delete a row from a Space dataset. The mutation creates a new snapshot.
runner.delete(pc.field("id") == pc.scalar(361586))

# Read the current version:
runner.read()

# Time travel back to before the deletion, by setting a read version.
runner.read(version="initialized")

Read data from the Space dataset through a [random access data source interface](https://www.tensorflow.org/datasets/tfless_tfds).

In [None]:
from space import RandomAccessDataSource

datasource = RandomAccessDataSource(
  # field-name: storage-location, for reading data from ArrayRecord files.
  {
    "features": "/path/to/space/mybucket/demo",
  },
  # Auto deserialize data using `tf_features_dict`.
  deserialize=True)