## Manage Tensorflow COCO dataset

This example will load the [TFDS COCO dataset](https://www.tensorflow.org/datasets/catalog/coco) into Space without copying ArrayRecord files. We will demonstrate how to modify data and use SQL engine to analyze annotations.

First let's download the COCO datasets in ArrayRecord format by following the [TFDS docs](https://www.tensorflow.org/datasets/tfless_tfds).

In [None]:
import tensorflow_datasets as tfds

tfds.data_source('coco/2017')

The TFDS COCO dataset defines the following features structure `tf_features_dict`. It is used for serializing complex nested data into bytes:

In [None]:
import numpy as np
from tensorflow_datasets import features as f

tf_features_dict = f.FeaturesDict({
 "image": f.Image(shape=(None, None, 3), dtype=np.uint8),
 "image/filename": f.Text(),
 "image/id": np.int64,
 "objects": f.Sequence({
   "area": np.int64,
   "bbox": f.BBoxFeature(),
   "id": np.int64,
   "is_crowd": np.bool_,
   "label": f.ClassLabel(num_classes=80),
  }),
})

We will make a copy of the above `objects` field into new Parquet files. This field will thus exist in both ArrayRecord files (original TFDS data, for feeding into training framework), and in Parquet for SQL queries. Note that the bulky image data is not copied to Parquet.

The Space dataset's schema is:

In [None]:
import pyarrow as pa
from space import TfFeatures  # A custom PyArrow type.

# Equivalent to the `objects` field in the above FeaturesDict.
object_schema = pa.struct([
  ("area", pa.int64()),
  ("bbox", pa.list_(pa.float32())),  # TODO: to use fixed size list.
  ("id", pa.int64()),
  ("is_crowd", pa.bool_()),
  ("label", pa.int64()),
])

ds_schema = pa.schema([
  ("id", pa.int64()),
  ("filename", pa.string()),
  ("objects", pa.list_(object_schema)),  # A copy of `objects` in Parquet files.
  ("features", TfFeatures(tf_features_dict))  # The original TFDS data.
])

Create a new Space dataset:

In [None]:
from space import Dataset

ds_location = "/directory/coco_demo"  # Change it to your preferred location

ds = Dataset.create(ds_location, ds_schema,
  primary_keys=["id"],
  record_fields=["features"])  # The `features` field is stored in ArrayRecord files.

The following code defines a method `index_fn` that reads ArrayRecord files and builds indexes for it. The method returns three index fields (`id`, `filename`, `objects`) to be written to Parquet files, together with the row's addresses in the ArrayRecord files. See the [storage design](../docs/design.md) for details.

In [None]:
from typing import Any, Dict, List

def pydict_to_pylist(objects: Dict[str, Any]) -> List[Dict[str, Any]]:
  return [
    {"id": area, "area": id_, "bbox": boxes, "is_crowd": is_crowds, "label": labels}
    for area, id_, boxes, is_crowds, labels in
    zip(objects["area"], objects["id"], objects["bbox"], objects["is_crowd"], objects["label"])
  ]

def index_fn(example: Dict[str, Any]) -> Dict[str, Any]:
  # Input format:
  #   key: Space record field name, value: [deserialized TFDS value] (size is 1)
  #    e.g., {"features": [{"image": v, "image/id": v, "image/filename": v, "objects": v}]}
  example = example["features"][0]
  return {
    "id": example["image/id"],
    "filename": example["image/filename"],
    "objects": pydict_to_pylist(example["objects"]),
  }

Calling `load_array_record` will processes all input ArrayRecord files using `index_fn` to obtain indexes. The loading will complete after the index fields have been written for all ArrayRecord records.

In [None]:
# TFDS downloaded files, replace it with your path
input_pattern = "/tensorflow_datasets/coco/2017/1.1.0/coco-validation.array_record*"

# ArrayRecord files.
ds.local().append_array_record(input_pattern, index_fn)
# >>>
# JobResult(state=<State.SUCCEEDED: 1>, storage_statistics_update=num_rows: 5000
# index_compressed_bytes: 31842
# index_uncompressed_bytes: 47048
# record_uncompressed_bytes: 816568313
# , error_message=None)

ds.add_tag("initialized")  # Tag the current version.

# Check loaded image IDs.
image_ids = ds.local().read_all(fields=["id"])
image_ids.num_rows

Objects are stored in a columnar field `objects` now. Read it into memory as a PyArrow table, and use [DuckDB](https://github.com/duckdb/duckdb) SQL to query it:

In [None]:
import duckdb

# Compute the min/max object bbox area in the dataset.
sql = """
SELECT MIN(objs.area), MAX(objs.area) FROM (
  SELECT UNNEST(objects) AS objs FROM objects)
"""

objects = ds.local().read_all(fields=["objects"])
duckdb.sql(sql).fetchall()

Space datasets are mutable. You can run append, insert, upsert, delete operations, locally (ds.local()) or distributedly (ds.ray()). A new snapshot of dataset is generated after a mutation, you can read previous snapshots by providing a snapshot ID or a tag.

In [None]:
import pyarrow.compute as pc

# Delete a row from a Space dataset. The mutation creates a new snapshot
ds.local().delete(pc.field("id") == pc.scalar(361586))
ds.add_tag("delete_some_data")  # Tag the new snapshot

# Check total rows
ds.local().read_all(fields=["id"]).num_rows
# >>>
# 4999

# Time travel back to before the deletion, by setting a read version.
ds.local().read_all(version="initialized", fields=["id"]).num_rows
# >>>
# 5000

Read data (the `features` field) in the original TFDS format via a [random access data source interface](https://www.tensorflow.org/datasets/tfless_tfds), as the input of training frameworks:

In [None]:
from space import RandomAccessDataSource

datasource = RandomAccessDataSource(
  # field-name: storage-location, for reading data from ArrayRecord files.
  {"features": ds_location},
  deserialize=True,  # Auto deserialize data using `tf_features_dict`.
  use_array_record_data_source=False)

len(datasource)
# >>>
# 4999

# Read the original TFDS data.
datasource[11]
# >>>
# {'image': array([[[239, 239, 237],
#        [239, 239, 241],
#        [239, 239, 239],
#        ...
#        [245, 246, 240]]], dtype=uint8),
#  'image/filename': b'000000292082.jpg', ...
#  'bbox': array([[0.51745313, 0.30523437, 0.6425156 , 0.3789375 ], ...
#  'id': array([ 294157,  467036, 1219153, 1840967, 1937564]),
#  'is_crowd': array([False, False, False, False, False]),
#   'label': array([27,  0,  0, 27, 56])}}