## Labeling training data using Space as DB: LabelStudio as example

Space brings several advantages when being used as the storage of ML data labeling services. It provides simple APIs to add and remove data entries (rows), and support deduplication and overwriting data. Version management features (snapshots, tags) allow you to time travel to a previous version. Branch support is in-progress, which can modify a previous version.

Space supports analysis queries on the annotations to gain insights of data. Space's transform and materialized view (MV) help you build a data pre-processing pipeline in a few lines. The source dataset can be transformed to a training ready format to feed into training frameworks. When annotations are changed (e.g., add, drop, modify annotations), you can refresh MVs to incrementally synchronize changes.

This example uses [LabelStudio](https://github.com/HumanSignal/label-studio) and Space to:
- Label objects for training image object detection models
- Store the labeling result in Space
- Analyze labels using SQL
- Build a data processing pipeline with Space transform and MV, and produce [Tensorflow format](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeaturesDict) training data.

### Label object bounding boxes in images

Following [Label Studio guide](https://labelstud.io/guide/install) to install and run `label-studio`, then set up the labeling interface and import the data to label.

We use a mini example of [RectangleLabels](https://labelstud.io/tags/rectanglelabels) as the labeling interface:

```
<View>html
  <Image name="image" value="$image"/>
  <RectangleLabels name="label" toName="image">
    <Label value="building" background="green"/>
    <Label value="car" background="red"/>
    <Label value="human" background="yellow"/>
    <Label value="airplane" background="gray"/>
    <Label value="boat" background="blue"/>
  </RectangleLabels>
</View>
```

In the next step, we import images to label to LabelStudio. This example simply uses [image files in a local directory](https://labelstud.io/guide/tasks#Import-data-from-a-local-directory). After manual labeling is done, export the annotations as JSON files. The annotations look like (pruned fields):

```json
[
  {
    "id": 8,
    "annotations": [
      {
        "result": [
          {
            "original_width": 1500,
            "original_height": 2060,
            "value": {
              "x": 0.9688418577307466,
              "y": 0.7054673721340388,
              "width": 97.36860670194004,
              "height": 92.68077601410934,
              "rectanglelabels": [
                "building"
              ]
            },
          }
        ],
        "unique_id": "24dd2656-41dd-4029-9ec0-aa8ce2e4c2d0",
      }
    ],
    "data": {
      "image": "/segment_anything/sa_000000/sa_1.jpg"
    },
    "created_at": "2024-01-11T16:22:23.587074Z"
  }
]
```

### Write annotations into a Space dataset

Read the JSON file and convert annotations to a PyArrow table `labels`:

In [None]:
import json
import pandas as pd
import pyarrow as pa

# The exported LabelStudio annotation file.
exported_ls_json = "project-2-at-2024-01-11-16-27-38c5377f.json"

# Preprocess before loading to Space.
# Drop empty fields in JSON. It is impossible to infer types for them.
with open(exported_ls_json) as f:
  labels_json = json.load(f)

for entry in labels_json:
  for annotation in entry["annotations"]:
    del annotation["draft_created_at"]
    del annotation["prediction"]
    del annotation["import_id"]
    del annotation["last_action"]
    del annotation["parent_prediction"]
    del annotation["parent_annotation"]
    del annotation["last_created_by"]
  del entry["drafts"]
  del entry["predictions"]
  del entry["meta"]
  del entry["last_comment_updated_at"]
  del entry["comment_authors"]

# Convert the JSON array to a PyArrow table `labels`.
labels = pa.Table.from_pandas(pd.json_normalize(labels_json))

# TODO: it is a Space limitation, to remove after timestamp is supported.
labels = labels.drop("created_at")
labels = labels.drop("updated_at")

# Check total number of rows.
labels.num_rows

# Check all fields.
labels.schema.names
# >>>
# ['id', 'annotations', 'inner_id', 'total_annotations', 'cancelled_annotations',
# 'total_predictions', 'comment_count', 'unresolved_comment_count', 'project', 'updated_by',
# 'data.image']

Create an empty Space dataset, using the `labels` PyArrow table's schema:

In [None]:
from space import Dataset

label_ds_location = "/space/labelstudio/label_ds"

# Create an empty dataset using the `labels`'s schema.
label_ds = Dataset.create(label_ds_location, labels.schema,
  primary_keys=["id"], record_fields=[])

# Load dataset after creation:
# label_ds = Dataset.load(label_ds_location)

#### Data mutations and version management

Append `labels` into the empty Space dataset, then delete some rows. Add a tag after each mutation, and read old data versions using tags.

Branch features (work in progress) will support modifying a previous version.

In [None]:
import pyarrow.compute as pc

# Append `labels` into the dataset.
label_ds.local().append(labels)
# Tag this version.
label_ds.add_tag("after_append")

# Check all `id`s we have, and delete 2 rows.
label_ds.local().read_all(fields=["id"])
label_ds.local().delete((pc.field("id") == 8) | (pc.field("id") == 9))
label_ds.add_tag("after_delete")

# Read an old version.
label_ds.local().read_all(version="after_append", fields=["id"])

#### Analyzing data using SQL queries

The following example uses SQL queries to analyze the distribution of annotations:

In [None]:
import duckdb

# Load `annotations` column into memory.
data = label_ds.local().read_all(fields=["annotations"])

# Run a SQL query to compute the max object size.
sql = """
SELECT MAX(result.value.width * result.value.height) FROM (
  SELECT UNNEST(annotations.result) AS result FROM (
    SELECT UNNEST(annotations) AS annotations FROM data
  )
)
"""
duckdb.sql(sql).fetchall()

### Transform annotations to a training ready Tensorflow dataset

[Tensorflow dataset](https://www.tensorflow.org/datasets/tfless_tfds) stores training data in [ArrayRecord](https://github.com/google/array_record) files. It serializes data into bytes using [FeaturesDict](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeaturesDict). An ArrayRecord file stores an array of bytes (serialized features).

We define a simple `FeaturesDict` for training an object detector:

In [None]:
import numpy as np
from tensorflow_datasets import features as f

tf_features_dict = f.FeaturesDict({
 "image": f.Image(shape=(None, None, 3), dtype=np.uint8),
 "objects": f.Sequence({
   "bbox": f.BBoxFeature(),
   "label": f.ClassLabel(num_classes=5),
  }),
})

`preprocess_labels` is a user defined function whose input is a batch of `id`, `data.image`, `annotations` columns of the dataset `label_ds`. It reads and resizes the image, and converts bounding boxes to the TFDS format. At the end, it serializes the features and returns the converted batch.

In [None]:
from typing import Any, Dict
import cv2

label_classes = {
  "airplane": 0,
  "boat": 1,
  "building": 2,
  "human": 3,
  "car": 4
}

def preprocess_labels(data: Dict[str, Any]) -> Dict[str, Any]:
  features = []
  for file_path, annotations in zip(data["data.image"], data["annotations"]):
    # NOTE: may need to modify file_path in your case.
    im = cv2.imread(file_path)
    im = cv2.resize(im, dsize=(100, 100), interpolation=cv2.INTER_CUBIC)

    objects = []
    for annotation in annotations:
      for result in annotation["result"]:
        v = result["value"]
        ymin, xmin = v["y"] / 100, v["x"] / 100
        w, h = v["width"] / 100, v["height"] / 100
        ymax, xmax = ymin + h, xmin + w
        objects.append({
          "bbox": f.BBox(ymin, xmin, ymax, xmax),
          "label": label_classes[v["rectanglelabels"][0]]
        })

    # Serialize the TFDS feature to bytes to write to storage.
    features.append(
      tf_features_dict.serialize_example({"image": im, "objects": objects}))

  return {"id": data["id"], "features": features}

Use a `map` transform to convert the dataset `label_ds` to a view `training_view`. Then create a materialized view. When `label_ds` is modified, simply `refresh` the MV to incrementally synchronize changes.

In [None]:
from space import TfFeatures 

training_view = label_ds.map_batches(
  fn=preprocess_labels,
  input_fields=["id", "data.image", "annotations"],  # Input of fn
  output_schema=pa.schema([   # Schema of the view
    ("id", pa.int64()),
    ("features", TfFeatures(tf_features_dict))
  ]),
  output_record_fields=["features"]  # Store this field in ArrayRecord
)

training_mv_location = "/space/labelstudio/training_mv"
training_mv = training_view.materialize(training_mv_location)

training_mv.ray().refresh("after_append")

# from space import MaterializedView
# training_mv = MaterializedView.load(training_mv_location)
# training_mv.ray().read_all()

Use Space's `RandomAccessDataSource` to feed data to the training framework:

In [None]:
from space import RandomAccessDataSource

datasource = RandomAccessDataSource(
  # field-name: storage-location, for reading data from ArrayRecord files.
  {"features": training_mv_location},
  # Auto deserialize data using `tf_features_dict`.
  deserialize=True)

# Read the data source.
len(datasource)

datasource[1]
# >>>
# {'image': array([[[234, 227, 224],
#         [243, 237, 238],
#         [245, 239, 236],
#         ...,
#         [187, 176, 178]]], dtype=uint8),
#  'objects': {'bbox': array([[0.00705467, 0.00968842, 0.93386245, 0.9833745 ]], dtype=float32),
#              'label': array([2])}}