## Parallel Data Ingestion from WebDataset

This example ingests data from a [WebDataset](https://github.com/webdataset/webdataset) to a Space dataset. The ingestion operation uses Ray runner to distribute the workload in a Ray cluster.

We use [img2dataset](https://github.com/rom1504/img2dataset) to download popular ML datasets in the WebDataset format. Install the packages and download the COCO dataset following the [guide](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/mscoco.md):

In [None]:
pip install webdataset img2dataset 

wget https://huggingface.co/datasets/ChristophSchuhmann/MS_COCO_2017_URL_TEXT/resolve/main/mscoco.parquet

img2dataset --url_list mscoco.parquet --input_format "parquet" \
    --url_col "URL" --caption_col "TEXT" --output_format "webdataset" \
    --output_folder "mscoco" --processes_count 16 --thread_count 64 --image_size 256 \
    --enable_wandb True

Download the COCO WebDataset to "COCO_DIR" (local or in Cloud Storage). Then create an empty Space dataset:

In [None]:
import pyarrow as pa

from space import DirCatalog
from space import ArrayRecordOptions, FileOptions

# The schema of a new Space dataset to create.
schema = pa.schema([
  ("key", pa.string()),
  ("caption", pa.binary()),
  ("jpg", pa.binary())])

catalog = DirCatalog("/path/to/my/tables")
ds = catalog.create_dataset("coco", schema, primary_keys=["key"],
  record_fields=["jpg"]) # Store "jpg" in ArrayRecord files

# Or load an existing dataset.
# ds = catalog.dataset("images_size64")

Connect to a Ray runner and create a Ray runner for the dataset. See the [setup and performance doc](/docs/performance.md#ray-runner-setup) for how to configure the options.

In [None]:
import ray

# Connect to a Ray cluster. Or skip it to use a local Ray instance.
ray.init(address="ray://12.34.56.78:10001")

runner = ds.ray(
  ray_options=RayOptions(max_parallelism=4),
  file_options=FileOptions(
    array_record_options=ArrayRecordOptions(options="group_size:64")
  ))

A WebDataset consists of a directory of tar files. A URL in form of `something-{000000..012345}.tar` represents a shard of the dataset, i.e., a subset of tar files. The following `read_webdataset` method returns an iterator to scan the shard described by a URL.

In [None]:
import webdataset as wds

# The size of a batch returned by the iterator.
BATCH_SIZE = 64

def read_webdataset(shard_url: str):
  print(f"Processing URL: {shard_url}")

  def to_dict(keys, captions, jpgs):
    return {"key": keys, "caption": captions, "jpg": jpgs}

  ds = wds.WebDataset(shard_url)
  keys, captions, jpgs = [], [], []
  for i, sample in enumerate(ds):
    keys.append(sample["__key__"])
    captions.append(sample["txt"])
    jpgs.append(sample["jpg"])

    if len(keys) == BATCH_SIZE:
      yield (to_dict(keys, captions, jpgs))
      keys.clear()
      captions.clear()
      jpgs.clear()

  if len(keys) > 0:
    yield (to_dict(keys, captions, jpgs))

The whole COCO dataset has 60 tar files. We split it into  multiple shards to ingest data in parallel:

In [None]:
NUM_SHARDS = 60
NUM_WORKERS = 4

COCO_DIR = "/path/to/my/downloaded/coco"

shards_per_worker = NUM_SHARDS // NUM_WORKERS
shard_urls = []
for i in range(NUM_WORKERS):
  start = f"{i * shards_per_worker:05d}"
  end = f"{min(NUM_SHARDS-1, (i + 1) * shards_per_worker - 1):05d}"
  shard_urls.append(COCO_DIR + f"/{{{start}..{end}}}.tar")

The `append_from` method takes a list of no-arg methods as input. Each method makes a new iterator that reads a shard of input data. The iterators are read in parallel on Ray nodes.

In [None]:
from functools import partial

# A Ray remote fn to call `append_from` from the Ray cluster.
@ray.remote
def run():
  runner.append_from([partial(read_webdataset, url) for url in shard_urls])

# Start ingestion
ray.get(run.remote())