## Preview WebDataset Shards

Quick notebook to mirror the validation workflow before uploading shards to the Hub. It borrows the streaming checks and audio sniff test from `HF_UPLOAD_STEPS.md` so you can run them in one place.

### 1. Spot-check metadata with streaming mode

`datasets` will stream each tar member without materialising the full audio column. Iterate over a few samples to confirm titles, durations, and other metadata line up with `metadata.csv`.


In [1]:
from itertools import islice
from datasets import load_dataset

ds_stream = load_dataset(
    "webdataset",
    data_files={"2025": "webdataset/2025/*.tar"},
    streaming=True,
)

for sample in islice(ds_stream["2025"], 3):
    print(sample["json"]["title"], sample["json"]["duration"])

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

《2024年行車隧道(政府)(修訂)條例草案》委員會會議 (2025/01/03) 02:15:45
內務委員會例行會議(2025/01/03) 00:36:45
內務委員會會議 (2025/01/03) 00:22:20


### 2. Decode the first clip locally

Grab one example, decode ~30 seconds via the `AudioDecoder`, and hand it to `IPython.display.Audio`. The array is shaped `(channels, samples)`; convert to mono with `mean(axis=0)` if you prefer.


In [1]:
from datasets import load_dataset

example = next(iter(load_dataset(
    "webdataset",
    data_files={"2025": "webdataset/2025/*.tar"},
    streaming=True,
)["2025"]))

decoder = example["opus"]  # AudioDecoder
clip = decoder.get_samples_played_in_range(0, 30)  # first 30 seconds
audio_np = clip.data.cpu().numpy()  # shape: (channels, samples)
sample_rate = clip.sample_rate

# Notebook preview
from IPython.display import Audio
Audio(audio_np, rate=sample_rate)
# Optional mono preview
Audio(audio_np.mean(axis=0), rate=sample_rate)

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]