Skip to content

Warn before large downloads in long-running commands (build, quantize) + live elapsed-time counter  #444

@hi-brenda

Description

@hi-brenda

Description

Long-running commands that trigger large downloads (calibration datasets, model weights) start the download silently with no upfront warning, no size estimate, and no time estimate. A developer on slow or metered connectivity has no chance to abort before the transfer is underway.

Confirmed reproduction with winml build + calibration dataset:

winml build downloads the entire timm/mini-imagenet dataset (~7 GB — 13 train + 3 validation + 2 test parquet files) even though the config specifies "samples": 10. Quantization took 896 s (~15 min), almost entirely spent downloading. The user sees no warning and no estimated time before this begins.

Steps to Reproduce

winml build -c config.json -m ProsusAI/finbert -o output/

Where config.json specifies:

{ "quant": { "dataset_name": "timm/mini-imagenet", "samples": 10 } }

Expected Behavior

Before any large download begins, the CLI prints a warning with size and estimated time:

⚠  Downloading calibration dataset timm/mini-imagenet (~7.0 GB).
   Estimated time on 10 Mbps: ~95 min  |  100 Mbps: ~10 min
   Press Ctrl+C to cancel.

If the dataset size cannot be determined ahead of time, at minimum print the dataset name and that it may be large, before streaming begins.

The same pre-download warning should apply to:

  • Model weight downloads (winml build -m <huggingface_id>)
  • Any other command that triggers a network fetch > a configurable threshold (e.g., 500 MB)

Actual Behavior

No warning is printed. The download starts immediately and silently inside the quantize StageLive block. The first visible signal is the spinner; the user has no indication of how long it will run or how much data will be transferred.

Root Cause (initial analysis)

The datasets library streams or caches parquet shards for the full dataset split regardless of how many samples are consumed downstream. The quantize_onnx call in _run_quantize_stage (build.py) does not query dataset size before fetching, and StageLive suppresses datasets progress bars to keep the display clean — removing the only secondary signal the user might have seen.

Two independent fixes are needed:

  1. Pre-download warning (UX): Query the Hugging Face Hub API for dataset / model size before fetching and print a structured warning with size + time estimate. Block for 3 s (or until Ctrl+C) to give the user a chance to abort.
  2. Lazy / partial download (efficiency): Investigate whether datasets streaming mode or shard-level access can be used to fetch only the N calibration samples without pulling all parquet files first. If feasible, this eliminates the problem for the samples-bounded case entirely.

Environment

Additional Context

This issue affects any command with a slow first-run experience:

Command Download trigger Typical size
winml build (HF model) Model weights 0.1 – 10 GB
winml build (calibration) Dataset parquet shards 1 – 50 GB
winml quantize Same as above 1 – 50 GB
winml eval Eval dataset variable

A developer running on coffee shop WiFi or a metered mobile hotspot will abandon the tool after one silent 15-minute hang. The pre-download warning is a low-cost, high-trust fix that should be prioritized independently of the lazy-download optimization.

Metadata

Metadata

Labels

P1High — major feature broken or significant UX impactdev experienceDeveloper experience improvements

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions