Strategy proposal: data-dependent output-shape ops (unique, nonzero, boolean indexing) via a static size= argument #3685

katlun-lgtm · 2026-06-14T17:54:20Z

katlun-lgtm
Jun 14, 2026

A consistent strategy for data-dependent output-shape ops in MLX

Draft design discussion for ml-explore/mlx — covering unique*, nonzero, boolean-mask read indexing a[mask], unary where(cond), compress/extract, repeat with a tensor argument.

Every framework-comparison claim below is attributed to a primary source — official JAX / PyTorch / OpenXLA / ONNX / array-API / MLX docs and the linked GitHub issues (full list in the appendix).

1. Problem statement

MLX fixes every array's shape at graph-build time. Concretely (mlx source):

Primitive::output_shapes(const std::vector<array>& inputs) derives output shapes from input shapes only, never input values.
An array is constructed with its shape fixed; UnaryPrimitive::eval_cpu/eval_gpu(inputs, array& output) receive an already-allocated, already-shaped output and merely fill it.

So any op whose output element-count depends on input data has nowhere to express its shape. This is why the maintainer deferred the whole class (issue #856: "ops whose output shape depends on input data … we will not implement it until we have figured out a consistent strategy") and why a[mask] read indexing raises "boolean indices are not yet supported" (#865, #246) and nonzero / unary where are absent (MLX indexing docs).

This document surveys how every other major framework solves it and proposes a consistent, phased strategy for MLX.

2. The op class (array API)

The Python array API standard has a dedicated design note, "Data-dependent output shapes", and a capabilities()["data-dependent shapes"] boolean flag so a conforming library can advertise that it does not support them ([array-api design topic], [array-api info.capabilities]). The ops it flags:

unique_all/unique_counts/unique_inverse/unique_values
nonzero
boolean-array indexing x[mask] (explicitly optional; the standard says graph-building libraries like JAX and Dask find it hard to implement, and a library may omit it and still conform) ([array-api indexing]).

MLX already returns False for that capability flag in our new __array_namespace_info__ — so MLX is conformant today by declining these. The question is whether/how to offer them.

3. How other frameworks handle data-dependent shapes

3.1 The array API standard — make it optional + advertise a capability

Boolean-mask filtering "the output array shape is data-dependent"; the spec makes boolean-array indexing optional and names compute-graph libraries (JAX, Dask) as the ones that legitimately omit it ([array-api indexing], [array-api design topic]). → Precedent: declining is conformant; the decision is about ergonomics, not standards compliance.

3.2 JAX / XLA — static `size=` argument + `fill_value` padding (the pragmatic winner)

jnp.unique is "not by default compatible with jit()" because its output size is data-dependent ([jax unique]).
JAX resolves this with a static size argument that fixes output length at trace time; without it under jit you get a concrete-value/abstract-tracer error ([jax unique], [jax gotchas]).
When size exceeds the true count, the tail is padded with fill_value (default: the min unique value), so the buffer always matches the statically-declared shape ([jax unique]). jnp.nonzero works the same way ([jax nonzero]).
This exists because XLA mandates static shapes; jax2tf shape-polymorphism explicitly cannot handle output shapes that depend on input values, only symbolic expressions of input dimensions ([jax2tf README]).

→ This pattern maps onto MLX with zero core changes: size is an argument, so output_shapes(inputs) can return [{size}] — known at build time. (More in §5.)

3.3 XLA / StableHLO — bounded dynamic shapes (the general fix)

XLA supports bounded shapes: a dimension with a static upper bound but unknown actual size, written s32[<=4]. The HLO primitive is set-dimension-size, which takes a static-shape array + a scalar size and yields the bounded-dynamic dim ([jax #26265]).
StableHLO documents this as "data-dependent dynamism" (e.g. nonzeros) and recommends modeling it via bounded dynamism — specify an upper bound, hardware implements it via tensor padding ([stablehlo dynamism]).

→ This is the "real" answer (true dynamic dims) but it's a deep core change: shape inference, allocation, compile specialization, and Metal kernels all must learn about bounded dims.

3.4 PyTorch — eager natively; `torch.compile`/export via unbacked SymInts

Eager PyTorch just runs data-dependent ops (it allocates at runtime).
Under torch.compile/export, output sizes that can't be known at compile time are represented as unbacked SymInts — symbolic ints with no concrete value/hint, introduced for nonzero()/item() ([pt backed-unbacked], [pt guardon]).
They can't drive control flow without a hint → GuardOnDataDependentSymNode, fixed with torch._check(...) / mark_unbacked ([pt guardon]).
PyTorch/XLA gates nonzero/masked_select/masked_scatter behind XLA_EXPERIMENTAL=... and uses bounded dynamic shapes — torch.nonzero() returns torch.Size([<=25, 2]) ([pt/xla dynamic_shape], [pt/xla #3884]).

→ Confirms the universal split: eager = trivial; ahead-of-time/graph = needs symbolic-or-bounded dims. MLX's lazy graph sits on the graph side.

3.5 GPU implementation reality — stream compaction

The actual GPU kernel for unique/nonzero/compress is stream compaction: a prefix-sum (scan) over a predicate mask to compute output offsets, then a scatter. NVIDIA CUB exposes exactly this as cub::DeviceSelect::Unique/Flagged, which writes a device-side d_num_selected_out count ([cub DeviceSelect]); the count must be read back to host to know the final size ([stream-compaction blog], [arxiv 2311.02103]). On Metal this is a scan + scatter over the sorted array (for unique) or the mask (for nonzero) — implementable with MLX's existing scan/scatter machinery, but the output size still has to come back to the host unless it's fixed by a size argument.

4. MLX-specific constraints any solution must satisfy

Lazy graph: downstream consumers are built against the declared output shape before eval.
compile() specializes on shapes (it has a shapeless path); a data-dependent dim defeats specialization unless represented symbolically.
vmap/jvp/vjp: every primitive must provide them (or throw). Batched data-dependent sizes differ per row — only the size=-bounded form vmaps cleanly (uniform size).
export: serializes shapes.
Metal (eval_gpu): writes into a pre-sized buffer; a count-then-compact op needs either a host sync to read the count, or a fixed size.

5. Options for MLX

Option	Output shape	Core change	Lazy/compile/vmap/grad	Verdict
A. Static `size=` + `fill_value` (JAX-style)	`f(arg)` → known at build	None	✅ all work (size is uniform)	Recommended Phase 1
B. Eager host-materialization	computed after a forced `eval`	small, but adds a sync + breaks purity	❌ no compile/vmap/grad; sync	Convenience-only, Phase 2
C. Bounded dynamic dims (`<=N`, StableHLO/XLA-style)	symbolic, bounded	Large (shape system, alloc, compile, Metal)	⚠️ partial; most general	Phase 3, maintainer-owned

Why A is the right first step

size is a function argument, so output_shapes returns [{size, ...}] — fully compatible with today's build-time shape model. No change to the array/shape/eval core.
unique/nonzero/compress become compositions of existing primitives once size is fixed:
- unique_values(x, size, fill_value) = sort(x) → adjacent-diff mask → cumsum offsets → scatter first-occurrences into a size-length buffer pre-filled with fill_value (clamp/truncate to size). unique_counts/inverse/all add the inverse map and run-length counts (also static once size is fixed).
- nonzero(x, size, fill_value) = mask → cumsum offsets → scatter indices.
It matches JAX exactly, so it's familiar and array-API-adjacent (the array-API funcs have no size, but a size-extended superset is the standard escape hatch every static-shape backend uses).
It composes with compile, vmap (uniform size), export, and even has a sensible vjp (gather/scatter are differentiable).

What A does not give

True no-size unique(x) / x[mask] returning exactly-N elements. That genuinely needs Option C. But A unblocks ~all real use cases (you almost always have an upper bound) and gives MLX a conformant, documented answer instead of "not supported."

6. Recommended phased plan

Phase 1 — size=-bounded ops (no core change). Add unique_values/unique_counts/unique_inverse/unique_all(x, *, size, fill_value=...), nonzero(x, *, size, fill_value=...), optionally compress. Pure compositions of sort+cumsum+scatter; works on CPU and Metal today. Document the size/fill_value contract (copy JAX's wording).
Phase 2 — explicit eager escape hatch (optional). A clearly-named, documented host-sync helper (e.g. mx.unique(x) with no size that internally evals and returns a concrete-shape array) for notebooks/REPL, explicitly marked as breaking laziness and unavailable under compile/vmap/grad. Mirrors PyTorch-eager and the maintainer's current "convert to NumPy" advice, but in-framework. (Include only if the team is comfortable with one eager op; otherwise skip.)
Phase 3 — bounded dynamic dimensions (core). Introduce a <=N bounded-dim concept (à la XLA set-dimension-size / StableHLO bounded dynamism / PyTorch-XLA), letting the no-size forms and a[mask] return bounded-dynamic outputs. Largest change; this is the "consistent strategy" the maintainer referenced and is theirs to own — but Phases 1–2 deliver value immediately and Phase 1's ops become the size-pinned fast path under it.

7. Concrete ask for the MLX team

Is a size=/fill_value API (Phase 1) acceptable as the sanctioned pattern for this op class? If so I can open a PR for unique_* + nonzero built on existing sort/cumsum/scatter (CPU + Metal, with tests).
Do you want the eager escape hatch (Phase 2) at all, or keep the core strictly lazy?
Is bounded dynamic dims (Phase 3) on any roadmap, or explicitly out of scope?

Appendix — sources (all primary unless noted)

MLX: indexing docs (usage/indexing.html); issues Need for implementing .unique() for arrays #856 (unique), requirement for implementing boolean indices for mx.array #865 (boolean indices), boolean mask or filter? #246 (boolean mask/filter).
Array API: design_topics/data_dependent_output_shapes.html; API_specification/indexing.html; 2024.12 … info.capabilities.
JAX: jax.numpy.unique, jax.numpy.nonzero, Common_Gotchas_in_JAX, jax2tf/README.md, jax issue #26265 (bounded shapes / set-dimension-size).
PyTorch: torch.compiler_dynamic_shapes, dynamic_shapes_backed_unbacked, dynamic_shapes_troubleshooting_guardon_errors; PyTorch/XLA dynamic_shape + issue #3884.
OpenXLA StableHLO: openxla.org/stablehlo/dynamism.
ONNX: ShapeInference.html.
GPU compaction: NVIDIA CUB DeviceSelect; arXiv 2311.02103; "stream compaction using wave intrinsics" (blog).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strategy proposal: data-dependent output-shape ops (unique, nonzero, boolean indexing) via a static size= argument #3685

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Strategy proposal: data-dependent output-shape ops (unique, nonzero, boolean indexing) via a static size= argument #3685

Uh oh!

katlun-lgtm Jun 14, 2026

A consistent strategy for data-dependent output-shape ops in MLX

1. Problem statement

2. The op class (array API)

3. How other frameworks handle data-dependent shapes

3.1 The array API standard — make it optional + advertise a capability

3.2 JAX / XLA — static size= argument + fill_value padding (the pragmatic winner)

3.3 XLA / StableHLO — bounded dynamic shapes (the general fix)

3.4 PyTorch — eager natively; torch.compile/export via unbacked SymInts

3.5 GPU implementation reality — stream compaction

4. MLX-specific constraints any solution must satisfy

5. Options for MLX

Why A is the right first step

What A does not give

6. Recommended phased plan

7. Concrete ask for the MLX team

Appendix — sources (all primary unless noted)

Replies: 0 comments

katlun-lgtm
Jun 14, 2026

3.2 JAX / XLA — static `size=` argument + `fill_value` padding (the pragmatic winner)

3.4 PyTorch — eager natively; `torch.compile`/export via unbacked SymInts