Problem
Once Tensor exists, almost every realistic workload runs into shape pairs that are obviously compatible without being identical: subtract a per-feature mean from every row, add a per-channel bias to a stack of images, scale a batch by a per-sample weight. Without broadcasting, the user has to allocate an expanded intermediate by hand — verbose, and exactly the kind of allocation that wipes out the benefit of running the op in C in the first place.
Matrix already has rank-2-only shortcuts for the row-vector and column-vector cases. The Tensor surface needs a general answer that subsumes them without regressing them.
Desired functionality
NumPy-style N-D broadcasting on every Tensor binary op. Shape pairs NumPy accepts are accepted; shape pairs NumPy rejects are rejected with an error message that names both operands' shapes and the axis where compatibility fails. The result of a broadcast op is a freshly-allocated contiguous tensor with the broadcast shape.
Constraints
- NumPy semantics, exactly. No bocpy-specific broadcasting rules.
- No views and no strides on the public Tensor surface — broadcasting is a compute-time concern, not a storage-time one.
- No measurable regression on rank-2
Matrix workloads. Row-vector and column-vector broadcast shapes that hit a fast path today must still hit a fast path.
Out of scope
Reductions, batched matmul, stride-based views, broadcast_to / broadcast_arrays-style helpers, F-order layout, new ops.
Open questions
- Behaviour on zero-size shapes (any axis equal to zero).
- Whether broadcasting also applies to
Tensor.__setitem__ right-hand sides, or whether scalar-broadcast remains the only accepted case there. Interacts with M4.
Problem
Once
Tensorexists, almost every realistic workload runs into shape pairs that are obviously compatible without being identical: subtract a per-feature mean from every row, add a per-channel bias to a stack of images, scale a batch by a per-sample weight. Without broadcasting, the user has to allocate an expanded intermediate by hand — verbose, and exactly the kind of allocation that wipes out the benefit of running the op in C in the first place.Matrixalready has rank-2-only shortcuts for the row-vector and column-vector cases. The Tensor surface needs a general answer that subsumes them without regressing them.Desired functionality
NumPy-style N-D broadcasting on every
Tensorbinary op. Shape pairs NumPy accepts are accepted; shape pairs NumPy rejects are rejected with an error message that names both operands' shapes and the axis where compatibility fails. The result of a broadcast op is a freshly-allocated contiguous tensor with the broadcast shape.Constraints
Matrixworkloads. Row-vector and column-vector broadcast shapes that hit a fast path today must still hit a fast path.Out of scope
Reductions, batched matmul, stride-based views,
broadcast_to/broadcast_arrays-style helpers, F-order layout, new ops.Open questions
Tensor.__setitem__right-hand sides, or whether scalar-broadcast remains the only accepted case there. Interacts with M4.