Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer indices should be unsigned #1135

Closed
litherum opened this issue Oct 6, 2020 · 64 comments
Closed

Buffer indices should be unsigned #1135

litherum opened this issue Oct 6, 2020 · 64 comments
Labels
wgsl WebGPU Shading Language Issues
Projects

Comments

@litherum
Copy link
Contributor

litherum commented Oct 6, 2020

If indices are unsigned, bounds checks only need to have one conditional.

There was a resolution for this on 2020-10-6

@litherum litherum added the wgsl WebGPU Shading Language Issues label Oct 6, 2020
@grorg grorg added this to Under Discussion in WGSL Oct 13, 2020
@grorg
Copy link
Contributor

grorg commented Oct 13, 2020

I assume that since we have a resolution we don't need to discuss this? We only need a PR?

@grorg grorg moved this from Under Discussion to Resolved: Needs Specification Work in WGSL Oct 13, 2020
@dneto0
Copy link
Contributor

dneto0 commented Oct 19, 2020

I don't think we had finalized this.

Background:

In the meeting I mentioned SPIR-V / Vulkan treat array indices as signed.

Since the meeting, I learned that LLVM's LangRef is clear in that indices are treated as signed integers.
See the getelementptr instruction:

If the inbounds keyword is present ... with infinitely precise signed arithmetic ...

...

If the inbounds keyword is not present, the offsets are added to the base address with silently-wrapping two’s complement arithmetic. If the offsets have a different width from the pointer, they are sign-extended or truncated to the width of the pointer.

An implication: For arrays with more than 2**31-1 elements, you can't access the upper elements. If you want something in that range you have to use a wider int width for the index. That's perfectly reasonable.

Also, I dispute the implication that signed index checks are necessarily less efficient.
For example, LLVM IR has a "icmp ule" instruction for integer comparison unsigned less-than-or-equal that can be used to tell if the index is in range 0..MAX_INT. or 0..MY_ARRAY_SIZE.
That's only one comparison.

@dneto0
Copy link
Contributor

dneto0 commented Oct 19, 2020

I think in the long term we should allow either signed or unsigned indices.

We would need the limitation in any case that the index values be limited to 0 .. max-signed-int.

I'd support limiting it to one of the options (signed or unsigned), as a work-reduction choice for MVP.

@dj2 dj2 moved this from Resolved: Needs Specification Work to Under Discussion in WGSL Oct 20, 2020
@kvark
Copy link
Contributor

kvark commented Oct 21, 2020

Thanks @dneto0 for elaboration and the info!

First, let's talk about the limit. If the user wants to access anything at 2^31 index or higher, what would be the buffer size containing that data? Even for half-floats, 2 bytes each, it seems like the buffer has to be of size at least 4GB, and all of it must be visible to shaders.

I don't think WebGPU have to support storage buffer bindings that are that large:

  • Vulkan's maxStorageBufferRange limit starts at 128Mb. We'll need to add it to our limits similar to the existing maxUniformBufferBindingSize that we have.
  • Metal has a runtime limit for the maximum buffer length that we have to query. It's been reported that the limit can easily be as low as 256Mb on some configurations.

With that in mind, WebGPU will have a limit on the storage buffer binding with the baseline likely well below 4GB mark. Therefore, the implementations can be safe in knowing that the index 2^31 and higher is never going to be valid, anyway.

So this addresses the note that 2^31 will have to be a limit even if unsigned indices are used, because in fact our limit will be even lower anyway. And it makes unsigned indices a logical choice going forward, I think.

@litherum
Copy link
Contributor Author

litherum commented Jan 9, 2021

I think this argument would convince me if we knew the maximum size of a resource at shader compilation time.

GPU memory sizes are increasing; indeed, modern GPUs regularly have > 4GB of memory, so we can't really use the device memory size as a signal that it's okay to cast the signed index to an unsigned and use a single comparison.

However, I'd guess that the average resource (view?) size in common applications isn't growing past 4GB any time soon (in accordance with @kvark's comment above). So I think the fundamental point is correct - casting a signed index to an unsigned before running the conditional should work for most resources. If there was a way to know at shader generation time that it was safe to take this shortcut in a future-proof way, I think that would solve this problem.

@dneto0
Copy link
Contributor

dneto0 commented Jan 26, 2021

See also recently filed #1371

@dneto0
Copy link
Contributor

dneto0 commented May 11, 2021

FYI. The spec already allows you to use either signed or unsigned indices.

https://gpuweb.github.io/gpuweb/wgsl/#array-access-expr

Maybe it's hard to see because of the "Int" metavariable.

@kvark
Copy link
Contributor

kvark commented May 11, 2021

A quick round-up on what I was saying on the call:

  1. The main problem with uint, which conceptually matches indexing better, is the inconvenience of typing u suffix: foo[1u]. This should be solved in the planned future (unable to find a proper reference to this consensus).
  2. Current index limit is effectively 128M, coming from maxStorageBufferBindingSize on the host side. The sign of the index doesn't matter for sizes less than 2Gb.
  3. A Vulkan implementation that wishes to expose this limit higher than 2Gb can do so only with ShaderInt64 feature. If it sees the limit requested higher than 2Gb, and the structure size in the shader would permit indexes higher than 2Gb, it would internally convert the indexes (only to the storage run-time arrays, nothing else) to 64-bit before bounds checking and passing to the driver.
  4. Supporting both int and uint seems to be a strange choice (that nobody asked for?).
  5. Supporting only uint would require a tiny bit more complexity for 32-bit indexes of large arrays (2.5 code paths instead of 2), but nothing major.
  6. If we go with signed int only, it seems to be both ergonomic and simple to implement. (<- proposal?)

@dneto0
Copy link
Contributor

dneto0 commented May 12, 2021

  1. Supporting both int and uint seems to be a strange choice (that nobody asked for?).

It's no worse than having the addition operator (+) work on both signed operands or unsigned operands. It's going to be very very common in users source code.

I think supporting both int and uint has way higher benefit-to-cost ratio compared to #1588 (scalar weight on the mix builtin-function).

Note: texture sizing (dimensions and levels) are provided or returned as signed integers. It would be weird to disallow signed array indices.

@litherum
Copy link
Contributor Author

This should be solved in the planned future (unable to find a proper reference to this consensus).

#739

@litherum
Copy link
Contributor Author

litherum commented May 12, 2021

I don't see where pipeline layout includes information about the maximum size a buffer can be; all I can see is the minimum size a buffer can be: https://gpuweb.github.io/gpuweb/#dom-gpubufferbindinglayout-minbindingsize.

Regardless, I think this is what the generated code would hold, if we could somehow know:

Number of elements Signed N-bit Index (e.g. i16, i32, i64) Unsigned N-bit Index
Known to be < 2^(N-1) Cast signed index to unsigned index. Single unsigned comparison. Single unsigned comparison.
Known to be between 2^(N-1) and 2^N A single signed comparison with 0 just to tell if the index is negative. Edit: see below Single unsigned comparison.
Known to be > 2^N A single signed comparison with 0 just to tell if the index is negative. Edit: see below No comparison!
Unknown Two comparisons Edit: see below Single unsigned comparison.

I'm worried about the "Cast signed index to unsigned index" cell in Metal, as this is technically undefined in C++ Edit: Turns out this is defined! See #1135 (comment)

@kainino0x
Copy link
Contributor

kainino0x commented May 12, 2021

It's not a pipeline layout option, it's two global limits on uniform and storage buffer binding sizes:
https://gpuweb.github.io/gpuweb/#dom-supported-limits-maxuniformbufferbindingsize
These limits are applied in createBindGroup.

@litherum
Copy link
Contributor Author

litherum commented May 12, 2021

Right.

If the "maximum buffer size" concept is per-device, rather than per-buffer, then we can't differentiate between the first 3 rows to use in that table above, so we have to pessimize when we generate code, and use the 4th row.

@kainino0x
Copy link
Contributor

The maximum storage buffer binding size default is only 128MiB. Wouldn't it have to be go >2GiB (actually >8GiB when only ≥4-byte loads are supported (default), >4GiB when only ≥2-byte loads are supported) before we start falling into the 4th row? Sorry if I'm totally missing something.

@kainino0x
Copy link
Contributor

I'm worried about the "Cast signed index to unsigned index" cell in Metal, as this is technically undefined in C++.

From a quick bit of research it appears that signed-to-unsigned is well-defined in C++, and only unsigned-to-signed is implementation-defined. https://stackoverflow.com/a/43336256

@litherum
Copy link
Contributor Author

litherum commented May 12, 2021

 >4GiB when only ≥2-byte loads are supported

2-byte loads are important for us; we'll probably implement them at the same time as implementing 4-byte loads.

But your general point is correct - we will hit this when we want larger buffers, which the industry is moving towards. Big graphics cards you can buy today often already have > 10GB of memory.

I expect the industry will want to go beyond 128MB quite soon, and I don't think it will be that long before the industry wants to go beyond 4GB.

So, the question is: When that happens, and our software runs on a device which has lots of memory, and we want to expose large buffers, what are we going to do? Are we going to:

  1. Start injecting two comparisons instead of one, for every access of every buffer, just because we don't know at compilation time whether or not we're hitting the first row of the table or the second.
    1a. Same as 1, but allow authors to supply unsigned indices if they want, too. Evangelize the fact that unsigned indices are going to be faster than signed indices.
  2. Give developers a way of telling us their (minimum and) maximum buffer size, for each buffer, during compilation time. Also, hope that both the minimum and the maximum fall within the same row of the table above
  3. Just make array indices unsigned and be done with it
  4. Maybe there's a 4th option? Would be interested to hear ideas.

signed-to-unsigned is well-defined in C++

I did not know this! Good to know.

@kainino0x
Copy link
Contributor

2-byte loads are important for us; we'll probably implement them at the same time as implementing 4-byte loads.

They'll still be gated on an optional feature, so you can use the device configuration to determine when it's needed.

  • Just make array indices unsigned and be done with it

I haven't been watching this issue closely enough but my uninformed opinion is that this sounds great.

@kdashg
Copy link
Contributor

kdashg commented May 17, 2021

WGSL meeting minutes 2021-05-11
  • MM: One Q: How does this work in VK? Can you use a uint to index into array? (yes) If the value is too large, does it wrap around? (yes, unless the buffer is big enough) If there’s an array of size 3B, and I use uint=2.9B, does that work?
  • DN: Hmm, in llvm address info is signed, so what I said above is probably wrong.
  • AB: Well, could use i64 if supported, and unlikely to have a system support 3B addressable that doesn’t support i64, so this is probably a perf question.
  • MM: I understand that buffers are usually smaller than 2GB, but it would be nice if we didn’t have to pick between perf and capability up-front.
  • TR: Is this a backend/implementation question? (maybe)
  • AB: VK doesn’t usually care about signedness of ints, but addressing is defined in signed ints, which limits us.
  • MM: So my 2.9B might be interpreted as negative i32? (yes)
  • AB: What’s the current limit? (128MB?) So implementations could choose to offer <=2GB and just not have this problem.
  • TR: Seems preferable not to decide that unsigned ints aren’t interpreted as signed for the purposes for addressing.
  • DM: So in the future, we could offer >2GB limits only if we support the i64 support we need. And since limits are opt-in, we could know when authors expect to need >2GB, and do the right thing.
  • DN: Is there a portability concern?
  • JG: Probably no?
  • DM: We are free to not expose anything above the baseline limits.
  • DN: Do we still need uint and sint overloads for addressing?
  • MM: If I have a buffer with 3.9B indices, and I ask for -0.1B, would that work?
  • MM: So is it, if the author doesn’t claim a max buffer size, they get the 128MB limit. Only if they ask for >2B, would we have to emit the extra ops during shader translation.
  • DM: Why are we investigating negative indices? Don’t we not want them?
  • DN: I don’t know how much of a hit it would be to usability to offer i32 support here.
  • JB: Rust always uses uints for indices, but it has good enough ergonomics that supports this.
  • DN: Do we really need the full >2GB?
  • JG: <missing>
  • MM: In real-pointer languages, they really are signed, because you can move them up or down.
  • JG: But fundamentally i32 and u32 add are the same bit results
  • <missing>
  • AB: Can we just limit the size?
  • MM: Yeah but what happens with out-of-bounds inputs?

@dneto0
Copy link
Contributor

dneto0 commented May 18, 2021

We're discussing the options around:

Initially:

  • allow both i32 and u32 indices
  • limit array element count to i32 max value

With the question of: what is the plan for very large arrays, i.e. element count larger than i32 max value.

One option is:

  • only allow very large arrays when the implementation also supports i64 in WGSL
    • Suggest requiring explicit opt-in enable i64_sized_array;.
    • Don't kick in just because i64 is available.
  • Array element count is at most i64 max value.
    • Note: arrayLength returns u32, so we would need a new builtin anyway, e.g. arrayLength64 (ha!)
  • an array is "very large" if and only if:
    • it has a declared size larger than i32 max value, or
    • it is a runtime-sized array
  • Index into a very large array must be either i64 or u64

The biggest problem I see with this is:

  • Code written without i64 in mind no longer compiles. Most troublesome (to me): any code that indexes into a runtime-sized array:

        [[block]] struct Foo { data : array<i32>; }
    
        [[...]] var<storage> buf: Foo;
        fn ... {
              let i: i32 = ...
              // Won't compile if large arrays are used anywhere.
              ...  buf.data[i];  
        }
    

I think trying to make this case more smooth starts to get into automatic promotion from smaller scalar types to larger scalar types. I think I don't want to go in that direction.

@Kangz
Copy link
Contributor

Kangz commented May 19, 2021

If the indices can be both i32 or u32, does it means that we need a clamp operation to inject robust access in the shader for i32 accesses, or are we going to rely on min(static_cast<u32>(index), arrayLength-1) instead?

@alan-baker
Copy link
Contributor

@dneto0, your example seems wrong. Did you mean u32 as the type of i?

@dneto0
Copy link
Contributor

dneto0 commented May 19, 2021

If the indices can be both i32 or u32, does it means that we need a clamp operation to inject robust access in the shader for i32 accesses, or are we going to rely on min(static_cast(index), arrayLength-1) instead?

It's materially the same, for example, when i = -5:

  • clamp( -5, 0, INT_MAX) = min( max(-5,0), arrayLen-1) = 0
  • min(static_cast(-5), arrayLength-1) = min(4billion-ish, arrayLength-1) = arrayLength-1

@dneto0
Copy link
Contributor

dneto0 commented May 19, 2021

@dneto0, your example seems wrong. Did you mean u32 as the type of i?

I meant i32. But it could have been u32 as well.

@alan-baker
Copy link
Contributor

i32 is never problematic though. The range for u32 between SIGNED_INT_MAX and UNSIGNED_INT_MAX is the part that needs spelled out for expansion.

@dneto0
Copy link
Contributor

dneto0 commented May 19, 2021

Thinking about this more, we can make life easier on the programmer, and get rid of the weird "code doesn't compile" case. Modify the proposal to:

  • allow any signed integer type that's already valid in the language.
  • allow any unsigned integer type that's already valid in the language. (Assume that when n-bit integers are introduced into the language, that both signed and unsigned forms are valid.)
  • always interpret the indices as signed. (So if it's unsigned, reinterpret the bits as a signed integer)

This is exactly what both LLVM and SPIR-V do.

So restating completely:

Initially:

  • allow both i32 and u32 indices
  • limit array element count to i32 max value

Future expansion:

  • allow both signed and unsigned indices
  • max array size is the max signed integer value (over any signed integer type)
  • always interpret array indices as signed. Bitcast an unsigned value to signed, if needed

This is simpler for the user, and a closer match to both LLVM and SPIR-V. That meshes well with the compiler stack underlying MSL, DXC/DXIL, and Vulkan.

@litherum
Copy link
Contributor Author

litherum commented Jun 23, 2021

I wrote this comment earlier in the issue but it bears repeating in case people missed it. I've clarified the comment to make it less ambiguous.

For D3D12, indices only work for values up to MAX_UINT32. I suggest we limit WebGPU to that range, or a subset of that range if we need to.

Well, WGSL doesn't actually have any 64-bit types, yet, so we are de-facto already limiting ourself to the range 0 .. MAX_UINT32. (For now! Using 64-bit types at all will have to require an extension anyway. I don't see any 64-bit integral types on https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-scalar so it looks like D3D wouldn't be able to offer this extension when we get around to spec'ing it.)

I believe this discussion is regarding "what happens when indices are between MAX_INT32 .. MAX_UINT32."

@tex3d
Copy link

tex3d commented Jun 29, 2021

I don't see any 64-bit integral types on https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-scalar so it looks like D3D wouldn't be able to offer this extension when we get around to spec'ing it.)

It's unfortunate, but these docs are quite out-of-date and have not been updated for what DXC supports. There is a more up-to-date table on this DXC Wiki page, but it isn't titled well: https://github.com/microsoft/DirectXShaderCompiler/wiki/16-Bit-Scalar-Types.

This includes 64-bit signed and unsigned int types, which are supported under an optional feature flag. It also includes explicit 16-bit int and float types (rather than fuzzy min-precision ones), also supported under an optional feature flag.

This doesn't mean that HLSL supports 64-bit array indexing or buffer offsets however. Those are still limited by the API and in HLSL to 32-bit unsigned sizes/indexes/offsets.

@kdashg
Copy link
Contributor

kdashg commented Jun 29, 2021

WGSL meeting minutes 2021-06-29
  • MM: I’m still working more on this. Two attacks: 1) Try in a number of shading languages; 2) Experimenting with zero-extending vs perf. I found I didn’t fully understand DN’s proposal. Could you elaborate?
  • DN: I think the problem is for large arrays, and I think I want a large array to require a wide int to index it. I want the max size of the array to be no larger than the max signed integer size.
  • MM: Consider, what if I index with -10. Is that “too big”? How should it work?
  • DN: It would wrap around.
  • MM: If the base of the array lies above 4 billion, then that should still work?
  • DN: That’s an implementation detail. The addressing that happens should be 32-bit friendly/well-defined. The fact that it has byte addresses that are very large is an implementation detail.
  • MM: I’m looking at the cost of a zero extension, and an array implementation involves a multiply and an add. On 32-bit pointer device, this is moot, so ignore this. On a 64-bit pointer device (that has enough memory), the multiply and the add would then be on the wider 64-bit types. So the question is what we do to promote the 32-bit signed indexing to our 64-bit internals? Assuming data lies at addresses, and arrays have a base pointer…
  • DN: That’s an implementation question.
  • JG: Is this sign-extend vs zero-extend?
  • MM: On many processors, the 32-bit operations will automatically zero the top half of the 64bit result register. So if that’s the goal, it’s generally free. For sign-extension, sometimes we’d need an extra instruction to get that behavior.
  • DN: I think I’m saying, if you’re thinking byte addressing has to happen a certain way, I’m saying it’s an implementation choice.
  • DN: For example, OpenCL didn’t start out with byte addressable memory, it had to be an extension. It’s often in hardware now, but not everywhere.
  • MM: If I write a program that use either signed or unsigned for the implementation, it will be the same instructions. Instead, in order to get a perf comparison, I have done all my own pointer math. If I present that, would it be valuable to others?
  • DN: I suspect that the memory access overhead/bandwidth will dominate the pointer math, even with caching. I would start with testing on a CPU first.
  • MM: I have some results, but not ready to share.
  • DN: I also think that given our portability requirements, then we shouldn’t admit this comparison.
  • [missing]
  • DN: I think if authors want >2GB they should author using 64bit types
  • MM: I think the math might still be expensive, which would mean that it would be useful to use a 32bit index
  • MM: We could potentially give the authors both options: u32 could be cheaply zero extended if users wanted it, or they could use i32, or i64 if they want
  • DN: Going back to automatic promotion to the widest signed type.
  • MM: If I have an array, I want to access the 3 billionth element, I would then need a wider type. Devices that support a 3 billion array probably do support i64. The compiler could see an unsigned array index and output i64 math. For the devices that don’t support i64, they would have lower max-size limits, and that would be ok.
  • DN: Could have a rule, if 32bit index, we’ll use a 64bit index if the device supports it, and you’ve opted into it.
  • MM: I think what you’re saying is: In WGSL, if the author has not enabled 64bit ext, max buffer sizes will be artificially low, but only for spir-v devices. If they opt-in instead, the max size of buffers would be higher
  • DN: Not quite: No special case for spir-v. If you want have a 2B+ element array, you’d need to opt in to the i64 extension.
  • MM: I think that’s over-restrictive. Could make it up to implementations, where some implementations could handle u32 indices properly, even without i64.
  • DN: Right now spec says index can be i32 or u32, so we have that already.
  • JG: I think the proposal is that implementations that don’t want to support 3B arrays without i64, could choose to limit the max size to 2B and not have to deal with it. What’s the portability issue exactly?
  • DN: It would be clear that e.g. a device is a spir-v device because they are limited to 2B if it doesn’t have i64.
  • JG: So this is a problem with non-portable limits? (yes) I think that’s not a philosophical problem/blocker for us.
  • DM: Are we debating whether to put the max array size in the spec?
  • DN: Right now we limit them in WGSL based on the max int size.
  • DM: Is MM unhappy with this? (yes)
  • MM: I just want my u32(3B) to work
  • DN: I want you to have to use i64(3B)
  • MM: I think that’s unreasonable. Is it valuable if other shading language would support this?
  • DN: No, because we still need spir-v to work.
  • JG: I think we’re at the point where we just need to choose qualitatively, and that there’s not more data to be gathered. I think we should take another week to think to ourselves about it before forcing a decision (at worst, voting) next week.
  • GR: As a note, DX doesn’t support 64bit indexing, though there are 64bit capability bits, but not for indexing.
  • GR: We may want to reconsider taking a consequential vote so close after the US holiday

@Kangz
Copy link
Contributor

Kangz commented Jun 29, 2021

I don't really understand the subtleties of this discussion but note that WebGPU has a maxStorageBufferBindingSize limit with a guaranteed value of 128MB only. An implementation could use this value to control the maximum size of arrays indexable by the shader.

@litherum
Copy link
Contributor Author

litherum commented Jul 10, 2021

I tried to understand the behavior that @dneto0 was describing in Vulkan, and came away more confused than when I started. I wrote a Vulkan program that's as simple as possible to test out the behavior of indices > 2B. This is the entire shader:

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types_int8 : require

layout(std430, binding = 0) buffer BigBuffer {
	uint8_t bigBuffer[];
};

layout(push_constant) uniform PushConstants {
	uint value;
};

layout(local_size_x = 1) in;
void main() {
	uint index = (1u << 31u) + 4u;
	bigBuffer[index] = uint8_t(value);
}

The API code dispatches 1 thread to execute this, and then runs vkCmdCopyBuffer() to copy the result to the CPU to see what was written. I expect the vkCmdCopyBuffer() is copying from the correct place, because the byte offset is represented as a VkDeviceSize, which is a uint64_t. I also enabled VK_LAYER_KHRONOS_validation, RobustBufferAccess, and RobustBufferAccess2 to make sure I wasn't doing anything wrong.

Running the test app, it appears that unsigned indices > 2B work correctly on Vulkan, which seems to contradict @dneto0's previous statements. I tested on a NVIDIA GeForce RTX 2080 Ti and a Using Radeon RX 560 Series (though the Radeon doesn't support VK_EXT_robustness2), and they both seem to agree.

The shader, when run through glslangValidator, produces this SPIR-V:

...
               6:             TypeInt 32 0 // "0" means unsigned
               7:             TypePointer Function 6(uint)
               9:     6(uint) Constant 2147483652
...
              14:     13(ptr) Variable Uniform
...
              16:     15(int) Constant 0
...
        8(index):      7(ptr) Variable Function
                              Store 8(index) 9
              17:     6(uint) Load 8(index)
...
              26:     25(ptr) AccessChain 14 16 17
                              Store 26 24

This appears like it's passing the expected unsigned value to the AccessChain operation.

Here is the test program: VulkanLargeArrayTest.zip

@dneto0 Do you think you could take a look and help me understand what's going on?

@litherum
Copy link
Contributor Author

litherum commented Jul 10, 2021

I also transcribed this program in a bunch of other APIs, too. Here are the results:

API GPU OS Result Notes
Metal AMD Radeon Pro Vega 56 macOS
Metal Apple M1 macOS
Metal AMD Radeon RX 570 macOS
Metal AMD Radeon Pro 560 macOS
Metal Intel(R) HD Graphics 630 macOS Cannot create a buffer big enough to test
OpenCL AMD Radeon Pro Vega 56 macOS
OpenCL Apple M1 macOS
OpenCL AMD Radeon RX 570 Compute Engine macOS
OpenCL AMD Radeon Pro 560 Compute Engine macOS
OpenCL Intel(R) HD Graphics 630 macOS Cannot create a buffer big enough to test
D3D12 NVIDIA GeForce RTX 2080 Ti Windows ❌✅ HLSL doesn't have any 1-byte types, so I can't test this on a buffer that's smaller than 4GB. Using a larger buffer, the debug layer outputs Root descriptor access out of bounds (results undefined): ... Highest byte offset from view start accessed: [0xffffffff](max 32-bit offset or overflow). However, it produces the correct result.
OpenCL NVIDIA GeForce RTX 2080 Ti Windows
OpenGL NVIDIA GeForce RTX 2080 Ti/PCIe/SSE2 Windows glLinkProgram() fails with this error: error C1068: array index out of bounds
Vulkan NVIDIA GeForce RTX 2080 Ti Windows ✅⭐️
CUDA NVIDIA GeForce RTX 2080 Ti Windows
D3D12 Radeon RX 560 Series Windows Debug layer outputs same error as above, and does not produce the correct result.
OpenCL Baffin Windows
OpenGL Radeon RX 560 Series Windows No error messages or debug output, but doesn't produce correct result
Vulkan Radeon RX 560 Series Windows ✅⭐️

Here are the test apps:

⭐: See above

@dneto0
Copy link
Contributor

dneto0 commented Jul 13, 2021

@dneto0 Do you think you could take a look and help me understand what's going on?

The simple answer is that you've exercised unchecked undefined behaviour. This is akin to the "works on NVIDIA" problem.

@dneto0
Copy link
Contributor

dneto0 commented Jul 13, 2021

I'll reassert that those large arrays should be allowed when the largest signed integer type can express their element count.

Here's a scenario:

  • we expand the pointer functionality in a future version (or extension) of WGSL
  • we allow pointer-difference within the same array
  • then what's the result type of that pointer diff, i.e. what corresponds to std::ptrdiff_t?
    C++14 says that behaviour is undefined if the resulting difference overflows. (5.7 para 6)

@dneto0
Copy link
Contributor

dneto0 commented Jul 13, 2021

Some more notes (possibly repeating what I've said verbally):

  • in the OpenCL cases, I think you're exercising implementations that support 64bit integers
  • in the D3D12 NVIDIA case, you're (i think) proving my point that, without 64bit ints, and without a byte type (that is addressible in memory), the use case is very niche. A 2byte scalar type with 2 giga-elements already spans a 4GB space. Beyond that, you're hitting validation errors. These are huge warning signs.

@kdashg
Copy link
Contributor

kdashg commented Jul 13, 2021

WGSL meeting minutes 2021-07-13
  • DM: What if we have a u32(3B)?
  • DN: OOB access behavior
  • DM: What about i64(3B)?
  • DN: That would be fine
  • [what about large runtime sized arrays as a different type]
  • DN: I don’t have a proposal for that.
  • MM: I think this would be a design mistake
  • MM: I think we should accept i32/u32/i64, and if it’s a valid offset, it’ll work regardless of type.
  • (let’s let gears turn and come back to this)

@litherum
Copy link
Contributor Author

litherum commented Jul 13, 2021

I'd like to be clear here about what the proposal is. I think this is the same as @jdashg's proposal during the call today, but he can correct me if I'm mistaken.

There are 3 (or 4) 'overloads' of array accesses in WGSL: array[i32], array[u32], array[i64] (and possibly array[u64]).

On Vulkan:

  • For array[i32], the index is passed directly to OpAccessChain. It just works.
  • For array[i64], the index is also passed directly to OpAccessChain. It just works (because the only way for there to be an i64 in the shader is if the i64 extension is enabled, which would automatically trigger the Int64 SPIR-V "Capability").
  • For array[u64], the index is just casted to an i64 and behaves as above, because there is no buffer that's big enough to be able to tell the difference
  • For array[u32], there are 2 options:
    • If the device doesn't support the Int64 SPIR-V "Capability", then the WebGPU device would set the maxUniformBufferBindingSize and maxStorageBufferBindingSize to at most 0x1FFFFFFFC (~8GB). This way, even if the author attaches the biggest possible binding to the smallest possible WGSL type, the highest valid index is 0x7FFFFFFF. So, the implementation can just simply cast the u32 to an i32 and it will be correct.
      • 8GB is a totally reasonable limit for these devices, because if they don't support the Int64 SPIR-V "Capability," they probably support less than 4GB of memory anyway. Also D3D12 already doesn't support any bindings larger than 4GB, so this probably isn't a problem.
    • On the other hand, if the device does support the Int64 SPIR-V "Capability", then the implementation can choose to either take the above path and restrict maxUniformBufferBindingSize and maxStorageBufferBindingSize, or it could promote the u32 to a i64 internally before passing it to SPIR-V.
      • On devices with > 4GB of memory, promotion is almost certainly not a problem, because (almost?) all implementations would be doing this promotion under the hood anyway. In order to add the element byte offset to the array's base address, that addition will (almost?) always need to be done on 64-bit values, because the array's base address could totally be higher than UINT_MAX. If the WGSL -> SPIR-V compiler does the promotion, the SPIR-V -> hardware compiler doesn't have to, so there's no performance lost.
      • And, on devices with < 4GB memory, the max bindings size will already be less than 8GB, so nothing else needs to be done. The index is just casted to an i32 and used.

On D3D12, the largest valid index will already have to be 0x3FFFFFFF (1B) because of that "Highest byte offset from view start" error message above. Anything higher would end up with a byte offset higher than the max 32-bit byte offset. So, D3D12 can just cast the u32 to an i32 and it will be correct.

On Metal, the index is just passed directly into the index operation in the generated MSL (after a bounds check, of course). Everything just works.

@kvark
Copy link
Contributor

kvark commented Jul 13, 2021

@litherum thank you for writing this down!
I'm wondering though, if we consider the subject of this issue ("Buffer indices should be unsigned"), isn't this logic equally compatible with having unsigned-only indices? I.e. if we wanted to only have unsigned indices, it appears to me that this would work just as well. FWIW, Rust indexing is strictly unsigned, and it works fine :)

@dneto0
Copy link
Contributor

dneto0 commented Jul 14, 2021

What, exactly is the proposed WGSL spec change?

https://gpuweb.github.io/gpuweb/wgsl/#array-access-expr allows both i32 and u32 as array indices.

dneto0 added a commit to dneto0/gpuweb that referenced this issue Jul 14, 2021
dneto0 added a commit to dneto0/gpuweb that referenced this issue Jul 14, 2021
- rename "array size" in many places to the more specific term
  "element count"
- element count may be an unsigned integer literal, as per gpuweb#1135
- describe array type matching in the positive sense, and as an
  if-and-only-if set of rules.
- update example to show array size with unsigned integer literal.
- Reword array layout to make "element stride" a defined term, and
  separately write out how its value is determined.
@kvark
Copy link
Contributor

kvark commented Jul 14, 2021

@litherum @dneto0 just a reminder that Office Hours are not official meetings, and are not scribed.
I know this has been discussed, so if there is any change to the story, please write it down loud and clear for everybody.

@litherum
Copy link
Contributor Author

litherum commented Jul 14, 2021

During the office hours, we got to a point where (I think!!) we agreed on the following:

  1. Indices into array access operations can be both signed or unsigned
  2. Unsigned values > 2B do not get silently implicitly wrapped around to be negative
  3. The type of the maxUniformBufferBindingSize and maxStorageBufferBindingSize limits in the API should be a 64-bit type instead of a 32-bit type
  4. The type declaration of a sized array should be able to have its size parameter be a signed value or an unsigned value
  5. The WGSL spec will not state any explicit limit which sized arrays' size parameter must be less than. Instead, such a limit would be device-specific and shader-specific. (And unqueryable, I think.) If authors want to know if their array size is too big, they can compile the shader to see if it compiled successfully.

@kvark
Copy link
Contributor

kvark commented Jul 14, 2021

Thank you! For clarity, this was an agreement between you and @dneto0 specifically, not a group consensus. But I'm sure it's very close to it.

If authors want to know if their array size is too big, they can compile the shader to see if it compiled successfully.

The limits we have apply to uniform and storage buffers. But what if I have a function or private class array that is big? Would shader compilation fail, and if so, how?

@dneto0
Copy link
Contributor

dneto0 commented Jul 15, 2021

+1 thanks for the summary @litherum

Regarding status of the spec and necessary changes to implement:

  1. already in the spec, Array Access Expression
  2. regarding wrapping: I think there is no change here. First, there is no implicit conversion of the index type. Second, the out-of-array-bounds condition kicks in. For indexing into an array reference, you get an invalid memory reference. For indexing into an array value you get any valid value for the element type.
  3. That was Use 64bit limits for max uniform/storage buffer binding sizes. #1941 (landed) and I hear it was just an oversight that it was originally only 32-bits.
  4. Is Allow unsigned wgsl array sizes. #1942 (and incorporated into wgsl: array size may be module-scope constant (possibly overridable) #1792 )
  5. Agree. To answer @kvark: I assert shader creation will fail or pipeline creation will fail. There is no reliable way to anticipate that because it's very dependent on the target machine and the software stack beneath the browser. This is an aspect of can pipeline creation fail in the driver compiler? particularly due to shader compilation failure #1872

@kdashg
Copy link
Contributor

kdashg commented Jul 27, 2021

WGSL meeting minutes 2021-07-27
  • Allow unsigned wgsl array sizes. #1942
  • wgsl: array size may be module-scope constant (possibly overridable) #1792
  • JG: We reached agreement on accepting both signed and unsigned. We are confident that we can implement these in a way that works for all of our platforms.
  • JG: I wrote up 1942 so grammatically you could use a uint instead of an int. 1792 includes this and more. Maybe that’s what we actually want.
  • DN: 1792 didn’t reach consensus. I want to land yours instead.
  • DN: For 1792, there was a discussion of a more constrained feature
  • JG: Is there more to talk about here?
  • DN: No. We all agree here.
  • DN: I like 1942 that just updates the grammar. That’s the only change to WGSL that’s required.
  • AB: I want to get back to 1792 later.
  • RESOLVED: Accept Allow unsigned wgsl array sizes. #1942

@kdashg kdashg closed this as completed Aug 10, 2021
WGSL automation moved this from Needs Decision to Done Aug 10, 2021
@kdashg
Copy link
Contributor

kdashg commented Aug 10, 2021

WGSL meeting minutes 2021-08-10
  • (are we done here?)
  • Yes

dneto0 added a commit to dneto0/gpuweb that referenced this issue Aug 10, 2021
- rename "array size" in many places to the more specific term
  "element count"
- element count may be an unsigned integer literal, as per gpuweb#1135
- describe array type matching in the positive sense, and as an
  if-and-only-if set of rules.
- update example to show array size with unsigned integer literal.
- Reword array layout to make "element stride" a defined term, and
  separately write out how its value is determined.
dneto0 added a commit that referenced this issue Aug 11, 2021
…le (#2031)

* wgsl: array size may be overridable constant identifier

It's still an i32; it was only ever an INT_LITERAL in the first place.

Partially addressess #1431
(Later commits remove ability to override that constant)

* Array size can be a module-scope constant

Not just pipeline-overridable

* Spell out when array types are different

Also give examples.

* Simplify IO-shareable

Remove matrix, array, and nested structs

Fixes: #1877

* Rename "sized array" to "fixed-size array"

* Various updates and cleanups:

- rename "array size" in many places to the more specific term
  "element count"
- element count may be an unsigned integer literal, as per #1135
- describe array type matching in the positive sense, and as an
  if-and-only-if set of rules.
- update example to show array size with unsigned integer literal.
- Reword array layout to make "element stride" a defined term, and
  separately write out how its value is determined.

* Array element count can't be overridable

* An size array is host-shareable, even if element count is a module-scope constant

* Fix the note about IO-shareable types, regarding bool

* Remove stale (and incorrect) note.

Change-Id: Ic68a52b994b42ef3e7b12d9eecd8ab6e83cb61eb

* Relax element-count condition for array type equality

The element-count condition for fixed-size array types
depends only on element-count value.

Change-Id: I7452375741160412071b5f0fe2e6e615264a4b11
ben-clayton pushed a commit to ben-clayton/gpuweb that referenced this issue Sep 6, 2022
The current interpolation tests will not attempt to validate an
interpolation with just the type (e.g. `@interpolate(flat)`). This is
due to the `,` always being appended so `@interpolate(flat, )` is
generated which is not a valid value.

This PR updates the generation to only add the `,` if there is a
sampling value to be appended. 

Issue: gpuweb#1135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wgsl WebGPU Shading Language Issues
Projects
WGSL
Done
Development

No branches or pull requests

10 participants