Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGPU Compatibility Mode #4266

Open
SenorBlanco opened this issue Aug 8, 2023 · 30 comments
Open

WebGPU Compatibility Mode #4266

SenorBlanco opened this issue Aug 8, 2023 · 30 comments
Labels
api WebGPU API compat WebGPU Compatibility Mode wgsl WebGPU Shading Language Issues
Milestone

Comments

@SenorBlanco
Copy link
Contributor

SenorBlanco commented Aug 8, 2023

Problem

WebGPU is a good match for modern explicit graphics APIs such as Vulkan, Metal and D3D12. However, there are a large number of devices which do not yet support those APIs. In particular, on Chrome on Windows, 31% of Chrome users do not have D3D11.1 or higher. On Android, 23% of Android users do not have Vulkan 1.1 (15% do not have Vulkan at all). On ChromeOS, Vulkan penetration is still quite low, while OpenGL ES 3.1 is ubiquitous.

Goals

The primary goal of WebGPU Compatibility mode is to increase the reach of WebGPU by providing an opt-in, slightly restricted subset of WebGPU which will run on older APIs such as D3D11 and OpenGL ES. This will increase adoption of WebGPU applications via a wider userbase.

Since WebGPU Compatibility mode is a subset of WebGPU, all valid Compatibility mode applications are also valid WebGPU applications. Consequently, Compatibility mode applications will also run on user agents which do not support Compatibility mode. Such user agents will simply ignore the option requesting a Compatibility mode Adapter and return a Core WebGPU Adapter instead.

WebGPU Spec Changes

partial dictionary GPURequestAdapterOptions {
    boolean compatibilityMode = false;
}

When calling GPU.RequestAdapter(), passing compatibilityMode = true in the GPURequestAdapterOptions will indicate to the User Agent to select the Compatibility subset of WebGPU. Any Devices created from the resulting Adapter on supporting UAs will support only Compatibility mode. Calls to APIs unsupported by Compatibility mode will result in validation errors.

Note that a supporting User Agent may return a compatibilityMode = true Adapter which is backed by a fully WebGPU-capable hardware adapter, such as D3D12, Metal or Vulkan, so long as it validates all subsequent API calls made on the Adapter and the objects it vends against the Compatibility subset.

partial interface GPUAdapter {
    readonly attribute boolean isCompatibilityMode;
}

As a convenience to the developer, the Adapter returned will have the isCompatibilityMode property set to true.

partial dictionary GPUTextureDescriptor {
    GPUTextureViewDimension textureBindingViewDimension;
}

See "Texture view dimension can be specified", below.

Compatibility mode restrictions

1. Texture view dimension may be specified

When specifying a texture, a textureBindingViewDimension property determines the views which can be bound from that texture for sampling (see "Proposed IDL changes", above). Binding a view of a different dimension for sampling than specified at texture creation time will cause a validation error. If textureBindingViewDimension is unspecified, use the same algorithm as createView():

if desc.dimension is "1d":
    set textureBindingViewDimension to "1d"
if desc.dimension is "2d":
  if desc.size.depthOrArrayLayers is 1:
    set textureBindingViewDimension to "2d"
  else:
    set textureBindingViewDimension to "2d-array"
if desc.dimension is "3d":
  set textureBindingViewDimension to "3d"

Justification: OpenGL ES does not support texture views.

Alternatives considered:

  • add viewDimension to GPUTextureDescriptor, as above, but make it mandatory; not specifying a viewDimension is a validation error

    • pros:
      • ensures no unexpected behaviour from existing apps
    • cons:
      • more verbose than a default
  • make a view dimension guess at texture creation time, and perform a texture-to-texture copy at bind time if the guess was incorrect.

    • pros:
      • wider support of existing WebGPU content without modification
    • cons:
      • unexpected performance cliff for developers
      • potentially increased VRAM usage (two+ copies of texture data)
  • make a view dimension guess at texture creation time, and perform a texture-to-texture copy on first binding if the guess was incorrect. All subsequent views bound for sampling from that texture must have the same dimension as the first use, else a validation error occurs.

    • pros:
      • wider support of existing WebGPU content without modification
      • at worst, a one-time performance penalty
    • cons:
      • unexpected performance cliff for developers
      • potentially increased VRAM usage (at worst, two copies of texture data)
  • disallow 6-layer 2D arrays (always cube maps)

    • cons:
      • poor compatibility, limits applications
  • disallow cube maps (always create 6-layer 2D arrays)

    • cons:
      • poor compatibility, limits applications

2. Disallow CommandEncoder.copyTextureToBuffer() and CommandEncoder.copyTextureToTexture() for compressed texture formats

CommandEncoder.copyTextureToBuffer() and CommandEncoder.copyTextureToTexture() of a compressed texture is disallowed, and will result in a validation error.

Justification: Compressed texture formats are non-renderable in OpenGL ES, and
glReadPixels() on works on a framebuffer-complete FBO. Additionally, because ES 3.1 does not support glCopyImageSubData(), texture-to-texture copies must be worked around with glBlitFramebuffer(). Since compressed textures cannot be bound for rendering, they cannot use the glBlitFramebuffer() workaround.

Alternatives considered:

  • implement a shadow copy buffer, and upload the compressed data to both a buffer and a texture
    • pros:
      • good compatibility
    • cons:
      • performance overhead, even when readbacks are not required
      • VRAM overhead

3. Views of the same texture used in a single draw may not differ in mip level or array layer parameters.

A draw call may not reference the same texture with two views differing in baseMipLevel, mipLevelCount, baseArrayLayer, or arrayLayerCount. Only a single mip level range and array layer range per texture is supported. This is enforced via validation at encode time.

Justification: OpenGL ES does not support texture views.

Alternatives considered:

  • when two bindings exist with different mip levels or array layers, do a texture-to-texture copy
    • pros:
      • good compatibility
    • cons:
      • a performance cliff for developers
      • higher VRAM usage

4. Color state alphaBlend, colorBlend and writeMask may not differ between color attachments in a single draw.

Color state descriptors used in a single draw must have the same alphaBlend, colorBlend and writeMask, or else an encode-time validation error will occur.

Justification: OpenGL ES 3.1 does not support indexed draw buffer state.

Alternatives considered

  • require GL_EXT_draw_buffers_indexed
  • expose as a WebGPU extension when the OpenGL ES extension is present (this could be a followup change)
    • pros:
      • ease of implementation
      • good performance
    • cons:
      • if this is the only implementation, it has poor reach

5. Disallow sample_mask builtin in WGSL.

Justification: OpenGL ES 3.1 does not support gl_SampleMask, gl_SampleMaskIn.

Alternatives considered

  • require GL_OES_sample_variables
  • expose as a WebGPU extension when the OpenGL ES extension is present (this could be a followup change)
    • pros:
      • ease of implementation
    • cons:
      • poor reach, unless this is built on top of the proposed solution

6. Disallow GPUTextureViewDimension "CubeArray" via validation

Justification: OpenGL ES does not support Cube Array textures.

Alternatives Considered:

  • none

7. Disallow textureLoad() of depth textures in WGSL via validation.

Justification: OpenGL ES does not support texelFetch() of a depth texture.

Alternatives considered:

  • bind to an RGBA8 binding point and use shader ALU
    • pros:
      • compatibility, performance
    • cons:
      • untried (does this work?)
  • use texture() with quantized texture coordinates; massage the results
    • pros:
      • compatibility, performance
    • cons:
      • untried
      • complexity of implementation

8. Disallow texture*() of a texture_depth_2d_array with an offset

Justification: OpenGL ES does not support textureOffset() on a sampler2DArrayShadow.

Alternatives considered:

  • emulate with a texture() call and use ALU for offset
    • pros:
      • compatibility, performance
    • cons:
      • untried

9. Emit dpdx() and dpdy() for all derivative functions (include Coarse and Fine variants).

Justification: GLSL does not support dFd*Coarse() or dFd*Fine() functions. However, these variants can be interpreted as a hint in WGSL, and emitted as dFd*().

Alternatives considered:

  • disallow Coarse and Fine variants via validation in WGSL
    • cons:
      • poor compatibility
  • Coarse is allowed; Fine is disallowed via validation

10. Disallow bgra8unorm-srgb textures.

Justification: OpenGL ES does not support sRGB BGRA texture formats.

Alternatives considered:

  • use a compute shader to swizzle bgra8unorm-srgb to rgba8unorm-srgb on copyBufferToTexture() and the reverse on copyTextureToBuffer()
    • pros:
      • wide compatibility
    • cons:
      • a performance cliff for developers
      • increased VRAM usage

Compatibility mode workarounds

The features below are not supported natively in OpenGL ES, but it is proposed to implement them in the User Agent via workarounds.

1. Emulate copyTextureToBuffer() of depth/stencil textures with a compute shader

Justification: OpenGL ES does not support glReadPixels() of depth/stencil textures.

Alternatives considered:

  • use CPU readback and re-upload
    • pros:
      • wide support
    • cons:
      • large performance cliff
  • disallow via validation
    • pros:
      • ease of implementation
    • cons:
      • poor compatibility
  • require GL_NV_read_depth_stencil
    • pros:
      • good performance
    • cons:
      • poor support (<1% on gpuinfo.org)

2. Emulate copyTextureToBuffer() of SNORM textures with a compute shader

Justification: OpenGL ES does not support glReadPixels() of SNORM textures

Alternatives considered:

  • disallow via validation
    • pros:
      • ease of implementation
    • cons:
      • poor compatibility; limits applications

3. Emulate separate sampler and texture objects with a cache of combined texture/samplers.

Justification: OpenGL ES does not support separate sampler and texture objects.

Alternatives considered:

  • allow only a single sampler to be used with a given texture
    • pros:
      • ease of implementation
    • cons:
      • poor compatibility

4. Inject hidden uniforms for textureNumLevels() and textureNumSamples() where required.

Justification: OpenGL ES 3.1 does not support textureQueryLevels() (only added to desktop GL in OpenGL 4.3).

Alternatives Considered:

  • disallow textureNumLevels() and textureNumSamples() in WGSL via validation.
    • pros:
      • ease of implementation
    • cons:
      • poor compatibility

5. Emulate 1D textures with 2D textures.

Justification: OpenGL ES does not support 1D textures.

Alternatives Considered:

  • disallow 1D textures in WGSL and API
    • pros:
      • ease of implementation
    • cons:
      • poor compatibility

6. Manually pad out GLSL structs and interface blocks to support explicit @align or @size decorations.

Justification: OpenGL ES does not support offset= interface block decorations on anything but atomic_uint.

Alternatives considered:

  • disallow @align and @size on WGSL structs via validation
    • pros:
      • ease of implementation
    • cons:
      • poor compatibility

7. Use GL_ext_texture_format_BGRA8888 to support BGRA copyBufferToTexture() and swizzle workarounds and RGBA textures where unavailable.

Justification: OpenGL ES does not support BGRA texture formats.

GL_ext_texture_format_BGRA8888 supports texture uploads and the BGRA8888 texture format, and has 99%+ support. The vast majority of devices which do not support it are GLES 3.0 implementations, and so would not support Compatibility mode anyway, but if an important device emerges, a CPU- or GPU-based swizzle workaround and RGBA textures should be implemented.

Alternatives considered

  • disallow BGRA8888 as a texture format through validation
    • pros:
      • ease of implementation
    • cons:
      • poor compatibility

8. Work around lack of BGRA support in copyTextureToBuffer() via compute or sampling.

Justification: OpenGL ES does not support BGRA texture formats for glReadPixels(), even with the GL_ext_texture_format_BGRA8888 extension.

Alternatives considered:

  • disallow copyTextureToBuffer() for BGRA formats.
    • pros:
      • ease of implementation
    • cons:
      • poor compatibility

9. Use emulation workaround to support BaseVertex / BaseInstance in direct draws. Disallow via validation in indirect draws.

Justification: OpenGL ES 3.1 does not support baseVertex or baseInstance parameters in Draw calls.

Alternatives considered

  • require OES_draw_elements_base_vertex (21% support) or EXT_draw_elements_base_vertex (21%) and GL_EXT_base_instance (1.7%)
    • pros:
      • ease of implementation
    • cons:
      • poor compatibility
@SenorBlanco
Copy link
Contributor Author

It was asked why we did not consider extending the reach further to include D3D FL10_0. However, the limitations imposed in compute shaders seem overly restrictive, and would prevent many useful compute features. In particular (if I'm reading it correctly), only a single UAV is allowed, meaning only a single storage buffer. Storage textures would be unavailable. There are no atomic instructions.

@greggman
Copy link
Contributor

greggman commented Aug 17, 2023

The idea came up today in our meeting to which can be summed up as 'Make WebGPU work as is for compat'

The suggested solutions were

  • emulate more things by shader generation
  • add more restrictions to WebGPU itself

That sounds great! ... if possible

Some points were made

  • Devs that really want perf will write 2 renderers (WebGL and WebGPU) therefore, perf of WebGPU on compat level hardware is not important

    I think Chrome would vehemently disagree with this POV. The class of hardware that would need these compat features is the slowest hardware, it therefore needs the best perf, not the worst. If solutions take more memory or run XX% slower they're not useful. These devices are also likely to have the least amount of memory so solutions that require creating shadows buffers or other large temporary copies seem problematic.

    I also don't agree that devs will write both. Devs have limited time and budget.

  • Worried that there will be 2 WebGPUs if compat exists

    I didn't understand this point. Compat is designed to disappear. Compat is
    design to be 100% forward compatible with WebGPU. It's designed
    so browsers don't have to implement it if they don't want to and any app
    that sticks to the compat limits will run just fine in full WebGPU, no changes.

    It's unclear how this is unique from WebGPU features. Say fp16,
    It sounds like the existance of fp16 causes 2 WebGPUs, those
    with and those without it. What makes that acceptable and this
    proposal not acceptable?

All that said, any ways in which we can make "compat" more compatible with full WebGPU and keep high performance would super great IMO.

With that in mind, some things that don't seem emulatable with a out large perf hit (note: for the points below, assume we're discussing a compat implementation on top of OpenGL ES 3.1

  • 2d vs 2d-array vs cube-maps vs cube-map arrays

    WebGPU has 1 type of "2d" texture which can be used 4 different ways,
    as a 2d texture, as a 2d-array, and a cube, and as a cube-array where as OpenGL ES 3.1
    is not as flexible.

    It sounds like the suggestion here is that all 2d textures
    be stored as 2d-arrays in OpenGL ES. Then, if the user
    uses texture_cube or texture_cube_array in their code, emit GLSL
    to emulate cubemaps on top of 2d-arrays. (2d can always(?) be emulated as
    a 1 layer 2d-array)

    You can find example GLSL for emulating cubemaps from 6 2d textures here here

    My guess is this would be unacceptably slow. On top of the linked example being
    a bunch of GLSL, it doesn't handle sampling across faces for seamless cubemaps
    and adding support would make it even slower.

    You'd also have the issue that wrap modes can be set per-axis which seems like
    you'd potentially need 3 samplers to emulate a single cubemap, which would mean
    if the user used the maximum number of cubemaps you'd run out of samplers to
    emulate with. Another solution would be to emit GSL that emulates the samplers.

    I'm only guessing this was the suggested solution. Looking forward to hearing more

    A solution here would be to add the limits suggested at the top into full WebGPU.
    That would likely break existing content. The most common issue is generating mip
    levels. In full WebGPU, it's easy to view each mip as single level, single layer
    2d texture. With the limits suggested above, if you wanted to generate mip levels
    for a cube or 2d-array then your mip level generation shader needs
    to use texture_cube, texture_2d_array and not texture_2d.

  • separate writeMask, alphaBlend, colodBlend

    It's not clear how this would be emulated efficiently. Off the top of my head
    it would require a separate render per render-attachment and dealing with
    z-buffer updates, stencil-updates, occlusion-query pixel counts.

    This sounds like it would be a large perf hit. Maybe there are other ways to
    accomplish this that don't have a large perf hit?

    Also, can you write to storage buffers from a fragment shader, and use atomics,
    which seems like it would make running a shader multiple times impossible since
    each iteration would see any storage buffers in a different state.

    A solution here would be to add the limits suggested at the top to full WebGPU

  • emulating sample_mask

    can this be emulated with gl_FragCoord

  • emulating textureView.baseArrayLayer and textureView.baseArrayCount

    IIUC no corresponding setting exists in OpenGL ES 3.1 so emulating sounds seems
    like it would require passing in some per-sampler uniforms and them adjusting/clamping
    the layer parameter to GLSL functions that take one.

    This might require lowering the maximum allowed uniform buffers?

    Another solution is to remove those settings from WebGPU.

It seems like most of the remaining issues don't matter (they're emulated already) with the following exceptions

  • OpenGL ES 3.1 does not support bgra8unorm-srgb

    Remove it from WebGPU?

  • OpenGL ES 3.1 does not allow copying compressed textures nor does it allow copying
    RGA9_E9 textures

    Disallow via validation in WebGPU as well?

@litherum
Copy link
Contributor

litherum commented Aug 22, 2023

Devs that really want perf will write 2 renderers (WebGL and WebGPU) therefore, perf of WebGPU on compat level hardware is not important

This is a miscommunication. In one proposal, a bunch of features won't work. In another proposal, those features will work, albeit more slowly than they would in non-compat mode.

Any application which is okay with missing features must necessarily be okay if those features exist but are slower than they would in non-compat mode. In both scenarios, the author has to know which features are missing/polyfilled, so they can know to avoid them if they're missing, or decrease their use if polyfilled. So it's totally okay to decrease performance for features which otherwise would be missing.

(We'd be totally OK telling applications what is slower than it would be in non-compat mode (and telling them if they are in compat mode or not). Either in the JS console, in the spec, in MDN, or elsewhere.)

Compat is designed to disappear.

I'm kind of bewildered by this idea, for 2 reasons:

  1. Why should anyone spend any time implementing/standardizing a feature which is expected to not be relevant in the future?
  2. Web specs don't disappear. Validation will live forever. Everything we standardize lasts forever.

It sounds like the existance of fp16 causes 2 WebGPUs

Our expectation is that fp16 will become ubiquitous over time. The difference is that usage of fp16 will increase to ubiquity, whereas usage of compat mode will decrease to irrelevance.

@litherum
Copy link
Contributor

litherum commented Aug 22, 2023

Here are some ideas about polyfilling:

  1. Texture view dimension must be specified

This is a situation where late compilation can help.

It sounds like the suggestion here is that all 2d textures
be stored as 2d-arrays in OpenGL ES. Then, if the user
uses texture_cube or texture_cube_array in their code, emit GLSL
to emulate cubemaps on top of 2d-arrays. (2d can always(?) be emulated as
a 1 layer 2d-array)

As a first-order approximation, that's generally right. As long as there is no data loss, and all the information necessary to be stored is, in fact, stored somewhere in the texture, it will always be possible to emulate any kind of lookup (cube map or otherwise) in software. It will be slower - you may need to texelFetch multiple texels and blend them together in software, but that's better than the feature being missing entirely.

  1. Disallow CommandEncoder.copyTextureToBuffer() and CommandEncoder.copyTextureToTexture() for compressed texture formats

This seems possible with a compute shader. These operations are already only valid between passes, where the implementation can inject compute shaders. (And, of course, compute shader support is already necessary in general.)

(Also, compressed texture format support is already optional. Another option is just to not support them.)

  1. Views of the same texture used in a single draw may not differ in mip level or array layer parameters.

Similar to (1) above. If we can recompile whenever we want, we can emit code that adds a constant to any miplevel accesses or array layers.

  1. Color state alphaBlend, colorBlend and writeMask may not differ between color attachments in a single draw.

This one seems difficult. Depending on the use cases, there may be a few possible paths forward: a) require the GL extension, b) enforce the requirement in WebGPU proper, c) run the render pass N times for N color targets with different states. Maybe there's another way, too.

  1. Disallow sample_mask builtin in WGSL.

gl_SampleMaskIn should be polyfillable with enough knowledge of sampling locations and texture sizes.

sample_mask is pretty poorly defined in the first place because we don't allow the samplePositions to be specified (or we don't specify what they are) so that means results are already inconsistent between vendors. It might be possible to get something good enough with a post-process compute shader. Or maybe there's another way by augmenting the fragment shader with another render target output.

  1. Disallow GPUTextureViewDimension "CubeArray" via validation

Again, late compilation and emulating sample operations in software can polyfill this.

  1. Disallow textureLoad() of depth textures in WGSL via validation.

It should be possible to emulate loading by crafting sample positions and sampling.

  1. Disallow texture*() of a texture_depth_2d_array with an offset

This is trivial to polyfill if you know how big the texture is.

  1. Emit dpdx() and dpdy() for all derivative functions (include Coarse and Fine variants).

There is no spec change necessary for this. Implementations can just do it.

  1. Disallow bgra8unorm-srgb textures.

Being able to recompile whenever you want alleviates the need for this too. WebGPU BGRA textures can be represented as OpenGL ES RGBA textures, and at draw call time it's possible to re-compile the shader to do the necessary swizzling. This uses the fact that every texture access in WGSL knows which texture it's accessing.

(We are intentionally not commenting on the "Compatibility mode workarounds" section, as they are in the spirit of this proposal we are making.)

(As an aside: it strikes me that the above set of polyfills is way smaller than the amount of polyfills required to get WebGL running on D3D on Windows machines.)

@litherum
Copy link
Contributor

litherum commented Aug 22, 2023

Reiterating some of the points I made on the previous call here, just for posterity:

  1. It's beneficial for WebGPU in general to have wide reach. Wide reach means more people use WebGPU. Wide reach means more WebGPU content. That's better for all implementations, even if they are running on a device that supports Metal / D3D12 / Vulkan.
  2. It's okay for features to be polyfilled with less performance than they would otherwise have in non-compat mode. There cannot be an application that would be okay if a feature was missing, but not okay if the feature was present but slower than it would be on another device. In both scenarios, the author has to know which features are missing/polyfilled, so they can know to avoid them if they're missing, or decrease their use if polyfilled.
    • If an author wants every last ounce of performance, they can get it, either by rewriting their WebGPU code to not use the polyfilled features, or by having a WebGL implementation of their webapp too.
    • We agree that it's still not okay to destroy performance; the polyfilled operations can't be so slow they're impossible to use by any content. Doing a full texture-to-texture copy is probably not acceptable, because usually full textures are not read on every frame. Doing a bunch of ALU and calling texelFetch() 4 times, instead of a single call to sample(), though, would be acceptable.
  3. The benefit of compat mode is to get the API to run at all, so developers (who don't need every last ounce of performance) don't have to write their webapp twice (once in WebGL and once in WebGPU) for wide reach.
  4. OpenGL famously will recompile shaders whenever it feels like, under the hood. I'm assuming D3D 9/10/11 will do so too (though I am less familiar with them). Therefore, if compat mode WebGPU is targeting being run on those APIs, it must be acceptable for compat mode WebGPU to recompile WGSL shaders whenever it feels like.
  5. Our primary goal here is to avoid fragmenting the ecosystem, potentially forever, for what we can all agree is a temporary situation. Increasing the reach of WebGPU is great, but adding a mode switch and forking a web spec in 10 different places because of a temporary situation is not.
  6. We'd be totally okay telling the webapp that it's in compat mode. Specifically, what this would mean to a developer is "you can no longer reason about which calls will be fast and which will be slow" (which is already going to be true for any implementation running on OpenGL). This could either be a single boolean exposed from the browser to the page, or it could be phrased similar to failIfMajorPerformanceCaveat (or something else).

@greggman
Copy link
Contributor

greggman commented Aug 23, 2023

Thank you for all the details. I was in the middle or writing up thoughts but it sounds like you covered most of them above.

I personally like this direction. Recompiling shaders doesn't seem bad to me. It will only come up if users go outside the proposed limits and the browser can warn them "this operation is slower, consider this alternative"

The question still comes up what specifically to do on some features

  • non-uniform colorState, alphaState, writeMask

    It's not clear if it's possible to emulate this, especially given writable storage
    buffers in fragment shaders. What's the solution here? Make non-uniform color target
    state be an optional feature in WebGPU? Is there some other solution?

  • (new) r16float, rg16float, rgba16float

    These are rendable in WebGPU but only optionally renderable in OpenGL ES 3.1
    Maybe someone with more stats can say if the devices we want to include support
    EXT_color_buffer_float_half. If not, does this become an optional feature
    in WebGPU?

  • (new) maxVertexAttributes

    In OpenGL gl_VertexID, gl_InstanceID, end up taking 1 or 2 attribute slots,
    reducing the maximum vertex attributes. One solution is to add this limit to WebGPU.
    maxVertexAttributes stays >= 16 but @builtin(vertex_index) and
    @builtin(instance_index) each count for 1 slot. Is there another solution?

  • copying compressed textures

    It wasn't clear to me how a compute shader helps here. Are you suggesting sampling
    the texture (so effectively uncompressed data) and compressing into the target
    format on the fly or am I'm missing how you can otherwise access the compressed
    texture data as raw data in a compute shader? Just asking for clarification.

@greggman
Copy link
Contributor

I want to add, I still see the compat mode validation limits as a viable solution. In particular, it doesn't require removing features from shipping WebGPU (assuming that ends up being a requirement to take the "more emulation" path). So, I wanted to comment on these points

Compat is designed to disappear.

I'm kind of bewildered by this idea, for 2 reasons:

  1. Why should anyone spend any time implementing/standardizing a feature which is expected to not be relevant in the future?
  2. Web specs don't disappear. Validation will live forever. Everything we standardize lasts forever.

This doesn't seem true to me. Any browser that doesn't want to implement compat mode doesn't have to. Devs may set compatibilityMode: true property when calling requestAdatper but any browser is free to ignore it.

What the devs get still works with their code. There is no validation that needs to live forever.

It sounds like the existence of fp16 causes 2 WebGPUs

Our expectation is that fp16 will become ubiquitous over time. The difference is that usage of fp16 will increase to ubiquity, whereas usage of compat mode will decrease to irrelevance.

I'd say those are 2 of the same thing. Full WebGPU will increase to ubiquity exactly as fp16 will increase to ubiquity. I don't understand the difference given the design.

@greggman
Copy link
Contributor

greggman commented Aug 25, 2023

More things to consider:

Given 2 possible implementations

  1. compat = extra validation + viewDimension hint
  2. compat = emulation

emulation may be slow for things that would be fast using validation+hint

It's not clear emulation will always be fast for paths that would be fast under the validation+hint scheme. For example, 6 layer 2d-array vs cube-map.

In the validation+hint impl, the user provides a viewDimension hint that full webgpu ignores, and compat uses. On a compat impl the backend allocates either a TEXTURE_2D_ARRAY or TEXTURE_CUBE_MAP based on the hint, both of which are fast for their use cases. The user is not allowed to use one as the other.

In the emulation impl, the impl probably guesses? If 6 layers then TEXTURE_CUBE_MAP (because more common?), it not 6 layers then TEXTURE_2D_ARRAY. If the user then uses a 6 layer texture as a 2d-array the implementation would have to emulate 2d-array logic on top of a TEXTURE_CUBE_MAP. That emulation appears to be pretty slow (wrote it as an exercise), in particular, it's slow because you need to emulate wrapping and clamping at the edges. So this path that would be faster with the validation+hint solution is significantly slower with the emulation solution.

Maybe you could keep a copy of the texture data and generate both TEXTURE_2D_ARRAY and TEXTURE_CUBE_MAP when used. It seems unacceptable to keep a CPU copy of the texture data to generate 2d-array or cube-map as needed because that would use 2x-3x the memory, nor can you convert the texture from cube-map to 2d-array since if the texture is a non-copyable format (like a compressed texture) then you have no way to do the conversion.

Further, it doesn't appear you can wait until first use to decide what to do on the backend (TEXTURE_2D_ARRAY or TEXTURE_CUBE_MAP). For one, that would require more memory as any texture yet to be used needs to keep a CPU copy of the data, waiting for first use. A user could easily upload lots of textures first, and only later use them and run out of CPU memory as the backend holds its copy. Further, all of the functions that operate on a texture would have to have 2 paths, one to work on the CPU copy and the existing ones that work on a real texture (so writeTexture, copyTextureToTexture, copyBufferToTexture, copyTextureToBuffer) etc. Because, the only signal of which actually lets the impl know which type of texture to make on the backend is TEXTURE_ATTACHMENT usage.

Another issue with deciding on first use is first use is often generating mipmaps. So, user loads a 6 layers, generates mips, implementation now has to choose is it TEXTURE_2D_ARRAY or TEXTURE_CUBE_MAP. Which ever one it picks, if it picks wrong, user gets the slow path. We could warn the user, but now user will need some really strange workaround where they need to use a texture as TEXTURE_ATTACHMENT with the view they really want to see in the end, before actually generating mips, all to hopefully coerce the impl to choose the faster path. It seems better to let the user just tell it their intent via a hint.

Yet another memory issue is shader variations. To do the emulation requires baking emulated sampler parameters into the shader PER emulated texture. In practice I suspect there wouldn't be that many variations used but, as it is there is 18 bits of info needed per emulated texture so 2 ** (18 * numEmulatedTextures) shader variations per shader that needs emulated textures.

Emulation is potentially a waste of time

If we go about implementing slow emulation solutions for compat, we'll likely want to tell users they're hitting the slow path. Concrete examples

  • (6 layer 2d-array vs cube-map) above
  • cube-arrays (don't exist on GLES 3.1 so would have to be emulated via 2d-arrays)

So, user uses a 6 layer texture as a 2d-array, browser prints to dev-console "this is really slow, consider a different solution".

It seems strange to spend a bunch of time writing slow emulation just to then tell the user "don't use this".

There's an assumption here that emulation is not slow because it has to compile a shader, it's slow because the resulting shader doing the emulation is slow.

@SenorBlanco
Copy link
Contributor Author

SenorBlanco commented Aug 29, 2023

Thanks to all for your thoughts and ideas! While I do think the Compatibility mode is still the better option, I think there are merits to the proposal to implement as much of WebGPU as possible. If further core functionality can be made to work and the community can agree to unship those features which cannot, it would obviate the need for the compatibility flag. This would make it possible to run the large majority of existing core WebGPU content without modification.

However, one baseline condition I feel we must meet is that any further core functionality implemented would not hinder performance on the common use cases that the Compatibility proposal already allows. If emulation for the rare cases impedes performance in the common case, it's a non-starter. I believe that makes the viewDimension on GPUTextureDimension a requirement, even if it is only as a hint. Otherwise, in the case of the 6-layer array <> cube map ambiguity for example, creating a view of non-default dimension would incur emulation, regardless of the default chosen, resulting in more texture samples and more ALU (or texture copying) than would be needed with a specified viewDimension. If I'm correct, then perhaps a first step would be to debate and standardize the viewDimension hint on its own.

Another concern I have is that providing warnings on slow paths in the dev console would have to be implemented on all browsers and all backends, even on those platforms which do not need to incur them. Otherwise, the developer may hit those unexpected performance cliffs. It would be best if those warnings were incorporated in the spec.

That said, I'd like to focus on the issues that we think are difficult to implement efficiently, such as per-attachment blend state and copying of compressed textures. If we cannot agree to unship those features, then a compatibility flag becomes the better option and we can debate the other issues (such as texture views) with knowledge that we are heading towards consensus rather than stumble on possible deal-breakers. Another one I'd like to draw your attention to is the lack of baseVertex/baseInstance in indirect draws. (My fault: I placed it in the list of workarounds, since direct draws can be emulated. Will fix!)

@teoxoy
Copy link
Member

teoxoy commented Aug 30, 2023

Want to mention a few more restrictions imposed by trying to run on top of OpenGL ES 3.1:

  • GPURenderPassDepthStencilAttachment.{depthReadOnly,stencilReadOnly} must be false (since it's undefined behavior in OpenGL ES to have the same texture bound in the shader while also using it as an attachment; see "Rendering Feedback Loops" spec chapter)
  • GPUTextureDescriptor.viewFormats must be empty (since OpenGL ES doesn't have texture views)
  • GPUDepthStencilState.depthBiasClamp must be 0 (without GL_EXT_polygon_offset_clamp; its reach being only 7.15%)
  • the maximum index value (within an index buffer) must be at most 231-1 (since GL_MAX_ELEMENT_INDEX is only at most i32::MAX); sounds expensive to validate, I wonder how the driver/hardware react in practice

Limits:

Features:

  • we currently require either BC or (ETC2 and ASTC) compressed formats, ETC2 formats are part of core OpenGL ES 3.1 but ASTC formats become core in 3.2; the GL_KHR_texture_compression_astc_ldr ext would be required for OpenGL ES 3.1, which according to my calculations is reported as being supported by 89.87% of GLES 3.1 reports (we should probably exclude reports on old drivers and see what the support is then)

There might be more restrictions but these are the additional ones I found so far.

It would be helpful to have a tool similar to https://github.com/kainino0x/gpuinfo-vulkan-query but for OpenGL ES.

@teoxoy
Copy link
Member

teoxoy commented Aug 30, 2023

Also, has anyone checked if OpenGL ES 3.1+ has the same level of support for the texture formats we currently have in the spec (and more importantly their capabilities)?

@kdashg
Copy link
Contributor

kdashg commented Aug 30, 2023

GPU Web 2023-08-16 (Atlantic-timed)
  • SW: take 2. Couple updates from last week. meta-discussion: is everyone on board with compatibility mode being in the spec. There was a question about whether D3D10 should be in the supported spec: its compute restrictions are overly restrictive.
  • SW: 2 edits to the proposal. One - needed to restrict limits on texture views at bind time rather than creation time. Otherwise couldn't render to slices of 2D arrays, for example. And copies between compressed textures don't work.
  • MM: we did another review of this.
  • MM: important to start by saying - increasing the reach of WebGPU is good for everybody. We like the goal. The more devices that can run WebGPU, the more authors will target it.
  • MM: from Apple's standpoint, the win of this compat work is allowing people to not have to write their app twice, once for WebGL and once for WebGPU. One audience - cares about perf. They can write their app twice. Another group, interested in using WebGPU but don't want to have to write app twice, and maybe WebGPU's reach isn't broad enough, so will use WebGL. nobody wants that. Those are the people we're trying to convert over.
  • MM: main thrust of compat work from our standpoint is compat, not perf. Of course it's possible to construct such a badly performing impl that it wouldn't work. Main thrust - make WebGPU work, full stop, not necessarily get every last bit of perf.
  • MM: GL famously will run compilations whenever it wants to. One of the reasons the new APIs exist. If GL under the hood runs compilations under the hood, then we should be able, too - when running on GL. (On Metal/D3D12/Vk, have more perf guarantees.)
  • MM: if you accept that premise, being able to run compilation later than createPipelineStateObject gets you a lot of mileage. Looking at big list of validation, most if not all could be alleviated if we could compile later. Example: lack of texture views. In WebGPU, possible to have tex viewed as both 2D array and cube map. In WGSL we know which accesses come from which resources . Compiling late, we could recompile the program at binding time. Change the program's code. At the place which accesses the cube map view, emulate those operations in software. Will have perf impact, true. HOwever the goal isn't to get perfect perf, it's to get the API to work.
  • MM: to close: the reason I give this speech is that we're worried that adding this mode will fragment WebGPU usage. Having two WebGPUs is not a good design going forward. Also, the reason spending so much time - Vk/D3D12/Metal will be the way of the future, and OpenGL/D3D11 will decrease over time. Creating this mode which needs to be maintained forever, the burden outweighs the benefit.
  • MM: of the remaining problems that can't be solved with late compilation - think it's worth researching how to handle them.
  • SW: thanks for your in-depth thoughts on this.
  • CW: you mentioned devs that care most about perf would be OK supporting both WebGL and WebGPU, so compat mode doesn't need to be as performant as WebGL. With small amounts of restrictions mentioned here, the compat mode would be much more performant than WebGL, and that's a win for developers. If your premise is that this mode can be slower than WebGL, then I understand your point of view.
  • KR: Goes beyond late binding, certain operations that are incompatible with OpenGL, no practical way to emulate. Lot of texture copying behind the scenes to emulate those things. Very expensive, bandwidth intensive. Have carefully thought through - though have we done profiling?
    • SW: No
  • KR: Our team’s experience that we want to avoid those large data copies behind the scenes as much as possible
  • MM: agree with you there. In this doc it was helpful to list alternatives considered. We agree that doing a full texture copy would probably be unacceptable in most cases - when you attach a texture to a shader, the shader usually won't read all the texture. So a full texture copy - much more bandwidth than the shader would use. But for the items listed in this doc, we think we have ways to polyfill all of the APIs without full texture copies. Think I should write those ideas up.
  • SW: would be great - or craft another doc.
  • MM: proposal we're making - WebGPU compat would support the full WebGPU API without restrictions, and if necessary, some way for the browser to tell the website you're running in compat mode, you may be surprised which calls are slow/fast. We don't have all the answers yet - think there are 2 items in this list that are more science projects.
  • CW: if we can find ways to minimize API differences, esp. efficiently, seems fine. Concerned about the claim that we don't care about performance. For adoption, people want to use WebGPU, but only if they get the reach and performance. Won't use it if it doesn't perform on X% of devices they care about. Compat was also important. That's why compat mode is optional - browsers don't need to implement it. A compat mode app in one browser will work the same way on other browsers that don't implement compat mode. Compat mode is designed to be unshipped.
  • MW: can we get both compat & perf by making these things warnings instead of restrictions?
  • SW: assuming we can polyfill everything? Could consider it then. If there were perf cliffs we would want to notify developers. Interested in some of the tougher cases. Compressed texture readbacks, cube map arrays - seems like a lot. And what restrictions does that impose on sizes of uniforms? Will need to use uniforms for passing data.
  • KR: Wanted to ask SW if all of the implementation options we considered internally were published in this GitHub issue.
  • SW: Yes.
  • MM: another thing: on fragmentation: one thing that could help is to enforce some of these compat-mode restrictions in real WebGPU, too. It is consistent with the idea of not fragmenting the API. Make WebGPU a little less expressive to make it able to run on OpenGL today. Only reasonable to do for non-commonly-used features.
  • CW: we should come back to this soon. Maybe figure out other alternatives for polyfilling to see if some more of the restrictions can be lifted.
  • KR: Want to confirm we haven’t come to a resolution yet. And underscore the fact that it’s designed to be unshipped.
  • CW: Correct, more information will help reach a better compromise.
  • MM: we want to be constructive, not destructive. We believe that more reach for WebGPU is good for the ecosystem.
  • (Out of time)

@SenorBlanco
Copy link
Contributor Author

As requested, here's a doc which sketches out the spec changes required for a single issue, in this case per-attachment blend state, from both the "unship and make optional Feature" approach and the "compatibilityMode" approach:

https://github.com/SenorBlanco/webgpu-per-attachment-blend-state/blob/main/proposal.md

(Please forgive me for non-standard or weird formatting or terminology; I'm not a spec editor.)

@mwyrzykowski
Copy link

We discussed internally and would prefer the compat flag approach over unshipping features from v1.

Per-attachment blend state is something that can not easily be poly-filled, so it seems reasonable to move this behind the compat flag.

@SenorBlanco
Copy link
Contributor Author

Note: removed the 6-layer cube map default from the viewDimension defaults (to match the consensus proposal).

@mwyrzykowski
Copy link

mwyrzykowski commented Oct 18, 2023

We discussed some of the remaining open compat mode issues:

  • baseVertex in DrawIndirect (workaround # 9)

    • Would propose we leave this up to the implementation to workaround in the shader or otherwise
  • Texture layers in views + Mip levels in views (restriction # 3)

    • We would be ok with this restriction in compat mode: "Only a single mip level range and array layer range per texture is supported."
  • Sample_mask (restriction # 5)

    • We would be ok making this unavailable in compat mode
  • (stretch) bgra8snorm

    • would prefer swizzle workaround

@teoxoy
Copy link
Member

teoxoy commented Nov 8, 2023

8. Disallow texture*() of a texture_depth_2d_array with an offset

Justification: OpenGL ES does not support textureOffset() on a sampler2DArrayShadow.

Looking at the specs:

WGSL fn GLSL ES 3.1 fn w/o offset GLSL ES 3.1 fn w/ offset
textureSample/textureSampleCompare texture textureOffset
textureGather/textureGatherCompare textureGather textureGatherOffset
textureSampleLevel/textureSampleCompareLevel textureLod textureLodOffset

❌ = fn with sampler2DArrayShadow overload missing

It's odd that textureGatherOffset exists but not textureOffset?
textureLod is also missing, so we should also disallow textureSampleLevel/textureSampleCompareLevel.

10. Disallow bgra8unorm-srgb textures.

Justification: OpenGL ES does not support sRGB BGRA texture formats.

The bgra8unorm format seems to also not be available in OpenGL ES (not just its sRGB counterpart).

@mwyrzykowski
Copy link

We are ok with the current proposal and restrictions for compat mode, specifically restrictions (7), (8), and (10) would be fine to have as restrictions.

@teoxoy
Copy link
Member

teoxoy commented Nov 13, 2023

@mwyrzykowski in #3838 (comment) asked if we can still support Apple3 under compat.

Quoting my comment here #3838 (comment):

I checked our codebase and the Metal feature set tables and it seems that the spec has dropped support for the Apple3 family a long time ago. The following features are currently needed by the spec and are missing from the Apple3 family but present in the Apple4 family.

  • "Cube map texture arrays" allows usage of texturecube_array
  • "Read/write textures in functions" previously named "Function texture read-writes" (ref for what it enables)

    Both vertex and fragment functions can now write to textures. Writable textures must be declared with the access::write or access::read_write qualifier.

We already covered the lack of cube map texture arrays and agreed to validate those out since OpenGL ES 3.1 also doesn't have those.

The remaining item is Apple3's lack of write-only or read-write access to storage textures in fragment shaders. For which I couldn't find an equivalent restriction in OpenGL ES 3.1; @SenorBlanco do you know if such a restriction exists?

image

@mwyrzykowski it would be worth double-checking that I haven't missed any other capabilities that Apple3 doesn't have and the core spec requires.

@SenorBlanco
Copy link
Contributor Author

The remaining item is Apple3's lack of write-only or read-write access to storage textures in fragment shaders. For which I couldn't find an equivalent restriction in OpenGL ES 3.1; @SenorBlanco do you know if such a restriction exists?

GL_MAX_FRAGMENT_IMAGE_UNIFORMS has a minimum value of 0, as does GL_MAX_VERTEX_IMAGE_UNIFORMS. Only GL_MAX_COMPUTE_IMAGE_UNIFORMS has a non-zero minimum (4).

We might have to look at the values on actual devices, but if we were to set the Compat limits based on the ES 3.1 limits, devices lacking storage textures in fragment shaders would be compliant.

@mwyrzykowski
Copy link

@teoxoy I did not find any other missing capabilities that Apple3 doesn't have and the core spec requires.

@SenorBlanco
Copy link
Contributor Author

GL_MAX_FRAGMENT_IMAGE_UNIFORMS has a minimum value of 0, as does GL_MAX_VERTEX_IMAGE_UNIFORMS. Only GL_MAX_COMPUTE_IMAGE_UNIFORMS has a non-zero minimum (4).

We might have to look at the values on actual devices, but if we were to set the Compat limits based on the ES 3.1 limits, devices lacking storage textures in fragment shaders would be compliant.

Just to follow up: there are no reports with zero GL_MAX_FRAGMENT_IMAGE_UNIFORMS or GL_MAX_VERTEX_IMAGE_UNIFORMS on ES 3.1 on gpuinfo.org: https://opengles.gpuinfo.org/displaycapability.php?name=GL_MAX_FRAGMENT_IMAGE_UNIFORMS&esversion=31. I'm not sure this is conclusive; I need to look at the spec sheets for some devices to verify that there aren't any other popular devices with zero maximums.

The limitation I see above in Apple3 is with respect to read/write storage textures; are we sure that applies to write-only storage textures as well?

@teoxoy
Copy link
Member

teoxoy commented Nov 16, 2023

The limitation I see above in Apple3 is with respect to read/write storage textures; are we sure that applies to write-only storage textures as well?

It was previously called "Function Texture Read-Writes". Looking at the article below it talks about writable textures which includes read-write ones.

Function Texture Read-Writes
Available in: OSX_GPUFamily1_v2

Both vertex and fragment functions can now write to textures. Writable textures must be declared with the access::write or access::read_write qualifier. Use an appropriate variant of the write() function to write to a texture (where lod is always constant and equal to zero).

from What’s New in iOS 10, tvOS 10, and macOS 10.12

@greggman
Copy link
Contributor

I think this is another thing that needs to be added to the list. WebGPU supports a stride of 0 for attributes. OpenGL ES does not. Stride of 0 in GL = advance the size of the attribute. Stride of 0 in WebGPU = a stride of 0 as in don't advance the attribute. Just keep reading the same value.

I guess in Compat a stride if 0 in an attribute would generate a validation error? Otherwise you could read value in the buffer, turn off the attribute (gl.disableVertexAttribPointer), and set the attribute's constant value (gl.vertexAttrib4f(...valueReadFromBuffer))

@kainino0x
Copy link
Contributor

The limitation I see above in Apple3 is with respect to read/write storage textures; are we sure that applies to write-only storage textures as well?

I am pretty sure we only ever disallowed storage textures (which at that point were always write-only) in vertex shaders, which would mean Apple3 must have supported write-only in fragment shaders.

You can verify this from the Metal Feature Set Tables: in the "texture capabilities by pixel format" table, a lot of formats have the "All" or "Write" capabilities all the way back to Apple2 (Apple1 if you look at older copies of the PDF). I think these apply to both fragment and compute.

Both vertex and fragment functions can now write to textures.

I think this means "Before, only fragment functions could write to textures. Now, both vertex and fragment can write to textures."

@mwyrzykowski
Copy link

@kainino0x yes that appears to be correct and is my understanding as well.

@teoxoy
Copy link
Member

teoxoy commented Nov 18, 2023

I see. It's an interesting choice of wording; I would have expected something along the lines of: "Vertex functions can now write to textures (in addition to fragment and kernel functions).".

@teoxoy
Copy link
Member

teoxoy commented Nov 18, 2023

It would be worth double-checking with someone from the Metal team as declaring a write-only texture in a fragment function didn't use to work prior to MSL 1.2 (which came out at the same time as the "Function Texture Read-Writes" capability).

See https://shader-playground.timjones.io/f08d3d49dcb231d15b062c614baf6002; found in gfx-rs/naga#2486 (comment).

It could be that this was just a compiler limitation and not a hardware one meaning that if you are using MSL 1.2+ and running on any GPU family (even Apple1) you can write to textures from a fragment function.

@greggman
Copy link
Contributor

I think this is another thing that needs to be added to the list. WebGPU supports a stride of 0 for attributes. OpenGL ES does not. Stride of 0 in GL = advance the size of the attribute. Stride of 0 in WebGPU = a stride of 0 as in don't advance the attribute. Just keep reading the same value.

I guess in Compat a stride if 0 in an attribute would generate a validation error? Otherwise you could read value in the buffer, turn off the attribute (gl.disableVertexAttribPointer), and set the attribute's constant value (gl.vertexAttrib4f(...valueReadFromBuffer))

Checking into this: The CTS tests this here: https://github.com/gpuweb/cts/blob/6e75f19212e3deaebd5bd8542fe10a6fdedc0cdf/src/webgpu/api/operation/vertex_state/correctness.spec.ts#L728

In OpenGL this can be emulated by setting the divisor for the attribute to a high number

glVertexAttribDivisor(attribLocation, 0xffffffff)

Which is what Dawn is currently doing. So I guess no spec changes are needed for this. Just passing on the existing solution.

@kdashg
Copy link
Contributor

kdashg commented Nov 30, 2023

GPU Web 2023-11-29 Atlantic-time
  • Add Compat Limits to Proposal #4397
  • Gregg: Only big question is the 3D texture limit if we can keep it at 1k instead of just 256, otherwise don't think there's controversy.
  • Gregg: basically spec limits or what's actually available in hardware, for example the 3D texture limit. Also the maxStorageTexturePerShaderStage that can be 0 but no device does that.
  • Corentin: Please take a look offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api WebGPU API compat WebGPU Compatibility Mode wgsl WebGPU Shading Language Issues
Projects
None yet
Development

No branches or pull requests

7 participants