Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Errors Synchronously #2478

Open
greggman opened this issue Jan 5, 2022 · 13 comments
Open

Handling Errors Synchronously #2478

greggman opened this issue Jan 5, 2022 · 13 comments
Labels
api WebGPU API
Milestone

Comments

@greggman
Copy link
Contributor

greggman commented Jan 5, 2022

I understand that WebGPU was designed to be 100% async and that includes errors but I'm curious, what's the recommended way to debug?

WebGL, for the most part, is also designed to be async. The only non-async parts are state queries (gl.getXXX) and readPixels. Given most apps don't call readPixels and except for uniform locations, many apps don't need to do any queries.

But, given the design, you can easily add synchronous error checking. Two examples are the Khronos debug context and webgl-lint. They both wrap the context and call gl.getError after each WebGL api call.

The nice thing about these is almost no changes are needed to your code.

I can't see any way to do this effectively in WebGPU without either code-generators or really ugly code.

The issue is in order to be "synchronous like" you'd have to do await but you can't await conditionally in an invisible way

If you have code like this

    const commandEncoder = device.createCommandEncoder();
    const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
    passEncoder.setPipeline(pipeline);
    passEncoder.setBindGroup(0, bindGroup);
    passEncoder.setVertexBuffer(0, positionBuffer);
    passEncoder.setVertexBuffer(1, normalBuffer);
    passEncoder.setVertexBuffer(2, texcoordBuffer);
    passEncoder.setIndexBuffer(indicesBuffer, 'uint16');
    passEncoder.drawIndexed(indices.length);
    passEncoder.endPass();
    device.queue.submit([commandEncoder.finish()]);

IIUC the way to check for errors is this

device.pushErrorScope();
    const commandEncoder = device.createCommandEncoder();
    const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
    passEncoder.setPipeline(pipeline);
    passEncoder.setBindGroup(0, bindGroup);
    passEncoder.setVertexBuffer(0, positionBuffer);
    passEncoder.setVertexBuffer(1, normalBuffer);
    passEncoder.setVertexBuffer(2, texcoordBuffer);
    passEncoder.setIndexBuffer(indicesBuffer, 'uint16');
    passEncoder.drawIndexed(indices.length);
    passEncoder.endPass();
    device.queue.submit([commandEncoder.finish()]);
device.popErrorScope().then(error => { assert(!error) }

But if you want the error synchronously then you need this

device.pushErrorScope();
    const commandEncoder = device.createCommandEncoder();
    const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
    passEncoder.setPipeline(pipeline);
    passEncoder.setBindGroup(0, bindGroup);
    passEncoder.setVertexBuffer(0, positionBuffer);
    passEncoder.setVertexBuffer(1, normalBuffer);
    passEncoder.setVertexBuffer(2, texcoordBuffer);
    passEncoder.setIndexBuffer(indicesBuffer, 'uint16');
    passEncoder.drawIndexed(indices.length);
    passEncoder.endPass();
    device.queue.submit([commandEncoder.finish()]);
const error = await device.popErrorScope();
assert(!error);

Ideally, to be as convenient as WebGL, you'd like to be able to wrap the WebGPU api where appropriate. For example

device.queue.submit = function (origFn) {
  return function(...args) {
     const result = origFn.call(this, ...args);
     // somehow synchronously check for error here
     ...
     return result;
  };
}(device.queue.submit);

If you put await in the "check for error here" part that's not useful for the user. They need to change

device.queue.submit(...);

to

await device.queue.submit(...);

You don't want the await there in production, only in development. Except of course if you need a user to help debug.

To make it possible to switch at runtime with synchronous like checking on it seems like they'd have to do something like this

const result = device.queue.submit(...);
if (debug && result) {  await result; }

Now your code is littered with this debug code vs WebGL where it can all be invisible.

Why do you need synchronous debugging? Because with async debugging, although you know you got an error, even with a detailed error message, what you really want is to break in the code exactly where the error occurred. That way you can inspect the state of your program. Check the stack, check your data structures, etc and see why your inputs were wrong. With async debugging, at the time you get the error it's too late to check your stack and state, it's long gone.

Is there a suggested solution here? I guess maybe you could re-implement WebGPU in JavaScript, parse the shaders in JavaScript to extract the inputs and outputs, and then validate all the inputs, outputs, bindings etc synchronously as one, possible solution.

That still probably wouldn't catch all errors synchronously. For example out-of-memory.

@kvark
Copy link
Contributor

kvark commented Jan 5, 2022

The browsers can (and likely will) provide a mode in which all WebGPU errors are synchronous (and/or trigger a breakpoint). It sounds like it would address the core of your issue?

@kainino0x
Copy link
Contributor

I don't think we should do a mode where WebGPU errors are synchronous (as it would change the application's actual behavior, essentially creating a heisenbug). However I would really like to have a "break on error" option in our dev tools. But you have a lot more experience with WebGL debug tools so maybe you can poke holes in this idea and find out it's insufficient. (Or maybe you're interested in prototyping it? 😛 )

Here's where this is described (if very briefly):
https://github.com/gpuweb/gpuweb/blob/main/design/ErrorHandling.md#debugging-dev-tools

@magcius
Copy link

magcius commented Jan 7, 2022

My personal experience so far is that the debugging experience in WebGPU is much more painful. WebGL has its troubles too, but it is part of my workflow to add if (gl.getError() != gl.NO_ERROR) debugger; my backend if I'm encountering a call that shows an error in the devtools. Something similar in WebGPU might be nice. Not all of my backend methods are designed to be async, so it's not like I can easily drop an await device.popErrorScope() in there. It would be nice to have a syncAwait() for debugging, but that's a JS/browser tech/scheduler problem more than it is a WebGPU one.

@greggman
Copy link
Contributor Author

greggman commented Jan 13, 2022

A bunch of random thoughts

  • I could potentially make a sync API by wrapping the entire webgpu API in JavaScript, compiling/validating the shaders in JavaScript, and doing all the checks in JavaScript and throwing errors.

  • IIUC Devtools is written in JavaScript so if it's going to be allowed to enable synchronous errors, why not pass that on to users?

  • WebGL mostly manages to by async as long as you don't ask for errors. Would it be so bad to do the same for WebGPU (ie, have a device.getErrors) that freezes the world and gets the errors? Just throwing that out there.

  • getting stack traces. Maybe all of these are not needed but I tried wrapping everything to get a stack trace on error ,meaning a stack trace for the call that caused the error.

    {
      function report(error, stack) {
        console.error(error, stack);
      }
    
      const objectToDeviceMap = new Map();
    
      function mapToDevice(object, device) {
        objectToDeviceMap.set(object, device);
      }
    
      function wrapGPUOtherFn(api, fnName) {
        const origFn = api[fnName];
        const afterFn = (fnName.startsWith('begin') || fnName.startsWith('create'))
            ? mapToDevice
            : undefined;
        api[fnName] = function(...args) {
          const stack = new Error().stack;
          const device = objectToDeviceMap.get(this);
          try {
            device.pushErrorScope('validation');
            const result = origFn.call(this, ...args);
            if (afterFn) {
              afterFn(result, device);
            }
            device.popErrorScope().then((error) => {
              if (error) {
                report(error, stack);
              }
            });
            return result;
          } catch (e) {
            report(e, stack);
          }
        }
      }
      function wrapGPUDeviceFn(api, fnName) {
        const origFn = api[fnName];
        const afterFn = fnName.startsWith('create')
            ? mapToDevice
            : undefined;
        api[fnName] = function(...args) {
          const stack = new Error().stack;
          try {
            this.pushErrorScope('validation');
            const result = origFn.call(this, ...args);
            if (afterFn) {
              afterFn(result, this);
            }
            this.popErrorScope().then((error) => {
              if (error) {
                report(error, stack);
              }
            });
            return result;
          } catch(e) {
            report(e, stack);
          }
        }
      }
    
      function wrapAPI(api, wrapFn, special = [])
      {
        const prototype = api.prototype;
        for (const k of Object.keys(prototype)) {
          if (special.includes(k)) {
            continue;
          }
          const desc = Object.getOwnPropertyDescriptor(prototype, k);
          if (!desc.writable) {
            continue;
          }
          const v = prototype[k];
          if (typeof v !== 'function') {
            continue;
          }
          wrapFn(prototype, k);
        }
      }
    
      wrapAPI(GPUDevice, wrapGPUDeviceFn, [
        'pushErrorScope',
        'popErrorScope',
      ]);
      wrapAPI(GPUCommandEncoder, wrapGPUOtherFn);
      wrapAPI(GPUComputePassEncoder, wrapGPUOtherFn);
      wrapAPI(GPURenderPassEncoder, wrapGPUOtherFn);
      wrapAPI(GPURenderBundleEncoder, wrapGPUOtherFn);
      wrapAPI(GPUQueue, wrapGPUOtherFn);
      //wrapAPI(GPUBuffer, wrapGPUOtherFn);
      wrapAPI(GPUTexture, wrapGPUOtherFn);
    
      GPUCanvasContext.prototype.configure = function(origFn) {
        return function(configuration, ...args) {
          const stack = new Error().stack;
          try {
            const {device} = configuration;
            device.pushErrorScope('validation');
            const result = origFn.call(this, configuration, ...args);
            mapToDevice(this, device);
            device.popErrorScope().then(error => {
              if (error) {
                report(error, stack);
              }
            });
            return result;
          } catch (e) {
            report(e, stack);
          }
        };
      }(GPUCanvasContext.prototype.configure);
    
      GPUCanvasContext.prototype.getCurrentTexture = function(origFn) {
        return function(...args) {
          const stack = new Error().stack;
          try {
            const device = objectToDeviceMap.get(this);
            device.pushErrorScope('validation');
            const result = origFn.call(this, ...args);
            mapToDevice(result, device);
            device.popErrorScope().then(error => {
              if (error) {
                report(error, stack);
              }
            });
            return result;
          } catch (e) {
            report(e, stack);
          }
        };
      }(GPUCanvasContext.prototype.getCurrentTexture);
    
      GPUAdapter.prototype.requestDevice = function(origFn) {
        return function(...args) {
          return origFn.call(this, ...args)
              .then(device => {
                mapToDevice(device.queue, device);
                return device;
              });
        };
      }(GPUAdapter.prototype.requestDevice);
    }
    

    which seems functionally equivalent to what people do on WebGL? It's definitely slower than WebGL's similar wrapping given it has to preemptively create an Error object and a promise for every call whereas with gl.getError nothing is created.

    Some things that came up writing that. It feels strange to push/pop error scope on device when calling functions not on the device like passEncoder.setScissorRect. It feels even stranger for canvasContext.getCurrentTexture().createView()

@greggman
Copy link
Contributor Author

Let me also add, if I wanted that wrapper to be invisible I'd need to track error scopes so that I can return the errors I captured to their outer scopes. Not sure how much work that would be.

@Kangz
Copy link
Contributor

Kangz commented Jan 13, 2022

A synchronous getError could be ideal for developers but I don't think that can be added to the WebGPU specification itself because IPC should not be exposed via synchronous operations in JS. (WebGL was grandparented into that) See https://www.w3.org/TR/design-principles/#synchronous However a browser could decide to expose such a call when developers add a command line flag.

I think that the best solution is what @kainino0x described where there is a devtools checkbox that causes the devtools to break on WebGPU validation errors (or raise an exception or ...). This would be running the same validation code as the browser so it would be "perfect".

In Chromium for example this could be done one of two ways:

  1. Instead of calling the dawn_wire client procs in Blink, we would call a "passthrough" dawn_native WebGPU backend that takes WebGPU commands, validates them and issues equivalent WebGPU commands. That passthrough step would be configured to call the dawn_wire client procs after validation, and to call a callback when there is any validation error.
  2. Next to every dawn_wire client proc call, add a potential dawn_native null backend call that's equivalent. Every JS object would have both a "real" WebGPU texture and a "null" one. The null calls would be configured to synchronously call a callback that breaks in the devtools.

Actually option 1) also gives a way to do this without browser support:

  1. Add a dawn_native "passthrough" WebGPU backend.
  2. Compile it to WASM using emscripten.
  3. Make an implementation of the WebGPU API that calls into the emscripten WebGPU implementation. (it looks big, but it's mostly just marshalling of arguments).
  4. Configure the passthrough WebGPU backend to call a callback immediately on validation errors and throw an exception into that.

That WebGPU interceptor could even be tested by making it run the WebGPU CTS!

@greggman
Copy link
Contributor Author

I don't think that can be added to the WebGPU specification itself because IPC should not be exposed via synchronous operations in JS. (WebGL was grandparented into that) See https://www.w3.org/TR/design-principles/#synchronous

I must be mis-understanding. This issue seems to be wanting to add a sync API and @kainino0x was arguing it should be on the main thread.

@Kangz
Copy link
Contributor

Kangz commented Jan 13, 2022

I must be mis-understanding. This issue seems to be wanting to add a sync API and @kainino0x was arguing it should be on the main thread.

On workers. @kainino0x is suggesting that we could also do it on the main thread but that seems completely against the guidelines for Web APIs. It is always possible to bypass that but it needs to be extremely carefully considered and could cause major standardization and implementation issues.

@greggman
Copy link
Contributor Author

greggman commented Jan 13, 2022

I feel like no one has actually used the current WebGPU error reporting in a production app and I feel like if async error checking is so easy and trivial then Dawn should try using Vulkan/Metal/DX12 with nothing but async error reporting as proof that it's trivial and easy to implement graphics with nothing but async reporting. I think you'd find it's extremely non-trivial to use and that you'd give up and beg for sync error reporting.

The spec is basically saying that malloc never returns an error and you get the error off thread in a callback.

Few if any of the trivial WebGPU samples up there check any errors. It would be good to get more feedback from devs doing production work where they feel they need to check errors and see what their feedback is

@greggman
Copy link
Contributor Author

greggman commented Jan 13, 2022

I'm sorry if my last comment seemed aggressive. I just want to make sure that using the API in a robust way (checking for errors) makes sense. It makes sense to me that fetch is not sync as it can take multiple seconds to return. It makes sense to desire 100% of webgpu to be async. My only question is does it make sense in practice. The mapSync issue seems to suggest maybe we can't be 100% async. I'm only wondering if apps actually try to check for out-of-memory and validation errors if the present async reporting works in practice.

Just as an example, let's say you wanted to implement OpenGL on top of webgpu (for example use regal). Is it possible? How would you report out-of-memory synchronously to make that work? If you can't make it work does that suggest that maybe other apps will struggle here? What is this like using webgpu in C++ (native or wasm) land?

@kainino0x
Copy link
Contributor

kainino0x commented Jan 13, 2022

The mapSync issue is somewhat different. I'm proposing allowing it (even on the main thread) not just because developers want it, but because developers inevitably will work around it by writing to a canvas and using synchronous canvas readback. Since the ugly solution already exists we're just giving people something that looks reasonable instead of a hack. (Though Corentin suggested to me this isn't a good reason, as if people use a hack they know they're using a hack, and if they use mapSync then it seems valid despite still being bad for users.)

Fundamentally though, I'm sure it's true that developers would really benefit from actual synchronous errors and not just breakpoints. If that exists only when enabled via devtools, then I'm fine with it - just was weakly against it due to the heisenbug problem. My intuition is that popErrorScopeSync() is not the most convenient debugging tool though. Maybe a getError style thing (peekErrorScope?) that returns the error held by the current scope - so if you have
(push oom, createBuffer(INT_MAX), pop, getError)
then you won't see the error; if you have
(push oom, invalid command, pop, getError)
or
(push val, invalid command, getError, pop)
or just have no scopes on the stack
(invalid command, getError)
then you will see the error.

Just as an example, let's say you wanted to implement OpenGL on top of webgpu (for example use regal). Is it possible? How would you report out-of-memory synchronously to make that work? If you can't make it work does that suggest that maybe other apps will struggle here? What is this like using webgpu in C++ (native or wasm) land?

I'm not sure we really want to cater to that, but my first solution would be Emscripten's Asyncify (for OpenGL-on-WebGPU, not WebGL-on-WebGPU), and my second would be having the application run GL in a worker, which remotes everything to second worker that sends errors back with an Atomics.notify to the first worker to receive by Atomics.wait.

@Kangz
Copy link
Contributor

Kangz commented Jan 14, 2022

My intuition is that popErrorScopeSync() is not the most convenient debugging tool though. Maybe a getError style thing (peekErrorScope?) that returns the error held by the current scope - so if you have (push oom, createBuffer(INT_MAX), pop, getError), you won't see the error; if you have (push oom, invalid command, pop, getError) or (push val, invalid command, getError, pop) or just have no scopes on the stack, you will see the error.

This could be an extension exposed behind a flag in workers but that could still be briefly described in the extensions/ subdirectory you suggest adding @kainino0x. This way it is not part of the Web platform, but developers can expect all browsers to expose similarly when the command line flag is added.

@greggman
Copy link
Contributor Author

I want to make it clear I'm not saying async errors are not sufficient. I just want to verify that's the case.

I also want to take a moment and ask that you please really consider how easy it would be to implement dawn or other GPU projects you may have worked on in the past if you couldn't check for errors synchronously. I think if you really do the thought experiment you might find it's difficult. I don't know all the use cases of babylon.js but in my limited understanding, most things made with it are relatively small? I think bigger projects (Photoshop, Google Maps, Figma, MapBox, ...) are probably more relevant to ask for input?

Further, I want to emphasize that devs will want to check errors on users machines. Going with the experience that it's hard to get users to copy and paste info from about:gpu or other system info, the same is arguably even worse for "please run your browser with these command line options so we can see the errors". I suspect the majority of users don't even know what a command line is. Similarly "please open devtools and check this box so we can see the errors" also seems like a hard request to ask of non-dev users. In other words, it's not clear this is just a case of "we only need this in devtools" though maybe async is enough.

As for using workers, my experience of WebGL is that workers aren't useful because they are only supported in Chrome. Neither Firefox nor Safari have shipped the required features to use WebGL from workers and even checking now, Firefox doesn't support webgpu in workers (Safari seems to have removed WebGPU at the moment so I can't check).

I agree that being able to stop synchronously in the debugger will be great. I'm just not (yet) sure it's sufficient.

Still, this isn't a V1 feature. If it turns out this is really needed then it can certainly be added later.

@kainino0x kainino0x added this to the post-V1 milestone Jan 18, 2022
@kainino0x kainino0x added the api WebGPU API label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api WebGPU API
Projects
None yet
Development

No branches or pull requests

5 participants