The simplest design for error handling would be synchronous, for example with Javascript exceptions. However, this would introduce a lot of round-trip synchronization points for multi-threaded/multi-process WebGPU implementations, making it too slow to be useful.
There are a number of cases that developers or applications need error handling for:
- Debugging: Getting errors synchronously during development, to break in to the debugger.
- Fatal Errors: Handling device/adapter loss, either by restoring WebGPU or by fallback to non-WebGPU content.
- Fallible Allocation: Making fallible resource allocations (detecting out-of-memory).
- Testing: Checking success of WebGPU calls, for conformance testing or application unit testing.
- Telemetry: Collecting error logs in deployment, for bug reporting and telemetry.
There is one other use case that is closely related to error handling:
- Waiting for Completion: Waiting for completion of off-queue GPU operations (like object creation).
Meanwhile, error handling should not make the API clunky to use.
Implementations should provide a way to enable synchronous validation, for example via a "break on WebGPU error" option in the developer tools. The extra overhead needs to be low enough that applications can still run while being debugged.
interface GPU {
Promise<GPUAdapter?> requestAdapter(optional GPURequestAdapterOptions options = {});
};
interface GPUDeviceLostInfo {
readonly attribute DOMString message;
};
partial interface GPUDevice {
readonly attribute Promise<GPUDeviceLostInfo> lost;
};
GPU.requestAdapter
requests an adapter from the user agent.
It returns a Promise which resolves when an adapter is ready.
The Promise may not resolve for a long time - for example, the browser
could delay until a background tab is foregrounded, to make sure the right
adapter is chosen at the time the tab is foregrounded (in case the system
state, such as battery state, has changed).
If it returns null
, the app knows for sure that its request could not be fulfilled
(at least, in the current system state...); it does not need to retry with the
same GPURequestAdapterOptions
.
If the options
are invalid (currently impossible), requestAdapter()
rejects.
GPUAdapter.requestDevice
requests a device from the adapter.
It returns a Promise which resolves when a device is ready.
The Promise may not resolve for a long time - for example, even if the
adapter is still valid, the browser could delay until a background tab is
foregrounded, to make sure that system resources are conserved until then.
If the adapter is lost and therefore unable to create a device, requestDevice()
returns an already-lost device.
If the descriptor
is invalid (e.g. it exceeds the limits of the adapter), requestDevice()
rejects.
The GPUDevice
may be lost if something goes fatally wrong on the device
(e.g. unexpected driver error, crash, or native device loss).
The GPUDevice
provides a promise, device.lost
, which resolves when the device is lost.
It will never reject and may be pending forever.
Once lost
resolves, the GPUDevice
cannot be used anymore.
The device and all objects created from the device have become invalid.
All further operations on the device and its objects are errors.
The "validationerror"
event will no longer fire. (This makes all further operations no-ops.)
An app should never give up on getting WebGPU access due to GPUDevice.lost
resolving.
Instead of giving up, the app should try again starting with requestAdapter
.
It should give up based on a requestAdapter
returning null
or rejecting.
(It should also give up on a requestDevice
rejection, as that indicates an app
programming error - the request was invalid, e.g. not compatible with the adapter.)
NOTE: this example (and possibly the init API) still needs significant rework!
class MyRenderer {
constructor() {
this.adapter = null;
this.device = null;
}
async begin() {
const usingWebGPU = await this.initWebGPU();
if (!usingWebGPU) {
this.initFallback();
}
}
initFallback() {
// Try WebGL, 2D Canvas, or other fallback.
}
async initWebGPU() {
// Stop rendering. (If there was already a device, WebGPU calls made before
// the app notices the device is lost are okay - they are no-ops.)
this.device = null;
// Keep current adapter (but make a new one if there isn't a current one.)
await tryEnsureDeviceOnCurrentAdapter();
if (!this.adapter) return false;
// If the device is null, the adapter was lost. Try a new adapter.
// Continue doing this until one is found or an error is thrown.
while (!this.device) {
this.adapter = null;
await tryEnsureDeviceOnCurrentAdapter();
if (!this.adapter) return false;
}
// ... Upload resources, etc.
return true;
}
// TODO: This example should not retry on the current adapter, it should get a new adapter.
async tryEnsureDeviceOnCurrentAdapter() {
// If no adapter, get one.
// If we can't, rejects and the app falls back.
if (!this.adapter) {
// If no adapter, get one.
this.adapter = await gpu.requestAdapter({ /* options */ });
// If requestAdapter resolves to null, no matching adapter is available.
// Exit to fallback.
if (!this.adapter) return;
}
// Try to get a device.
// rejection => options were invalid (app programming error)
this.device = await this.adapter.requestDevice({ /* options */ });
// When the device is lost, just try to get a device again.
device.lost.then((info) => {
console.error("Device was lost.", info);
this.initWebGPU();
});
}
}
What signals should the app get, and when?
Two independent applications are running on the same webpage against two devices on the same adapter. The tab is in the background, and one device is using a lot of resources.
- The browser chooses to lose the heavier device.
device.lost
resolves, message = reclaiming device resources- (If the app calls
requestDevice
on the same adapter, orrequestAdapter
, it does not resolve until the tab is foregrounded.)
- Later, the browser might choose to lose the smaller device too.
device.lost
resolves, message = reclaiming device resources- (If the app calls
requestDevice
on the same adapter, orrequestAdapter
, it does not resolve until the tab is foregrounded.)
- The system configuration changes (e.g. laptop is unplugged).
- Since the adapter is no longer used, the UA may choose to lose it and
reject any outstanding
requestDevice
promises. (Perhaps not until the tab is foregrounded.) - (If the app calls
requestAdapter
, it does not resolve until the tab is foregrounded.)
- Since the adapter is no longer used, the UA may choose to lose it and
reject any outstanding
A page begins loading in a tab, but then the tab is backgrounded.
- On load, the page attempts creation of an adapter.
- The browser may or may not provide a WebGPU adapter yet - if it doesn't,
then when the page is foregrounded, the
requestAdapter
Promise will resolve. (This allows the browser to choose an adapter based on the latest system state.)
- The browser may or may not provide a WebGPU adapter yet - if it doesn't,
then when the page is foregrounded, the
A device's adapter is physically unplugged from the system (but an integrated GPU is still available).
- The same adapter, or a new adapter, is plugged back in.
- A later
requestAdapter
call may return the new adapter. (In the future, it might fire a "gpuadapterschanged" event.)
- A later
An app is running on an integrated adapter.
- A new, discrete adapter is plugged in.
- A later
requestAdapter
call may return the new adapter. (In the future, it might fire a "gpuadapterschanged" event.)
- A later
An app is running on a discrete adapter.
- The adapter is physically unplugged from the system. An integrated GPU is still available.
device.lost
resolves,requestDevice
on same adapter rejects,requestAdapter
gives the new adapter.
- The same adapter, or a new adapter, is plugged back in.
- A later
requestAdapter
call may return the new adapter. (In the future, it might fire a "gpuadapterschanged" event.)
- A later
The device is lost because of an unexpected error in the implementation.
device.lost
resolves, message = whatever the unexpected thing was.
A TDR-like scenario occurs.
- The adapter is lost, which loses all devices on the adapter.
device.lost
resolves on every device, message = adapter reset. Application must request adapter again. - (TODO: alternatively, adapter could be retained, but all devices on it are lost.)
All devices and adapters are lost (except for software?) because GPU access has been disabled by the browser (for this page or globally, e.g. due to unexpected GPU process crashes).
device.lost
resolves on every device, message = whatever
WebGPU access has been disabled for the page.
requestAdapter
returns null (or a software adapter).
The device is lost right as it's being returned by requestDevice, or otherwise couldn't be created due to non-deterministic/internal reasons.
device.lost
resolves.
WebGPU objects are opaque handles. On creation, such a handle is "pending" until the backing object is created by the implementation. Asynchronously, a handle may refer to a successfully created object (called a "valid object"), or an internally-empty/unsuccessful object (called an "invalid object"). The status of an object is opaque to JavaScript, except that any errors during object creation can be captured (see below).
If a WebGPU object handle A is passed to a WebGPU API call C that requires a valid object, that API call opaquely accepts the object regardless of its status (pending, valid, or invalid). However, internally and asynchronously, C will not be validated and executed until A's status has resolved. If A resolves to invalid, C will fail (asynchronously).
Errors in operations or creation will generate an error into the current scope. An error may be captured by a surrounding Error Scope (described below). If an error is not captured, it may fire the Device's "unhandlederror" event (below).
Creation of the adapter and device.
gpu.requestAdapter
GPUAdapter.requestDevice
Handled by "Fatal Errors" above.
WebGPU Object creation and getters.
GPUDevice.createTexture
GPUDevice.createBuffer
GPUDevice.createBufferMapped
GPUTexture.createView
GPUTexture.createDefaultView
GPUCommandEncoder.finish
GPUDevice.getQueue
GPUSwapChain.getCurrentTexture
If there is an error, the returned object is invalid, and an error is generated into the current scope.
Recording of GPU commands in GPUCommandEncoder
.
GPUCommandEncoder.*
GPURenderPassEncoder.*
GPUComputePassEncoder.*
These commands do not report errors.
Instead, GPUCommandEncoder.finish
returns an invalid object and generates an error into the current scope.
GPUDevice.createBufferMappedAsync
GPUCanvasContext.getSwapChainPreferredFormat
GPUFence.onCompletion
GPUBuffer.mapReadAsync
GPUBuffer.mapWriteAsync
If there is an error, the returned Promise rejects.
GPUQueue.submit
GPUQueue.signal
GPUBuffer.unmap
GPUBuffer.destroy
GPUTexture.destroy
If there is an error, an error is generated into the current scope.
GPUFence.getCompletedValue
This call cannot fail.
Each device* maintains a persistent "error scope" stack state.
Initially, the device's error scope stack is empty.
GPUDevice.pushErrorScope(filter)
creates an error scope and pushes it onto the stack.
GPUDevice.popErrorScope()
pops an error scope from the stack, and returns a Promise<GPUError?>
, which resolves once the enclosed operations are complete.
It resolves to null if no errors were captured, and otherwise resolves to the first error that occurred in the scope -
either a GPUOutOfMemoryError
or a GPUValidationError
object containing information about the validation failure.
An error scope captures an error if its filter matches the type of the error scope:
pushErrorScope('out-of-memory')
captures GPUOutOfMemoryError
s;
pushErrorScope('validation')
captures GPUValidationError
s.
The filter mechanism prevents developers from, e.g., accidentally silencing validation errors when trying to do fallible allocation.
If an error scope captures an error, the error is not passed down to the enclosing error scope. Each error scope stores only the first error it captures, and returns that error when the scope is popped. Any further errors it captures are silently ignored.
If an error is not captured by an error scope, it is passed out to the enclosing error scope.
If there are no error scopes on the stack, popErrorScope()
throws OperationError.
If the device is lost, popErrorScope()
always rejects with OperationError.
* Error scope state is per-device, per-execution-context.
That is, when a GPUDevice
is posted to a Worker for the first time, the new GPUDevice
copy's error scope stack is empty.
(If a GPUDevice
is copied back to an execution context it already existed on, it shares its error scope state with all other copies on that execution context.)
enum GPUErrorFilter {
"out-of-memory",
"validation"
};
interface GPUOutOfMemoryError {};
interface GPUValidationError {
readonly attribute DOMString message;
};
typedef (GPUOutOfMemoryError or GPUValidationError) GPUError;
partial interface GPUDevice {
void pushErrorScope(GPUErrorFilter filter);
Promise<GPUError?> popErrorScope();
};
An out-of-memory
error scope can be used to detect allocation failure.
async function tryCreateBuffer(device, desc) {
device.pushErrorScope('out-of-memory');
const buffer = device.createBuffer(desc);
if (await device.popErrorScope() !== null) {
return null;
}
return buffer;
}
Using a validation
error scope can tell an application when validation has
completed, but is otherwise not intended to signal completion.
(On-queue operation completion can be detected with GPUFence
.)
For pipeline creation, there are createReadyComputePipeline
and
createReadyRenderPipeline
.
createReadyRenderPipeline
is asynchronous.
Note requestAnimationFrame
's callback is not treated as asynchronous - only the first task is guaranteed to occur before the frame is displayed.
class Renderer {
init() {
const fastPipeline = createRenderPipeline(...);
this.pipeline = fastPipeline;
}
prepareSlowPipeline() {
createReadyRenderPipeline(...).then((slowPipeline) => {
this.pipeline = slowPipeline;
});
}
draw() {
if (wantSlowPipeline) {
prepareSlowPipeline();
}
// draw object with this.pipeline.
// It switches to the "slowPipeline" when it becomes available.
}
}
renderer.init();
const frame = () => {
requestAnimationFrame(frame);
renderer.draw();
};
requestAnimationFrame(frame);
Tests need to be able to reliably detect both expected and unexpected errors.
device.pushErrorScope('out-of-memory');
device.pushErrorScope('validation');
{
// Do stuff that shouldn't produce errors.
{
device.pushErrorScope('validation');
device.doOperationThatErrors();
device.popErrorScope().then(error => { assert(error !== null); });
}
// More stuff that shouldn't produce errors
}
// Detect unexpected errors.
device.popErrorScope().then(error => { assert(error === null); });
device.popErrorScope().then(error => { assert(error === null); });
If an error is not captured by an explicit error scope, it bubbles up to the device and may fire its uncapturederror
event.
This mechanism is like a programmatic way to access the warnings that appear in the developer tools. Errors reported via the validation error event should also appear in the developer tools console as warnings (like in WebGL). However, some developer tools warnings might not necessarily fire the event, and message strings could be different (e.g. some details omitted for security).
The WebGPU implementation may choose not to fire the uncapturederror
event for a given error, for example if it has fired too many times, too many times in a row, or with too many errors of the same kind.
This is similar to how console warnings would work, and work today for WebGL.
(In badly-formed applications, this mechanism can prevent the events from having a significant performance impact on the system.)
Unlike error scoping, the uncapturederror
event can only fire on the main thread (Window) event loop.
[
Constructor(DOMString type, GPUUncapturedErrorEventInit gpuUncapturedErrorEventInitDict),
Exposed=Window
]
interface GPUUncapturedErrorEvent : Event {
readonly attribute GPUError error;
};
dictionary GPUUncapturedErrorEventInit : EventInit {
required DOMString message;
};
// TODO: is it possible to expose the EventTarget only on the main thread?
partial interface GPUDevice : EventTarget {
[Exposed=Window]
attribute EventHandler onuncapturederror;
};
const device = await adapter.requestDevice({});
device.addEventListener('uncapturederror', (event) => {
appendToTelemetryReport(event.message);
});
-
Is there a need for synchronous, programmatic capture of errors during development? (E.g. an option to throw an exception on error instead of surfacing the error asynchronously. Asynchronous error handling APIs are not enough to polyfill this.) This would only be needed for printf-style debugging; a "break on WebGPU error" would be used for Dev Tools debugging.
-
How can a synchronous application (e.g. WASM port) handle all of these asynchronous errors? A synchronous version of
popErrorState
and other entry points would need to be exposed on Workers. (A more general solution for using asynchronous APIs synchronously would also solve this.) -
Should there be a maximum error scope depth?
-
Or should error scope balance be enforced by changing the API to e.g.
device.withErrorScope('validation', () => { device.stuff(); /*...*/ })
? -
Should the error scope filter be a bitfield?
-
Should the error scope filter have a default value?
-
Should errors beyond the first in an error scope be silently ignored, bubble up to the parent error scope, or be immediately given to the
uncapturederror
event?- (Currently, it is silently ignored.)
-
Should there be codes for different error types, to slightly improve testing fidelity? (e.g.
invalid-object
,invalid-value
,invalid-state
) -
Should developers be able to self-impose a memory limit (in order to emulate lower-memory devices)? Should implementations automatically impose a lower memory limit (to improve stability and portability)?
-
To help developers, should
GPUUncapturedErrorEvent.message
contain some sort of "stack trace" taking advantage of object debug labels? For example:<myQueue>.submit failed: - commands[0] (<mainColorPass>) was invalid: - in setIndexBuffer, indexBuffer (<mesh3.indices>) was invalid: - in createBuffer, desc.usage was invalid (0x89)
-
How do applications handle the case where they've allocated a lot of optional memory, but want to make another required allocation (which could fail due to OOM)? How do they know when to free an optional allocation first?
- For now, applications wanting to handle this kind of case must always use fallible allocations.
- (We will likely improve this with a
GPUResourceHeap
, once we figure out what that looks like.)
-
Should attempting to use a buffer or texture in the
"out-of-memory"
state (a) result in immediate device loss, (b) result in device loss when used in a device-level operation (submit, map, etc.), or (c) just produce a validation error?- Currently described: none, implicitly (c)
-
In a world with persistent object "usage" state: If an invalid command buffer is submitted, and its transitions becomes no-ops, the usage state won't update. Will this cause future command buffer submits to become invalid because of a usage validation error?
- Tentatively resolved: WebGPU is expected not to require explicit usage transitions.
-
Should an object creation error immediately log an error to the error log? Or should it only log if the error propagates to a device-level operation?
- Tentatively resolved: errors should be logged immediately.
-
Should applications be able to intentionally create graphs of potentially-invalid objects, and recover from this late? E.g. create a large buffer, create a bind group from that, create a command buffer from that, then choose whether to submit based on whether the buffer was successfully allocated.
- For non-OOM, tentatively resolved: They can, inside of an error scope. Any subsequent errors can be suppressed. Not sure if it's useful.
- For OOM, see other questions about OOM.
-
Should there be an API that exposes object status?
- Resolved: No, but errors during object creation can be detected.
-
Should there be a way to capture out-of-memory errors without capturing validation errors? (And vice versa?)
- Resolved: Yes, so applications don't accidentally silence validation errors.