Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the "internal" error type #3123

Merged
merged 9 commits into from Aug 31, 2022
Merged

Add the "internal" error type #3123

merged 9 commits into from Aug 31, 2022

Conversation

toji
Copy link
Member

@toji toji commented Jun 30, 2022

FixesIssue: #2308 (EDIT(@kainino0x): still need to keep that issue open about exposing non-uncategorized errors)

This error type captures non-validation errors that occur at a system
or implementation level. (Such as shaders failing because they are
too complex.) Initially only pipeline creation may produce operation
errors.

@github-actions
Copy link
Contributor

Previews, as seen when this build job started (4cd80ef):
WebGPU webgpu.idl | Explainer | Correspondence Reference
WGSL grammar.js | wgsl.lalr.txt

Copy link
Contributor

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, modulo any possible changes to the error scope API which are pretty much orthogonal.

spec/index.bs Outdated
[=shader-creation error|shader creation=] of
|descriptor|.{{GPUComputePipelineDescriptor/compute}}.{{GPUProgrammableStage/module}}, or during
the [=pipeline-creation error|pipeline creation=] of |pipeline|:
1. [$generate an operation error$] and make |pipeline| [=invalid=].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should also be something in createShaderModule, I think along the lines of "if there's an uncategorized shader error, make the GPUShaderModule invalid without generating any error"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... that would cause create*Pipeline() to fail validation, unless we had a special case "invalid due to uncategorized error" state. Alternatively, we could check for the errors in the shader modules prior to pipeline validation, but my impression was that we wanted to treat validation errors as "highest priority", and only start looking at whether or not other errors were generated when we knew validation was successful.

@kainino0x kainino0x added the tacit resolution candidate Editors may be able to resolve and move to tacit resolution queue label Jul 25, 2022
@toji toji changed the title Add the "operation" error type Replace "out-of-memory" with the "internal" error type Aug 10, 2022
@toji toji added tacit resolution queue Editors have agreed and intend to land if no feedback is given and removed tacit resolution candidate Editors may be able to resolve and move to tacit resolution queue labels Aug 10, 2022
@github-actions
Copy link
Contributor

Previews, as seen when this build job started (1cc2145):
WebGPU webgpu.idl | Explainer | Correspondence Reference
WGSL grammar.js | wgsl.lalr.txt

@toji
Copy link
Member Author

toji commented Aug 10, 2022

After editors discussion on Monday, we determined that there wasn't a solid need for a separate "operation" and "out-of-memory" error type, as the situations in which they could arise were operation specific and didn't overlap. As a result, it's cleaner to simply extend the existing "out-of-memory" errors to encompass the types of uncategorized errors described in the WGSL spec and rename that error type to "internal" to reflect it's new scope. This PR has been updated accordingly.

In the future should we see a need to differentiate OOM errors from other types of internal errors explicitly, we can add enums, flags, or other metadata to the GPUInternalError interface. It's not clear at this point, however, if that's actually beneficial.

@kainino0x
Copy link
Contributor

Notably it didn't seem to make sense to have a generic "internal" error as well as an "out-of-memory" error. It would make more sense to have "pipeline-failure" and "out-of-memory" or just "internal". We went with just "internal" for now because it's easy to expose the difference later as a field of GPUInternalError.

@kdashg
Copy link
Contributor

kdashg commented Aug 10, 2022

I think it's useful to preserve out-of-memory whenever we can tell that that's a reasonable possibility.

spec/index.bs Outdated Show resolved Hide resolved
spec/index.bs Outdated Show resolved Hide resolved
spec/index.bs Outdated Show resolved Hide resolved
spec/index.bs Outdated Show resolved Hide resolved
@kainino0x
Copy link
Contributor

I think it's useful to preserve out-of-memory whenever we can tell that that's a reasonable possibility.

@kdashg out of memory errors can only come from createBuffer, createTexture, and createQuerySet. If you capture an internal error from one of those calls then you know it's an out of memory condition. Do you think that doesn't work well?

@github-actions
Copy link
Contributor

Previews, as seen when this build job started (b7d46dd):
WebGPU webgpu.idl | Explainer | Correspondence Reference
WGSL grammar.js | wgsl.lalr.txt

@github-actions
Copy link
Contributor

Previews, as seen when this build job started (6f71ca9):
WebGPU webgpu.idl | Explainer | Correspondence Reference
WGSL grammar.js | wgsl.lalr.txt

@toji
Copy link
Member Author

toji commented Aug 10, 2022

Updated to add a GPUInternalErrorReason enum, of which the only value is currently "out-of-memory" to satisfy Kelsey's request today.

Also, adding that created just enough clutter in that section that I felt the urge to restructure it for clarity. So now there's one subsection that just describes the different GPUError variants, their purpose, and their members, and another subsection that just describes error scopes. I think it's a lot easier to follow now, but it does make the diff a bit messy. Sorry!

Copy link
Contributor

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nit

spec/index.bs Outdated Show resolved Hide resolved
system or implementation-specific reason even when all validation requirements have been satisfied.
For example, the operation may exceede the capabilities of the implementation in a way not easily
captured by the [=supported limits=]. The same operation may succeed on other devices or under
difference circumstances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should say that internal errors can only come from certain operations described in the spec, not any arbitrary operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and also added an algorithm for generating an internal error for future use (and while I was at it cleaned up some of the algorithms to use the new timeline styles.)

Haven't actually added any instances of an operation generating an internal error yet, because this CL is already messy enough as is and I'd like those to be more easily reviewed.

Out of curiosity, what's the expected behavior for an operation that we don't explicitly say can generate internal error experiencing an internal error? Is that just forced to be a lost context every time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize we didn't currently specify that out of memory can come from createBuffer/Texture/QuerySet. That'll be nice to have written down.

Yeah, the behavior I have in mind is:

  • Implementation does any heroics it wants to try to do to make the error go away (like waiting for the queue to finish and then deleting any temporary resources) and tries again
  • If it still fails, lose the device

@kainino0x kainino0x removed the tacit resolution queue Editors have agreed and intend to land if no feedback is given label Aug 17, 2022
spec/index.bs Outdated Show resolved Hide resolved
spec/index.bs Outdated Show resolved Hide resolved
@toji toji changed the title Replace "out-of-memory" with the "internal" error type Add the "internal" error type Aug 22, 2022
@toji
Copy link
Member Author

toji commented Aug 22, 2022

Editors meeting: After discussion about usage and addressing concerns raised by @kdashg and others, decided to break up "internal" and "out-of-memory" errors into separate types once again.

One motivating example for this change was the idea of developers who may want to capture an entire frame in an OOM error scope and use it as a signal to take some general action to reduce memory use, such as flushing caches. Combining internal and OOM errors would make that more difficult.

Change has already been implemented in the PR.

@github-actions
Copy link
Contributor

Previews, as seen when this build job started (573f678):
WebGPU webgpu.idl | Explainer | Correspondence Reference
WGSL grammar.js | wgsl.lalr.txt

@kainino0x kainino0x added the tacit resolution candidate Editors may be able to resolve and move to tacit resolution queue label Aug 22, 2022
@toji toji added tacit resolution queue Editors have agreed and intend to land if no feedback is given and removed tacit resolution candidate Editors may be able to resolve and move to tacit resolution queue labels Aug 24, 2022
toji and others added 9 commits August 31, 2022 12:06
Fixes #2308

This error type captures non-validation errors that occur at a system
or implementation level. (Such as shaders failing because they are
too complex.)
Co-authored-by: Kai Ninomiya <kainino@chromium.org>
Co-authored-by: Kai Ninomiya <kainino@chromium.org>
@github-actions
Copy link
Contributor

Previews, as seen when this build job started (94bf9a4):
WebGPU webgpu.idl | Explainer | Correspondence Reference
WGSL grammar.js | wgsl.lalr.txt

@toji
Copy link
Member Author

toji commented Aug 31, 2022

This approach was agreed on in today's meeting with the understanding that it satisfies @kdashg's previously voiced requirement (easily differentiating OOM errors). Merging.

@toji toji merged commit fbc4139 into main Aug 31, 2022
@toji toji deleted the operation-error branch August 31, 2022 19:40
juj added a commit to juj/wasm_webgpu that referenced this pull request Sep 6, 2022
@Kangz Kangz removed the tacit resolution queue Editors have agreed and intend to land if no feedback is given label Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants