Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update LPMS with panic for CUDA_ERROR_ILLEGAL_ADDRESS stuck sessions #2057

Merged
merged 3 commits into from
Oct 25, 2021

Conversation

jailuthra
Copy link
Contributor

@jailuthra jailuthra commented Oct 15, 2021

What does this pull request do? Explain your changes. (required)

See livepeer/lpms#267

Specific updates (required)

See commits

How did you test each of these updates (required)

Does this pull request close any open issues?

Fixes #1921

Checklist:

We now `panic` if CUDA_ERROR_ILLEGAL_ADDRESS is encountered, as it is an unrecoverable error.
See #1921 for details.
Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions:

  • How will B behave if an O/T crashes during a transcode as a result of this change?
  • Does the HTTP conn get closed cleanly with an error causing B to immediately switch to another O/T?
  • Was there any progress with distinguishing CUDA illegal addresses so Bs could mark those errors as non-retryable as mentioned in Crash when we get CUDA_ERROR_ILLEGAL_ADDRESS #1921 (comment)?

@yondonfu
Copy link
Member

I haven't tested this, but a few thoughts below...

Does the HTTP conn get closed cleanly with an error causing B to immediately switch to another O/T?

I don't think the HTTP conn will be closed cleanly if the O/T panics in the middle of a transcode.

How will B behave if an O/T crashes during a transcode as a result of this change?

I think the B will wait for a timeout and then switch to a different O/T.

It would be preferable if the O/T could return an error immediately to B so B can switch before the O/T panics so that B doesn't have to wait for a timeout. Maybe panic recovery as described in this post could help here? The O/T could recover from the panic, return the error and then trigger another panic to crash. An alternative would be for go-livepeer to check for the unrecoverable LPMS error introduced in https://github.com/livepeer/lpms/pull/267/files and trigger the panic only in go-livepeer instead of triggering it in LPMS.

Was there any progress with distinguishing CUDA illegal addresses so Bs could mark those errors as non-retryable as mentioned in #1921 (comment)?

I'm not sure how confident we are that the "Unknown error" that is currently classified as unrecoverable in LPMS is always attributable as a fault of the source segment as opposed to the O/T (as already mentioned in this comment). There is some discussion around propagating an HTTP status code to the client of the broadcaster to tell them to stop retrying if the broadcaster hits the max # of retries - if that goes through then distinguishing CUDA illegal address errors on Bs would be less important.

@yondonfu
Copy link
Member

Since this is blocking #2066 I'm going to approve/merge this for now. Created #2070 to track the improvement to O/T behavior.

Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yondonfu yondonfu merged commit eb0c1bf into master Oct 25, 2021
yondonfu pushed a commit that referenced this pull request Oct 25, 2021
@yondonfu yondonfu deleted the jai/illegal-addr-err branch October 25, 2021 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash when we get CUDA_ERROR_ILLEGAL_ADDRESS
2 participants