-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update LPMS with panic
for CUDA_ERROR_ILLEGAL_ADDRESS
stuck sessions
#2057
Conversation
We now `panic` if CUDA_ERROR_ILLEGAL_ADDRESS is encountered, as it is an unrecoverable error. See #1921 for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions:
- How will B behave if an O/T crashes during a transcode as a result of this change?
- Does the HTTP conn get closed cleanly with an error causing B to immediately switch to another O/T?
- Was there any progress with distinguishing CUDA illegal addresses so Bs could mark those errors as non-retryable as mentioned in Crash when we get CUDA_ERROR_ILLEGAL_ADDRESS #1921 (comment)?
I haven't tested this, but a few thoughts below...
I don't think the HTTP conn will be closed cleanly if the O/T panics in the middle of a transcode.
I think the B will wait for a timeout and then switch to a different O/T. It would be preferable if the O/T could return an error immediately to B so B can switch before the O/T panics so that B doesn't have to wait for a timeout. Maybe panic recovery as described in this post could help here? The O/T could recover from the panic, return the error and then trigger another panic to crash. An alternative would be for go-livepeer to check for the unrecoverable LPMS error introduced in https://github.com/livepeer/lpms/pull/267/files and trigger the panic only in go-livepeer instead of triggering it in LPMS.
I'm not sure how confident we are that the "Unknown error" that is currently classified as unrecoverable in LPMS is always attributable as a fault of the source segment as opposed to the O/T (as already mentioned in this comment). There is some discussion around propagating an HTTP status code to the client of the broadcaster to tell them to stop retrying if the broadcaster hits the max # of retries - if that goes through then distinguishing CUDA illegal address errors on Bs would be less important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this pull request do? Explain your changes. (required)
See livepeer/lpms#267
Specific updates (required)
See commits
How did you test each of these updates (required)
Does this pull request close any open issues?
Fixes #1921
Checklist:
make
runs successfully./test.sh
pass