Update LPMS with `panic` for `CUDA_ERROR_ILLEGAL_ADDRESS` stuck sessions #2057

jailuthra · 2021-10-15T06:46:45Z

What does this pull request do? Explain your changes. (required)

See livepeer/lpms#267

Specific updates (required)

See commits

How did you test each of these updates (required)

Does this pull request close any open issues?

Fixes #1921

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
README and other documentation updated
Pending changelog updated

We now `panic` if CUDA_ERROR_ILLEGAL_ADDRESS is encountered, as it is an unrecoverable error. See #1921 for details.

yondonfu

A few questions:

How will B behave if an O/T crashes during a transcode as a result of this change?
Does the HTTP conn get closed cleanly with an error causing B to immediately switch to another O/T?
Was there any progress with distinguishing CUDA illegal addresses so Bs could mark those errors as non-retryable as mentioned in Crash when we get CUDA_ERROR_ILLEGAL_ADDRESS #1921 (comment)?

yondonfu · 2021-10-25T14:25:11Z

I haven't tested this, but a few thoughts below...

Does the HTTP conn get closed cleanly with an error causing B to immediately switch to another O/T?

I don't think the HTTP conn will be closed cleanly if the O/T panics in the middle of a transcode.

How will B behave if an O/T crashes during a transcode as a result of this change?

I think the B will wait for a timeout and then switch to a different O/T.

It would be preferable if the O/T could return an error immediately to B so B can switch before the O/T panics so that B doesn't have to wait for a timeout. Maybe panic recovery as described in this post could help here? The O/T could recover from the panic, return the error and then trigger another panic to crash. An alternative would be for go-livepeer to check for the unrecoverable LPMS error introduced in https://github.com/livepeer/lpms/pull/267/files and trigger the panic only in go-livepeer instead of triggering it in LPMS.

Was there any progress with distinguishing CUDA illegal addresses so Bs could mark those errors as non-retryable as mentioned in #1921 (comment)?

I'm not sure how confident we are that the "Unknown error" that is currently classified as unrecoverable in LPMS is always attributable as a fault of the source segment as opposed to the O/T (as already mentioned in this comment). There is some discussion around propagating an HTTP status code to the client of the broadcaster to tell them to stop retrying if the broadcaster hits the max # of retries - if that goes through then distinguishing CUDA illegal address errors on Bs would be less important.

yondonfu · 2021-10-25T17:32:57Z

Since this is blocking #2066 I'm going to approve/merge this for now. Created #2070 to track the improvement to O/T behavior.

yondonfu

LGTM

jailuthra added 3 commits October 15, 2021 12:10

mod: Update to livepeer/lpms@4b4244c

0372297

We now `panic` if CUDA_ERROR_ILLEGAL_ADDRESS is encountered, as it is an unrecoverable error. See #1921 for details.

changelog: Update for #2057

6ca0049

core/transcoder_test: Fix Nvidia test

006c950

jailuthra requested review from yondonfu, darkdarkdragon and iameli October 18, 2021 10:31

yondonfu reviewed Oct 18, 2021

View reviewed changes

yondonfu mentioned this pull request Oct 25, 2021

Return error to B from O before panic due to unrecoverable LPMS error #2070

Closed

yondonfu approved these changes Oct 25, 2021

View reviewed changes

yondonfu merged commit eb0c1bf into master Oct 25, 2021

yondonfu pushed a commit that referenced this pull request Oct 25, 2021

changelog: Update for #2057

2be4dba

yondonfu deleted the jai/illegal-addr-err branch October 25, 2021 17:33

yondonfu mentioned this pull request Nov 14, 2022

Only mark CUDA_ERROR_ILLEGAL_ADDRESS errors as unrecoverable errors livepeer/lpms#356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update LPMS with `panic` for `CUDA_ERROR_ILLEGAL_ADDRESS` stuck sessions #2057

Update LPMS with `panic` for `CUDA_ERROR_ILLEGAL_ADDRESS` stuck sessions #2057

jailuthra commented Oct 15, 2021 •

edited

yondonfu left a comment

yondonfu commented Oct 25, 2021

yondonfu commented Oct 25, 2021

yondonfu left a comment

Update LPMS with panic for CUDA_ERROR_ILLEGAL_ADDRESS stuck sessions #2057

Update LPMS with panic for CUDA_ERROR_ILLEGAL_ADDRESS stuck sessions #2057

Conversation

jailuthra commented Oct 15, 2021 • edited

yondonfu left a comment

Choose a reason for hiding this comment

yondonfu commented Oct 25, 2021

yondonfu commented Oct 25, 2021

yondonfu left a comment

Choose a reason for hiding this comment

Update LPMS with `panic` for `CUDA_ERROR_ILLEGAL_ADDRESS` stuck sessions #2057

Update LPMS with `panic` for `CUDA_ERROR_ILLEGAL_ADDRESS` stuck sessions #2057

jailuthra commented Oct 15, 2021 •

edited