segfault in the latest version of livepeer_bench #2211

iameli · 2022-01-25T18:04:35Z

This seems to show up when there are a certain number of concurrent sessions. This is on v0.5.26:

./livepeer_bench   -in bbb/source.m3u8   -transcodingOptions transcodingOptions.json   -nvidia 0,1,2,3,4,5,6   -concurrentSessions 70

[ ... lots of successful transcoding omitted ... ]

2022-01-25 17:47:52.6708,3,2,2,0.1172
2022-01-25 17:47:52.6761,29,1,2,0.1268
[h264_cuvid @ 0x7f0aedc61a80] ctx->cvdl->cuvidCreateDecoder(&ctx->cudecoder, &cuinfo) failed -> CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[h264_cuvid @ 0x7f0aedc61a80] ctx->cvdl->cuvidDecodePicture(ctx->cudecoder, picparams) failed -> CUDA_ERROR_INVALID_HANDLE: invalid resource handle
[h264_cuvid @ 0x7f0aedc61a80] cuvid decode callback error
ERROR: decoder.c:64] Error sending packet to decoder : Generic error in an external library
ERROR: transcoder.c:236] Could not decode; stopping : Generic error in an external library
E0125 17:47:52.741422   19402 ffmpeg.go:536] Transcoder Return : Generic error in an external library
F0125 17:47:52.741478   19402 livepeer_bench.go:207] Transcoding failed for session 33 segment 0: Generic error in an external library
fatal error: unexpected signal during runtime execution

That same command works successfully with livepeer_bench v0.5.17. Test case is reproducible, happy to bisect and find the source of the issue if that's helpful!

The text was updated successfully, but these errors were encountered:

cyberj0g · 2022-02-10T12:12:33Z

Couldn't reproduce it on 2 GPU server - had to reduce number of sessions to 40, because there's not enough VRAM for 70x4, though. I've noticed that OOM error doesn't always has correct message, it may just fail anywhere in CUDA code. Maybe the reason this test is not passing anymore is some new feature which slightly increased VRAM usage?

AlexKordic · 2022-02-10T12:16:04Z

Should we rename this issue to handle OOM gracefully ?

AlexKordic · 2022-02-10T13:04:50Z

We can reproduce the error using command:

./livepeer_bench -in ../bbb/source.m3u8 -transcodingOptions transcodingOptions.json -nvidia 0,1 -concurrentSessions 45

2022-02-10 12:58:14.8896,20,3,2,0.3392
[h264_cuvid @ 0x7f66655fcd00] ctx->cvdl->cuvidGetDecoderCaps(&ctx->caps8) failed -> CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[h264_cuvid @ 0x7f6a55a8d6c0] ctx->cvdl->cuvidCreateDecoder(&ctx->cudecoder, &cuinfo) failed -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
[h264_cuvid @ 0x7f6a55a8d6c0] ctx->cvdl->cuvidDecodePicture(ctx->cudecoder, picparams) failed -> CUDA_ERROR_INVALID_HANDLE: invalid resource handle
[h264_cuvid @ 0x7f6a55a8d6c0] cuvid decode callback error
ERROR: decoder.c:64] Error sending packet to decoder : Generic error in an external library
ERROR: transcoder.c:236] Could not decode; stopping : Generic error in an external library
E0210 12:58:15.085218   93101 ffmpeg.go:609] Transcoder Return : Generic error in an external library
F0210 12:58:15.085247   93101 livepeer_bench.go:205] Transcoding failed for session 40 segment 0: Generic error in an external library

The error code and message can vary depending where out-of-memory happens. Here is example of OOM in rescaling:

[h264_cuvid @ 0x7fd58011eb40] ctx->cvdl->cuvidGetDecoderCaps(&ctx->caps8) failed -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-02-10 12:52:59.2408,39,0,2,0.7784
[h264_cuvid @ 0x7fd58011eb40] ctx->cvdl->cuvidGetDecoderCaps(&ctx->caps10) failed -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
[Parsed_scale_cuda_0 @ 0x7fe457404640] cu->cuModuleLoadData(&s->cu_module, scaler_ptx) failed -> CUDA_ERROR_JIT_COMPILER_NOT_FOUND: PTX JIT compiler library not found
[Parsed_scale_cuda_0 @ 0x7fe457404640] Failed to configure output pad on Parsed_scale_cuda_0
ERROR: filter.c:124] Unable configure video filtergraph : Generic error in an external library
ERROR: transcoder.c:331] Error encoding : Error number -1381256262 occurred
E0210 12:52:59.247897   90508 ffmpeg.go:609] Transcoder Return : Error initializing filtergraph
F0210 12:52:59.247978   90508 livepeer_bench.go:205] Transcoding failed for session 40 segment 0: Error initializing filtergraph

@cyberj0g Is the fix needed in the code or we handle this with proper configuration of session limit?

yondonfu · 2022-02-10T13:37:33Z

Seems like there are two things worth looking into here:

Why did @iameli's livepeer_bench run fail with a CUDA OOM on v0.5.26, but pass on v0.5.17?
Is there a way to make it clearer that a CUDA OOM occurred regardless of where it happened (i.e. during decoding, scaling, etc.)? I'm thinking it would be nice if the error returned to the caller was something like Error initializing filtergraph: CUDA_ERROR_OUT_OF_MEMORY if the OOM occurs during scaling and Error sending packet to decoder: CUDA_ERROR_OUT_OF_MEMORY.

For 1:

Maybe the reason this test is not passing anymore is some new feature which slightly increased VRAM usage?

From glancing at the "Transcoder" notes for releases going back to v0.5.17 I noticed that in v0.5.19 enabled B-frames in encoded outputs - perhaps that could increase VRAM usage?

We could check if the livepeer_bench run passes with https://github.com/livepeer/go-livepeer/releases/tag/v0.5.17 and fails with https://github.com/livepeer/go-livepeer/releases/tag/v0.5.19

AlexKordic · 2022-02-10T16:49:21Z

Regarding point 1)

There is no difference in memory consumed comparing master (68% mem) and v0.5.17 (67% mem):

To check my setup v0.5.17 is linked statically to ffmpeg and master version is dynamically linked:

cyberj0g · 2022-02-10T18:09:01Z

@AlexKordic great analysis. On 2: yes, I linked it dynamically, you can get static build by calling install_ffmpeg.sh /root/ (without BUILD_TAGS), removing libavcodec.so from ffmpeg dir prior to that, but I'm not sure if it will yield any different result. If not OOM, it may be something GPU specific, which we don't see with 1070 Ti.

AlexKordic · 2022-02-10T18:42:50Z

@cyberj0g thanks for confirmation. I was in doubt if 2 installed go-livepeer repos would use same ffmpeg code.

yondonfu · 2022-07-11T12:29:36Z

Closing because it doesn't seem like the issue was reproducible. Feel free to re-open if this issue re-emerges.

github-actions bot added the status: triage this issue has not been evaluated yet label Jan 25, 2022

iameli added area: orchestrator QoL type: bug Something isn't working status: triage this issue has not been evaluated yet and removed status: triage this issue has not been evaluated yet labels Jan 25, 2022

hthillman removed the status: triage this issue has not been evaluated yet label Jan 25, 2022

yondonfu assigned cyberj0g Jan 26, 2022

yondonfu added the status: on deck label Jan 26, 2022

yondonfu assigned AlexKordic Feb 7, 2022

yondonfu closed this as completed Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault in the latest version of livepeer_bench #2211

segfault in the latest version of livepeer_bench #2211

iameli commented Jan 25, 2022

cyberj0g commented Feb 10, 2022

AlexKordic commented Feb 10, 2022

AlexKordic commented Feb 10, 2022

yondonfu commented Feb 10, 2022 •

edited

AlexKordic commented Feb 10, 2022

cyberj0g commented Feb 10, 2022

AlexKordic commented Feb 10, 2022

yondonfu commented Jul 11, 2022

segfault in the latest version of livepeer_bench #2211

segfault in the latest version of livepeer_bench #2211

Comments

iameli commented Jan 25, 2022

cyberj0g commented Feb 10, 2022

AlexKordic commented Feb 10, 2022

AlexKordic commented Feb 10, 2022

yondonfu commented Feb 10, 2022 • edited

AlexKordic commented Feb 10, 2022

cyberj0g commented Feb 10, 2022

AlexKordic commented Feb 10, 2022

yondonfu commented Jul 11, 2022

yondonfu commented Feb 10, 2022 •

edited