Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in the latest version of livepeer_bench #2211

Closed
iameli opened this issue Jan 25, 2022 · 8 comments
Closed

segfault in the latest version of livepeer_bench #2211

iameli opened this issue Jan 25, 2022 · 8 comments
Assignees

Comments

@iameli
Copy link
Member

iameli commented Jan 25, 2022

This seems to show up when there are a certain number of concurrent sessions. This is on v0.5.26:

./livepeer_bench   -in bbb/source.m3u8   -transcodingOptions transcodingOptions.json   -nvidia 0,1,2,3,4,5,6   -concurrentSessions 70

[ ... lots of successful transcoding omitted ... ]

2022-01-25 17:47:52.6708,3,2,2,0.1172
2022-01-25 17:47:52.6761,29,1,2,0.1268
[h264_cuvid @ 0x7f0aedc61a80] ctx->cvdl->cuvidCreateDecoder(&ctx->cudecoder, &cuinfo) failed -> CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[h264_cuvid @ 0x7f0aedc61a80] ctx->cvdl->cuvidDecodePicture(ctx->cudecoder, picparams) failed -> CUDA_ERROR_INVALID_HANDLE: invalid resource handle
[h264_cuvid @ 0x7f0aedc61a80] cuvid decode callback error
ERROR: decoder.c:64] Error sending packet to decoder : Generic error in an external library
ERROR: transcoder.c:236] Could not decode; stopping : Generic error in an external library
E0125 17:47:52.741422   19402 ffmpeg.go:536] Transcoder Return : Generic error in an external library
F0125 17:47:52.741478   19402 livepeer_bench.go:207] Transcoding failed for session 33 segment 0: Generic error in an external library
fatal error: unexpected signal during runtime execution

That same command works successfully with livepeer_bench v0.5.17. Test case is reproducible, happy to bisect and find the source of the issue if that's helpful!

@github-actions github-actions bot added the status: triage this issue has not been evaluated yet label Jan 25, 2022
@iameli iameli added area: orchestrator QoL type: bug Something isn't working status: triage this issue has not been evaluated yet and removed status: triage this issue has not been evaluated yet labels Jan 25, 2022
@hthillman hthillman removed the status: triage this issue has not been evaluated yet label Jan 25, 2022
@cyberj0g
Copy link
Contributor

Couldn't reproduce it on 2 GPU server - had to reduce number of sessions to 40, because there's not enough VRAM for 70x4, though. I've noticed that OOM error doesn't always has correct message, it may just fail anywhere in CUDA code. Maybe the reason this test is not passing anymore is some new feature which slightly increased VRAM usage?

@AlexKordic
Copy link
Contributor

Should we rename this issue to handle OOM gracefully ?

@AlexKordic
Copy link
Contributor

We can reproduce the error using command:

./livepeer_bench -in ../bbb/source.m3u8 -transcodingOptions transcodingOptions.json -nvidia 0,1 -concurrentSessions 45

2022-02-10 12:58:14.8896,20,3,2,0.3392
[h264_cuvid @ 0x7f66655fcd00] ctx->cvdl->cuvidGetDecoderCaps(&ctx->caps8) failed -> CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[h264_cuvid @ 0x7f6a55a8d6c0] ctx->cvdl->cuvidCreateDecoder(&ctx->cudecoder, &cuinfo) failed -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
[h264_cuvid @ 0x7f6a55a8d6c0] ctx->cvdl->cuvidDecodePicture(ctx->cudecoder, picparams) failed -> CUDA_ERROR_INVALID_HANDLE: invalid resource handle
[h264_cuvid @ 0x7f6a55a8d6c0] cuvid decode callback error
ERROR: decoder.c:64] Error sending packet to decoder : Generic error in an external library
ERROR: transcoder.c:236] Could not decode; stopping : Generic error in an external library
E0210 12:58:15.085218   93101 ffmpeg.go:609] Transcoder Return : Generic error in an external library
F0210 12:58:15.085247   93101 livepeer_bench.go:205] Transcoding failed for session 40 segment 0: Generic error in an external library

The error code and message can vary depending where out-of-memory happens. Here is example of OOM in rescaling:

[h264_cuvid @ 0x7fd58011eb40] ctx->cvdl->cuvidGetDecoderCaps(&ctx->caps8) failed -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-02-10 12:52:59.2408,39,0,2,0.7784
[h264_cuvid @ 0x7fd58011eb40] ctx->cvdl->cuvidGetDecoderCaps(&ctx->caps10) failed -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
[Parsed_scale_cuda_0 @ 0x7fe457404640] cu->cuModuleLoadData(&s->cu_module, scaler_ptx) failed -> CUDA_ERROR_JIT_COMPILER_NOT_FOUND: PTX JIT compiler library not found
[Parsed_scale_cuda_0 @ 0x7fe457404640] Failed to configure output pad on Parsed_scale_cuda_0
ERROR: filter.c:124] Unable configure video filtergraph : Generic error in an external library
ERROR: transcoder.c:331] Error encoding : Error number -1381256262 occurred
E0210 12:52:59.247897   90508 ffmpeg.go:609] Transcoder Return : Error initializing filtergraph
F0210 12:52:59.247978   90508 livepeer_bench.go:205] Transcoding failed for session 40 segment 0: Error initializing filtergraph

@cyberj0g Is the fix needed in the code or we handle this with proper configuration of session limit?

@yondonfu
Copy link
Member

yondonfu commented Feb 10, 2022

Seems like there are two things worth looking into here:

  1. Why did @iameli's livepeer_bench run fail with a CUDA OOM on v0.5.26, but pass on v0.5.17?
  2. Is there a way to make it clearer that a CUDA OOM occurred regardless of where it happened (i.e. during decoding, scaling, etc.)? I'm thinking it would be nice if the error returned to the caller was something like Error initializing filtergraph: CUDA_ERROR_OUT_OF_MEMORY if the OOM occurs during scaling and Error sending packet to decoder: CUDA_ERROR_OUT_OF_MEMORY.

For 1:

Maybe the reason this test is not passing anymore is some new feature which slightly increased VRAM usage?

From glancing at the "Transcoder" notes for releases going back to v0.5.17 I noticed that in v0.5.19 enabled B-frames in encoded outputs - perhaps that could increase VRAM usage?

We could check if the livepeer_bench run passes with https://github.com/livepeer/go-livepeer/releases/tag/v0.5.17 and fails with https://github.com/livepeer/go-livepeer/releases/tag/v0.5.19

@AlexKordic
Copy link
Contributor

Regarding point 1)

There is no difference in memory consumed comparing master (68% mem) and v0.5.17 (67% mem):

gpu_usage_master

gpu_usage_v0 5 17

To check my setup v0.5.17 is linked statically to ffmpeg and master version is dynamically linked:

linkage

@cyberj0g
Copy link
Contributor

@AlexKordic great analysis. On 2: yes, I linked it dynamically, you can get static build by calling install_ffmpeg.sh /root/ (without BUILD_TAGS), removing libavcodec.so from ffmpeg dir prior to that, but I'm not sure if it will yield any different result. If not OOM, it may be something GPU specific, which we don't see with 1070 Ti.

@AlexKordic
Copy link
Contributor

@cyberj0g thanks for confirmation. I was in doubt if 2 installed go-livepeer repos would use same ffmpeg code.

@yondonfu
Copy link
Member

Closing because it doesn't seem like the issue was reproducible. Feel free to re-open if this issue re-emerges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants