Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suffer GPU hang by specific HEVC transcoding in CML #992

Closed
zcwang opened this issue Jul 3, 2020 · 15 comments
Closed

Suffer GPU hang by specific HEVC transcoding in CML #992

zcwang opened this issue Jul 3, 2020 · 15 comments
Assignees
Labels
Decode video decode related P2 Medium priority verifying PR: fix ready and verifying with build/test

Comments

@zcwang
Copy link

zcwang commented Jul 3, 2020

Need help on GPU hang issue of HEVC transcoding in CML.

It will cause GPU hang by following command with specific HEVC video (sample video about 5xxMB in here).

  • Command
    ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i target-HEVC-video.mkv -vf 'deinterlace_vaapi=rate=field:auto=1,scale_vaapi=w=1920:h=1080' -c:v hevc_vaapi output.mp4

  • Test Environment
    OS: Ubuntu 18.04 with kernel v5.7 or the latest i915 drm-tip kernel (v5.8-rc2 on 06-29).
    Open Source Media Stack: 2020’Q1 release or the latest upstream on 7/1/2020
    FFmpeg vresion: the latest code in upstream on 7/1 (commit id--> e409262837 avutil/common: Fix integer overflow in av_ceil_log2_c())
    vainfo: VA-API version: 1.8 (libva 2.8.0.pre1)
    vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 20.3.pre (adc2326)

  • GPU Hang in vcs,
    ...
    Jul 1 11:50:46 intel-NUC kernel: [ 9831.062462] i915 0000:00:02.0: [drm] Resetting vcs0 for preemption time out
    Jul 1 11:50:46 intel-NUC kernel: [ 9831.062468] i915 0000:00:02.0: [drm] ffmpeg[3208] context reset due to GPU hang
    Jul 1 11:50:46 intel-NUC kernel: [ 9831.062510] i915 0000:00:02.0: [drm:__i915_request_reset [i915]] client ffmpeg[3208]: gained 1 ban score, now 1
    Jul 1 11:50:46 intel-NUC kernel: [ 9831.063554] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:4:a8fffffd, in ffmpeg [3208]

ERROR: 0x00000000
DONE_REG: 0xffffffff
FAULT_TLB_DATA: 0x00000011 0xb442c1b0
Address 0x00001b442c1b0000 GGTT
GTT_CACHE_EN: 0xf0007fff
vcs0 command stream:
CCID: 0x00000000
START: 0x00011000
HEAD: 0x00000268 [0x00000230]
head = 0x00000268, wraps = 0
TAIL: 0x00000ee0 [0x00000270, 0x00000298]
CTL: 0x00003001
len=16384, enabled
MODE: 0x00000000
HWS: 0xfffe3000
ACTHD: 0x00000000 000b3924
at ring: 0x00000000
IPEIR: 0x00000000
IPEHR: 0x13000002
ESR: 0x00000000
INSTDONE: 0xbbffffff
batch: [0x00000000_000b3000, 0x00000000_000bb000]
BBADDR: 0x00000000_000b3925
BB_STATE: 0x00000020
INSTPS: 0x00009080
INSTPM: 0x00000000
FADDR: 0x00000000 000b3b00
RC PSMI: 0x00000010
FAULT_REG: 0x00000000
GFX_MODE: 0x00008000
PDP0: 0x00000006237ef000
PDP1: 0x0000000000000000
PDP2: 0x0000000000000000
PDP3: 0x0000000000000000
engine reset count: 0
ELSP[0]: pid 2486, seqno 18:00000044, prio 0, head 00000e70, tail 00000ee0
ELSP[1]: pid 2485, seqno 1c:00000002, prio 0, head 00000000, tail 00000068
Active context: ffmpeg[2486] prio 0, guilty 1 active 0, runtime total 4540598ns, avg 3970720ns

Please refer log files,
ffmpeg-gpu-hang-gary-0701.zip

@zcwang
Copy link
Author

zcwang commented Jul 6, 2020

Issue cannot be duplicate by MSDK’s transcoding sample with command “sample_multi_transcode -i::h265 ~/input.h265 -deinterlace -o::h265 test-output.h265 -w 1920 -h 1080

The successful transcoded video with 1080p resolution (from 2160p) I put here.

@fulinjie
Copy link
Contributor

Ping.
This gpu hang accidentally occurs in decoding procedure for some clips with missing refs.

@dmitryermilov
Copy link
Contributor

Issue cannot be duplicate by MSDK’s transcoding sample with command “sample_multi_transcode -i::h265 ~/input.h265 -deinterlace -o::h265 test-output.h265 -w 1920 -h 1080

The successful transcoded video with 1080p resolution (from 2160p) I put here.

@fulinjie , if msdk decoder can handle the stream, perhaps a WA is possible on ffmpeg side?

@fulinjie
Copy link
Contributor

Hi @dmitryermilov ,

The main reason for this issue is that:
• The clips doesn’t start from an IRAP frame (Intra random access point)
o Hence the first 50 frames lack the valid reference list, they could not be decoded correctly.
o Also missing reference in application level leads to the Null pointer in driver, however it should not leads to GPU hang;

It seems to be related with error tolerant/handling case for Null pointer in driver.
• Note that it’s only reproduced in multi-thread mode, “-threads 1” would not trigger this GPU Hang;

• The reason MSDK is workable:
o Sample decode seems to have checked the reference list dependency, and simply skipped the first 50 invalid frames;
Hence it only decoded the last 50 decodable frames;
• $ ./sample_decode h265 -i input-100frames.h265 -o /dev/null
o Decoding started
o Frame number: 50, fps: 12.097, fread_fps: 0.000, fwrite_fps: 12.712
o Decoding finished

@fulinjie , if msdk decoder can handle the stream, perhaps a WA is possible on ffmpeg side?

Yes, I'm working on some WA in FFmpeg to skip the invalid frames (which contradicts the native decoding pipeline), but IMHO it would be better to have GPU hang somehow prevented no matter whether we had the "valid check" or not..
(Note that only some of the bitstreams with missing reference would lead to this GPU hang)

Ps. FYI, internal discussion is accessible in:
https://jira.devtools.intel.com/browse/VIZ-16147

@dmitryermilov
Copy link
Contributor

dmitryermilov commented Jul 24, 2020

Yes, I fully understand you, @fulinjie . It goes without saying that UMD should attempt to prevent GPU hangs.
My point is, ideally, each component in media stack should be error tolerant. When problems, which one component in media layer can't handle, will be handled by another component.

simply skipped the first 50 invalid frames

The motivation here is not just "simply" skip as many as possible frames :) There should be a balance between:

  • following decoding process how it's described in the spec
  • user experience. I mean even if we can output these 50 frames (which will be fully corrupted) without GPU hang, does a user really want to watch them in the screen?
  • error tolerance and error recovery

@fulinjie
Copy link
Contributor

simply skipped the first 50 invalid frames
The motivation here is not just "simply" skip as many as possible frames :) There should be a balance between:

  • following decoding process how it's described in the spec
  • user experience. I mean even if we can output these 50 frames (which will be fully corrupted) without GPU hang, does a user really want to watch them in the screen?
  • error tolerance and error recovery

Yep, agree. These skipped frames are useless and are with garbage in this clips, and would be better to be skipped.
And that's the reason I'm working on some WAs in FFmpeg to start decoding from IRAP frames:
fulinjie/ffmpeg@8926ae4

The gpu hang could be hide after applying above patch.
However since we've caught this hang issue, IMHO it would be good if we could add corresponding error tolerance in media-driver.

@XinfengZhang
Copy link
Contributor

XinfengZhang commented Aug 13, 2020

@wangyan-intel could we add a check when call in EndPicture, if there are no reference frame, media-driver should return failure to avoid gpu hang, not send real command buffer to GPU

@zcwang
Copy link
Author

zcwang commented Aug 27, 2020

@XinfengZhang sorry for bothering you. May I know any possible direction on this issue?

@wangyan-intel
Copy link

I will take a look. Sorry for slow response.

@wangyan-intel wangyan-intel removed their assignment Aug 28, 2020
@wangyan-intel
Copy link

@weizhu-intel Could you please help take a look? Thanks.

@wangyan-intel wangyan-intel self-assigned this Aug 28, 2020
@weizhu-intel
Copy link
Contributor

Hi Linjie&zcwang,
I have a try on my side, and found that, ffmpeg still pass ref_pic_id even reference is missed. This will cause some unexpected issue.
Sometimes it has no gmm resourceinfo then we can detect it in endpicture, then return error. Sometimes error gmm resource info, this will lead to hang.

So could you pass in_valid_surfaceid instead of correct ref_pic_id if reference is missed, then our driver can detect this.

Thanks
wayne zhu

@zcwang
Copy link
Author

zcwang commented Oct 21, 2020

Issue was fixed by following patch (i.e. intel-ffmpeg-patechset included in Media-Driver 2020Q3 release, but not in upstream),
https://github.com/intel-media-ci/intel-ffmpeg-patch/blob/master/0057-lavc-vaapi_hevc-add-skip_frame-invalid-to-skip-inval.patch

@zcwang
Copy link
Author

zcwang commented Oct 21, 2020

@weizhu-intel and @dmitryermilov,
Do you think this issue should be fixed by ffmpeg's patch or media-driver? Thanks!
https://patchwork.ffmpeg.org/project/ffmpeg/list/?series=2021
Gary

@XinfengZhang XinfengZhang added P2 Medium priority verifying PR: fix ready and verifying with build/test labels Jan 8, 2021
@Jexu Jexu assigned Jexu and unassigned wangyan-intel Dec 2, 2022
@Jexu Jexu added the Decode video decode related label Dec 2, 2022
@Jexu
Copy link
Contributor

Jexu commented Mar 6, 2023

This issue should have been fixed in latest media driver, could you try it again?
Hang is gone on my side with latest driver.
By the way, driver fix is to skip the decoding if ref frame missed.

@Jexu
Copy link
Contributor

Jexu commented Apr 21, 2023

Let me close this issue now since fixed in media driver and you can also add strict check for invalid reference frame in ffmpeg or vpl as option.
Please re-open it again if having any other questions.

@Jexu Jexu closed this as completed Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Decode video decode related P2 Medium priority verifying PR: fix ready and verifying with build/test
Projects
None yet
Development

No branches or pull requests

8 participants