Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling mb_rate_control kills whole machine (Skylake GT2) #172

Open
fhvwy opened this issue May 20, 2017 · 14 comments
Open

Enabling mb_rate_control kills whole machine (Skylake GT2) #172

fhvwy opened this issue May 20, 2017 · 14 comments
Assignees
Labels

Comments

@fhvwy
Copy link
Contributor

fhvwy commented May 20, 2017

Build ffmpeg git master with @mypopydev's patch to add the mb_rate_control option: https://lists.ffmpeg.org/pipermail/ffmpeg-devel/2017-May/211334.html.

Input file doesn't seem to matter much. To be consistent I am using the Big Buck Bunny 1080p file here.

Take steps to avoid data loss (remount all data mounts readonly, sync).

Run:

./ffmpeg_g -y -threads 1 -hwaccel vaapi -hwaccel_output_format vaapi -i bbb_1080_264.mp4 -an -c:v h264_vaapi -b 1M -mb_rate_control 1 /tmp/out.h264

After some frames (not repeatable between runs, but at most a few hundred) the machine becomes completely unresponsive.

On some runs I get a GPU hang log on the console (transcribed) before it locks up, but not consistently:

[drm] GPU HANG ecode 9:0:0x8fd0ffff, in ffmpeg_g [2669], reason: Hang on render ring, action: reset
[drm] {the usual GPU hang bug warning}
[drm] drm/i915: Resetting chip after gpu hang
[drm:i915_reset [i915]] *ERROR* Failed to reset chip: -110

Power-cycle to recover the machine.

Setup:

There are probably at least two issues here: in the VAAPI driver (because enabling mb_rate_control has broken the GPU) and in the kernel (because it didn't recover). I've only sent this here because the reproducer is here, but please do forward this if appropriate.

Possibly relevant: The same ffmpeg command with the mb_rate_control option works fine on a Skylake 6260U (GT3, 48 EUs). Could there be something about the proprietary shader binaries which only works on the larger GPU and breaks horribly on the smaller one?

@fhvwy
Copy link
Contributor Author

fhvwy commented May 20, 2017

Behaviour is identical with the 1.8.1 release.

Whether console output appears or not appears to depend on whether the full DRM framebuffer is being used. If it is, then taking out the GPU kills the output entirely and I don't get anything. If not, the output doesn't die and gives the log above before locking up. Maybe a serial console would be able to get more output if there is any (a panic log, perhaps)?

@xhaihao xhaihao added the bug label May 24, 2017
@fhvwy
Copy link
Contributor Author

fhvwy commented Jun 16, 2017

Has anyone been able to reproduce this? The failure is completely consistent for me, always killing the whole machine when running as above.

Is there anything else I can do to help debug it?

@xhaihao
Copy link
Contributor

xhaihao commented Jun 18, 2017

@fhvwy we will give a try with your patch.

@xhaihao xhaihao self-assigned this Jun 25, 2017
@Brainiarc7
Copy link

I'll test this on a similar workstation and report back.

@FocusLuo FocusLuo added the P3 label Jul 26, 2017
@xhaihao xhaihao removed the P3 label Jul 26, 2017
@FocusLuo FocusLuo added the P3 label Jul 26, 2017
@wangzj0601
Copy link

wangzj0601 commented Dec 8, 2017

Can not duplicate this issue after apply the patch FFmpeg-devel-V3-lavc-vaapi_encode_h264-Enable-MB-rate-control..patch(apply the patch by copying the code line by line because the patch is too old) with ffmpeg commit 991eca0f8729043724ae4574be0eb4c20bdba915
cmdline: ./ffmpeg_g -y -threads 1 -hwaccel vaapi -hwaccel_output_format vaapi -i /media/h264_container/720p.mp4 -an -c:v h264_vaapi -b:v 1M -mb_rate_control 1 ./out.h264

Env
Processor: Skylake ULX (Intel(R) core(TM) m5-6Y57 CPU
GT info: GT2 (0x191E)
Kernel version: 4.12.0-rc2
ffmpeg: repo https://git.ffmpeg.org/ffmpeg.git commit c+patch FFmpeg-devel-V3-lavc-vaapi_encode_h264-Enable-MB-rate-control..patch(apply the patch by copying the code line by line because the patch is too old)
Libva: 2.0.1.pre1 master branch commit 51e98b1224794a44ba097baa7a1b4e35c3596d0c
intel_driver:  2.0.1.pre1 master branch commit 35fc70f repo: https://github.com/01org/intel-vaapi-driver.git

@wangzj0601
Copy link

wangzj0601 commented Dec 8, 2017

upload my patched file vaapi_encode_h264.c, you can use this file with changing extension .c instead of native vaapi_encode_h264.c in ffmpeg commit 991eca0f8729043724ae4574be0eb4c20bdba915
vaapi_encode_h264.txt

@fhvwy
Copy link
Contributor Author

fhvwy commented Dec 8, 2017

I tried this again on the same machine (Skylake 6300), with slightly newer software. The problem persists, but the machine is no longer hard-reset by the operation so I am able to extract some debug information. The graphics core is still completely dead, and doesn't work at all until the machine is rebooted.

Using:

Kernel output:

[ 2249.401011] [drm] GPU HANG: ecode 9:0:0x8fd0fffe, in ffmpeg_g [9317], reason: Hang on rcs0, action: reset
[ 2249.401012] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 2249.401012] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 2249.401012] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 2249.401013] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 2249.401013] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 2249.401028] drm/i915: Resetting chip after gpu hang
[ 2250.107308] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[ 2250.107433] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

DRM error dump: http://ixia.jkqxz.net/~mrt/i965/bug172_drm_error.

@wangzj0601
Copy link

I try one another SKL unit, this issue still can not be duplicated with ffmpeg commit 991eca0f8729043724ae4574be0eb4c20bdba915 + patch FFmpeg-devel-V3-lavc-vaapi_encode_h264-Enable-MB-rate-control..patch

CPU: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
VGA: VGA compatible controller [0300]: Intel Corporation Sky Lake Integrated Graphics [8086:1912] (rev 06)
ffmpeg compilation cmd: --enable-vaapi --prefix=/opt/yami/ffmpeg

Whole info. during run ffmpeg command with option mb_rate_control as below
root@yami-skl:/build/ffmpeg# ./ffmpeg_g -y -threads 1 -hwaccel vaapi -hwaccel_output_format vaapi -i /media/h264_container/720p.mp4 -an -c:v h264_vaapi -b:v 1M -mb_rate_control 1 ./out.h264
ffmpeg version N-88605-g991eca0 Copyright (c) 2000-2017 the FFmpeg developers
built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1
16.04.4) 20160609
configuration: --enable-vaapi --prefix=/opt/yami/ffmpeg
libavutil 56. 0.100 / 56. 0.100
libavcodec 58. 1.100 / 58. 1.100
libavformat 58. 2.100 / 58. 2.100
libavdevice 58. 0.100 / 58. 0.100
libavfilter 7. 0.101 / 7. 0.101
libswscale 5. 0.101 / 5. 0.101
libswresample 3. 0.101 / 3. 0.101
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/media/h264_container/720p.mp4':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf57.26.100
Duration: 00:00:03.34, start: 0.000000, bitrate: 4096 kb/s
Stream #0:0(eng): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x720 [SAR 1:1 DAR 16:9], 4092 kb/s, 29.98 fps, 29.97 tbr, 16016 tbn, 60.67 tbc (default)
Metadata:
handler_name : VideoHandler
Stream mapping:
Stream #0:0 -> #0:0 (h264 (native) -> h264 (h264_vaapi))
Press [q] to stop, [?] for help
Output #0, h264, to './out.h264':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf58.2.100
Stream #0:0(eng): Video: h264 (h264_vaapi) (High), vaapi_vld, 1280x720 [SAR 1:1 DAR 16:9], q=0-31, 1000 kb/s, 29.97 fps, 29.97 tbn, 29.97 tbc (default)
Metadata:
handler_name : VideoHandler
encoder : Lavc58.1.100 h264_vaapi
frame= 100 fps=0.0 q=-0.0 Lsize= 396kB time=00:00:03.30 bitrate= 981.0kbits/s speed=13.6x
video:396kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000000%

@fhvwy
Copy link
Contributor Author

fhvwy commented Dec 12, 2017

@wangzj0601 What input file are you using? Have you tried encoding more than 100 frames? The failure is very consistent for me, but how long it takes varies by file and other settings (though usually around 200 frames).

E.g. with the 1080p "Big Buck Bunny" file running:

./ffmpeg_g -v 55 -y -hwaccel vaapi -hwaccel_output_format vaapi -i bbb_1080_264.mp4 -an -c:v h264_vaapi -b:v 1M -mb_rate_control 1 out.h264

the GPU always dies when encoding frame 234.

Wrt the SKU you are using, have you tried one with 23 EUs rather than 24? That is one possible difference which I suggested above and haven't been able to check. (I think both the 6Y57 and 6600K will have 24, though do correct me if I'm wrong.)

@xhaihao
Copy link
Contributor

xhaihao commented Dec 13, 2017

@wangzj0601 could you try ffmpeg 2fdc9f7c4939f83a6c9d1f9d85b6d37ce0bab714 + http://ixia.jkqxz.net/~mrt/i965/mb_rc.patch? Mark has rebased the ffmpeg patch against a newer version of FFmpeg.

@fhvwy I think your SKL should have 24 EUs, the pci id is 0x1912 in your DRM error dump. Why do you think your machine has 23EUs?

@fhvwy
Copy link
Contributor Author

fhvwy commented Dec 13, 2017

@xhaihao See table and notes in https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol04-configurations.pdf - "[a] Particular SKUs produced by Intel may have one EU disabled.". It's visible at runtime in Beignet, which indicates that it has 23 compute units while other similar machines have 24. (I assume there is an ioctl() somewhere which will return how many there are.)

@fhvwy
Copy link
Contributor Author

fhvwy commented Dec 13, 2017

$ uname -s -v
Linux #1 SMP Debian 4.13.13-1 (2017-11-16)
$ cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-6300 CPU @ 3.80GHz
$ cat eu_count.c 
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <intel_bufmgr.h>

int main(int argc, const char **argv)
{
    const char *device;
    int err;

    if (argc == 1)
        device = "/dev/dri/renderD128";
    else if (argc == 2)
        device = argv[1];
    else {
        fprintf(stderr, "Usage: %s <drm-device>\n", argv[0]);
        return 1;
    }

    err = open(device, O_RDWR);
    if (err < 0) {
        fprintf(stderr, "Failed to open device %s: %m.\n", device);
        return 1;
    }
    int fd = err;

    unsigned int eu_total = 0;
    err = drm_intel_get_eu_total(fd, &eu_total);
    if (err < 0) {
        fprintf(stderr, "Failed to get EU total: %m.\n");
        return 1;
    }

    printf("EU total: %u\n", eu_total);

    close(fd);

    return 0;
}
$ gcc eu_count.c $(pkg-config --libs --cflags libdrm libdrm_intel)
$ ./a.out 
EU total: 23

@yakuizhao
Copy link
Contributor

Thanks for sharing the detailed info.
The current code already tries to query the EU_count by using drm ioctl.

intel->eu_total = 0;
if (intel_driver_get_param(intel, LOCAL_I915_PARAM_EU_TOTAL, &ret_value)) {
intel->eu_total = ret_value;
}

@lizhong1008
Copy link
Contributor

I also tried to reproduce this issue on my KBL (i7-7567U) but failed. Looks like it only happens on specified CPU. @wangzj0601 could you try to find a skylake 6300 (or some other verisions with 23 EU) to reproduce it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants