Skip to content

Wich opencl platforms does the opencl vp9 encoder work with? #2

Open
olidietzel opened this Issue · 13 comments

5 participants

@olidietzel

Hi, is it possible to use your opencl based vp9 encoder in ffmpeg on a regular x86-64 linux install?

Tried fedora with nvidia opencl on a 960 maxwell 2 gpu, was able to install and test opencl, but had errors when trying to encode vp9 with your libvpx version compiled into ffmpeg, crashed.

Sorry for asking, i was not enough a coder to debug this on my own! :)

@ittiamvpx
Owner

Hi,

This project supports only Mali-T6xx GPUs(OpenCL). All the performance optimization,validation etc., is done only for Mali GPUs. And it would work functionally on any OpenCL platform with Integrated GPUs such as Intel, though performance is not guaranteed on those platforms. It would not work in OpenCL platforms based on discrete cards such as Nvidia, AMD graphics cards.
Please note that this project became obsolete, as the WebM libvpx improved quality significantly by changing its algorithms.
We are now working on the new OpenCL project libvpx-1, based on the latest WebM libvpx's quality. libvpx-1 is not yet complete. It is a "work-in-progress". You could track that project for the latest updates.

@Kagami

Hi, @ittiamvpx.

You have issues at libvpx-1 closed so I hope you don't mind me asking here.

Could you please tell the current state of libvpx-1 project? Is it possible to build and run that encoder on machine with discrete GPU (e.g. nvidia)? Are you going to support discrete cards in future or you have only specific lists of cards to support (like in this project)?

Thanks!

@ram-mohan
Collaborator

Hi Kagami,

The GPU acceleration of vp9 encoder in the repository libvpx-1 is targeted towards real time encoding presets only and particularly for specific cpu speeds. The workspace is under development but the package as is was tested on Integrated GPU's (Mali and Intel HD Graphics) for quality and performance and is stable. We did not test on discrete graphic cards but we believe that we did not do anything in particular that limits its usage only for Integrated GPU's, As of now we do not have any a road map towards support for discrete cards.

Thanks
Ram.

@Kagami

Hi, @ram-mohan.

Thanks for the answer.

We did not test on discrete graphic cards but we believe that we did not do anything in particular that limits its usage only for Integrated GPU

I built the most recent commit of libvpx-1 repo (ittiamvpx/libvpx-1@14a8f3e) and it segfaults right after the run with --gpu option enabled (without it everything works):

./configure --enable-opencl --opencl-lib=/opt/cuda/lib64/libOpenCL.so --disable-unit-tests --disable-vp8 --enable-debug
make -j8
./vpxenc park_joy_420_720p50.y4m --gpu --codec=vp9 -o test.webm

Trace:

Program received signal SIGSEGV, Segmentation fault.
end (worker=0x0) at vpx_util/vpx_thread.c:148
148   if (worker->impl_ != NULL) {
(gdb) bt
#0  end (worker=0x0) at vpx_util/vpx_thread.c:148
#1  0x0000000000485f87 in vp9_remove_compressor (cpi=0x7ffff6ad2020) at vp9/encoder/vp9_encoder.c:2121
#2  0x0000000000486480 in vp9_create_compressor (oxcf=oxcf@entry=0x835ca8, pool=0x83e810)
    at vp9/encoder/vp9_encoder.c:1674
#3  0x00000000004770f7 in encoder_init (ctx=<optimized out>, data=<optimized out>) at vp9/vp9_cx_iface.c:812
#4  0x0000000000473e60 in vpx_codec_enc_init_ver (ctx=ctx@entry=0x824040, iface=<optimized out>, 
    cfg=cfg@entry=0x823c70, flags=<optimized out>, ver=ver@entry=11) at vpx/src/vpx_encoder.c:54
#5  0x0000000000403c69 in initialize_encoder (global=0x7fffffffdd60, stream=0x823c60) at vpxenc.c:1526
#6  main (argc=<optimized out>, argv_=<optimized out>) at vpxenc.c:2076

I have Nvidia GTX 970 with proprietary drivers. I also built version without multithreading and it segfaults inside vp9_aq_cyclicrefresh.c in that case:

Program received signal SIGSEGV, Segmentation fault.
vp9_cyclic_refresh_free (cr=0x0) at vp9/encoder/vp9_aq_cyclicrefresh.c:47
47    vpx_free(cr->map);
(gdb) bt
#0  vp9_cyclic_refresh_free (cr=0x0) at vp9/encoder/vp9_aq_cyclicrefresh.c:47
#1  0x0000000000485c81 in dealloc_compressor_data (cpi=0x7ffff6ad2020) at vp9/encoder/vp9_encoder.c:372
#2  vp9_remove_compressor (cpi=0x7ffff6ad2020) at vp9/encoder/vp9_encoder.c:2131
#3  0x0000000000485fc0 in vp9_create_compressor (oxcf=oxcf@entry=0x833ca8, pool=0x83c810)
    at vp9/encoder/vp9_encoder.c:1674
#4  0x0000000000476c40 in encoder_init (ctx=<optimized out>, data=<optimized out>) at vp9/vp9_cx_iface.c:812
#5  0x00000000004739c0 in vpx_codec_enc_init_ver (ctx=ctx@entry=0x822040, iface=<optimized out>, 
    cfg=cfg@entry=0x821c70, flags=<optimized out>, ver=ver@entry=11) at vpx/src/vpx_encoder.c:54
#6  0x00000000004037c9 in initialize_encoder (global=0x7fffffffdd60, stream=0x821c60) at vpxenc.c:1526
#7  main (argc=<optimized out>, argv_=<optimized out>) at vpxenc.c:2076

As of now we do not have any a road map towards support for discrete cards

Ok, I understand. I may provide additional debug info of my configuration/built if needed though.

Regards.

@ram-mohan
Collaborator

Looking at the failure it seems that the application you are running is unable to open kernel files for compilation. In the file "vp9_eopencl.c" there is a macro called PREFIX_PATH. This path helps in locating the opencl kernel files. Try modifying this relative path to open *.cl files. See if build kernel calls made in function in vp9_eopencl_init() are successful.

we recommend following configuration for encoding "./vpxenc --target-bitrate=1000 --ivf --rt --cpu-used=-6 --end-usage=cbr --undershoot-pct=50 --overshoot-pct=50 --buf-sz=1000 --buf-initial-sz=500 --buf-optimal-sz=600 --max-intra-rate=300 --limit=1000 --profile=0 --lag-in-frames=0 --min-q=2 --max-q=52 --passes=1 --kf-max-dist=99999 --kf-min-dist=0 --drop-frame=0 --static-thresh=0 --sharpness=0 --error-resilient=1 --codec=vp9 --gf-cbr-boost=200 --frame-parallel=0 --aq-mode=3 /home/testclips/gipsrestat720p.y4m --threads=1 -o out.ivf"

@Kagami

Thanks for your help! With that change:

diff --git a/vp9/encoder/opencl/vp9_eopencl.c b/vp9/encoder/opencl/vp9_eopencl.c
index 8e3fabf..f560155 100644
--- a/vp9/encoder/opencl/vp9_eopencl.c
+++ b/vp9/encoder/opencl/vp9_eopencl.c
@@ -17,7 +17,7 @@
 #if ARCH_ARM
 #define PREFIX_PATH "./"
 #else
-#define PREFIX_PATH "../../vp9/encoder/opencl/"
+#define PREFIX_PATH "./vp9/encoder/opencl/"
 #endif

 static const int pixel_rows_per_workitem_log2_pro_me = 4;

I was able to successfully encode 1 frame of video with --gpu option. Videos with more than 1 frame fail with different error:

(gdb) run
Starting program: vpxenc --gpu --target-bitrate=1000 --ivf --rt --cpu-used=-6 --end-usage=cbr --undershoot-pct=50 --overshoot-pct=50 --buf-sz=1000 --buf-initial-sz=500 --buf-optimal-sz=600 --max-intra-rate=300 --limit=1000 --profile=0 --lag-in-frames=0 --min-q=2 --max-q=52 --passes=1 --kf-max-dist=99999 --kf-min-dist=0 --drop-frame=0 --static-thresh=0 --sharpness=0 --error-resilient=1 --codec=vp9 --gf-cbr-boost=200 --frame-parallel=0 --aq-mode=3 2frames.y4m --threads=1 -o test.ivf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Pass 1/1 frame    2/1      24101B   46612 us 42.91 fps [ETA  0:00:46] vpxenc: vp9/encoder/vp9_egpu.c:395: vp9_enc_sync_gpu: Assertion `gpu_output_buffer - cpi->gpu_output_pro_me_base == size' failed.
[New Thread 0x7fffe078f700 (LWP 22058)]
[New Thread 0x7fffe3fff700 (LWP 22057)]
[New Thread 0x7fffe8b1b700 (LWP 22056)]
[New Thread 0x7fffe931c700 (LWP 22055)]
[New Thread 0x7fffe9b1d700 (LWP 22054)]
[New Thread 0x7fffea31e700 (LWP 22053)]
[New Thread 0x7fffeabff700 (LWP 22052)]
[New Thread 0x7ffff397d700 (LWP 22051)]

Program received signal SIGABRT, Aborted.
0x00007ffff6dc3167 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6dc3167 in raise () from /lib64/libc.so.6
#1  0x00007ffff6dc44ca in abort () from /lib64/libc.so.6
#2  0x00007ffff6dbc296 in ?? () from /lib64/libc.so.6
#3  0x00007ffff6dbc342 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000004a0b19 in vp9_enc_sync_gpu (cpi=cpi@entry=0x7ffff6ad2020, td=td@entry=0x7ffff6ade020, 
    mi_row=mi_row@entry=32, mi_row_step=mi_row_step@entry=8) at vp9/encoder/vp9_egpu.c:395
#5  0x000000000052f9bb in encode_sb_rows (mi_row_start=0, mi_row_step=8, mi_row_end=90, td=0x7ffff6ade020, 
    cpi=0x7ffff6ad2020) at vp9/encoder/vp9_encodeframe.c:4046
#6  encode_tiles (cpi=0x7ffff6ad2020) at vp9/encoder/vp9_encodeframe.c:4137
#7  encode_frame_internal (cpi=cpi@entry=0x7ffff6ad2020) at vp9/encoder/vp9_encodeframe.c:4349
#8  0x0000000000530551 in vp9_encode_frame (cpi=cpi@entry=0x7ffff6ad2020) at vp9/encoder/vp9_encodeframe.c:4554
#9  0x0000000000489dc8 in encode_without_recode_loop (cpi=0x7ffff6ad2020) at vp9/encoder/vp9_encoder.c:3366
#10 encode_frame_to_data_rate (cpi=cpi@entry=0x7ffff6ad2020, size=size@entry=0x7fffffffd7a8, 
    dest=dest@entry=0x7fffe8077010 "\203I\203B", frame_flags=frame_flags@entry=0x7fffffffd794)
    at vp9/encoder/vp9_encoder.c:3870
#11 0x000000000048c1ba in Pass0Encode (frame_flags=<optimized out>, dest=<optimized out>, size=<optimized out>, 
    cpi=<optimized out>) at vp9/encoder/vp9_encoder.c:4022
#12 vp9_get_compressed_data (cpi=cpi@entry=0x7ffff6ad2020, frame_flags=frame_flags@entry=0x7fffffffd794, 
    size=size@entry=0x7fffffffd7a8, dest=dest@entry=0x7fffe8077010 "\203I\203B", 
    time_stamp=time_stamp@entry=0x7fffffffd798, time_end=time_end@entry=0x7fffffffd7a0, flush=1)
    at vp9/encoder/vp9_encoder.c:4472
#13 0x000000000048338c in encoder_encode (ctx=0x834e40, img=0x0, pts=<optimized out>, duration=<optimized out>, 
    flags=<optimized out>, deadline=<optimized out>) at vp9/vp9_cx_iface.c:1060
#14 0x0000000000474340 in vpx_codec_encode (ctx=ctx@entry=0x824110, img=img@entry=0x0, pts=pts@entry=20, 
    duration=duration@entry=20, flags=flags@entry=0, deadline=<optimized out>) at vpx/src/vpx_encoder.c:223
#15 0x0000000000403f40 in encode_frame (global=0x7fffffffdaf0, global=0x7fffffffdaf0, global=0x7fffffffdaf0, 
    frames_in=2, img=0x0, stream=0x823d30) at vpxenc.c:1642
#16 main (argc=<optimized out>, argv_=<optimized out>) at vpxenc.c:2169

we recommend following configuration for encoding

Nice, thanks. It doesn't seem to include --gpu flag though?

@ram-mohan
Collaborator

Yeah i notice, --gpu flag is missing. Sorry about that. There seems to be an assertion failure in function "vp9_enc_sync_gpu (file:vp9_egpu.c, line 395)". Can you share the Lvalue and Rvalue in the comparison made.

@Kagami
(gdb) break vp9_egpu.c:395
Breakpoint 1 at 0x4a0a2c: file vp9/encoder/vp9_egpu.c, line 395.
(gdb) run
Starting program: vpxenc --gpu --target-bitrate=1000 --ivf --rt --cpu-used=-6 --end-usage=cbr --undershoot-pct=50 --overshoot-pct=50 --buf-sz=1000 --buf-initial-sz=500 --buf-optimal-sz=600 --max-intra-rate=300 --limit=1000 --profile=0 --lag-in-frames=0 --min-q=2 --max-q=52 --passes=1 --kf-max-dist=99999 --kf-min-dist=0 --drop-frame=0 --static-thresh=0 --sharpness=0 --error-resilient=1 --codec=vp9 --gf-cbr-boost=200 --frame-parallel=0 --aq-mode=3 2frames.y4m --threads=1 -o test.ivf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Pass 1/1 frame    2/1      24101B   47116 us 42.45 fps [ETA  0:00:47] [New Thread 0x7fffe078f700 (LWP 23165)]
[New Thread 0x7fffe3fff700 (LWP 23164)]
[New Thread 0x7fffe8b1b700 (LWP 23163)]
[New Thread 0x7fffe931c700 (LWP 23162)]
[New Thread 0x7fffe9b1d700 (LWP 23161)]
[New Thread 0x7fffea31e700 (LWP 23160)]
[New Thread 0x7fffeabff700 (LWP 23159)]
[New Thread 0x7ffff397d700 (LWP 23158)]

Breakpoint 1, vp9_enc_sync_gpu (cpi=cpi@entry=0x7ffff6ad2020, td=td@entry=0x7ffff6ade020, mi_row=mi_row@entry=0, 
    mi_row_step=mi_row_step@entry=8) at vp9/encoder/vp9_egpu.c:395
395           assert(gpu_output_buffer - cpi->gpu_output_pro_me_base == size);
(gdb) print gpu_output_buffer
$1 = (GPU_OUTPUT_PRO_ME *) 0x205e42200
(gdb) print cpi->gpu_output_pro_me_base
$2 = (GPU_OUTPUT_PRO_ME *) 0x205e42200
(gdb) print size
$3 = 0
(gdb) cont
Continuing.

Breakpoint 1, vp9_enc_sync_gpu (cpi=cpi@entry=0x7ffff6ad2020, td=td@entry=0x7ffff6ade020, mi_row=mi_row@entry=32, 
    mi_row_step=mi_row_step@entry=8) at vp9/encoder/vp9_egpu.c:395
395           assert(gpu_output_buffer - cpi->gpu_output_pro_me_base == size);
(gdb) print gpu_output_buffer
$4 = (GPU_OUTPUT_PRO_ME *) 0x205e49000
(gdb) print cpi->gpu_output_pro_me_base
$5 = (GPU_OUTPUT_PRO_ME *) 0x205e42200
(gdb) print size
$6 = 80
(gdb) cont
Continuing.
vpxenc: vp9/encoder/vp9_egpu.c:395: vp9_enc_sync_gpu: Assertion `gpu_output_buffer - cpi->gpu_output_pro_me_base == size' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff6dc3167 in raise () from /lib64/libc.so.6
@ram-mohan
Collaborator

For 720p content Rvalue 80 is as expected. But I am unable to make much out of the Lvalue. Can you please share the sizeof(GPU_OUTPUT_PRO_ME) structure on your platform and the actual difference 'gpu_output_buffer - cpi->gpu_output_pro_me_base' you are seeing

In vp9_eopencl_alloc_buffers() memory needed for gpu interface buffers is allocated. Lines 431-465 represent allocation of a part of gpu output buffers that is currently under consideration. Looking at the buffer/sub-buffer creation and their cpu side map pointers is the key for solving this issue. As of now I do not have a set up similar that of yours to reproduce this issue. Once I get hold of it, i will look in to it.

Thanks,
Ram.

@Kagami

Can you please share the sizeof(GPU_OUTPUT_PRO_ME) structure on your platform and the actual difference 'gpu_output_buffer - cpi->gpu_output_pro_me_base' you are seeing

I added debug prints near this line:

diff --git a/vp9/encoder/vp9_egpu.c b/vp9/encoder/vp9_egpu.c
index cb0e945..4610c75 100644
--- a/vp9/encoder/vp9_egpu.c
+++ b/vp9/encoder/vp9_egpu.c
@@ -390,8 +390,20 @@ void vp9_enc_sync_gpu(VP9_COMP *cpi, ThreadData *td, int mi_row, int mi_row_step
           const int size = cm->sb_cols * sb_row;

           (void) size;
+          printf("BEFORE p1=%p p2=%p diff=%ld size=%d sizeof=%zu\n",
+                 gpu_output_buffer,
+                 cpi->gpu_output_pro_me_base,
+                 (gpu_output_buffer - cpi->gpu_output_pro_me_base),
+                 size,
+                 sizeof(GPU_OUTPUT_PRO_ME));
           egpu->acquire_output_pro_me_buffer(cpi, (void **) &gpu_output_buffer,
                                              subframe_idx);
+          printf("AFTER  p1=%p p2=%p diff=%ld size=%d sizeof=%zu\n",
+                 gpu_output_buffer,
+                 cpi->gpu_output_pro_me_base,
+                 (gpu_output_buffer - cpi->gpu_output_pro_me_base),
+                 size,
+                 sizeof(GPU_OUTPUT_PRO_ME));
           assert(gpu_output_buffer - cpi->gpu_output_pro_me_base == size);
         }
         if (mi_row - mi_row_step == subframe.mi_row_start &&

Output:

BEFORE p1=0x6cab146c5e78fa00 p2=0x205e42200 diff=-6067348286818819520 size=0 sizeof=96
AFTER  p1=0x205e42200 p2=0x205e42200 diff=0 size=0 sizeof=96
BEFORE p1=0x205e44000 p2=0x205e42200 diff=80 size=80 sizeof=96
AFTER  p1=0x205e49000 p2=0x205e42200 diff=-6148914691236516912 size=80 sizeof=96

Seems like pointers are correct before acquire_output_pro_me_buffer call on the second time but then it slightly changes and difference is not equal to 96*80.

In vp9_eopencl_alloc_buffers() memory needed for gpu interface buffers is allocated. Lines 431-465 represent allocation of a part of gpu output buffers that is currently under consideration. Looking at the buffer/sub-buffer creation and their cpu side map pointers is the key for solving this issue.

I'll try to look into it, thanks.

@mingtotti

Hi @ram-mohan ,

Got the same assertion error in vp9_enc_sync_gpu(). My understanding is that the vp9_opencl_map_buffer() doesn't generate continuous addresses for the mapped pointers from different sub-frames in the host memory. The reason might be clEnqueueMapBuffer() itself, or there are other host memory allocations during two map calls.

Are there any particular reasons to consider those pointers as continuous?

Thanks,
mingtotti

@ram-mohan
Collaborator

Hi mingtotti,

Yes we were able to reproduce this issue. Like you pointed out, the host pointers for different sub buffers were not contiguous. The assumption we made was out of general intuition. It seems that this assumption is not valid as per OpenCL specification. We have made the necessary changes from our side. We will push these changes soon.

Thanks
Ram.

@mingtotti

Hi Ram,

That would be great!

Thanks,
Totti

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.