Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are OpenCL/GPU schedule for blur of apps #1568

Open
xingjinglu opened this issue Oct 25, 2016 · 11 comments
Open

Are OpenCL/GPU schedule for blur of apps #1568

xingjinglu opened this issue Oct 25, 2016 · 11 comments

Comments

@xingjinglu
Copy link

Are there GPU schedule for blur ? I want to generate the OpenCL version of blur app. By the way, can I get the OpenCL version code of blur ?

@zvookin
Copy link
Member

zvookin commented Oct 25, 2016

If you set the HL_DEBUG_CODEGEN environment variable to 1 or higher, the
compiler will print the generated OpenCL code as debugging output. (There
are other ways to set the debug level, including the following line of C++
code:
Halide::Internal::debug::debug_level = 1;
}

-Z-

On Mon, Oct 24, 2016 at 10:55 PM, ericlew notifications@github.com wrote:

Are there GPU schedule for blur ? I want to generate the OpenCL version of
blur app. By the way, can I get the OpenCL version code of blur ?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1568, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABbqFOp-G3qxADwbifb8xRn5Bcy83A-bks5q3ZnRgaJpZM4KfluE
.

@abadams
Copy link
Member

abadams commented Oct 25, 2016

There isn't a GPU schedule for it currently. I think a decent gpu schedule
would be:

blur_y.gpu_tile(x, y, 8, 8);

and then possibly also

blur_x.compute_at(blur_y, Var::gpu_blocks).gpu_threads(x, y);

but uint16 math kinda sucks on gpus, so you might want to change it to
floats.

On Mon, Oct 24, 2016 at 10:55 PM, ericlew notifications@github.com wrote:

Are there GPU schedule for blur ? I want to generate the OpenCL version of
blur app. By the way, can I get the OpenCL version code of blur ?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1568, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAfdRhFljmpEZG-r2AxUv4mSVntxegUgks5q3ZnQgaJpZM4KfluE
.

@jrk
Copy link
Member

jrk commented Oct 26, 2016

On a GPU I'd guess with 95% certainty that you just want to let blur_x
inline into blur_y (the default, if you don't schedule blur_x at all), and
then blur_y.gpu_time(x, y, K, L) as Andrew said, with reasonable values of
K,L being 8,8.

Blur is generally a very uninteresting workload, especially for a GPU—it's
mostly meant as the bare minimum thing that shows the basic tradeoffs that
matter, but even on modern CPUs it's increasingly becoming best just to
inline the two stages. At the very least, we should make the kernel larger
to enable more advanced producer-consumer locality strategies to matter.

On Tue, Oct 25, 2016 at 2:56 PM, Andrew Adams notifications@github.com
wrote:

There isn't a GPU schedule for it currently. I think a decent gpu schedule
would be:

blur_y.gpu_tile(x, y, 8, 8);

and then possibly also

blur_x.compute_at(blur_y, Var::gpu_blocks).gpu_threads(x, y);

but uint16 math kinda sucks on gpus, so you might want to change it to
floats.

On Mon, Oct 24, 2016 at 10:55 PM, ericlew notifications@github.com
wrote:

Are there GPU schedule for blur ? I want to generate the OpenCL version
of
blur app. By the way, can I get the OpenCL version code of blur ?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1568, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAfdRhFljmpEZG-
r2AxUv4mSVntxegUgks5q3ZnQgaJpZM4KfluE>
.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1568 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAju8KszWS-E_scfodKz-EQSD9uMt2yks5q3nsIgaJpZM4KfluE
.

@darkbuck
Copy link
Contributor

GPU Schedule Study

Just curious how that two schedules performs on GPU, I studied them on NVidia
GTX 970. Let's putting number first before going into details. The following
lists the runtime of different tile layout under GPU schedules (the first two
are the suggested ones):

Schedule/Runtime(s) 32x8 64x8
Inline 0.001910 0.001986
Cache 0.001839 0.002046
Slide 0.001127 0.000981
SlideVector 0.001248 0.000875

Here's the same performance presented in the data transfer rate so that we
could compare them with peak data copy throughput to see how efficient we use
that hardware:

Schedule/Bandwith(GB/s) 32x8 64x8
Inline 59.85 57.62
Cache 62.26 55.93
Slide 101.54 116.66
SlideVector 91.70 130.79

The bandwidth is calculated as roughly as:

(sizeof(Ushort) * 6400 * 4800 * 2) / Runtime
  • Inline is the schedule simply invoking the GPU tile.

    blur_y.gpu_tile(x, y, tilex, tiley);
    

    In my tests, I tried to tile layouts: 32x8 and 64x8.

  • Cache is the schedule caching the calculation of blur_x.

    blur_y.gpu_tile(x, y, tilex, tiley);
    blur_x.compute_at(blur_y, Var::gpu_blocks).gpu_threads(x, y);
    
  • Slide is a schedule with nested tiling at y dimension. After unrolling,
    that reduces the recompuation of blur_x for GPU thread.

    blur_y.tile(x, y, xo, yo, xi, yi, tile_x, tile_y);
    blur_y.split(yi, yio, yii, tile_y);
    blur_y.reorder(yii, xi, yio);
    blur_y.gpu_blocks(xo, yo);
    blur_y.gpu_threads(xi, yio);
    blur_y.unroll(yii);
    
  • Furthermore, based on Slide, SlideVector packs two calculation of
    blur_x into a sinlge GPU thread to fully utilize the memory bandwidth per
    GPU thread.

    blur_y.tile(x, y, xo, yo, xi, yi, tile_x, tile_y);
    blur_y.tile(xi, yi, xio, yio, xii, yii, 2, tile_y);
    blur_y.gpu_blocks(xo, yo);
    blur_y.gpu_threads(xio, yio);
    blur_y.unroll(yii);
    blur_y.vectorize(xii, 2);
    

GTX 970 has theoretical memory bandwidth up to 192GB/s, but it's very difficult
to reach that. The best data copy rate I could measured is nearly to 140GB/s
using 'bandwidthTest' from CUDA SDK. Compared to the results we listed,
SlideVector reaches up to 93.42% of that copy bandwidth. If we count the
overhead per tile, the result is pretty good in term of efficiency.

Proposal on GPU schedule interface change

Since putting more work per GPU thread is well-known optimization, shall we add
new interface for GPU schedule, taking 2D version as example:

Func &Func::gpu_tile2(VarOrRVar x, VarOrRVar y, int x_size, int y_size, VarOrRVar xi, VarOrRVar yi, int xi_size, int yi_size, TailStrategy tail, DeviceAPI device_api);

which could be implemented as

tile(x, y, xt, yt, x_size, y_size);
tile(xt, yt, xi, yi, xi_size, yi_size);
gpu_blocks(x, y);
gpu_threads(xt, yt);

With that, Slide and SlideVector could be simplified to

blur_y.gpu_tile2(x, y, tilex, tiley, xi, yi, 1, tiley);
blur_y.unroll(yi);

and

blur_y.gpu_tile2(x, y, tilex, tiley, xi, yi, 2, tiley);
blur_y.unroll(yi);
blur_y.vectorize(xi, 2);

@darkbuck
Copy link
Contributor

Here's the performance of CUDA (Previously result is from OpenCL):

Schedule/Runtime(s) 32x8 64x8
Inline 0.002519 0.002601
Cache 0.001777 0.001985
Slide 0.001423 0.001077
SlideVector 0.001678 0.001247
Schedule/Bandwith(GB/s) 32x8 64x8
Inline 45.43 44.00
Cache 64.40 57.65
Slide 80.42 106.26
SlideVector 68.20 91.77

Note the performance between OpenCL and CUDA is major due to different backend used. In OpenCL, we only generate OpenCL C source code and will use full NVIDIA compiler but CUDA uses the open source PTX backend in LLVM.

@zvookin
Copy link
Member

zvookin commented Dec 13, 2016 via email

@darkbuck
Copy link
Contributor

The major issue from PTX backend is the cache hint used in load. That's quite different from one generated from NVIDIA. PTX in LLVM doesn't exploit that hint at all.

@darkbuck
Copy link
Contributor

Example blur in Halide uses ushort, this's more complicated since each load of ushort doesn't fit into the cache line of GM204 (GTX 970). To simplify that a little bit, I change the type to 'uint' so that each load from a wrap will fit into a single cache line and makes the cache hint less effective. The results (runtime (s) of Slide in 32x8 tile over image size 16384 x 16384) from OpenCL and CUDA are quite matching now.

CUDA OpenCL
15.410 14.750

Translated into bandwidth (GB/s):

CUDA OpenCL
129.78 135.59

@jrk
Copy link
Member

jrk commented Dec 13, 2016

Sorry, to be sure I follow: the last tests change the entire pipeline and input/output data to operate on uint32_t instead of uint16_t?

That's not very satisfying, since we should still be able to get good performance using 16-bit values, it just may be necessary to do load/stores via 2-vectors, for 32 contiguous bits per lane, to get peak bus utilization, as you showed in your SlideVector schedule earlier. If the PTX backend can't match OpenCL on that benchmark, we'd like to look into it more.

@darkbuck
Copy link
Contributor

Yeah, just replaced uint16_t with uint32_t. The motivation is just to figure out the direction to look into.

Unfortunatelly, load/store via 2-vector requires alignment, i.e. <2 x i16> requires 4B alignment. Making the alignment assumption requires the change of the image storage format.

SlideVector still have two loads of i16 but with cache hint set so that the second one will hit L1$.

@darkbuck
Copy link
Contributor

the proposed API is requested in #1690

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants