Are OpenCL/GPU schedule for blur of apps #1568

xingjinglu · 2016-10-25T05:55:26Z

Are there GPU schedule for blur ? I want to generate the OpenCL version of blur app. By the way, can I get the OpenCL version code of blur ?

zvookin · 2016-10-25T18:41:29Z

If you set the HL_DEBUG_CODEGEN environment variable to 1 or higher, the
compiler will print the generated OpenCL code as debugging output. (There
are other ways to set the debug level, including the following line of C++
code:
Halide::Internal::debug::debug_level = 1;
}

-Z-

On Mon, Oct 24, 2016 at 10:55 PM, ericlew notifications@github.com wrote:

Are there GPU schedule for blur ? I want to generate the OpenCL version of
blur app. By the way, can I get the OpenCL version code of blur ?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1568, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABbqFOp-G3qxADwbifb8xRn5Bcy83A-bks5q3ZnRgaJpZM4KfluE
.

abadams · 2016-10-25T21:56:23Z

There isn't a GPU schedule for it currently. I think a decent gpu schedule
would be:

blur_y.gpu_tile(x, y, 8, 8);

and then possibly also

blur_x.compute_at(blur_y, Var::gpu_blocks).gpu_threads(x, y);

but uint16 math kinda sucks on gpus, so you might want to change it to
floats.

On Mon, Oct 24, 2016 at 10:55 PM, ericlew notifications@github.com wrote:

Are there GPU schedule for blur ? I want to generate the OpenCL version of
blur app. By the way, can I get the OpenCL version code of blur ?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1568, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAfdRhFljmpEZG-r2AxUv4mSVntxegUgks5q3ZnQgaJpZM4KfluE
.

jrk · 2016-10-26T15:50:58Z

On a GPU I'd guess with 95% certainty that you just want to let blur_x
inline into blur_y (the default, if you don't schedule blur_x at all), and
then blur_y.gpu_time(x, y, K, L) as Andrew said, with reasonable values of
K,L being 8,8.

Blur is generally a very uninteresting workload, especially for a GPU—it's
mostly meant as the bare minimum thing that shows the basic tradeoffs that
matter, but even on modern CPUs it's increasingly becoming best just to
inline the two stages. At the very least, we should make the kernel larger
to enable more advanced producer-consumer locality strategies to matter.

On Tue, Oct 25, 2016 at 2:56 PM, Andrew Adams notifications@github.com
wrote:

There isn't a GPU schedule for it currently. I think a decent gpu schedule
would be:

blur_y.gpu_tile(x, y, 8, 8);

and then possibly also

blur_x.compute_at(blur_y, Var::gpu_blocks).gpu_threads(x, y);

but uint16 math kinda sucks on gpus, so you might want to change it to
floats.

On Mon, Oct 24, 2016 at 10:55 PM, ericlew notifications@github.com
wrote:

Are there GPU schedule for blur ? I want to generate the OpenCL version
of
blur app. By the way, can I get the OpenCL version code of blur ?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1568, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAfdRhFljmpEZG-
r2AxUv4mSVntxegUgks5q3ZnQgaJpZM4KfluE>
.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1568 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAju8KszWS-E_scfodKz-EQSD9uMt2yks5q3nsIgaJpZM4KfluE
.

darkbuck · 2016-12-13T21:08:35Z

GPU Schedule Study

Just curious how that two schedules performs on GPU, I studied them on NVidia
GTX 970. Let's putting number first before going into details. The following
lists the runtime of different tile layout under GPU schedules (the first two
are the suggested ones):

Schedule/Runtime(s)	32x8	64x8
Inline	0.001910	0.001986
Cache	0.001839	0.002046
Slide	0.001127	0.000981
SlideVector	0.001248	0.000875

Here's the same performance presented in the data transfer rate so that we
could compare them with peak data copy throughput to see how efficient we use
that hardware:

Schedule/Bandwith(GB/s)	32x8	64x8
Inline	59.85	57.62
Cache	62.26	55.93
Slide	101.54	116.66
SlideVector	91.70	130.79

The bandwidth is calculated as roughly as:

(sizeof(Ushort) * 6400 * 4800 * 2) / Runtime

Inline is the schedule simply invoking the GPU tile.
```
blur_y.gpu_tile(x, y, tilex, tiley);
```
In my tests, I tried to tile layouts: 32x8 and 64x8.

Cache is the schedule caching the calculation of blur_x.

blur_y.gpu_tile(x, y, tilex, tiley);
blur_x.compute_at(blur_y, Var::gpu_blocks).gpu_threads(x, y);

Slide is a schedule with nested tiling at y dimension. After unrolling,
that reduces the recompuation of blur_x for GPU thread.

blur_y.tile(x, y, xo, yo, xi, yi, tile_x, tile_y);
blur_y.split(yi, yio, yii, tile_y);
blur_y.reorder(yii, xi, yio);
blur_y.gpu_blocks(xo, yo);
blur_y.gpu_threads(xi, yio);
blur_y.unroll(yii);

Furthermore, based on Slide, SlideVector packs two calculation of
blur_x into a sinlge GPU thread to fully utilize the memory bandwidth per
GPU thread.

blur_y.tile(x, y, xo, yo, xi, yi, tile_x, tile_y);
blur_y.tile(xi, yi, xio, yio, xii, yii, 2, tile_y);
blur_y.gpu_blocks(xo, yo);
blur_y.gpu_threads(xio, yio);
blur_y.unroll(yii);
blur_y.vectorize(xii, 2);

GTX 970 has theoretical memory bandwidth up to 192GB/s, but it's very difficult
to reach that. The best data copy rate I could measured is nearly to 140GB/s
using 'bandwidthTest' from CUDA SDK. Compared to the results we listed,
SlideVector reaches up to 93.42% of that copy bandwidth. If we count the
overhead per tile, the result is pretty good in term of efficiency.

Proposal on GPU schedule interface change

Since putting more work per GPU thread is well-known optimization, shall we add
new interface for GPU schedule, taking 2D version as example:

Func &Func::gpu_tile2(VarOrRVar x, VarOrRVar y, int x_size, int y_size, VarOrRVar xi, VarOrRVar yi, int xi_size, int yi_size, TailStrategy tail, DeviceAPI device_api);

which could be implemented as

tile(x, y, xt, yt, x_size, y_size);
tile(xt, yt, xi, yi, xi_size, yi_size);
gpu_blocks(x, y);
gpu_threads(xt, yt);

With that, Slide and SlideVector could be simplified to

blur_y.gpu_tile2(x, y, tilex, tiley, xi, yi, 1, tiley);
blur_y.unroll(yi);

and

blur_y.gpu_tile2(x, y, tilex, tiley, xi, yi, 2, tiley);
blur_y.unroll(yi);
blur_y.vectorize(xi, 2);

darkbuck · 2016-12-13T22:44:26Z

Here's the performance of CUDA (Previously result is from OpenCL):

Schedule/Runtime(s)	32x8	64x8
Inline	0.002519	0.002601
Cache	0.001777	0.001985
Slide	0.001423	0.001077
SlideVector	0.001678	0.001247

Schedule/Bandwith(GB/s)	32x8	64x8
Inline	45.43	44.00
Cache	64.40	57.65
Slide	80.42	106.26
SlideVector	68.20	91.77

Note the performance between OpenCL and CUDA is major due to different backend used. In OpenCL, we only generate OpenCL C source code and will use full NVIDIA compiler but CUDA uses the open source PTX backend in LLVM.

zvookin · 2016-12-13T22:52:42Z

Might be interesting to dig into this a bit and see what the differences are. Claim is llvm's PTX backend is getting a lot more competitive with nvcc.

…

-Z-

On Tue, Dec 13, 2016 at 2:44 PM, darkbuck ***@***.***> wrote: Here's the performance of CUDA (Previously result is from OpenCL): Schedule/Runtime(s) 32x8 64x8 Inline 0.002519 0.002601 Cache 0.001777 0.001985 Slide 0.001423 0.001077 SlideVector 0.001678 0.001247 Schedule/Bandwith(GB/s) 32x8 64x8 Inline 45.43 44.00 Cache 64.40 57.65 Slide 80.42 106.26 SlideVector 68.20 91.77 Note the performance between OpenCL and CUDA is major due to different backend used. In OpenCL, we only generate OpenCL C source code and will use full NVIDIA compiler but CUDA uses the open source PTX backend in LLVM. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1568 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABbqFPM7PEi6KzRFzTSCSOLChBFXW-KAks5rHx_MgaJpZM4KfluE> .

darkbuck · 2016-12-13T22:58:11Z

The major issue from PTX backend is the cache hint used in load. That's quite different from one generated from NVIDIA. PTX in LLVM doesn't exploit that hint at all.

darkbuck · 2016-12-13T23:13:38Z

Example blur in Halide uses ushort, this's more complicated since each load of ushort doesn't fit into the cache line of GM204 (GTX 970). To simplify that a little bit, I change the type to 'uint' so that each load from a wrap will fit into a single cache line and makes the cache hint less effective. The results (runtime (s) of Slide in 32x8 tile over image size 16384 x 16384) from OpenCL and CUDA are quite matching now.

CUDA	OpenCL
15.410	14.750

Translated into bandwidth (GB/s):

CUDA	OpenCL
129.78	135.59

jrk · 2016-12-13T23:20:02Z

Sorry, to be sure I follow: the last tests change the entire pipeline and input/output data to operate on uint32_t instead of uint16_t?

That's not very satisfying, since we should still be able to get good performance using 16-bit values, it just may be necessary to do load/stores via 2-vectors, for 32 contiguous bits per lane, to get peak bus utilization, as you showed in your SlideVector schedule earlier. If the PTX backend can't match OpenCL on that benchmark, we'd like to look into it more.

darkbuck · 2016-12-13T23:30:38Z

Yeah, just replaced uint16_t with uint32_t. The motivation is just to figure out the direction to look into.

Unfortunatelly, load/store via 2-vector requires alignment, i.e. <2 x i16> requires 4B alignment. Making the alignment assumption requires the change of the image storage format.

SlideVector still have two loads of i16 but with cache hint set so that the second one will hit L1$.

darkbuck · 2016-12-14T21:08:28Z

the proposed API is requested in #1690

darkbuck mentioned this issue Dec 13, 2016

CodeGen_PTX: Optimize module before code generation #1685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are OpenCL/GPU schedule for blur of apps #1568

Are OpenCL/GPU schedule for blur of apps #1568

xingjinglu commented Oct 25, 2016

zvookin commented Oct 25, 2016

abadams commented Oct 25, 2016

jrk commented Oct 26, 2016

darkbuck commented Dec 13, 2016

darkbuck commented Dec 13, 2016

zvookin commented Dec 13, 2016 via email

darkbuck commented Dec 13, 2016

darkbuck commented Dec 13, 2016

jrk commented Dec 13, 2016

darkbuck commented Dec 13, 2016

darkbuck commented Dec 14, 2016

Are OpenCL/GPU schedule for blur of apps #1568

Are OpenCL/GPU schedule for blur of apps #1568

Comments

xingjinglu commented Oct 25, 2016

zvookin commented Oct 25, 2016

abadams commented Oct 25, 2016

jrk commented Oct 26, 2016

darkbuck commented Dec 13, 2016

GPU Schedule Study

Proposal on GPU schedule interface change

darkbuck commented Dec 13, 2016

zvookin commented Dec 13, 2016 via email

darkbuck commented Dec 13, 2016

darkbuck commented Dec 13, 2016

jrk commented Dec 13, 2016

darkbuck commented Dec 13, 2016

darkbuck commented Dec 14, 2016