New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are OpenCL/GPU schedule for blur of apps #1568
Comments
If you set the HL_DEBUG_CODEGEN environment variable to 1 or higher, the -Z- On Mon, Oct 24, 2016 at 10:55 PM, ericlew notifications@github.com wrote:
|
There isn't a GPU schedule for it currently. I think a decent gpu schedule blur_y.gpu_tile(x, y, 8, 8); and then possibly also blur_x.compute_at(blur_y, Var::gpu_blocks).gpu_threads(x, y); but uint16 math kinda sucks on gpus, so you might want to change it to On Mon, Oct 24, 2016 at 10:55 PM, ericlew notifications@github.com wrote:
|
On a GPU I'd guess with 95% certainty that you just want to let blur_x Blur is generally a very uninteresting workload, especially for a GPU—it's On Tue, Oct 25, 2016 at 2:56 PM, Andrew Adams notifications@github.com
|
GPU Schedule StudyJust curious how that two schedules performs on GPU, I studied them on NVidia
Here's the same performance presented in the data transfer rate so that we
The bandwidth is calculated as roughly as:
GTX 970 has theoretical memory bandwidth up to 192GB/s, but it's very difficult Proposal on GPU schedule interface changeSince putting more work per GPU thread is well-known optimization, shall we add
which could be implemented as
With that,
and
|
Here's the performance of CUDA (Previously result is from OpenCL):
Note the performance between OpenCL and CUDA is major due to different backend used. In OpenCL, we only generate OpenCL C source code and will use full NVIDIA compiler but CUDA uses the open source PTX backend in LLVM. |
Might be interesting to dig into this a bit and see what the differences
are. Claim is llvm's PTX backend is getting a lot more competitive with
nvcc.
…-Z-
On Tue, Dec 13, 2016 at 2:44 PM, darkbuck ***@***.***> wrote:
Here's the performance of CUDA (Previously result is from OpenCL):
Schedule/Runtime(s) 32x8 64x8
Inline 0.002519 0.002601
Cache 0.001777 0.001985
Slide 0.001423 0.001077
SlideVector 0.001678 0.001247
Schedule/Bandwith(GB/s) 32x8 64x8
Inline 45.43 44.00
Cache 64.40 57.65
Slide 80.42 106.26
SlideVector 68.20 91.77
Note the performance between OpenCL and CUDA is major due to different
backend used. In OpenCL, we only generate OpenCL C source code and will use
full NVIDIA compiler but CUDA uses the open source PTX backend in LLVM.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1568 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABbqFPM7PEi6KzRFzTSCSOLChBFXW-KAks5rHx_MgaJpZM4KfluE>
.
|
The major issue from PTX backend is the cache hint used in load. That's quite different from one generated from NVIDIA. PTX in LLVM doesn't exploit that hint at all. |
Example blur in Halide uses
Translated into bandwidth (GB/s):
|
Sorry, to be sure I follow: the last tests change the entire pipeline and input/output data to operate on That's not very satisfying, since we should still be able to get good performance using 16-bit values, it just may be necessary to do load/stores via 2-vectors, for 32 contiguous bits per lane, to get peak bus utilization, as you showed in your |
Yeah, just replaced uint16_t with uint32_t. The motivation is just to figure out the direction to look into. Unfortunatelly, load/store via 2-vector requires alignment, i.e. <2 x i16> requires 4B alignment. Making the alignment assumption requires the change of the image storage format. SlideVector still have two loads of i16 but with cache hint set so that the second one will hit L1$. |
the proposed API is requested in #1690 |
Are there GPU schedule for blur ? I want to generate the OpenCL version of blur app. By the way, can I get the OpenCL version code of blur ?
The text was updated successfully, but these errors were encountered: