New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use single GPU thread to handle a tile #3114

Open
sumriver8 opened this Issue Jul 10, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@sumriver8

sumriver8 commented Jul 10, 2018

Hello,

For mobile GPU, if I want to set the tile size as 8x8 and use single thread to handle this tile, how can I do with Halide? Or how can set the work-group size as one but with multiple GPU threads? Such setting is very useful for mobile GPU.

I know with original OpenCL, it is very easy. But with Halide, gpu_tile seems cannot meet my requirement, because it is using multiple GPU threads to handle that tile and each tile is a work group.

Thanks,
-Owen

@abadams

This comment has been minimized.

Show comment
Hide comment
@abadams

abadams Jul 10, 2018

Member

I'm not 100% sure what you're asking, because I'm used to cuda terminology, but you should be able to get any combination of sizes using Func::tile calls and direct calls to Func::gpu_blocks and Func::gpu_threads to mark which dimensions are which things. E.g. the following gives you 8x8 thread blocks with one thread per block:

f.tile(x, y, xi, yi, 8, 8).gpu_blocks(x, y);

So that one thread will do a little serial 8x8 loop inside its lonely thread block.

One thread block with an 8x8 group of threads iterating over the image serially would be something like:

f.tile(x, y, xi, yi, 8, 8).gpu_threads(xi, yi).gpu_blocks(Var::outermost);

Var::outermost is a synthetic variable which is a dummy outermost loop of size 1. It can be useful for marking device transitions that don't actually have loops associated with them. You could equivalently do something like:

f.tile(x, y, xi, yi, width, height).tile(xi, yi, xii, yii, 8, 8).gpu_blocks(x, y).gpu_threads(xii, yii);

where "width" and "height" are Exprs that equal the output size.

I'd suggest turning on HL_DEBUG_CODEGEN=1 and inspecting the pseudocode generated by things like the above.

Member

abadams commented Jul 10, 2018

I'm not 100% sure what you're asking, because I'm used to cuda terminology, but you should be able to get any combination of sizes using Func::tile calls and direct calls to Func::gpu_blocks and Func::gpu_threads to mark which dimensions are which things. E.g. the following gives you 8x8 thread blocks with one thread per block:

f.tile(x, y, xi, yi, 8, 8).gpu_blocks(x, y);

So that one thread will do a little serial 8x8 loop inside its lonely thread block.

One thread block with an 8x8 group of threads iterating over the image serially would be something like:

f.tile(x, y, xi, yi, 8, 8).gpu_threads(xi, yi).gpu_blocks(Var::outermost);

Var::outermost is a synthetic variable which is a dummy outermost loop of size 1. It can be useful for marking device transitions that don't actually have loops associated with them. You could equivalently do something like:

f.tile(x, y, xi, yi, width, height).tile(xi, yi, xii, yii, 8, 8).gpu_blocks(x, y).gpu_threads(xii, yii);

where "width" and "height" are Exprs that equal the output size.

I'd suggest turning on HL_DEBUG_CODEGEN=1 and inspecting the pseudocode generated by things like the above.

@sumriver8

This comment has been minimized.

Show comment
Hide comment
@sumriver8

sumriver8 Jul 10, 2018

Thanks, Andrew. Very detailed help.

With ARM Mali GPU, below one works for me and the result is what I am looking for:
f.tile(x, y, xi, yi, width, height).tile(xi, yi, xii, yii, 8, 8).gpu_blocks(x, y).gpu_threads(xii, yii);

The another solution "f.tile(x, y, xi, yi, 8, 8).gpu_threads(xi, yi).gpu_blocks(Var::outermost);" throws out compiling error message " error: no matching function for call to ‘Halide::Func::gpu_blocks(Halide::Var (&)())"

sumriver8 commented Jul 10, 2018

Thanks, Andrew. Very detailed help.

With ARM Mali GPU, below one works for me and the result is what I am looking for:
f.tile(x, y, xi, yi, width, height).tile(xi, yi, xii, yii, 8, 8).gpu_blocks(x, y).gpu_threads(xii, yii);

The another solution "f.tile(x, y, xi, yi, 8, 8).gpu_threads(xi, yi).gpu_blocks(Var::outermost);" throws out compiling error message " error: no matching function for call to ‘Halide::Func::gpu_blocks(Halide::Var (&)())"

@abadams

This comment has been minimized.

Show comment
Hide comment
@abadams

abadams Jul 10, 2018

Member

oops, I think outermost is a function that returns the magic variable, not the variable itself:

f.tile(x, y, xi, yi, 8, 8).gpu_threads(xi, yi).gpu_blocks(Var::outermost());

but if the other thing works for you, that sounds good to me.

Member

abadams commented Jul 10, 2018

oops, I think outermost is a function that returns the magic variable, not the variable itself:

f.tile(x, y, xi, yi, 8, 8).gpu_threads(xi, yi).gpu_blocks(Var::outermost());

but if the other thing works for you, that sounds good to me.

@sumriver8

This comment has been minimized.

Show comment
Hide comment
@sumriver8

sumriver8 Jul 10, 2018

Thanks, this one is good without compilation error. And it is much clear one for single work-group dispatching which is important for mobile GPU.

sumriver8 commented Jul 10, 2018

Thanks, this one is good without compilation error. And it is much clear one for single work-group dispatching which is important for mobile GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment