Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU schedule for Blur (targeted GTX 970) #1690

Merged
merged 1 commit into from Dec 18, 2016
Merged

Add GPU schedule for Blur (targeted GTX 970) #1690

merged 1 commit into from Dec 18, 2016

Conversation

darkbuck
Copy link
Contributor

Add GPU schedules based on different strategy for blur.

@abadams
Copy link
Member

abadams commented Dec 14, 2016

I'm a fan of improvements to the blur schedule, but I disagree with the new gpu_tile overloads. I think they do too much in one call. I'd prefer people just say .tile(x, y, xi, yi, 2, 2).gpu_tile(x, y, 8, 8) to get that kind of nested tiling. I also think that you typically want the thread axis to be the innermost (apart from the vectorization) to get dense loads. So for a 16-bit type the sort of schedule I'd use is something like:

.vectorize(x, 2)
.tile(x, y, xi, yi, 32, 2)
.tile(x, y, xo, yo, 2, 8)
.gpu(x, y, xi, yo)

That says (or at least is intended to say) that each thread does a 2x2 block of work, the work done by one thread is adjacent in y, but separated by 32x2 elements (one full cache line) in x. E.g. thread tx, ty in block 0, 0 takes care of four pairs of elements: ([2tx, 2tx+1], 2ty) ([2tx+64, 2tx+65], 2ty) ([2tx, 2tx+1], 2ty+1) ([2tx+64, 2tx+65], 2ty+1) I think that makes all the loads and stores dense across the warp.

@darkbuck
Copy link
Contributor Author

I see your pointer to schedule innermost outwards while most CPU schedule is made outermost inwards. That will make the current GPU interface more straight-forward. That's fine as that's just short-cut and won't matter too much as long as people figure out the difference between optimal GPU and CPU schedules.

I rewrote the schedule following your point. They achieve the same code generation and performance. Shall we merge the GPU schedule example for blur?

// blur_x calculation is re-used implicitly. This achieves
// the similar schedule of sliding window.
Var yi("yi");
blur_y.split(y, y, yi, tile_y);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our usual convention for chaining scheduling calls across multiple lines is to not repeat the Func name:

blur_y.split(...)
    .reorder(...)
    .unroll(...)
    .gpu_tile(...);

@abadams
Copy link
Member

abadams commented Dec 15, 2016

Yeah, looks good. Just one style nit.

@darkbuck
Copy link
Contributor Author

Sure, style is revised.

@darkbuck darkbuck changed the title Add GPU nested tiling support Add GPU schedule for Blur (targeted GTX 970) Dec 17, 2016
@abadams abadams merged commit 16e771d into halide:master Dec 18, 2016
@darkbuck darkbuck deleted the darkbuck/master/gpu-blur branch January 25, 2017 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants