Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimization: support for 2-D/3-D arrays with strides >= for better memory bandwidth utilization #70

Open
nevion opened this issue Jan 10, 2015 · 2 comments

Comments

@nevion
Copy link

nevion commented Jan 10, 2015

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472173

You can see that, depending on graphics card model, there are stride values to avoid - funny enough sometimes with strides of powers of 2 - fairly common numbers. If you ignore this, you can get into trouble and under utilize the memory channels available in hardware in addition to increasing bank conflicts (so the hit is 2 fold). The fix is to simply allocate a slightly pad the strides to avoid these conflicts (found via vendor/model documentation). This should be of use to any GPU implementation and likely some accelerators. It should not be performed when the device type is CPU or code that is doing vector operations directly on memory where the hardware has alignment restrictions.

As such, I tried to use such an array with pyopencl but it doesn't work out very well in manipulation/printing/inspection:

    def make_cl_image_buffer(queue, img_dims, img_dtype, linestep = None):
        img_dtype = dtype_of(img_dtype)
        if linestep is None:
            linestep = img_dims[1] * img_dtype.itemsize
            if queue.device.type == cl.device_type.GPU:
                #make sure to have an uneven line width for better memory channel utilization, as per AMD recommendation
                if linestep % 2 == 0:
                    linestep += 1
        buf = cl.Buffer(queue.context, cl.mem_flags.READ_WRITE, img_dims[0]*linestep)
        return buf, linestep

    def make_cl_image(queue, img_dims, img_dtype, linestep = None):
        img_dtype = dtype_of(img_dtype)
        img_buf, linestep = make_cl_image_buffer(queue, img_dims, img_dtype, linestep)
        strides = (linestep, dtype_of(img_dtype).itemsize)
        img = clarray.Array(queue, img_dims, img_dtype, strides=strides, data=img_buf)
        return img


In [3]: img=com.make_cl_image(queue, (2044, 2044), np.uint32)

In [12]: img.shape
Out[12]: (2044, 2044)

In [13]: img.strides
Out[13]: (8177, 4)
    /home/jason/tmp/pyopencl.git/pyopencl/array.py in get(self, queue, ary, async)
        685                 raise TypeError("'ary' has non-matching type")
        686
    --> 687         assert self.flags.forc, "Array in get() must be contiguous"
        688
        689         if self.size:

    AssertionError: Array in get() must be contiguous

I searched around and found someone else curious about this last year - he also put some work into this:
#54

I wouldn't mind adding some lines to the library to accomplish good-no-surprises manipulation but I was unsure the best way to do it since it seems everything relies on the element wise kernels. I think the best/most compatible way of doing this is having the element wise kernels take in an additional stride parameter which they use to calculate the offset of the element they process.

What do you think? Is it possible for you to add this? If you don't have the time - can you outline what is the way to get it done?

ps. Tried bringing this up for discussion on the ML, got stuck in moderation.

@inducer
Copy link
Owner

inducer commented Jan 18, 2015

PyOpenCL's arrays are, for now, restricted to being contiguous for most operations.

To my mind, the correct way to implement operations on multi-dimensional arrays would use loopy.

@nevion
Copy link
Author

nevion commented Jan 18, 2015

loopy is a cool thing to have learned about - I see libraries like that pop-up from academic labs in a few different contexts.

But what we're talking about on the kernel side is probably no code changes to anything that takes in pitches/strides in bytes - ie it assumes unpacked layout. If the stride is a bad value for certain GPUs than one would want to set it >= the packed value and this can greatly change memory bandwidth utilization / bank conflicts.

We just need a small bit more support to make allocating those arrays not be too painful - of course it's best if it works on everything that is currently supported. The most important things are
-get matrix values
-set from or to values from a packed numpy array or another clarray
-set to constant value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants