Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bindings for clEnqueueNDRangeKernel different to c++ version #21

Closed
zcaudate opened this issue Feb 8, 2018 · 8 comments
Closed

Bindings for clEnqueueNDRangeKernel different to c++ version #21

zcaudate opened this issue Feb 8, 2018 · 8 comments

Comments

@zcaudate
Copy link

zcaudate commented Feb 8, 2018

I'm looking to reuse opencl code for a sobel filter that I got working with opencv.

I'm getting a CL_INVALID_WORK_GROUP_SIZE when using a local worksize of anything other than [1,1]. The project is here and I'm also asking on the clojurecl repo.

I'm hoping to figure out why this is happening.

---- Project (image-bench:1) ----

image-bench.sobel-test

 Failure  [sobel_test.clj:22]
    Info  "running a grayscale kernel at [345, 912] [13, 3]"
    Form  (gray/grayscale img [13 3])
   Check  java.awt.image.BufferedImage
  Actual  clojure.lang.ExceptionInfo: OpenCL error: CL_INVALID_WORK_GROUP_SIZE. {:name "CL_INVALID_WORK_GROUP_SIZE", :code -54, :type :opencl-error, :details "queue"}

 Failure  [sobel_test.clj:39]
    Info  "running a sobel kernel at [345, 912] [13, 3]"
    Form  (sobel/sobel-invalid-work-group-error img [13 3])
   Check  java.awt.image.BufferedImage
  Actual  clojure.lang.ExceptionInfo: OpenCL error: CL_INVALID_WORK_GROUP_SIZE. {:name "CL_INVALID_WORK_GROUP_SIZE", :code -54, :type :opencl-error, :details "queue"}

 Summary (1)
   Files  1
   Facts  4
  Checks  4
  Passed  2
  Thrown  0

 Failed  (2)
@gpu
Copy link
Owner

gpu commented Feb 9, 2018

As already discussed in the issue that you linked to: It's hard to figure out where exactly in the stack of

image-bench -> clojurecl -> JOCL -> OpenCL

this error is introduced. As a first, very basic step, I created a (quick and dirty) test based on one of the JOCL Samples, and plugged in your kernel, and it basically seems to work with larger "local work sizes":

package org.jocl.test;

import static org.jocl.CL.*;

import java.awt.BorderLayout;
import java.awt.Graphics;
import java.awt.GridLayout;
import java.awt.image.BufferedImage;
import java.awt.image.DataBufferInt;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Collectors;

import javax.imageio.ImageIO;
import javax.swing.*;

import org.jocl.*;

// Quick + Dirty test for https://github.com/gpu/JOCL/issues/21
// Based on the "Simple Image" example, using the kernel from 
// https://github.com/zcaudate-me/image-bench/blob/master/source/opencl/sobel.cl
public class JOCLSobelTest
{
    // Adjust these for your test
    private static final String KERNEL_FILE_NAME = "kernels/sobel.cl";
    private static final String IMAGE_FILE_NAME = 
        "src/main/resources/data/lena512color.png";
    private static final long[] LOCAL_WORK_SIZE = { 10, 10 };

    public static void main(String args[])
    {
        SwingUtilities.invokeLater(new Runnable()
        {
            @Override
            public void run()
            {
                new JOCLSobelTest();
            }
        });
    }

    /**
     * The input image
     */
    private BufferedImage inputImage;

    /**
     * The output image
     */
    private BufferedImage outputImage;

    /**
     * The OpenCL context
     */
    private cl_context context;

    /**
     * The OpenCL command queue
     */
    private cl_command_queue commandQueue;

    /**
     * The OpenCL kernel
     */
    private cl_kernel kernel;

    /**
     * The memory object for the input image
     */
    private cl_mem inputImageMem;

    /**
     * The memory object for the output image
     */
    private cl_mem outputImageMem;

    /**
     * The width of the image
     */
    private int imageSizeX;

    /**
     * The height of the image
     */
    private int imageSizeY;

    /**
     * Creates the JOCLSimpleImage sample
     */
    public JOCLSobelTest()
    {
        inputImage = createBufferedImage(IMAGE_FILE_NAME);
        imageSizeX = inputImage.getWidth();
        imageSizeY = inputImage.getHeight();

        outputImage = new BufferedImage(
            imageSizeX, imageSizeY, BufferedImage.TYPE_INT_RGB);

        // Create the panel showing the input and output images
        JPanel mainPanel = new JPanel(new GridLayout(1,0));
        JLabel inputLabel = new JLabel(new ImageIcon(inputImage));
        mainPanel.add(inputLabel, BorderLayout.CENTER);
        JLabel outputLabel = new JLabel(new ImageIcon(outputImage));
        mainPanel.add(outputLabel, BorderLayout.CENTER);

        // Create the main frame
        JFrame frame = new JFrame("JOCL Sobel Test");
        frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
        frame.setLayout(new BorderLayout());
        frame.add(mainPanel, BorderLayout.CENTER);
        frame.pack();
        frame.setVisible(true);

        initCL();
        initImageMem();
        updateImage();
    }



    /**
     * Initialize the OpenCL context, command queue and kernel
     */
    void initCL()
    {
        final int platformIndex = 0;
        final long deviceType = CL_DEVICE_TYPE_ALL;
        final int deviceIndex = 0;

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Obtain the number of platforms
        int numPlatformsArray[] = new int[1];
        clGetPlatformIDs(0, null, numPlatformsArray);
        int numPlatforms = numPlatformsArray[0];

        // Obtain a platform ID
        cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_platform_id platform = platforms[platformIndex];

        // Initialize the context properties
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
        
        // Obtain the number of devices for the platform
        int numDevicesArray[] = new int[1];
        clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
        int numDevices = numDevicesArray[0];
        
        // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        cl_device_id device = devices[deviceIndex];

        // Create a context for the selected device
        context = clCreateContext(
            contextProperties, 1, new cl_device_id[]{device}, 
            null, null, null);
        
        // Check if images are supported
        int imageSupport[] = new int[1];
        clGetDeviceInfo (device, CL.CL_DEVICE_IMAGE_SUPPORT,
            Sizeof.cl_int, Pointer.to(imageSupport), null);
        System.out.println("Images supported: "+(imageSupport[0]==1));
        if (imageSupport[0]==0)
        {
            System.out.println("Images are not supported");
            System.exit(1);
            return;
        }

        // Create a command-queue
        System.out.println("Creating command queue...");
        long properties = 0;
        properties |= CL_QUEUE_PROFILING_ENABLE;
        properties |= CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;
        commandQueue = clCreateCommandQueue(context, device, properties, null);

        String programSource = "";
        try
        {
            Path path = Paths.get(KERNEL_FILE_NAME);
            programSource = Files.readAllLines(path).stream()
                .collect(Collectors.joining("\n"));
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
        
        // Create the program
        System.out.println("Creating program...");
        cl_program program = clCreateProgramWithSource(context,
            1, new String[]{ programSource }, null, null);

        // Build the program
        System.out.println("Building program...");
        clBuildProgram(program, 0, null, null, null, null);

        // Create the kernel
        System.out.println("Creating kernel...");
        kernel = clCreateKernel(program, "sobel_uchar", null);

    }

    /**
     * Initialize the memory objects for the input and output images
     */
    private void initImageMem()
    {
        // Create the memory object for the input- and output image
        DataBufferInt dataBufferSrc =
            (DataBufferInt)inputImage.getRaster().getDataBuffer();
        int dataSrc[] = dataBufferSrc.getData();
        byte[] bytePixels = new byte[dataSrc.length];
        for (int i=0; i<dataSrc.length; i++)
        {
            bytePixels[i] = (byte)(dataSrc[i] & 0xFF);
        }
        inputImageMem = clCreateBuffer(context, 
            CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
            imageSizeX * imageSizeY, Pointer.to(bytePixels), null);

        outputImageMem = clCreateBuffer(context, CL_MEM_READ_WRITE, 
            imageSizeX * imageSizeY, null, null);
    }


    /**
     * Rotate the input image by the given angle, and write it into
     * the output image
     *
     * @param angle The rotation angle
     */
    void updateImage()
    {
        // Set up the work size and arguments, and execute the kernel
        long globalWorkSize[] = new long[2];
        globalWorkSize[0] = imageSizeX;
        globalWorkSize[1] = imageSizeY;
        int a = 0;
        clSetKernelArg(kernel, a++, Sizeof.cl_mem, Pointer.to(inputImageMem));
        clSetKernelArg(kernel, a++, Sizeof.cl_int, Pointer.to(new int[] { 0 })); // not used
        clSetKernelArg(kernel, a++, Sizeof.cl_int, Pointer.to(new int[] { 0 })); // not used
        clSetKernelArg(kernel, a++, Sizeof.cl_int, Pointer.to(new int[] { imageSizeY })); 
        clSetKernelArg(kernel, a++, Sizeof.cl_int, Pointer.to(new int[] { imageSizeX }));
        clSetKernelArg(kernel, a++, Sizeof.cl_mem, Pointer.to(outputImageMem));
        clSetKernelArg(kernel, a++, Sizeof.cl_int, Pointer.to(new int[] { 0 })); // not used
        clSetKernelArg(kernel, a++, Sizeof.cl_int, Pointer.to(new int[] { 0 })); // not used

        
        clEnqueueNDRangeKernel(commandQueue, kernel, 2, LOCAL_WORK_SIZE,
            globalWorkSize, null, 0, null, null);

        // Read the pixel data into the output image
        DataBufferInt dataBufferDst =
            (DataBufferInt)outputImage.getRaster().getDataBuffer();
        int dataDst[] = dataBufferDst.getData();
        byte[] bytePixels = new byte[dataDst.length];
        clEnqueueReadBuffer(commandQueue, outputImageMem, CL_TRUE, 0, 
            imageSizeX * imageSizeY, Pointer.to(bytePixels), 0,  null,  null);
        for (int i=0; i<dataDst.length; i++)
        {
            dataDst[i] = bytePixels[i];
        }
    }
    
    
    /**
     * Creates a BufferedImage of with type TYPE_INT_RGB from the
     * file with the given name.
     *
     * @param fileName The file name
     * @return The image, or null if the file may not be read
     */
    private static BufferedImage createBufferedImage(String fileName)
    {
        BufferedImage image = null;
        try
        {
            image = ImageIO.read(new File(fileName));
        }
        catch (IOException e)
        {
            e.printStackTrace();
            return null;
        }

        int sizeX = image.getWidth();
        int sizeY = image.getHeight();

        BufferedImage result = new BufferedImage(
            sizeX, sizeY, BufferedImage.TYPE_INT_RGB);
        Graphics g = result.createGraphics();
        g.drawImage(image, 0, 0, null);
        g.dispose();
        return result;
    }
    
}

The output is probably not the desired one...

sobel01

... but I think the kernel is still preliminary, and in any case, this is unrelated to the original issue.

Sorry, this may not immediately help you, but may be a first step to narrow down the search space...

@zcaudate
Copy link
Author

@gpu: wow. thanks for the example. the output is a little weird but it shows that it's working and way better than all zeros.

Let me verify this on my end and get back to you.

@zcaudate
Copy link
Author

I'm running this example on a Macbook Pro OSX 10.13.2. In order to get the code working, I've had to disable:

// properties |= CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;

otherwise this happens:

Images supported: true
Creating command queue...
CLException CL_INVALID_VALUE  org.jocl.CL.checkResult (CL.java:808)

In the example, for a LOCAL_WORK_SIZE of {1, 1} I'm getting this:

screen shot 2018-02-10 at 8 47 26 am

In the example, for a LOCAL_WORK_SIZE of {10, 10} I'm getting this:

Images supported: true
Creating command queue...
Creating program...
Building program...
Creating kernel...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000131303c90, pid=2604, tid=0x0000000000007403
#
# JRE version: Java(TM) SE Runtime Environment (8.0_152-b16) (build 1.8.0_152-b16)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.152-b16 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [cl_kernels+0xc90]  _target_.sobel_uchar_wrapper+0x970
#

@gpu
Copy link
Owner

gpu commented Feb 11, 2018

I just tried it out on my...

  • NVIDIA Platform with GPU device
  • AMD platform with GPU device
  • AMD platform with CPU device

and also encountered a crash for the last one. However, this crash might in fact be unrelated to the local work size: Some write operations are not supported in older OpenCL versions. See https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/restrictions.html :

Built-in types that are less than 32-bits in size i.e. char, uchar, char2, uchar2, short, ushort, and half have the following restriction:
Writes to a pointer (or arrays) of type char, uchar, char2, uchar2, short, ushort, and half or to elements of a struct that are of type char, uchar, char2, uchar2, short, and ushort are not supported.

And you are writing to such a pointer. So if you're running this on an "old" platform, arbitrary things may go wrong...

Can you run a device query on your platform?

(Note: This is still a guess. If you edit your kernel to omit all write operations, then I'm pretty sure it would "work", which would (not confirm, but at least) be a hint that it might not the binding or launch configuration or work sizes, but indeed the fact that you're doing invalid write operations...)

@zcaudate
Copy link
Author

zcaudate commented Feb 12, 2018

@gpu: okay, I've found the mistake.

I'd always thought that the GPU was the first device on the platform and so I was making the call:

 (first (cl/devices (first (cl/platforms))))

which, I realised only after running the device query that I was selecting the CPU

(map info/name-info (cl/devices (first (cl/platforms))))
=> ("Intel(R) Core(TM) i5-6360U CPU @ 2.00GHz"
    "Intel(R) Iris(TM) Graphics 540")

Selecting the GPU works fine, although for an image of 500x500, [8 8] workgroup still does not work but [10 10] does.

@zcaudate
Copy link
Author

zcaudate commented Feb 12, 2018

Is there a way to determine the type of device that is available to be selected?

Like as in - select the gpu first, or else select the cpu.

@gpu
Copy link
Owner

gpu commented Feb 12, 2018

Good to hear that it is (basically) resolved.

For the device type, there are two options:

  1. You could query the device type, using clDeviceGetInfo and CL_DEVICE_TYPE

  2. You could list only GPU devices in the first place. When the devices are listed, the list can basically be "filtered" based on the type.

For the latter, you may have a look at the

final long deviceType = CL_DEVICE_TYPE_ALL;

in the example above. (This could be changed to CL_DEVICE_TYPE_GPU. Not sure how this translates to Clojure, though...)

The fact that it does not work for (500,500)/(8,8) is then more likely due to the divisibility issue. I think your kernel is already prepared to handle that, basically: You can add a "padding" for the image, i.e. use a global work size that is larger than the image, and just do a bounds check at the beginning of the kernel. (This may be particularly relevant when the size of the image is a prime number, and no sensible local work size except for (1,1) can be found).

BTW, it may be worth mentioning here: When passing in null as the local work size, then the OpenCL implementation is free to choose the local work size that it finds appropriate. (This, of course, only makes sense when you don't have shared memory that depends on the local work size - and it still does not solve the potential issue of the global size being a prime number...)

@zcaudate
Copy link
Author

@gpu: Thanks so much for your help and patience. I'll close this issue and will experiment a bit more with the above info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants