Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

./lib/llvmopencl/Kernel.cc:129: pocl::ParallelRegion* pocl::Kernel::createParallelRegionBefore(llvm::BasicBlock*): Assertion `region_entry_barrier != NULL' failed #1435

Closed
picca opened this issue Mar 11, 2024 · 14 comments
Labels
Kernel compiler Regression A problem that wasn't there with an older commit.
Milestone

Comments

@picca
Copy link

picca commented Mar 11, 2024

Hello, I am affected by this bug while preparing the Debian package of silx

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1060318

reading a bunch of your issues I found that adding

POCL_WORK_GROUP_METHOD=cbs helps saolve the issue or at leat dos not trigger the core dump.

so my question is what is the right way to solve this issue ?

thanks for considering

@pjaaskel
Copy link
Member

Hi. If you can create a reproducer in C/C++ with as small kernel as possible that causes it, it wold be helpful in trying to tackle the underlying problem. Meanwhile you can use the CBS for this test case.

@picca
Copy link
Author

picca commented Mar 11, 2024

Here some information

silx-kit/silx#4073

It seems that this is a llvm issue...

llvm 15 ok
llvm 16 ko

@kif
Copy link
Contributor

kif commented Mar 12, 2024

I simplified the culprit kernel to a certain extent ... This does not do any sensible thing anymore but crashes the compiler:

__kernel void medfilt2d(__global float *image,  // input image
                        __global float *result, // output array
                        __local  float4 *l_data,// local storage 4x the number of threads
                                 int khs1,      // Kernel half-size along dim1 (nb lines)
                                 int khs2,      // Kernel half-size along dim2 (nb columns)
                                 int height,    // Image size along dim1 (nb lines)
                                 int width)     // Image size along dim2 (nb columns)
{
    int threadid = get_local_id(0);
    int x = get_global_id(1);

    if (x < width)
    {
        union
        {
            float  ary[8];
            float8 vec;
        } output, input;
        input.vec = (float8)(MAXFLOAT, MAXFLOAT, MAXFLOAT, MAXFLOAT, MAXFLOAT, MAXFLOAT, MAXFLOAT, MAXFLOAT);
        int kfs1 = 2 * khs1 + 1; 
        int kfs2 = 2 * khs2 + 1;
        int nbands = (kfs1 + 7) / 8; 
        for (int y=0; y<height; y++)
        {
            //Select only the active threads, some may remain inactive
            int nb_threads =  (nbands * kfs2);
            int band_nr = threadid / kfs2;
            int band_id = threadid % kfs2;
            int pos_x = clamp((int)(x + band_id - khs2), (int) 0, (int) width-1);
            int max_vec = clamp(kfs1 - 8 * band_nr, 0, 8);
            if (y == 0)
            {
                for (int i=0; i<max_vec; i++)
                {
                    if (threadid<nb_threads)
                    {
                        int pos_y = clamp((int)(y + 8 * band_nr + i - khs1), (int) 0, (int) height-1);
                        input.ary[i] = image[pos_x + width * pos_y];
                    }
                }
            }
            else
            {
                //store storage.s0 to some shared memory to retrieve it from another thread.
                l_data[threadid].s0 = input.vec.s0;

                //Offset to the bottom
                input.vec = (float8)(input.vec.s1,
                        input.vec.s2,
                        input.vec.s3,
                        input.vec.s4,
                        input.vec.s5,
                        input.vec.s6,
                        input.vec.s7,
                        MAXFLOAT);

                barrier(CLK_LOCAL_MEM_FENCE);

                int read_from = threadid + kfs2;
                if (read_from < nb_threads)
                    input.vec.s7 = l_data[read_from].s0;
                else if (threadid < nb_threads) //we are on the last band
                {
                    int pos_y = clamp((int)(y + 8 * band_nr + max_vec - 1 - khs1), (int) 0, (int) height-1);
                    input.ary[max_vec - 1] = image[pos_x + width * pos_y];
                }

            }

            barrier(CLK_LOCAL_MEM_FENCE);
        }
    }
}

It is the last barrier which is the cause of the crash.

If one removes the if (y==0) or the else, they do compile.

@pjaaskel
Copy link
Member

And this compiles with later LLVMs, but not older than 15?

@kif
Copy link
Contributor

kif commented Mar 12, 2024

This compiles with LLVM15 and elder (as old as we remember ... the code is from 2017) but it fails with LLVM16.

@anbe42
Copy link
Contributor

anbe42 commented Mar 12, 2024

This is unrelated to the LLVM version being used.
In Debian bookworm we have PoCL 3.1 built against LLVM 15 not showing the bug.
In Debian sid we have PoCL 5.0 built against LLVM 16 showing the bug.
I just rebuilt PoCL 5.0 locally against LLVM 15 and it still shows the bug.

@anbe42
Copy link
Contributor

anbe42 commented Mar 12, 2024

I bisected it to commit b165b29 "LLVM 17 support", unfortunately that has
77 files changed, 4410 insertions(+), 2341 deletions(-)
I just rebuilt PoCL 5.0 against LLVM 17 locally and that still shows the bug.

@pjaaskel pjaaskel added Kernel compiler Regression A problem that wasn't there with an older commit. labels Mar 12, 2024
@anbe42
Copy link
Contributor

anbe42 commented Mar 12, 2024

minimized nonsensical kernel to reproduce the problem (reduced with cvise and some manual postprocessing)

__kernel void a(__local float2 *b) {
  struct {
    int c[1];
    float2 d;
  } e;
  for (int f = 0; f < 2; f++) {
    if (f)
      for (int g; g < b; g++)
        e.c[g] = 0;
    else if (b)
      e.d.s0 = b[0].s0;
    barrier(0);
  }
}

@pjaaskel
Copy link
Member

pjaaskel commented Mar 13, 2024

Thanks! Very helpful. Also a great pointer to the cvise tool - looks very useful. @franz this is likely something to do with the moving to the new PM by default? Do we still have the ability to run with the old PM to test?

@pjaaskel pjaaskel added this to the 5.1/6.0 milestone Mar 13, 2024
@franz
Copy link
Contributor

franz commented Mar 13, 2024

Since SVMOffset pass was added without support for legacy PM, main branch does not support the legacy PM anymore.

PoCL 5.0 should be able to run with old PM, it's enough to change MIN_LLVM_NEW_PASSMANAGER in config.h.in.cmake to some higher number (this will only work up to LLVM 16).

@pjaaskel
Copy link
Member

The SVMOffset pass is not used in the client-side kernel compilation pipeline, but only called from command line from the pocld server side, so it shouldn't affect this.

franz added a commit to franz/pocl that referenced this issue Apr 4, 2024
franz added a commit to franz/pocl that referenced this issue Apr 4, 2024
@franz
Copy link
Contributor

franz commented Apr 4, 2024

@picca @kif - could any of you test the "real" kernel in silx with this branch - it should fix the crash, but i'm interested if it produces correct results.

@kif
Copy link
Contributor

kif commented Apr 5, 2024 via email

franz added a commit to franz/pocl that referenced this issue Apr 12, 2024
franz added a commit to franz/pocl that referenced this issue Apr 12, 2024
franz added a commit to franz/pocl that referenced this issue Apr 15, 2024
@franz
Copy link
Contributor

franz commented Apr 16, 2024

fixed in main branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kernel compiler Regression A problem that wasn't there with an older commit.
Projects
None yet
Development

No branches or pull requests

5 participants