Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] Inner reduction crashes at runtime #66708

Open
doru1004 opened this issue Sep 18, 2023 · 2 comments
Open

[OpenMP] Inner reduction crashes at runtime #66708

doru1004 opened this issue Sep 18, 2023 · 2 comments
Assignees
Labels

Comments

@doru1004
Copy link
Contributor

doru1004 commented Sep 18, 2023

In the following example:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define MAX_N 64 * 1024 * 4

void reset_input(double *a, double *a_h, double *b, double *c) {
  for(int i = 0 ; i < MAX_N ; i++) {
    a[i] = a_h[i] = 0;
    b[i] = i;
    c[i] = i;
  }
}

int main(int argc, char *argv[]) {
  double * a = (double *) malloc(MAX_N * sizeof(double));
  double * a_h = (double *) malloc(MAX_N * sizeof(double));
  double * b = (double *) malloc(MAX_N * sizeof(double));
  double * c = (double *) malloc(MAX_N * sizeof(double));

  int n = MAX_N;
#pragma omp target enter data map(to:a[:MAX_N],b[:MAX_N],c[:MAX_N])

  reset_input(a, a_h, b, c);

#pragma omp target update to(a[:n],b[:n],c[:n])
#pragma omp target teams distribute parallel for
  for (uint64_t sample=0; sample < 64 * 1024; sample++) {
    double partial_sum = 0.0;
    #pragma omp parallel for reduction(+:partial_sum)
    for (int i = 0; i < 4; ++i) {
      partial_sum += b[sample * 4 + i] + c[sample * 4 + i];
    }
    a[sample * 4] = partial_sum;
  }

  for (int i = 0; i < 64 * 1024; ++i) {
    double p_sum = 0.0;
    for (int j = 0; j < 4; ++j) {
      p_sum += b[4*i + j] + c[4*i + j];
    }
    a_h[i*4] = p_sum;
  }
#pragma omp target update from(a[:n])

  for (int i = 0; i < MAX_N; ++i) {
    if (a_h[i] != a[i]) {
      printf("A Error at n = %d, i = %d: host = %f, device = %f\n", n, i, a_h[i], a[i]);
      return 1;
    }
  }
  printf("Succeeded\n");
  return 0;
}

the inner reduction crashes at runtime.

Compile line:

clang++  -O3 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a test.cpp  -o test -save-temps

The postlink LLVM IR looks ok. The postopt LLVM IR looks suspicious with function bodies being set to unreachable and then called from the kernel. This could well be a red herring and/or the error could be somewhere else but it could be starting point. At first glance it seems to be an issue with OpenMP Opt.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 19, 2023

@llvm/issue-subscribers-openmp

In the following example:
#include &lt;omp.h&gt;
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

#define MAX_N 64 * 1024 * 4

void reset_input(double *a, double *a_h, double *b, double *c) {
  for(int i = 0 ; i &lt; MAX_N ; i++) {
    a[i] = a_h[i] = 0;
    b[i] = i;
    c[i] = i;
  }
}

int main(int argc, char *argv[]) {
  double * a = (double *) malloc(MAX_N * sizeof(double));
  double * a_h = (double *) malloc(MAX_N * sizeof(double));
  double * b = (double *) malloc(MAX_N * sizeof(double));
  double * c = (double *) malloc(MAX_N * sizeof(double));

  int n = MAX_N;
#pragma omp target enter data map(to:a[:MAX_N],b[:MAX_N],c[:MAX_N])

  reset_input(a, a_h, b, c);

#pragma omp target update to(a[:n],b[:n],c[:n])
#pragma omp target teams distribute parallel for
  for (uint64_t sample=0; sample &lt; 64 * 1024; sample++) {
    double partial_sum = 0.0;
    #pragma omp parallel for reduction(+:partial_sum)
    for (int i = 0; i &lt; 4; ++i) {
      partial_sum += b[sample * 4 + i] + c[sample * 4 + i];
    }
    a[sample * 4] = partial_sum;
  }

  for (int i = 0; i &lt; 64 * 1024; ++i) {
    double p_sum = 0.0;
    for (int j = 0; j &lt; 4; ++j) {
      p_sum += b[4*i + j] + c[4*i + j];
    }
    a_h[i*4] = p_sum;
  }
#pragma omp target update from(a[:n])

  for (int i = 0; i &lt; MAX_N; ++i) {
    if (a_h[i] != a[i]) {
      printf("A Error at n = %d, i = %d: host = %f, device = %f\n", n, i, a_h[i], a[i]);
      return 1;
    }
  }
  printf("Succeeded\n");
  return 0;
}

the inner reduction crashes at runtime.

Compile line:

clang++  -O3 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a test.cpp  -o test -save-temps

The postlink LLVM IR looks ok. The postopt LLVM IR looks suspicious with function bodies being set to unreachable and then called from the kernel. This could well be a red herring and/or the error could be somewhere else but it could be starting point. At first glance it seems to be an issue with OpenMP Opt.

@josemonsalve2
Copy link
Contributor

@rodrigo-ceccato May be of your interest

jdoerfert added a commit to jdoerfert/llvm-project that referenced this issue Oct 1, 2023
If we update the state, or indicate a pessimistic fixpoint, we need to
consider NestedParallelism too.

Fixes llvm#66708

That said, the reproducer still needs malloc which we don't support on
AMD GPU.
jdoerfert added a commit that referenced this issue Oct 21, 2023
If we update the state, or indicate a pessimistic fixpoint, we need to
consider NestedParallelism too.

Fixes part of #66708

That said, the reproducer still needs malloc which we don't support on
AMD GPU. Will be added later.
jdoerfert added a commit to jdoerfert/llvm-project that referenced this issue Oct 21, 2023
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
`LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
`-fopenmp-target-debug=8`). Two tests were added, and one was enabled.

This is the next step towards fixing
 llvm#66708
jdoerfert added a commit that referenced this issue Oct 21, 2023
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
`LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
`-fopenmp-target-debug=8`). Two tests were added, and one was enabled.

This is the next step towards fixing
 #66708
searlmc1 pushed a commit to ROCm/llvm-project that referenced this issue Oct 23, 2023
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
`LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
`-fopenmp-target-debug=8`). Two tests were added, and one was enabled.

This is the next step towards fixing
 llvm#66708

Change-Id: I181cdca714994b285c0cd1d16dd3546809cc5dd2
searlmc1 pushed a commit to ROCm/llvm-project that referenced this issue Nov 2, 2023
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
  `LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
  `-fopenmp-target-debug=8`). Two tests were added, and one was enabled.

This is the next step towards fixing
     llvm#66708

Change-Id: I1fdec0f2a24dfff49ccbad5d43a0fd68916ccf16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants