[OpenMP] Inner reduction crashes at runtime #66708

doru1004 · 2023-09-18T22:02:00Z

In the following example:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define MAX_N 64 * 1024 * 4

void reset_input(double *a, double *a_h, double *b, double *c) {
  for(int i = 0 ; i < MAX_N ; i++) {
    a[i] = a_h[i] = 0;
    b[i] = i;
    c[i] = i;
  }
}

int main(int argc, char *argv[]) {
  double * a = (double *) malloc(MAX_N * sizeof(double));
  double * a_h = (double *) malloc(MAX_N * sizeof(double));
  double * b = (double *) malloc(MAX_N * sizeof(double));
  double * c = (double *) malloc(MAX_N * sizeof(double));

  int n = MAX_N;
#pragma omp target enter data map(to:a[:MAX_N],b[:MAX_N],c[:MAX_N])

  reset_input(a, a_h, b, c);

#pragma omp target update to(a[:n],b[:n],c[:n])
#pragma omp target teams distribute parallel for
  for (uint64_t sample=0; sample < 64 * 1024; sample++) {
    double partial_sum = 0.0;
    #pragma omp parallel for reduction(+:partial_sum)
    for (int i = 0; i < 4; ++i) {
      partial_sum += b[sample * 4 + i] + c[sample * 4 + i];
    }
    a[sample * 4] = partial_sum;
  }

  for (int i = 0; i < 64 * 1024; ++i) {
    double p_sum = 0.0;
    for (int j = 0; j < 4; ++j) {
      p_sum += b[4*i + j] + c[4*i + j];
    }
    a_h[i*4] = p_sum;
  }
#pragma omp target update from(a[:n])

  for (int i = 0; i < MAX_N; ++i) {
    if (a_h[i] != a[i]) {
      printf("A Error at n = %d, i = %d: host = %f, device = %f\n", n, i, a_h[i], a[i]);
      return 1;
    }
  }
  printf("Succeeded\n");
  return 0;
}

the inner reduction crashes at runtime.

Compile line:

clang++  -O3 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a test.cpp  -o test -save-temps

The postlink LLVM IR looks ok. The postopt LLVM IR looks suspicious with function bodies being set to unreachable and then called from the kernel. This could well be a red herring and/or the error could be somewhere else but it could be starting point. At first glance it seems to be an issue with OpenMP Opt.

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-09-19T06:13:27Z

@llvm/issue-subscribers-openmp

In the following example:

#include &lt;omp.h&gt;
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

#define MAX_N 64 * 1024 * 4

void reset_input(double *a, double *a_h, double *b, double *c) {
  for(int i = 0 ; i &lt; MAX_N ; i++) {
    a[i] = a_h[i] = 0;
    b[i] = i;
    c[i] = i;
  }
}

int main(int argc, char *argv[]) {
  double * a = (double *) malloc(MAX_N * sizeof(double));
  double * a_h = (double *) malloc(MAX_N * sizeof(double));
  double * b = (double *) malloc(MAX_N * sizeof(double));
  double * c = (double *) malloc(MAX_N * sizeof(double));

  int n = MAX_N;
#pragma omp target enter data map(to:a[:MAX_N],b[:MAX_N],c[:MAX_N])

  reset_input(a, a_h, b, c);

#pragma omp target update to(a[:n],b[:n],c[:n])
#pragma omp target teams distribute parallel for
  for (uint64_t sample=0; sample &lt; 64 * 1024; sample++) {
    double partial_sum = 0.0;
    #pragma omp parallel for reduction(+:partial_sum)
    for (int i = 0; i &lt; 4; ++i) {
      partial_sum += b[sample * 4 + i] + c[sample * 4 + i];
    }
    a[sample * 4] = partial_sum;
  }

  for (int i = 0; i &lt; 64 * 1024; ++i) {
    double p_sum = 0.0;
    for (int j = 0; j &lt; 4; ++j) {
      p_sum += b[4*i + j] + c[4*i + j];
    }
    a_h[i*4] = p_sum;
  }
#pragma omp target update from(a[:n])

  for (int i = 0; i &lt; MAX_N; ++i) {
    if (a_h[i] != a[i]) {
      printf("A Error at n = %d, i = %d: host = %f, device = %f\n", n, i, a_h[i], a[i]);
      return 1;
    }
  }
  printf("Succeeded\n");
  return 0;
}

the inner reduction crashes at runtime.

Compile line:

clang++  -O3 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a test.cpp  -o test -save-temps

The postlink LLVM IR looks ok. The postopt LLVM IR looks suspicious with function bodies being set to unreachable and then called from the kernel. This could well be a red herring and/or the error could be somewhere else but it could be starting point. At first glance it seems to be an issue with OpenMP Opt.

josemonsalve2 · 2023-09-20T17:23:52Z

@rodrigo-ceccato May be of your interest

If we update the state, or indicate a pessimistic fixpoint, we need to consider NestedParallelism too. Fixes llvm#66708 That said, the reproducer still needs malloc which we don't support on AMD GPU.

If we update the state, or indicate a pessimistic fixpoint, we need to consider NestedParallelism too. Fixes part of #66708 That said, the reproducer still needs malloc which we don't support on AMD GPU. Will be added later.

The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements `malloc`, both internally and externally, while we continue to default to the NVIDIA `malloc` when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB. It allows to track allocation statistics via `LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with `-fopenmp-target-debug=8`). Two tests were added, and one was enabled. This is the next step towards fixing llvm#66708

The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements `malloc`, both internally and externally, while we continue to default to the NVIDIA `malloc` when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB. It allows to track allocation statistics via `LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with `-fopenmp-target-debug=8`). Two tests were added, and one was enabled. This is the next step towards fixing #66708

The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements `malloc`, both internally and externally, while we continue to default to the NVIDIA `malloc` when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB. It allows to track allocation statistics via `LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with `-fopenmp-target-debug=8`). Two tests were added, and one was enabled. This is the next step towards fixing llvm#66708 Change-Id: I181cdca714994b285c0cd1d16dd3546809cc5dd2

The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements `malloc`, both internally and externally, while we continue to default to the NVIDIA `malloc` when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB. It allows to track allocation statistics via `LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with `-fopenmp-target-debug=8`). Two tests were added, and one was enabled. This is the next step towards fixing llvm#66708 Change-Id: I1fdec0f2a24dfff49ccbad5d43a0fd68916ccf16

github-actions bot added the new issue label Sep 18, 2023

doru1004 assigned jdoerfert Sep 18, 2023

EugeneZelenko added openmp and removed new issue labels Sep 19, 2023

jdoerfert mentioned this issue Oct 21, 2023

[OpenMP] Basic BumpAllocator for (AMD)GPUs #69806

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP] Inner reduction crashes at runtime #66708

[OpenMP] Inner reduction crashes at runtime #66708

doru1004 commented Sep 18, 2023 •

edited

llvmbot commented Sep 19, 2023

josemonsalve2 commented Sep 20, 2023

[OpenMP] Inner reduction crashes at runtime #66708

[OpenMP] Inner reduction crashes at runtime #66708

Comments

doru1004 commented Sep 18, 2023 • edited

llvmbot commented Sep 19, 2023

josemonsalve2 commented Sep 20, 2023

doru1004 commented Sep 18, 2023 •

edited