[OpenMP] incorrect concurrent target reduction #70249

ye-luo · 2023-10-25T20:26:56Z

#include <iostream>
#include <vector>

#define N 4

int main()
{
  std::vector<int> avec(N);
  int* a = avec.data();
  #pragma omp parallel for
  for(int i = 0; i < N; i++)
  {
    a[i] = 0;
    #pragma omp target teams distribute parallel for reduction(+:a[i])
    for(int j =0 ;j< N; j++)
      a[i] += 1;
  }

  std::cout << "results:";
  for(int i = 0; i < N; i++)
    std::cout << " " << a[i];
  std::cout << std::endl;
}

running clang++ -fopenmp --offload-arch=sm_80 main.cpp && ./a.out
expect

results: 4 4 4 4

but I got random failure

results: 0 4 0 0
results: 4 0 4 0
results: 4 0 4 0

Turning on debugging info

OMP_TARGET_OFFLOAD=mandatory OMP_NUM_THREADS=32 LIBOMPTARGET_DEBUG=1 ./a.out >& out && grep "Moving 4 bytes\|result" out
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c27c) -> (tgt:0x00007f7112600000)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c278) -> (tgt:0x00007f7112600200)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c274) -> (tgt:0x00007f7112600400)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c270) -> (tgt:0x00007f7112600600)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600000) -> (hst:0x00005572a0d9c27c)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600400) -> (hst:0x00005572a0d9c274)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600600) -> (hst:0x00005572a0d9c270)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600200) -> (hst:0x00005572a0d9c278)
results: 4 0 4 0

Mapping and transfers seem OK to me. The failure was miserable.

setting OMP_NUM_THREADS=1 the test passes reliably.

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-10-25T21:18:37Z

@llvm/issue-subscribers-openmp

Author: Ye Luo (ye-luo)

``` #include <iostream> #include <vector>

#define N 4

int main()
{
std::vector<int> avec(N);
int* a = avec.data();
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
a[i] = 0;
#pragma omp target teams distribute parallel for reduction(+:a[i])
for(int j =0 ;j< N; j++)
a[i] += 1;
}

for(int i = 0; i < N; i++)
std::cout << a[i] << std::endl;
}

running `clang++ -fopenmp --offload-arch=sm_80 main.cpp &amp;&amp; ./a.out`
expect

results: 4 4 4 4

but I got random failure

results: 0 4 0 0
results: 4 0 4 0
results: 4 0 4 0


Turning on debugging info

OMP_TARGET_OFFLOAD=mandatory OMP_NUM_THREADS=32 LIBOMPTARGET_DEBUG=1 ./a.out >& out && grep "Moving 4 bytes|result" out
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c27c) -> (tgt:0x00007f7112600000)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c278) -> (tgt:0x00007f7112600200)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c274) -> (tgt:0x00007f7112600400)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c270) -> (tgt:0x00007f7112600600)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600000) -> (hst:0x00005572a0d9c27c)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600400) -> (hst:0x00005572a0d9c274)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600600) -> (hst:0x00005572a0d9c270)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600200) -> (hst:0x00005572a0d9c278)
results: 4 0 4 0

Mapping and transfers seem OK to me. The failure was miserable.

setting `OMP_NUM_THREADS=1` the test passes reliably.

</details>

jdoerfert · 2023-10-26T20:12:57Z

This is a known issue. I'll put a bandage on it to make it work for now.

ye-luo · 2023-10-26T20:17:15Z

This is a known issue. I'll put a bandage on it to make it work for now

Any idea why it went wrong? Mappings all seem OK.

jdoerfert · 2023-10-27T01:35:10Z

No, the runtime has always been broken in this case.
It has since the beginning used two globals to synchronize the teams, that works fine as long there is only one team reduction happening at any point in time.

The KernelEnvironment is for compile time information about a kernel. It allows the compiler to feed information to the runtime. The KernelLaunchEnvironment is for dynamic information *per* kernel launch. It allows the rutime to feed information to the kernel that is not shared with other invocations of the kernel. The first use case is to replace the globals that synchronize teams reductions with per-launch versions. This allows concurrent teams reductions. More uses cases will follow, e.g., per launch memory pools. Fixes: llvm#70249

We used to perform team reduction on global memory allocated in the runtime and by clang. This was racy as multiple instances of a kernel, or different kernels with team reductions, would use the same locations. Since we now have the kernel launch environment, we can allocate dynamic memory per-launch, allowing us to move all the state into a non-racy place. Fixes: llvm#70249

…70401) The KernelEnvironment is for compile time information about a kernel. It allows the compiler to feed information to the runtime. The KernelLaunchEnvironment is for dynamic information *per* kernel launch. It allows the rutime to feed information to the kernel that is not shared with other invocations of the kernel. The first use case is to replace the globals that synchronize teams reductions with per-launch versions. This allows concurrent teams reductions. More uses cases will follow, e.g., per launch memory pools. Fixes: #70249

ye-luo · 2023-11-01T16:14:59Z

My test is still failing sporadically. Tested with 954af75

yeluo@epyc-server:~/temp$ ./a.out 
results: 4 0 4 4
yeluo@epyc-server:~/temp$ ./a.out 
results: 4 4 4 4

shiltian · 2023-11-01T16:32:44Z

#70752 is on the way

We used to perform team reduction on global memory allocated in the runtime and by clang. This was racy as multiple instances of a kernel, or different kernels with team reductions, would use the same locations. Since we now have the kernel launch environment, we can allocate dynamic memory per-launch, allowing us to move all the state into a non-racy place. Fixes: llvm#70249

) We used to perform team reduction on global memory allocated in the runtime and by clang. This was racy as multiple instances of a kernel, or different kernels with team reductions, would use the same locations. Since we now have the kernel launch environment, we can allocate dynamic memory per-launch, allowing us to move all the state into a non-racy place. Fixes: #70249

llvmbot · 2023-11-01T18:20:59Z

@llvm/issue-subscribers-clang-codegen

Author: Ye Luo (ye-luo)

``` #include <iostream> #include <vector>

#define N 4

int main()
{
std::vector<int> avec(N);
int* a = avec.data();
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
a[i] = 0;
#pragma omp target teams distribute parallel for reduction(+:a[i])
for(int j =0 ;j< N; j++)
a[i] += 1;
}

std::cout << "results:";
for(int i = 0; i < N; i++)
std::cout << " " << a[i];
std::cout << std::endl;
}

running `clang++ -fopenmp --offload-arch=sm_80 main.cpp &amp;&amp; ./a.out`
expect

results: 4 4 4 4

but I got random failure

results: 0 4 0 0
results: 4 0 4 0
results: 4 0 4 0


Turning on debugging info

OMP_TARGET_OFFLOAD=mandatory OMP_NUM_THREADS=32 LIBOMPTARGET_DEBUG=1 ./a.out >& out && grep "Moving 4 bytes|result" out
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c27c) -> (tgt:0x00007f7112600000)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c278) -> (tgt:0x00007f7112600200)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c274) -> (tgt:0x00007f7112600400)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c270) -> (tgt:0x00007f7112600600)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600000) -> (hst:0x00005572a0d9c27c)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600400) -> (hst:0x00005572a0d9c274)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600600) -> (hst:0x00005572a0d9c270)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600200) -> (hst:0x00005572a0d9c278)
results: 4 0 4 0

Mapping and transfers seem OK to me. The failure was miserable.

setting `OMP_NUM_THREADS=1` the test passes reliably.

</details>

…lvm#70401) The KernelEnvironment is for compile time information about a kernel. It allows the compiler to feed information to the runtime. The KernelLaunchEnvironment is for dynamic information *per* kernel launch. It allows the rutime to feed information to the kernel that is not shared with other invocations of the kernel. The first use case is to replace the globals that synchronize teams reductions with per-launch versions. This allows concurrent teams reductions. More uses cases will follow, e.g., per launch memory pools. Fixes: llvm#70249 Change-Id: I06ce8c63cf5020be778e1a9e06053a1950dfb18e

…m#70752) We used to perform team reduction on global memory allocated in the runtime and by clang. This was racy as multiple instances of a kernel, or different kernels with team reductions, would use the same locations. Since we now have the kernel launch environment, we can allocate dynamic memory per-launch, allowing us to move all the state into a non-racy place. Fixes: llvm#70249 Change-Id: Id8a5932a1cde8cfcbb0e17655ef3f390f6f4d050

github-actions bot added the new issue label Oct 25, 2023

EugeneZelenko added openmp and removed new issue labels Oct 25, 2023

jdoerfert self-assigned this Oct 26, 2023

jdoerfert mentioned this issue Nov 1, 2023

[OpenMP] Introduce the KernelLaunchEnvironment as implicit argument #70401

Merged

jdoerfert closed this as completed in #70401 Nov 1, 2023

jdoerfert mentioned this issue Nov 1, 2023

[OpenMP][FIX] Allocate per launch memory for GPU team reductions #70752

Merged

ye-luo reopened this Nov 1, 2023

jdoerfert closed this as completed in #70752 Nov 1, 2023

EugeneZelenko added the clang:codegen label Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP] incorrect concurrent target reduction #70249

[OpenMP] incorrect concurrent target reduction #70249

ye-luo commented Oct 25, 2023 •

edited

llvmbot commented Oct 25, 2023

jdoerfert commented Oct 26, 2023

ye-luo commented Oct 26, 2023 •

edited

jdoerfert commented Oct 27, 2023

ye-luo commented Nov 1, 2023 •

edited

shiltian commented Nov 1, 2023

llvmbot commented Nov 1, 2023

[OpenMP] incorrect concurrent target reduction #70249

[OpenMP] incorrect concurrent target reduction #70249

Comments

ye-luo commented Oct 25, 2023 • edited

llvmbot commented Oct 25, 2023

jdoerfert commented Oct 26, 2023

ye-luo commented Oct 26, 2023 • edited

jdoerfert commented Oct 27, 2023

ye-luo commented Nov 1, 2023 • edited

shiltian commented Nov 1, 2023

llvmbot commented Nov 1, 2023

ye-luo commented Oct 25, 2023 •

edited

ye-luo commented Oct 26, 2023 •

edited

ye-luo commented Nov 1, 2023 •

edited