Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] incorrect concurrent target reduction #70249

Closed
ye-luo opened this issue Oct 25, 2023 · 7 comments · Fixed by #70401 or #70752
Closed

[OpenMP] incorrect concurrent target reduction #70249

ye-luo opened this issue Oct 25, 2023 · 7 comments · Fixed by #70401 or #70752

Comments

@ye-luo
Copy link
Contributor

ye-luo commented Oct 25, 2023

#include <iostream>
#include <vector>

#define N 4

int main()
{
  std::vector<int> avec(N);
  int* a = avec.data();
  #pragma omp parallel for
  for(int i = 0; i < N; i++)
  {
    a[i] = 0;
    #pragma omp target teams distribute parallel for reduction(+:a[i])
    for(int j =0 ;j< N; j++)
      a[i] += 1;
  }

  std::cout << "results:";
  for(int i = 0; i < N; i++)
    std::cout << " " << a[i];
  std::cout << std::endl;
}

running clang++ -fopenmp --offload-arch=sm_80 main.cpp && ./a.out
expect

results: 4 4 4 4

but I got random failure

results: 0 4 0 0
results: 4 0 4 0
results: 4 0 4 0

Turning on debugging info

OMP_TARGET_OFFLOAD=mandatory OMP_NUM_THREADS=32 LIBOMPTARGET_DEBUG=1 ./a.out >& out && grep "Moving 4 bytes\|result" out
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c27c) -> (tgt:0x00007f7112600000)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c278) -> (tgt:0x00007f7112600200)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c274) -> (tgt:0x00007f7112600400)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c270) -> (tgt:0x00007f7112600600)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600000) -> (hst:0x00005572a0d9c27c)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600400) -> (hst:0x00005572a0d9c274)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600600) -> (hst:0x00005572a0d9c270)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600200) -> (hst:0x00005572a0d9c278)
results: 4 0 4 0

Mapping and transfers seem OK to me. The failure was miserable.

setting OMP_NUM_THREADS=1 the test passes reliably.

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 25, 2023

@llvm/issue-subscribers-openmp

Author: Ye Luo (ye-luo)

``` #include <iostream> #include <vector>

#define N 4

int main()
{
std::vector<int> avec(N);
int* a = avec.data();
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
a[i] = 0;
#pragma omp target teams distribute parallel for reduction(+:a[i])
for(int j =0 ;j< N; j++)
a[i] += 1;
}

for(int i = 0; i < N; i++)
std::cout << a[i] << std::endl;
}

running `clang++ -fopenmp --offload-arch=sm_80 main.cpp &amp;&amp; ./a.out`
expect

results: 4 4 4 4

but I got random failure

results: 0 4 0 0
results: 4 0 4 0
results: 4 0 4 0


Turning on debugging info

OMP_TARGET_OFFLOAD=mandatory OMP_NUM_THREADS=32 LIBOMPTARGET_DEBUG=1 ./a.out >& out && grep "Moving 4 bytes|result" out
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c27c) -> (tgt:0x00007f7112600000)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c278) -> (tgt:0x00007f7112600200)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c274) -> (tgt:0x00007f7112600400)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c270) -> (tgt:0x00007f7112600600)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600000) -> (hst:0x00005572a0d9c27c)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600400) -> (hst:0x00005572a0d9c274)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600600) -> (hst:0x00005572a0d9c270)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600200) -> (hst:0x00005572a0d9c278)
results: 4 0 4 0

Mapping and transfers seem OK to me. The failure was miserable.

setting `OMP_NUM_THREADS=1` the test passes reliably.

</details>

@jdoerfert
Copy link
Member

This is a known issue. I'll put a bandage on it to make it work for now.

@jdoerfert jdoerfert self-assigned this Oct 26, 2023
@ye-luo
Copy link
Contributor Author

ye-luo commented Oct 26, 2023

This is a known issue. I'll put a bandage on it to make it work for now

Any idea why it went wrong? Mappings all seem OK.

@jdoerfert
Copy link
Member

No, the runtime has always been broken in this case.
It has since the beginning used two globals to synchronize the teams, that works fine as long there is only one team reduction happening at any point in time.

jdoerfert added a commit to jdoerfert/llvm-project that referenced this issue Oct 27, 2023
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.

Fixes: llvm#70249
jdoerfert added a commit to jdoerfert/llvm-project that referenced this issue Oct 31, 2023
We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: llvm#70249
jdoerfert added a commit to jdoerfert/llvm-project that referenced this issue Oct 31, 2023
We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: llvm#70249
jdoerfert added a commit that referenced this issue Nov 1, 2023
…70401)

The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.

Fixes: #70249
@ye-luo
Copy link
Contributor Author

ye-luo commented Nov 1, 2023

My test is still failing sporadically. Tested with 954af75

yeluo@epyc-server:~/temp$ ./a.out 
results: 4 0 4 4
yeluo@epyc-server:~/temp$ ./a.out 
results: 4 4 4 4

@ye-luo ye-luo reopened this Nov 1, 2023
@shiltian
Copy link
Contributor

shiltian commented Nov 1, 2023

#70752 is on the way

jdoerfert added a commit to jdoerfert/llvm-project that referenced this issue Nov 1, 2023
We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: llvm#70249
jdoerfert added a commit that referenced this issue Nov 1, 2023
)

We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: #70249
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 1, 2023

@llvm/issue-subscribers-clang-codegen

Author: Ye Luo (ye-luo)

``` #include <iostream> #include <vector>

#define N 4

int main()
{
std::vector<int> avec(N);
int* a = avec.data();
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
a[i] = 0;
#pragma omp target teams distribute parallel for reduction(+:a[i])
for(int j =0 ;j< N; j++)
a[i] += 1;
}

std::cout << "results:";
for(int i = 0; i < N; i++)
std::cout << " " << a[i];
std::cout << std::endl;
}

running `clang++ -fopenmp --offload-arch=sm_80 main.cpp &amp;&amp; ./a.out`
expect

results: 4 4 4 4

but I got random failure

results: 0 4 0 0
results: 4 0 4 0
results: 4 0 4 0


Turning on debugging info

OMP_TARGET_OFFLOAD=mandatory OMP_NUM_THREADS=32 LIBOMPTARGET_DEBUG=1 ./a.out >& out && grep "Moving 4 bytes|result" out
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c27c) -> (tgt:0x00007f7112600000)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c278) -> (tgt:0x00007f7112600200)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c274) -> (tgt:0x00007f7112600400)
Libomptarget --> Moving 4 bytes (hst:0x00005572a0d9c270) -> (tgt:0x00007f7112600600)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600000) -> (hst:0x00005572a0d9c27c)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600400) -> (hst:0x00005572a0d9c274)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600600) -> (hst:0x00005572a0d9c270)
Libomptarget --> Moving 4 bytes (tgt:0x00007f7112600200) -> (hst:0x00005572a0d9c278)
results: 4 0 4 0

Mapping and transfers seem OK to me. The failure was miserable.

setting `OMP_NUM_THREADS=1` the test passes reliably.

</details>

searlmc1 pushed a commit to ROCm/llvm-project that referenced this issue Nov 3, 2023
…lvm#70401)

The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.

Fixes: llvm#70249

Change-Id: I06ce8c63cf5020be778e1a9e06053a1950dfb18e
searlmc1 pushed a commit to ROCm/llvm-project that referenced this issue Nov 3, 2023
…m#70752)

We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: llvm#70249

Change-Id: Id8a5932a1cde8cfcbb0e17655ef3f390f6f4d050
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment