Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible translation for OpenACC loop seq #24

Open
Lyphion opened this issue Jun 18, 2024 · 7 comments
Open

Possible translation for OpenACC loop seq #24

Lyphion opened this issue Jun 18, 2024 · 7 comments
Assignees

Comments

@Lyphion
Copy link
Contributor

Lyphion commented Jun 18, 2024

Currently OpenMP doesn't support the OpenACC loop seq construct and no direct translation is present/possible.
A possible translation could be to use the bind(thread) construct instead. According to this paper and my own tests the following code snippets produce correct results with comparable performance.

OpenACC:

!$acc parallel
!$acc loop seq
do j = 1, n
!$acc loop
  do i = 1, n
    b(i) = b(i) / j + a(i,j)
  end do
end do
!$acc end parallel

OpenMP:

!$omp target teams
!$omp loop bind(thread)
do j = 1, n
!$omp loop
  do i = 1, n
    b(i) = b(i) / j + a(i,j)
  end do
end do
!$omp end target teams

For better transparency a feature flag is useful and appropriate.

@Lyphion
Copy link
Contributor Author

Lyphion commented Jun 23, 2024

After further investigation the correctness of the translation depends on the Compiler and used Hardware. When using Nvidia Tools and Hardware the translation is correct. With Intel the result doesn't match the expected one.
For that reason, the possible translation should be included in the experimental section.

@hservatg
Copy link
Contributor

Hey @Lyphion -- do you mind sharing which intel compiler did you try? Thanks

I'm a bit swamped these days -- but I'll try to work on this when I have some time.

@Lyphion
Copy link
Contributor Author

Lyphion commented Jun 25, 2024

All my tests are done with Fortran.

  • ifx 2024.1.2 or 2024.2 for Intel (the old Fortran Compiler Classic doesn't support my hardware)
  • nvfortran 24.3 for Nvidia

This was just an idea, if you like it but don't have much time, I could also design a implementation/draft.

@hservatg
Copy link
Contributor

hservatg commented Jul 1, 2024

Hello,

I'm not sure about this proposal. According to the OpenACC spec for loop construct / seq:

2153 2.9.5 seq clause
2154 The seq clause specifies that the associated loop or loops are to be executed sequentially by the
2155 accelerator. This clause will override any automatic parallelization or vectorization.

however, a !$omp loop bind(thread) would parallelize the loop construct over the threads and that would not honor the OpenACC semantics of the original code.

The example you posted works because the parallel region does not spawn threads (or workers in OpenACC jargon). However, what if threads/workers are spawned? Not sure that the translation using your suggestion would be valid.

@Lyphion
Copy link
Contributor Author

Lyphion commented Jul 2, 2024

I know that this is more like a shortcut or hack. As I already mentioned it doesn't work on all platforms for that reason. But in some instances it really helps with the performance and in the case of the Nvidia Compiler it prints the same Debug-Log when compiling. Converting an outer sequential loop into an OpenMP construct would require to spawn a new kernel on each iteration which hurt the performance.

Thanks for investigating my idea. The documentation/manual of OpenMP and OpenACC are a bit confusing and open in some parts.

If you are skeptical about it, we can leave it as it is and I refactor my code on my side without tool support.

@hservatg
Copy link
Contributor

I've been thinking on the topic and discussing it with some colleagues. I think that the appropriate solution would be to translate the !$acc loop seq into a no-op (currently it is translated as !$omp loop, which is wrong). Basically !$acc loop seq prevents a loop of being parallelized by the OpenACC compiler -- so it shall run serially by a given thread.

Sorry if this does not align with your expectations but this shall be the most semantically equivalent translation.

@hservatg hservatg self-assigned this Jul 25, 2024
@Lyphion
Copy link
Contributor Author

Lyphion commented Jul 25, 2024

I totally agree with you about the solution. For my own testing I also tried translating it into a no-op and it work good enough for me. The user must keep in mind, that all instructions between the outer sequential loop (!$acc loop seq) and a inner parallel one are most likely run by all threads, so nothing should be calculated/saved here.

I'd like to thank you again for checking and researching. Your tool and feedback really helped me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants