Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing the coordination with openACC #1075

Open
Iximiel opened this issue May 14, 2024 · 3 comments
Open

Writing the coordination with openACC #1075

Iximiel opened this issue May 14, 2024 · 3 comments

Comments

@Iximiel
Copy link
Member

Iximiel commented May 14, 2024

As I did with CUDA(#1028) and I tried to do with Arrayfire(#1049) and pytorch, I tried to rewrite the COORDINATION cv with openACC as accelerator.

Here's the result, using the new benchmark tool

sc_100
Is slower than CUDA, but writing in openACC may be more familiar, because it looks like openMP and also because you can leave the compiler to guess how to implement the parallelization of the loops and you do not have to use the <<<>>> to launch kernels like in CUDA. And is way more flexible than the tensor libraries.
On the compilation I have some mixed feelings, as you can read in the spoiler below.

Details about compilation and script used

I run everything on my workstation (NVIDIA T1000 8GB + AMD Ryzen 5 PRO 5650G)
I used nvhpc24.3, downloaded already compiled from the Nvidia site.

The environment used is actually slightly complex:
I compiled plumed from master with plain gcc+mpi
Then I compiled the plugin with my wild Makefile that uses nvc++ for the accelerated part and g++ for the main body of the CV.
Then I ran the benchmark without nvhpc in the environment, because it conflicts with the mpi that I used with plumed:

nsteps=100
list_of_natoms="500 2000 4000 6000 8000 10000 12000 14000 16000"
export PLUMED_NUM_THREADS=8
useDistr="line sc"
useDistr="sc"

for distr in $useDistr; do
  for natoms in $list_of_natoms; do
    fname="${distr}_wACC_${PLUMED_NUM_THREADS}threads_${natoms}_Steps${nsteps}"
    plumed benchmark --plumed="plumed.dat:cudasingleplumed.dat:accplumed.dat" \
      --natoms=${natoms} --nsteps=${nsteps} --atom-distribution=${distr} >"${fname}.out"
    grep -B1 Comparative "${fname}.out"
  done
done
rm -f bck.*

(I have to try to make everything run compiled with plain nvhpc
But since nvhpc does not like the kw auto for deducing return types (as used in tools/MergeVectorTools.h:54), it needs some massages to the plumed source and I did not want to touch src for this project)


If you look at the code I also added a few extra headers:

  • LoopUnroller.h Tensor.h Vector.h that are a variant to the originals header with the possibility of declaring Tensors and Vector of any type.
    and some splashes of refactor to c++17 where I did not managed to convince nvc++ to deduce the template arguments as I wanted
  • Tools_pow.h that templatizes the type int the runtime version of fastpow

Since these modifications are a prerequisite to the use of openACC but are completely independent from it. If you are ok with this, I would like to open a PR with a patch to the original .h files

@GiovanniBussi
Copy link
Member

Regarding the vector and tensor with generic type, I tried to do the same a few years ago and I remember that with intel compiler the performances were measurably affected (to my surprise). Maybe you can double check this. In case it's true, maybe we can duplicate the code. Otherwise I am also happy with a more general version, it would be useful in other parts of the code as well

@GiovanniBussi
Copy link
Member

(I have to try to make everything run compiled with plain nvhpc
But since nvhpc does not like the kw auto for deducing return types (as used in tools/MergeVectorTools.h:54), it needs some massages to the plumed source and I did not want to touch src for this project)

If it's limited to this maybe we can adjust the code. It would be ideal if we could also install nvc++ on one job in GitHub actions to test for this

@Iximiel
Copy link
Member Author

Iximiel commented May 15, 2024

Regarding the vector and tensor with generic type, I tried to do the same a few years ago and I remember that with intel compiler the performances were measurably affected (to my surprise). Maybe you can double check this. In case it's true, maybe we can duplicate the code. Otherwise I am also happy with a more general version, it would be useful in other parts of the code as well

Ok, so I set up the PR as a wip, then I will produce some benchmarks

If it's limited to this maybe we can adjust the code. It would be ideal if we could also install nvc++ on one job in GitHub actions to test for this

I'm trying to do it in #1076

@Iximiel Iximiel mentioned this issue Jul 1, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants