Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCSD(T) Performance problem using GPU #79

Closed
xiexiguo opened this issue Nov 11, 2018 · 9 comments
Closed

CCSD(T) Performance problem using GPU #79

xiexiguo opened this issue Nov 11, 2018 · 9 comments

Comments

@xiexiguo
Copy link

Dear developers,

I encountered some performance problems when I do CCSD(T) calculation using a GPU version NWChem.
I run the CCSD(T) calculation (using TCE) on a server which has 2 Nvidia P100 GPU.
I varied the number of MPI processes and the number of GPU, and found that the CCSD(T)
calculation time is not stable. The following is the result.

  • Test1: 1 process using 1GPU: cpu/wall time 42.6s / 77.5s
  • Test2: 2 processes using 1GPU: cpu/wall time 774.6s / 913.2s
  • Test3: 2 processes using 2GPU: cpu/wall time 45.2s / 67.4s
  • Test4: 8 processes using 2GPU:cpu/wall time 454.6s / 507.9s

The time of test2 is much longer than test1 although one more MPI process is used.
The time of test3 is almost equal to test1 although one more GPU is added.
So the result is abnormal. Adding cpu cores or GPU does not shorten the running time.
Do you have some suggestions when doing tce calcuation using GPU?

The following is the test case I used. It is a benene molecule.
start benene
echo
memory stack 64000 mb heap 64000 mb global 64000 mb noverify

geometry units angstrom noprint
C 0.20718593 -0.66876032 0.00068200
C 1.60234593 -0.66876032 0.00068200
C 2.29988393 0.53899068 0.00068200
C 1.60222993 1.74749968 -0.00051700
C 0.20740493 1.74742168 -0.00099600
C -0.49019607 0.53921568 0.00000000
H -0.34257307 -1.62107732 0.00113200
H 2.15185393 -1.62127332 0.00199700
H 3.39956393 0.53907068 0.00131600
H 2.15242993 2.69964268 -0.00057600
H -0.34271707 2.69970268 -0.00194900
H -1.58980007 0.53939868 -0.00018000
end

basis spherical
C library aug-cc-pvdz
H library aug-cc-pvdz
end

tce
freeze atomic
ccsd(t)
thresh 1e-4
io ga
tilesize 24
cuda 1
end

task tce energy

@xiexiguo
Copy link
Author

I compiled NWChem using gnu compiler and Intel MKL. The following is the enviroment variables I used.
export NWCHEM_TOP=/mnt/nwchem/nwchem-6.8.1-cuda-gnu
export NWCHEM_TARGET=LINUX64

export ARMCI_NETWORK=MPI-TS
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/public/software/mpi/intelmpi/2017.4.239/intel64
export MPI_INCLUDE=$MPI_LOC/include
export MPI_LIB="$MPI_LOC/lib -L$MPI_LOC/lib/release_mt"
export LIBMPI="-lmpifort -lmpi -lmpigi -ldl -lrt -lpthread"

export NWCHEM_MODULES=all
export USE_NOIO=TRUE

export USE_SCALAPACK=y
export SCALAPACK="-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_gf_ilp64 -lmkl_core -lmkl_sequential -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export SCALAPACK_SIZE=8
export BLAS_SIZE=8
export BLASOPT="-L$MKLROOT/lib/intel64 -lmkl_gf_ilp64 -lmkl_core -lmkl_sequential -lpthread -lm"

export TCE_CUDA=y
export CUDA_INCLUDE="-I/usr/local/cuda-8.0/include"
export CUDA_LIBS="-L/usr/local/cuda-8.0/lib64 -lcudart"

export FC=gfortran
export CC=gcc

@jeffhammond
Copy link
Collaborator

It seems that using more MPI ranks than GPUs per node is a bad idea. I suspect this is caused by some broken logic in how the processes that own GPUs are detected.

@xiexiguo
Copy link
Author

Thanks. But using just 1 MPI rank leads to a much longer time comsumption on CCSD interation. So the total calculation time is still very long.

@xiexiguo
Copy link
Author

I read the JCTC paper(2013 , 9 (4) :1949-1957,
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems). This paper said NWChem employed a dynamic load balancing scheme, and calculation can be done both at CPU side and at GPU side. So using more MPI ranks should work.

@xiexiguo
Copy link
Author

I changed ARMCI_NETWORK to MPI-PR or ARMCI_MPI, and recompiled NWChem. Now the performance seems OK. I think GA has some problems when setting ARMCI_NETWORK to MPI-TS.
Thanks, Jeff !

@jeffhammond
Copy link
Collaborator

jeffhammond commented Nov 30, 2018 via email

@xiexiguo
Copy link
Author

xiexiguo commented Dec 2, 2018

Hi, Jeff, I found there are two cuda code (sd_t_total.cu and sd_t_total_ttlg.cu) in tce/ccsd_t directory. The code of sd_t_total.cu has not changed for 5 years. Does ttlg has better preformance on newer GPU?

@jeffhammond
Copy link
Collaborator

jeffhammond commented Dec 2, 2018 via email

@ajaypanyala
Copy link
Collaborator

@xiexiguo I would not use TTLG for now. It's work in progress and not fully tested. The new GPU code will be available in the next few months.

@xiexiguo xiexiguo closed this as completed Jan 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants