CCSD(T) Performance problem using GPU #79

xiexiguo · 2018-11-11T05:14:10Z

Dear developers,

I encountered some performance problems when I do CCSD(T) calculation using a GPU version NWChem.
I run the CCSD(T) calculation (using TCE) on a server which has 2 Nvidia P100 GPU.
I varied the number of MPI processes and the number of GPU, and found that the CCSD(T)
calculation time is not stable. The following is the result.

Test1: 1 process using 1GPU: cpu/wall time 42.6s / 77.5s
Test2: 2 processes using 1GPU: cpu/wall time 774.6s / 913.2s
Test3: 2 processes using 2GPU: cpu/wall time 45.2s / 67.4s
Test4: 8 processes using 2GPU：cpu/wall time 454.6s / 507.9s

The time of test2 is much longer than test1 although one more MPI process is used.
The time of test3 is almost equal to test1 although one more GPU is added.
So the result is abnormal. Adding cpu cores or GPU does not shorten the running time.
Do you have some suggestions when doing tce calcuation using GPU?

The following is the test case I used. It is a benene molecule.
start benene
echo
memory stack 64000 mb heap 64000 mb global 64000 mb noverify

geometry units angstrom noprint
C 0.20718593 -0.66876032 0.00068200
C 1.60234593 -0.66876032 0.00068200
C 2.29988393 0.53899068 0.00068200
C 1.60222993 1.74749968 -0.00051700
C 0.20740493 1.74742168 -0.00099600
C -0.49019607 0.53921568 0.00000000
H -0.34257307 -1.62107732 0.00113200
H 2.15185393 -1.62127332 0.00199700
H 3.39956393 0.53907068 0.00131600
H 2.15242993 2.69964268 -0.00057600
H -0.34271707 2.69970268 -0.00194900
H -1.58980007 0.53939868 -0.00018000
end

basis spherical
C library aug-cc-pvdz
H library aug-cc-pvdz
end

tce
freeze atomic
ccsd(t)
thresh 1e-4
io ga
tilesize 24
cuda 1
end

task tce energy

xiexiguo · 2018-11-11T05:17:37Z

I compiled NWChem using gnu compiler and Intel MKL. The following is the enviroment variables I used.
export NWCHEM_TOP=/mnt/nwchem/nwchem-6.8.1-cuda-gnu
export NWCHEM_TARGET=LINUX64

export ARMCI_NETWORK=MPI-TS
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/public/software/mpi/intelmpi/2017.4.239/intel64
export MPI_INCLUDE=$MPI_LOC/include
export MPI_LIB="$MPI_LOC/lib -L$MPI_LOC/lib/release_mt"
export LIBMPI="-lmpifort -lmpi -lmpigi -ldl -lrt -lpthread"

export NWCHEM_MODULES=all
export USE_NOIO=TRUE

export USE_SCALAPACK=y
export SCALAPACK="-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_gf_ilp64 -lmkl_core -lmkl_sequential -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export SCALAPACK_SIZE=8
export BLAS_SIZE=8
export BLASOPT="-L$MKLROOT/lib/intel64 -lmkl_gf_ilp64 -lmkl_core -lmkl_sequential -lpthread -lm"

export TCE_CUDA=y
export CUDA_INCLUDE="-I/usr/local/cuda-8.0/include"
export CUDA_LIBS="-L/usr/local/cuda-8.0/lib64 -lcudart"

export FC=gfortran
export CC=gcc

jeffhammond · 2018-11-11T07:02:27Z

It seems that using more MPI ranks than GPUs per node is a bad idea. I suspect this is caused by some broken logic in how the processes that own GPUs are detected.

xiexiguo · 2018-11-13T02:35:58Z

Thanks. But using just 1 MPI rank leads to a much longer time comsumption on CCSD interation. So the total calculation time is still very long.

xiexiguo · 2018-11-13T02:53:32Z

I read the JCTC paper(2013 , 9 (4) :1949-1957,
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems). This paper said NWChem employed a dynamic load balancing scheme, and calculation can be done both at CPU side and at GPU side. So using more MPI ranks should work.

xiexiguo · 2018-11-30T03:24:03Z

I changed ARMCI_NETWORK to MPI-PR or ARMCI_MPI, and recompiled NWChem. Now the performance seems OK. I think GA has some problems when setting ARMCI_NETWORK to MPI-TS.
Thanks, Jeff !

jeffhammond · 2018-11-30T03:37:12Z

MPI-TS doesn’t make asynchronous progress, which is a problem in some modules.

xiexiguo · 2018-12-02T13:28:58Z

Hi, Jeff, I found there are two cuda code (sd_t_total.cu and sd_t_total_ttlg.cu) in tce/ccsd_t directory. The code of sd_t_total.cu has not changed for 5 years. Does ttlg has better preformance on newer GPU?

jeffhammond · 2018-12-02T17:53:45Z

Hi, Jeff, I found there are two cuda code (sd_t_total.cu and sd_t_total_ttlg.cu) in tce/ccsd_t directory. The code of sd_t_total.cu has not changed for 5 years.

Correct. I think it was written by Oreste Villa for NVIDIA Kepler series GPUs.

Does ttlg has better preformance on newer GPU?

I don’t know. I haven’t used it. My focus is OpenMP optimization for CPUs. Data transfer overheads are a major issue for GPU offload...

ajaypanyala · 2018-12-05T23:25:17Z

@xiexiguo I would not use TTLG for now. It's work in progress and not fully tested. The new GPU code will be available in the next few months.

xiexiguo mentioned this issue Nov 11, 2018

Performance problem on VEGA20 Sugon server hpc-app/nwchem#1

Open

xiexiguo mentioned this issue Dec 3, 2018

TTLG problem and port hpc-app/nwchem#4

Open

xiexiguo closed this as completed Jan 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCSD(T) Performance problem using GPU #79

CCSD(T) Performance problem using GPU #79

xiexiguo commented Nov 11, 2018

xiexiguo commented Nov 11, 2018

jeffhammond commented Nov 11, 2018

xiexiguo commented Nov 13, 2018

xiexiguo commented Nov 13, 2018

xiexiguo commented Nov 30, 2018

jeffhammond commented Nov 30, 2018 via email

xiexiguo commented Dec 2, 2018

jeffhammond commented Dec 2, 2018 via email

ajaypanyala commented Dec 5, 2018

CCSD(T) Performance problem using GPU #79

CCSD(T) Performance problem using GPU #79

Comments

xiexiguo commented Nov 11, 2018

xiexiguo commented Nov 11, 2018

jeffhammond commented Nov 11, 2018

xiexiguo commented Nov 13, 2018

xiexiguo commented Nov 13, 2018

xiexiguo commented Nov 30, 2018

jeffhammond commented Nov 30, 2018 via email

xiexiguo commented Dec 2, 2018

jeffhammond commented Dec 2, 2018 via email

ajaypanyala commented Dec 5, 2018