New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCSD(T) Performance problem using GPU #79
Comments
I compiled NWChem using gnu compiler and Intel MKL. The following is the enviroment variables I used. export ARMCI_NETWORK=MPI-TS export NWCHEM_MODULES=all export USE_SCALAPACK=y export TCE_CUDA=y export FC=gfortran |
It seems that using more MPI ranks than GPUs per node is a bad idea. I suspect this is caused by some broken logic in how the processes that own GPUs are detected. |
Thanks. But using just 1 MPI rank leads to a much longer time comsumption on CCSD interation. So the total calculation time is still very long. |
I read the JCTC paper(2013 , 9 (4) :1949-1957, |
I changed ARMCI_NETWORK to MPI-PR or ARMCI_MPI, and recompiled NWChem. Now the performance seems OK. I think GA has some problems when setting ARMCI_NETWORK to MPI-TS. |
MPI-TS doesn’t make asynchronous progress, which is a problem in some modules.
|
Hi, Jeff, I found there are two cuda code (sd_t_total.cu and sd_t_total_ttlg.cu) in tce/ccsd_t directory. The code of sd_t_total.cu has not changed for 5 years. Does ttlg has better preformance on newer GPU? |
Hi, Jeff, I found there are two cuda code (sd_t_total.cu and sd_t_total_ttlg.cu) in tce/ccsd_t directory. The code of sd_t_total.cu has not changed for 5 years.
Correct. I think it was written by Oreste Villa for NVIDIA Kepler series GPUs.
Does ttlg has better preformance on newer GPU?
I don’t know. I haven’t used it. My focus is OpenMP optimization for CPUs. Data transfer overheads are a major issue for GPU offload...
|
@xiexiguo I would not use TTLG for now. It's work in progress and not fully tested. The new GPU code will be available in the next few months. |
Dear developers,
I encountered some performance problems when I do CCSD(T) calculation using a GPU version NWChem.
I run the CCSD(T) calculation (using TCE) on a server which has 2 Nvidia P100 GPU.
I varied the number of MPI processes and the number of GPU, and found that the CCSD(T)
calculation time is not stable. The following is the result.
The time of test2 is much longer than test1 although one more MPI process is used.
The time of test3 is almost equal to test1 although one more GPU is added.
So the result is abnormal. Adding cpu cores or GPU does not shorten the running time.
Do you have some suggestions when doing tce calcuation using GPU?
The following is the test case I used. It is a benene molecule.
start benene
echo
memory stack 64000 mb heap 64000 mb global 64000 mb noverify
geometry units angstrom noprint
C 0.20718593 -0.66876032 0.00068200
C 1.60234593 -0.66876032 0.00068200
C 2.29988393 0.53899068 0.00068200
C 1.60222993 1.74749968 -0.00051700
C 0.20740493 1.74742168 -0.00099600
C -0.49019607 0.53921568 0.00000000
H -0.34257307 -1.62107732 0.00113200
H 2.15185393 -1.62127332 0.00199700
H 3.39956393 0.53907068 0.00131600
H 2.15242993 2.69964268 -0.00057600
H -0.34271707 2.69970268 -0.00194900
H -1.58980007 0.53939868 -0.00018000
end
basis spherical
C library aug-cc-pvdz
H library aug-cc-pvdz
end
tce
freeze atomic
ccsd(t)
thresh 1e-4
io ga
tilesize 24
cuda 1
end
task tce energy
The text was updated successfully, but these errors were encountered: