-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for ERI gradient performance issue #347
Conversation
…t kernels, moved store2 intialization out of primitive function loops
How big of a performance hit comes from larger MAXPRIM values? The cc-pVDZ basis set has contracted s functions with 14 primitives for a few elements (e.g. Ca and Ge, no f functions). MAXPRIM=10 would not work. |
The cc-pVDZ basis set has contraction level up to 14 for a few elements like Ca, Ge etc.
The value of MAXPRIM does not seem to affect performance. Timings for taxol (tight SCF settings) on Expanse A100 nodes, compiled without f function support. master-2257ba84
v23.08b
324-eri-gradient-performance-issue, MAXPRIM=10
324-eri-gradient-performance-issue, MAXPRIM=14
I will change MAXPRIM to a default of 14 without f functions. The 2e gradient code is still a bit slower compare to version 23.08 but it's pretty close. |
AWG resolved conflicts in src/cuda/gpu_get2e_grad_ffff.cuh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I resolved a merge conflict and tested the code on Expanse A100 nodes. All tests serial, mpi, cuda, and cudampi pass (full test suite).
Waiting for feedback from @ohearnk who is testing on MSU HPCC with both K80's and V100's to confirm expected behavior. |
Performances numbers I'm getting on the Intel16 and AMD20 nodes on MSU HPCC (Kepler K80s and Volta V100s) are below. Note that tests are using the tight SCF benchmarks and are testing both current HEAD commit on master (2257ba8) and commits applied on top of master for PR-347, both without and with f-functions support enabled. The summary is that performance looks okay (slightly better for PR-347), but I did not go back to compare against the code in QUICK v23.08b so there may still be some performance lost from back there (but which will come back in coming ERI optimizations).
|
So, I think this is good to merge now. |
The enclosed changes resolve #324. There were two problems. 1) I increased the maximum number of primitives (maxprim) per basis function to 20, which was originally 10., 2) For some reason, nvcc takes longer to compile the ERI code when we have the assignment operator (=) instead of the assign-add compound operator (+=) in the VRR code. To lower the compile time, I used += in the VRR code and in the gradient driver, since we have to scale the primitive integrals, I initialized the primitive integral matrix (store2) to zero inside the primitive loops. This is the major culprit for performance hit in the gradient code. I have updated the VRR grad code to use = when saving the primitive integral values into the store2 matrix and eliminated initialization inside primitive loops. The maxprim variable is set to 10 unless we compile the code with F functions.
Here is the new timing for taxol example.
v23:
324-eri-gradient-performance-issue branch: