Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for ERI gradient performance issue #347

Merged
merged 9 commits into from
Mar 20, 2024

Conversation

Madu86
Copy link
Collaborator

@Madu86 Madu86 commented Mar 17, 2024

The enclosed changes resolve #324. There were two problems. 1) I increased the maximum number of primitives (maxprim) per basis function to 20, which was originally 10., 2) For some reason, nvcc takes longer to compile the ERI code when we have the assignment operator (=) instead of the assign-add compound operator (+=) in the VRR code. To lower the compile time, I used += in the VRR code and in the gradient driver, since we have to scale the primitive integrals, I initialized the primitive integral matrix (store2) to zero inside the primitive loops. This is the major culprit for performance hit in the gradient code. I have updated the VRR grad code to use = when saving the primitive integral values into the store2 matrix and eliminated initialization inside primitive loops. The maxprim variable is set to 10 unless we compile the code with F functions.

Here is the new timing for taxol example.

v23:

------------- TIMING ---------------
| INITIAL GUESS TIME  =     2.809394000(  2.47%)
| DFT GRID OPERATIONS =     0.776835000(  0.68%)
| TOTAL SCF TIME      =    55.049241000( 48.48%)
|       TOTAL OP TIME      =    53.129157000( 46.79%)
|             TOTAL 1e TIME      =     0.465957000(  0.41%)
|             TOTAL 2e TIME      =    40.730061000( 35.87%)
|             TOTAL EXC TIME     =    11.223443000(  9.88%)
|       TOTAL DII TIME      =     1.873671000(  1.65%)
|             TOTAL DIAG TIME    =     0.453374000(  0.40%)
| TOTAL GRADIENT TIME      =    51.141289000( 45.04%)
|       TOTAL 1e GRADIENT TIME      =     4.066855000( 3.71%)
|       TOTAL 2e GRADIENT TIME      =    43.730815000(38.51%)
|       TOTAL EXC GRADIENT TIME     =     3.197379000(  2.82%)
| TOTAL TIME          =   113.555557000
------------------------------------

324-eri-gradient-performance-issue branch:

------------- TIMING ---------------
| INITIAL GUESS TIME  =     2.941698000(  2.57%)
| DFT GRID OPERATIONS =     0.776506000(  0.68%)
| TOTAL SCF TIME      =    54.886646000( 47.87%)
|       TOTAL OP TIME      =    53.292129000( 46.48%)
|             TOTAL 1e TIME      =     0.469765000(  0.41%)
|             TOTAL 2e TIME      =    40.998447000( 35.76%)
|             TOTAL EXC TIME     =    11.141979000(  9.72%)
|       TOTAL DII TIME      =     1.550480000(  1.35%)
|             TOTAL DIAG TIME    =     0.429817000(  0.37%)
| TOTAL GRADIENT TIME      =    53.245226000( 46.44%)
|       TOTAL 1e GRADIENT TIME      =     4.034990000( 3.64%)
|       TOTAL 2e GRADIENT TIME      =    45.878091000(40.01%)
|       TOTAL EXC GRADIENT TIME     =     3.188262000(  2.78%)
| TOTAL TIME          =   114.660209000
------------------------------------

@Madu86 Madu86 added Code cleanup Code cleanup or refactoring Bug fix labels Mar 17, 2024
@Madu86 Madu86 requested a review from agoetz March 17, 2024 23:49
@agoetz
Copy link
Collaborator

agoetz commented Mar 19, 2024

How big of a performance hit comes from larger MAXPRIM values? The cc-pVDZ basis set has contracted s functions with 14 primitives for a few elements (e.g. Ca and Ge, no f functions). MAXPRIM=10 would not work.
Does MAXPRIM need to be set at compile time? Otherwise it would make sense to set it at runtime based on the basis set that is used.

The cc-pVDZ basis set has contraction level up to 14 for a few elements
like Ca, Ge etc.
@agoetz
Copy link
Collaborator

agoetz commented Mar 19, 2024

The value of MAXPRIM does not seem to affect performance. Timings for taxol (tight SCF settings) on Expanse A100 nodes, compiled without f function support.

master-2257ba84

------------- TIMING ---------------
| INITIAL GUESS TIME  =     3.225195000(  2.60%)
| DFT GRID OPERATIONS =     0.950536000(  0.77%)
| TOTAL SCF TIME      =    57.044406000( 45.99%)
|       TOTAL OP TIME      =    54.033858000( 43.56%)
|             TOTAL 1e TIME      =     0.421890000(  0.34%)
|             TOTAL 2e TIME      =    41.661156000( 33.59%)
|             TOTAL EXC TIME     =    11.232105000(  9.05%)
|       TOTAL DII TIME      =     2.883073000(  2.32%)
|             TOTAL DIAG TIME    =     1.254617000(  1.01%)
| TOTAL GRADIENT TIME      =    61.636344000( 49.69%)
|       TOTAL 1e GRADIENT TIME      =     3.704015000( 3.17%)
|       TOTAL 2e GRADIENT TIME      =    54.499291000(43.94%)
|       TOTAL EXC GRADIENT TIME     =     3.201951000(  2.58%)
| TOTAL TIME          =   124.044814000
------------------------------------
| Job cpu time:  0 days  0 hours  2 minutes  4.0 seconds.

v23.08b

------------- TIMING ---------------
| INITIAL GUESS TIME  =     2.722562000(  2.35%)
| DFT GRID OPERATIONS =     0.881157000(  0.76%)
| TOTAL SCF TIME      =    57.194307000( 49.35%)
|       TOTAL OP TIME      =    54.239142000( 46.80%)
|             TOTAL 1e TIME      =     0.418605000(  0.36%)
|             TOTAL 2e TIME      =    41.881871000( 36.14%)
|             TOTAL EXC TIME     =    11.233137000(  9.69%)
|       TOTAL DII TIME      =     2.825800000(  2.44%)
|             TOTAL DIAG TIME    =     1.243106000(  1.07%)
| TOTAL GRADIENT TIME      =    51.847473000( 44.74%)
|       TOTAL 1e GRADIENT TIME      =     3.684577000( 3.38%)
|       TOTAL 2e GRADIENT TIME      =    44.729103000(38.59%)
|       TOTAL EXC GRADIENT TIME     =     3.204255000(  2.76%)
| TOTAL TIME          =   115.895565000
------------------------------------
| Job cpu time:  0 days  0 hours  1 minutes 55.9 seconds.

324-eri-gradient-performance-issue, MAXPRIM=10

------------- TIMING ---------------
| INITIAL GUESS TIME  =     3.214959000(  2.75%)
| DFT GRID OPERATIONS =     0.891579000(  0.76%)
| TOTAL SCF TIME      =    57.270866000( 48.97%)
|       TOTAL OP TIME      =    54.282211000( 46.41%)
|             TOTAL 1e TIME      =     0.413180000(  0.35%)
|             TOTAL 2e TIME      =    41.887565000( 35.81%)
|             TOTAL EXC TIME     =    11.275406000(  9.64%)
|       TOTAL DII TIME      =     2.861102000(  2.45%)
|             TOTAL DIAG TIME    =     1.250597000(  1.07%)
| TOTAL GRADIENT TIME      =    54.398082000( 46.51%)
|       TOTAL 1e GRADIENT TIME      =     3.665321000( 3.33%)
|       TOTAL 2e GRADIENT TIME      =    47.301101000(40.44%)
|       TOTAL EXC GRADIENT TIME     =     3.201653000(  2.74%)
| TOTAL TIME          =   116.960426000
------------------------------------
| Job cpu time:  0 days  0 hours  1 minutes 57.0 seconds.

324-eri-gradient-performance-issue, MAXPRIM=14

------------- TIMING ---------------
| INITIAL GUESS TIME  =     3.195861000(  2.73%)
| DFT GRID OPERATIONS =     0.900699000(  0.77%)
| TOTAL SCF TIME      =    57.441711000( 49.06%)
|       TOTAL OP TIME      =    54.476833000( 46.53%)
|             TOTAL 1e TIME      =     0.420327000(  0.36%)
|             TOTAL 2e TIME      =    42.058335000( 35.92%)
|             TOTAL EXC TIME     =    11.290336000(  9.64%)
|       TOTAL DII TIME      =     2.837329000(  2.42%)
|             TOTAL DIAG TIME    =     1.263158000(  1.08%)
| TOTAL GRADIENT TIME      =    54.361989000( 46.43%)
|       TOTAL 1e GRADIENT TIME      =     3.657468000( 3.29%)
|       TOTAL 2e GRADIENT TIME      =    47.300566000(40.40%)
|       TOTAL EXC GRADIENT TIME     =     3.204332000(  2.74%)
| TOTAL TIME          =   117.075072000
------------------------------------
| Job cpu time:  0 days  0 hours  1 minutes 57.1 seconds.

I will change MAXPRIM to a default of 14 without f functions.

The 2e gradient code is still a bit slower compare to version 23.08 but it's pretty close.

AWG resolved conflicts in src/cuda/gpu_get2e_grad_ffff.cuh
Copy link
Collaborator

@agoetz agoetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I resolved a merge conflict and tested the code on Expanse A100 nodes. All tests serial, mpi, cuda, and cudampi pass (full test suite).

@agoetz
Copy link
Collaborator

agoetz commented Mar 20, 2024

Waiting for feedback from @ohearnk who is testing on MSU HPCC with both K80's and V100's to confirm expected behavior.

@ohearnk
Copy link
Collaborator

ohearnk commented Mar 20, 2024

Performances numbers I'm getting on the Intel16 and AMD20 nodes on MSU HPCC (Kepler K80s and Volta V100s) are below. Note that tests are using the tight SCF benchmarks and are testing both current HEAD commit on master (2257ba8) and commits applied on top of master for PR-347, both without and with f-functions support enabled. The summary is that performance looks okay (slightly better for PR-347), but I did not go back to compare against the code in QUICK v23.08b so there may still be some performance lost from back there (but which will come back in coming ERI optimizations).

K80:
  TEST: master-HEAD-2257ba84:
    psb5:          47 seconds =      0.8 minutes
    morphine:     285 seconds =      4.8 minutes
    taxol:        891 seconds =     31.5 minutes
    valinomycin: 3685 seconds =     61.4 minutes
  TEST: pr-347:
    psb5:          46 seconds =      0.8 minutes
    morphine:     270 seconds =      4.5 minutes
    taxol:       1760 seconds =     29.3 minutes
    valinomycin: 3429 seconds =     57.1 minutes
  TEST: master-HEAD-2257ba84-enablef:
    psb5:          56 seconds =      0.9 minutes
    morphine:     281 seconds =      4.7 minutes
    taxol:       1853 seconds =     30.9 minutes
    valinomycin: 3607 seconds =     60.1 minutes
  TEST: pr-347-enablef:
    psb5:          58 seconds =      1.0 minutes
    morphine:     275 seconds =      4.6 minutes
    taxol:       1806 seconds =     30.1 minutes
    valinomycin: 3520 seconds =     58.7 minutes

V100:
  TEST: master-HEAD-2257ba84:
    psb5:          23 seconds =      0.4 minutes
    morphine:      31 seconds =      0.5 minutes
    taxol:        197 seconds =      3.3 minutes
    valinomycin:  395 seconds =      6.6 minutes
  TEST: pr-347:
    psb5:           8 seconds =      0.1 minutes
    morphine:      30 seconds =      0.5 minutes
    taxol:        184 seconds =      3.1 minutes
    valinomycin:  373 seconds =      6.2 minutes
  TEST: master-HEAD-2257ba84-enablef:
    psb5:          10 seconds =      0.2 minutes
    morphine:      33 seconds =      0.6 minutes
    taxol:        201 seconds =      3.4 minutes
    valinomycin:  399 seconds =      6.7 minutes
  TEST: pr-347-enablef:
    psb5:          10 seconds =      0.2 minutes
    morphine:      32 seconds =      0.5 minutes
    taxol:        187 seconds =      3.1 minutes
    valinomycin:  377 seconds =      6.3 minutes

@ohearnk
Copy link
Collaborator

ohearnk commented Mar 20, 2024

So, I think this is good to merge now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug fix Code cleanup Code cleanup or refactoring
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

ERI gradient performance issue
3 participants