Fix for ERI gradient performance issue #347

Madu86 · 2024-03-17T23:49:46Z

The enclosed changes resolve #324. There were two problems. 1) I increased the maximum number of primitives (maxprim) per basis function to 20, which was originally 10., 2) For some reason, nvcc takes longer to compile the ERI code when we have the assignment operator (=) instead of the assign-add compound operator (+=) in the VRR code. To lower the compile time, I used += in the VRR code and in the gradient driver, since we have to scale the primitive integrals, I initialized the primitive integral matrix (store2) to zero inside the primitive loops. This is the major culprit for performance hit in the gradient code. I have updated the VRR grad code to use = when saving the primitive integral values into the store2 matrix and eliminated initialization inside primitive loops. The maxprim variable is set to 10 unless we compile the code with F functions.

Here is the new timing for taxol example.

v23:

------------- TIMING ---------------
| INITIAL GUESS TIME  =     2.809394000(  2.47%)
| DFT GRID OPERATIONS =     0.776835000(  0.68%)
| TOTAL SCF TIME      =    55.049241000( 48.48%)
|       TOTAL OP TIME      =    53.129157000( 46.79%)
|             TOTAL 1e TIME      =     0.465957000(  0.41%)
|             TOTAL 2e TIME      =    40.730061000( 35.87%)
|             TOTAL EXC TIME     =    11.223443000(  9.88%)
|       TOTAL DII TIME      =     1.873671000(  1.65%)
|             TOTAL DIAG TIME    =     0.453374000(  0.40%)
| TOTAL GRADIENT TIME      =    51.141289000( 45.04%)
|       TOTAL 1e GRADIENT TIME      =     4.066855000( 3.71%)
|       TOTAL 2e GRADIENT TIME      =    43.730815000(38.51%)
|       TOTAL EXC GRADIENT TIME     =     3.197379000(  2.82%)
| TOTAL TIME          =   113.555557000
------------------------------------

324-eri-gradient-performance-issue branch:

------------- TIMING ---------------
| INITIAL GUESS TIME  =     2.941698000(  2.57%)
| DFT GRID OPERATIONS =     0.776506000(  0.68%)
| TOTAL SCF TIME      =    54.886646000( 47.87%)
|       TOTAL OP TIME      =    53.292129000( 46.48%)
|             TOTAL 1e TIME      =     0.469765000(  0.41%)
|             TOTAL 2e TIME      =    40.998447000( 35.76%)
|             TOTAL EXC TIME     =    11.141979000(  9.72%)
|       TOTAL DII TIME      =     1.550480000(  1.35%)
|             TOTAL DIAG TIME    =     0.429817000(  0.37%)
| TOTAL GRADIENT TIME      =    53.245226000( 46.44%)
|       TOTAL 1e GRADIENT TIME      =     4.034990000( 3.64%)
|       TOTAL 2e GRADIENT TIME      =    45.878091000(40.01%)
|       TOTAL EXC GRADIENT TIME     =     3.188262000(  2.78%)
| TOTAL TIME          =   114.660209000
------------------------------------

…rage in vrr code

…t kernels, moved store2 intialization out of primitive function loops

…ns are compiled.

agoetz · 2024-03-19T06:36:22Z

How big of a performance hit comes from larger MAXPRIM values? The cc-pVDZ basis set has contracted s functions with 14 primitives for a few elements (e.g. Ca and Ge, no f functions). MAXPRIM=10 would not work.
Does MAXPRIM need to be set at compile time? Otherwise it would make sense to set it at runtime based on the basis set that is used.

The cc-pVDZ basis set has contraction level up to 14 for a few elements like Ca, Ge etc.

agoetz · 2024-03-19T17:57:22Z

The value of MAXPRIM does not seem to affect performance. Timings for taxol (tight SCF settings) on Expanse A100 nodes, compiled without f function support.

master-2257ba84

------------- TIMING ---------------
| INITIAL GUESS TIME  =     3.225195000(  2.60%)
| DFT GRID OPERATIONS =     0.950536000(  0.77%)
| TOTAL SCF TIME      =    57.044406000( 45.99%)
|       TOTAL OP TIME      =    54.033858000( 43.56%)
|             TOTAL 1e TIME      =     0.421890000(  0.34%)
|             TOTAL 2e TIME      =    41.661156000( 33.59%)
|             TOTAL EXC TIME     =    11.232105000(  9.05%)
|       TOTAL DII TIME      =     2.883073000(  2.32%)
|             TOTAL DIAG TIME    =     1.254617000(  1.01%)
| TOTAL GRADIENT TIME      =    61.636344000( 49.69%)
|       TOTAL 1e GRADIENT TIME      =     3.704015000( 3.17%)
|       TOTAL 2e GRADIENT TIME      =    54.499291000(43.94%)
|       TOTAL EXC GRADIENT TIME     =     3.201951000(  2.58%)
| TOTAL TIME          =   124.044814000
------------------------------------
| Job cpu time:  0 days  0 hours  2 minutes  4.0 seconds.

v23.08b

------------- TIMING ---------------
| INITIAL GUESS TIME  =     2.722562000(  2.35%)
| DFT GRID OPERATIONS =     0.881157000(  0.76%)
| TOTAL SCF TIME      =    57.194307000( 49.35%)
|       TOTAL OP TIME      =    54.239142000( 46.80%)
|             TOTAL 1e TIME      =     0.418605000(  0.36%)
|             TOTAL 2e TIME      =    41.881871000( 36.14%)
|             TOTAL EXC TIME     =    11.233137000(  9.69%)
|       TOTAL DII TIME      =     2.825800000(  2.44%)
|             TOTAL DIAG TIME    =     1.243106000(  1.07%)
| TOTAL GRADIENT TIME      =    51.847473000( 44.74%)
|       TOTAL 1e GRADIENT TIME      =     3.684577000( 3.38%)
|       TOTAL 2e GRADIENT TIME      =    44.729103000(38.59%)
|       TOTAL EXC GRADIENT TIME     =     3.204255000(  2.76%)
| TOTAL TIME          =   115.895565000
------------------------------------
| Job cpu time:  0 days  0 hours  1 minutes 55.9 seconds.

324-eri-gradient-performance-issue, MAXPRIM=10

------------- TIMING ---------------
| INITIAL GUESS TIME  =     3.214959000(  2.75%)
| DFT GRID OPERATIONS =     0.891579000(  0.76%)
| TOTAL SCF TIME      =    57.270866000( 48.97%)
|       TOTAL OP TIME      =    54.282211000( 46.41%)
|             TOTAL 1e TIME      =     0.413180000(  0.35%)
|             TOTAL 2e TIME      =    41.887565000( 35.81%)
|             TOTAL EXC TIME     =    11.275406000(  9.64%)
|       TOTAL DII TIME      =     2.861102000(  2.45%)
|             TOTAL DIAG TIME    =     1.250597000(  1.07%)
| TOTAL GRADIENT TIME      =    54.398082000( 46.51%)
|       TOTAL 1e GRADIENT TIME      =     3.665321000( 3.33%)
|       TOTAL 2e GRADIENT TIME      =    47.301101000(40.44%)
|       TOTAL EXC GRADIENT TIME     =     3.201653000(  2.74%)
| TOTAL TIME          =   116.960426000
------------------------------------
| Job cpu time:  0 days  0 hours  1 minutes 57.0 seconds.

324-eri-gradient-performance-issue, MAXPRIM=14

------------- TIMING ---------------
| INITIAL GUESS TIME  =     3.195861000(  2.73%)
| DFT GRID OPERATIONS =     0.900699000(  0.77%)
| TOTAL SCF TIME      =    57.441711000( 49.06%)
|       TOTAL OP TIME      =    54.476833000( 46.53%)
|             TOTAL 1e TIME      =     0.420327000(  0.36%)
|             TOTAL 2e TIME      =    42.058335000( 35.92%)
|             TOTAL EXC TIME     =    11.290336000(  9.64%)
|       TOTAL DII TIME      =     2.837329000(  2.42%)
|             TOTAL DIAG TIME    =     1.263158000(  1.08%)
| TOTAL GRADIENT TIME      =    54.361989000( 46.43%)
|       TOTAL 1e GRADIENT TIME      =     3.657468000( 3.29%)
|       TOTAL 2e GRADIENT TIME      =    47.300566000(40.40%)
|       TOTAL EXC GRADIENT TIME     =     3.204332000(  2.74%)
| TOTAL TIME          =   117.075072000
------------------------------------
| Job cpu time:  0 days  0 hours  1 minutes 57.1 seconds.

I will change MAXPRIM to a default of 14 without f functions.

The 2e gradient code is still a bit slower compare to version 23.08 but it's pretty close.

AWG resolved conflicts in src/cuda/gpu_get2e_grad_ffff.cuh

agoetz

This looks good to me. I resolved a merge conflict and tested the code on Expanse A100 nodes. All tests serial, mpi, cuda, and cudampi pass (full test suite).

agoetz · 2024-03-20T00:11:24Z

Waiting for feedback from @ohearnk who is testing on MSU HPCC with both K80's and V100's to confirm expected behavior.

ohearnk · 2024-03-20T19:26:07Z

Performances numbers I'm getting on the Intel16 and AMD20 nodes on MSU HPCC (Kepler K80s and Volta V100s) are below. Note that tests are using the tight SCF benchmarks and are testing both current HEAD commit on master (2257ba8) and commits applied on top of master for PR-347, both without and with f-functions support enabled. The summary is that performance looks okay (slightly better for PR-347), but I did not go back to compare against the code in QUICK v23.08b so there may still be some performance lost from back there (but which will come back in coming ERI optimizations).

K80:
  TEST: master-HEAD-2257ba84:
    psb5:          47 seconds =      0.8 minutes
    morphine:     285 seconds =      4.8 minutes
    taxol:        891 seconds =     31.5 minutes
    valinomycin: 3685 seconds =     61.4 minutes
  TEST: pr-347:
    psb5:          46 seconds =      0.8 minutes
    morphine:     270 seconds =      4.5 minutes
    taxol:       1760 seconds =     29.3 minutes
    valinomycin: 3429 seconds =     57.1 minutes
  TEST: master-HEAD-2257ba84-enablef:
    psb5:          56 seconds =      0.9 minutes
    morphine:     281 seconds =      4.7 minutes
    taxol:       1853 seconds =     30.9 minutes
    valinomycin: 3607 seconds =     60.1 minutes
  TEST: pr-347-enablef:
    psb5:          58 seconds =      1.0 minutes
    morphine:     275 seconds =      4.6 minutes
    taxol:       1806 seconds =     30.1 minutes
    valinomycin: 3520 seconds =     58.7 minutes

V100:
  TEST: master-HEAD-2257ba84:
    psb5:          23 seconds =      0.4 minutes
    morphine:      31 seconds =      0.5 minutes
    taxol:        197 seconds =      3.3 minutes
    valinomycin:  395 seconds =      6.6 minutes
  TEST: pr-347:
    psb5:           8 seconds =      0.1 minutes
    morphine:      30 seconds =      0.5 minutes
    taxol:        184 seconds =      3.1 minutes
    valinomycin:  373 seconds =      6.2 minutes
  TEST: master-HEAD-2257ba84-enablef:
    psb5:          10 seconds =      0.2 minutes
    morphine:      33 seconds =      0.6 minutes
    taxol:        201 seconds =      3.4 minutes
    valinomycin:  399 seconds =      6.7 minutes
  TEST: pr-347-enablef:
    psb5:          10 seconds =      0.2 minutes
    morphine:      32 seconds =      0.5 minutes
    taxol:        187 seconds =      3.1 minutes
    valinomycin:  377 seconds =      6.3 minutes

ohearnk · 2024-03-20T19:27:03Z

So, I think this is good to merge now.

Madu86 added 7 commits March 17, 2024 10:01

disabled including unused kernel

6055276

added a preprocessor variable useful for primitive integral value sto…

710e7de

…rage in vrr code

updated to store primitive integral values into store2 directly

63f8ce0

updated to initialize store2 array

fe39fd2

eliminated store add compound operation in primitive integral gradien…

b1897cc

…t kernels, moved store2 intialization out of primitive function loops

setting maximum number of primitive functions for 10 unless F functio…

5bd04d3

…ns are compiled.

cleaned up

91f9aad

Madu86 added Code cleanup Code cleanup or refactoring Bug fix labels Mar 17, 2024

Madu86 requested a review from agoetz March 17, 2024 23:49

Madu86 assigned Madu86 and agoetz Mar 17, 2024

AWG - increase MAXPRIM to 14

2269f86

The cc-pVDZ basis set has contraction level up to 14 for a few elements like Ca, Ge etc.

Merge branch 'master' into 324-eri-gradient-performance-issue

e985907

AWG resolved conflicts in src/cuda/gpu_get2e_grad_ffff.cuh

agoetz approved these changes Mar 20, 2024

View reviewed changes

ohearnk mentioned this pull request Mar 20, 2024

Add error checks for builds enabling f-functions with GPU targets requiring legacy atomics (-DUSE_LEGACY_ATOMICS). #345

Merged

agoetz merged commit 45db608 into master Mar 20, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for ERI gradient performance issue #347

Fix for ERI gradient performance issue #347

Madu86 commented Mar 17, 2024

agoetz commented Mar 19, 2024

agoetz commented Mar 19, 2024

agoetz left a comment

agoetz commented Mar 20, 2024

ohearnk commented Mar 20, 2024

ohearnk commented Mar 20, 2024

Fix for ERI gradient performance issue #347

Fix for ERI gradient performance issue #347

Conversation

Madu86 commented Mar 17, 2024

agoetz commented Mar 19, 2024

agoetz commented Mar 19, 2024

agoetz left a comment

Choose a reason for hiding this comment

agoetz commented Mar 20, 2024

ohearnk commented Mar 20, 2024

ohearnk commented Mar 20, 2024