Copying spectral radii back to main memory after blockette routines … #37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
... if updateDt is true. Without this change, reverse mode AD routines use outdated spectral radii values, which results in inaccurate sensitivities.
Purpose
This PR addresses issues #32 and #36.
With this change, the spectral radii values are copied back to main memory from cached memory after a blockette residual computation that also updates the time step. This copy is done inside this if check because we do not need to copy these values for matrix-free matrix-vector products; however, we want to copy them after every ANK and NK step.
The extra copy comes with additional cost due to the increase memory access. This extra cost will only be present with blockette calls that also update the time step, which is not required for matrix-free operations.
I have ran tests to measure the performance difference caused by this change on one residual evaluation. The test is ran on Stampede2, one skylake node, compiled with AVX2 instruction set, 48 processors, blockette size 8, test block sizes of 32^3 and 48^3 per processor. I provide a few results:
Base speed: Millions of cells processed per one processor in one second with default residual routines that operate on main arrays.
Blockette speed: Same speed metric, but with cache-blocked residual routines.
Speedup: Blockette speed / base speed
The timing results:
Old results before this change:
32^3:
Base speed: 0.676747811395
Blockette speed: 1.24815149274
Speedup: 1.844337686393324
48^3:
Base speed: 0.629510355349
Blockette speed: 1.29542792502
Speedup: 2.057834178600312
New results with this change:
32^3:
Base speed: 0.676336855482
Blockette speed: 1.21488916591
Speedup: 1.7962782244717301
48^3:
Base speed: 0.627697421135
Blockette speed: 1.25598801018
Speedup: 2.00094499019755
The addition of this extra memory access causes the speedup to decrease from 1.84 to 1.80 and from 2.06 to 2.00 for test blocks of size 32^3 and 48^3, respectively. This is a very minor decrease in performance. Furthermore, this "slowdown" will only be happening with calls that update the time step, and the main computationally intensive calls (mat-free operations and preconditioner computations) do not update this time step. Therefore, the overall effect of this change on performance will be negligible.
Type of change
Testing
Regressions pass. Furthermore, the test in #32 also passes.
Checklist