[warps] measure cuda thread divergence#202
Merged
valassi merged 8 commits intomadgraph5:masterfrom May 13, 2021
Merged
Conversation
…oughput12.sh
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.403954e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.747784 sec
2,606,062,103 cycles # 2.648 GHz
3,536,734,749 instructions # 1.36 insn per cycle
1.052060967 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
: smsp__sass_branch_targets.sum 868,352
: smsp__sass_branch_targets_threads_uniform.sum 868,352
: smsp__sass_branch_targets_threads_divergent.sum 0
: smsp__warps_launched.sum 16,384
-------------------------------------------------------------------------
FP precision = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.397452e+05 ) sec^-1
MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4
TOTAL : 0.608068 sec
2,198,704,176 cycles # 2.652 GHz
2,956,510,323 instructions # 1.34 insn per cycle
0.892671051 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
: smsp__sass_branch_targets.sum 9,053,696
: smsp__sass_branch_targets_threads_uniform.sum 9,053,696
: smsp__sass_branch_targets_threads_divergent.sum 0
: smsp__warps_launched.sum 512
=========================================================================
…lly profile more details
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.410966e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.755021 sec
2,631,624,147 cycles # 2.658 GHz
3,536,753,184 instructions # 1.34 insn per cycle
1.052640869 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
-------------------------------------------------------------------------
FP precision = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.401510e+05 ) sec^-1
MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4
TOTAL : 0.604015 sec
2,186,419,449 cycles # 2.652 GHz
2,935,826,967 instructions # 1.34 insn per cycle
0.887381812 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
… divergence and measure it
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.811740e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 1.012377 sec
3,103,258,724 cycles # 2.652 GHz
4,387,995,862 instructions # 1.41 insn per cycle
1.308308716 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
: smsp__sass_branch_targets.sum 1,785,856
: smsp__sass_branch_targets_threads_uniform.sum 1,720,320
: smsp__sass_branch_targets_threads_divergent.sum 65,536
: smsp__warps_launched.sum 16,384
=========================================================================
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.677619e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.748096 sec
2,610,682,594 cycles # 2.651 GHz
3,539,886,609 instructions # 1.36 insn per cycle
1.051570589 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
: smsp__sass_branch_targets.sum 109
: smsp__sass_branch_targets_threads_uniform.sum 105
: smsp__sass_branch_targets_threads_divergent.sum 4
: smsp__warps_launched.sum 1
=========================================================================
…ion to add it as an example
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.433666e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.747238 sec
2,606,742,547 cycles # 2.653 GHz
3,539,351,466 instructions # 1.36 insn per cycle
1.051872183 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
: smsp__sass_branch_targets.sum 53
: smsp__sass_branch_targets_threads_uniform.sum 53
: smsp__sass_branch_targets_threads_divergent.sum 0
: smsp__warps_launched.sum 1
-------------------------------------------------------------------------
FP precision = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.456046e+05 ) sec^-1
MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4
TOTAL : 0.604293 sec
2,202,232,601 cycles # 2.651 GHz
2,955,309,807 instructions # 1.34 insn per cycle
0.889275197 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
: smsp__sass_branch_targets.sum 17,683
: smsp__sass_branch_targets_threads_uniform.sum 17,683
: smsp__sass_branch_targets_threads_divergent.sum 0
: smsp__warps_launched.sum 1
=========================================================================
Note that I tried the 'stalled_barrier' metrics an dthey do not seem interesting
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.711994e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.745683 sec
2,603,540,638 cycles # 2.655 GHz
3,537,849,260 instructions # 1.36 insn per cycle
1.049477458 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
: smsp__sass_branch_targets.sum 109 4.18/usecond
: smsp__sass_branch_targets_threads_uniform.sum 105 4.03/usecond
: smsp__sass_branch_targets_threads_divergent.sum 4 153.37/msecond
: smsp__warps_launched.sum 1
=========================================================================
…ce test
Note that the throughput degradation from divergence is real and reproducible.
Without divergence, now around 6.4E8 - against 5.7E8 with divergence.
It is very difficult to correlate the percent degradation in throughput to the metrics however.
In summary, one should just aim at 100% uniform execution.
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.425099e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.741551 sec
2,589,547,187 cycles # 2.655 GHz
3,537,039,425 instructions # 1.37 insn per cycle
1.044156654 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
: smsp__sass_branch_targets.sum 53 2.89/usecond
: smsp__sass_branch_targets_threads_uniform.sum 53 2.89/usecond
: smsp__sass_branch_targets_threads_divergent.sum 0 0/second
: smsp__warps_launched.sum 1
-------------------------------------------------------------------------
FP precision = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.454874e+05 ) sec^-1
MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4
TOTAL : 0.602111 sec
2,193,960,041 cycles # 2.654 GHz
2,948,877,241 instructions # 1.34 insn per cycle
0.885704400 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
: smsp__sass_branch_targets.sum 17,683 1.52/usecond
: smsp__sass_branch_targets_threads_uniform.sum 17,683 1.52/usecond
: smsp__sass_branch_targets_threads_divergent.sum 0 0/second
: smsp__warps_launched.sum 1
=========================================================================
Member
Author
|
Self merging |
This was referenced May 13, 2021
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two changes
This essentially resolves issue #25, using command line profiling rather than GUI profiling