[warps] measure cuda thread divergence by valassi · Pull Request #202 · madgraph5/madgraph4gpu

valassi · 2021-05-12T18:19:29Z

Two changes

improve the profiling script to also measure cuda thread divergence
add the option to artificially introduce some thread divergence in eemumu, just to check what it does

This essentially resolves issue #25, using command line profiling rather than GUI profiling

…oughput12.sh On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.403954e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.747784 sec 2,606,062,103 cycles # 2.648 GHz 3,536,734,749 instructions # 1.36 insn per cycle 1.052060967 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 868,352 : smsp__sass_branch_targets_threads_uniform.sum 868,352 : smsp__sass_branch_targets_threads_divergent.sum 0 : smsp__warps_launched.sum 16,384 ------------------------------------------------------------------------- FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.397452e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.608068 sec 2,198,704,176 cycles # 2.652 GHz 2,956,510,323 instructions # 1.34 insn per cycle 0.892671051 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 9,053,696 : smsp__sass_branch_targets_threads_uniform.sum 9,053,696 : smsp__sass_branch_targets_threads_divergent.sum 0 : smsp__warps_launched.sum 512 =========================================================================

…lly profile more details On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.410966e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.755021 sec 2,631,624,147 cycles # 2.658 GHz 3,536,753,184 instructions # 1.34 insn per cycle 1.052640869 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ------------------------------------------------------------------------- FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.401510e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.604015 sec 2,186,419,449 cycles # 2.652 GHz 2,935,826,967 instructions # 1.34 insn per cycle 0.887381812 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% =========================================================================

… divergence and measure it On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 5.811740e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.012377 sec 3,103,258,724 cycles # 2.652 GHz 4,387,995,862 instructions # 1.41 insn per cycle 1.308308716 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 128 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33% : smsp__sass_branch_targets.sum 1,785,856 : smsp__sass_branch_targets_threads_uniform.sum 1,720,320 : smsp__sass_branch_targets_threads_divergent.sum 65,536 : smsp__warps_launched.sum 16,384 =========================================================================

On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 5.677619e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.748096 sec 2,610,682,594 cycles # 2.651 GHz 3,539,886,609 instructions # 1.36 insn per cycle 1.051570589 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 128 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33% : smsp__sass_branch_targets.sum 109 : smsp__sass_branch_targets_threads_uniform.sum 105 : smsp__sass_branch_targets_threads_divergent.sum 4 : smsp__warps_launched.sum 1 =========================================================================

…ion to add it as an example On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.433666e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.747238 sec 2,606,742,547 cycles # 2.653 GHz 3,539,351,466 instructions # 1.36 insn per cycle 1.051872183 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 53 : smsp__sass_branch_targets_threads_uniform.sum 53 : smsp__sass_branch_targets_threads_divergent.sum 0 : smsp__warps_launched.sum 1 ------------------------------------------------------------------------- FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.456046e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.604293 sec 2,202,232,601 cycles # 2.651 GHz 2,955,309,807 instructions # 1.34 insn per cycle 0.889275197 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 17,683 : smsp__sass_branch_targets_threads_uniform.sum 17,683 : smsp__sass_branch_targets_threads_divergent.sum 0 : smsp__warps_launched.sum 1 =========================================================================

Note that I tried the 'stalled_barrier' metrics an dthey do not seem interesting On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 5.711994e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.745683 sec 2,603,540,638 cycles # 2.655 GHz 3,537,849,260 instructions # 1.36 insn per cycle 1.049477458 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 128 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33% : smsp__sass_branch_targets.sum 109 4.18/usecond : smsp__sass_branch_targets_threads_uniform.sum 105 4.03/usecond : smsp__sass_branch_targets_threads_divergent.sum 4 153.37/msecond : smsp__warps_launched.sum 1 =========================================================================

…ce test Note that the throughput degradation from divergence is real and reproducible. Without divergence, now around 6.4E8 - against 5.7E8 with divergence. It is very difficult to correlate the percent degradation in throughput to the metrics however. In summary, one should just aim at 100% uniform execution. On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.425099e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.741551 sec 2,589,547,187 cycles # 2.655 GHz 3,537,039,425 instructions # 1.37 insn per cycle 1.044156654 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 53 2.89/usecond : smsp__sass_branch_targets_threads_uniform.sum 53 2.89/usecond : smsp__sass_branch_targets_threads_divergent.sum 0 0/second : smsp__warps_launched.sum 1 ------------------------------------------------------------------------- FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.454874e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.602111 sec 2,193,960,041 cycles # 2.654 GHz 2,948,877,241 instructions # 1.34 insn per cycle 0.885704400 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 17,683 1.52/usecond : smsp__sass_branch_targets_threads_uniform.sum 17,683 1.52/usecond : smsp__sass_branch_targets_threads_divergent.sum 0 0/second : smsp__warps_launched.sum 1 =========================================================================

valassi · 2021-05-13T08:56:04Z

Self merging

valassi added 8 commits May 12, 2021 19:57

[warps] add a comment with a useful link

3ada0c7

valassi merged commit 0bc1b7a into madgraph5:master May 13, 2021

This was referenced May 13, 2021

[warps] add the divergence metrics also to the profile script for GUI analysis #203

Merged

Branch efficiency: check that we have no issues with branch divergence #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[warps] measure cuda thread divergence#202

[warps] measure cuda thread divergence#202
valassi merged 8 commits intomadgraph5:masterfrom
valassi:warps

valassi commented May 12, 2021

Uh oh!

valassi commented May 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

valassi commented May 12, 2021

Uh oh!

valassi commented May 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant