Skip to content

[warps] measure cuda thread divergence#202

Merged
valassi merged 8 commits intomadgraph5:masterfrom
valassi:warps
May 13, 2021
Merged

[warps] measure cuda thread divergence#202
valassi merged 8 commits intomadgraph5:masterfrom
valassi:warps

Conversation

@valassi
Copy link
Copy Markdown
Member

@valassi valassi commented May 12, 2021

Two changes

  • improve the profiling script to also measure cuda thread divergence
  • add the option to artificially introduce some thread divergence in eemumu, just to check what it does

This essentially resolves issue #25, using command line profiling rather than GUI profiling

valassi added 8 commits May 12, 2021 19:57
…oughput12.sh

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.403954e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.747784 sec
     2,606,062,103      cycles                    #    2.648 GHz
     3,536,734,749      instructions              #    1.36  insn per cycle
       1.052060967 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       868,352
                             : smsp__sass_branch_targets_threads_uniform.sum       868,352
                             : smsp__sass_branch_targets_threads_divergent.sum     0
                             : smsp__warps_launched.sum                            16,384
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.397452e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.608068 sec
     2,198,704,176      cycles                    #    2.652 GHz
     2,956,510,323      instructions              #    1.34  insn per cycle
       0.892671051 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       9,053,696
                             : smsp__sass_branch_targets_threads_uniform.sum       9,053,696
                             : smsp__sass_branch_targets_threads_divergent.sum     0
                             : smsp__warps_launched.sum                            512
=========================================================================
…lly profile more details

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.410966e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.755021 sec
     2,631,624,147      cycles                    #    2.658 GHz
     3,536,753,184      instructions              #    1.34  insn per cycle
       1.052640869 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.401510e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.604015 sec
     2,186,419,449      cycles                    #    2.652 GHz
     2,935,826,967      instructions              #    1.34  insn per cycle
       0.887381812 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
… divergence and measure it

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.811740e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.012377 sec
     3,103,258,724      cycles                    #    2.652 GHz
     4,387,995,862      instructions              #    1.41  insn per cycle
       1.308308716 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
                             : smsp__sass_branch_targets.sum                       1,785,856
                             : smsp__sass_branch_targets_threads_uniform.sum       1,720,320
                             : smsp__sass_branch_targets_threads_divergent.sum     65,536
                             : smsp__warps_launched.sum                            16,384
=========================================================================
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.677619e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.748096 sec
     2,610,682,594      cycles                    #    2.651 GHz
     3,539,886,609      instructions              #    1.36  insn per cycle
       1.051570589 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
                             : smsp__sass_branch_targets.sum                       109
                             : smsp__sass_branch_targets_threads_uniform.sum       105
                             : smsp__sass_branch_targets_threads_divergent.sum     4
                             : smsp__warps_launched.sum                            1
=========================================================================
…ion to add it as an example

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.433666e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.747238 sec
     2,606,742,547      cycles                    #    2.653 GHz
     3,539,351,466      instructions              #    1.36  insn per cycle
       1.051872183 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       53
                             : smsp__sass_branch_targets_threads_uniform.sum       53
                             : smsp__sass_branch_targets_threads_divergent.sum     0
                             : smsp__warps_launched.sum                            1
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.456046e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.604293 sec
     2,202,232,601      cycles                    #    2.651 GHz
     2,955,309,807      instructions              #    1.34  insn per cycle
       0.889275197 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       17,683
                             : smsp__sass_branch_targets_threads_uniform.sum       17,683
                             : smsp__sass_branch_targets_threads_divergent.sum     0
                             : smsp__warps_launched.sum                            1
=========================================================================
Note that I tried the 'stalled_barrier' metrics an dthey do not seem interesting

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.711994e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.745683 sec
     2,603,540,638      cycles                    #    2.655 GHz
     3,537,849,260      instructions              #    1.36  insn per cycle
       1.049477458 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
                             : smsp__sass_branch_targets.sum                       109        4.18/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       105        4.03/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     4          153.37/msecond
                             : smsp__warps_launched.sum                            1
=========================================================================
…ce test

Note that the throughput degradation from divergence is real and reproducible.
Without divergence, now around 6.4E8 - against 5.7E8 with divergence.
It is very difficult to correlate the percent degradation in throughput to the metrics however.
In summary, one should just aim at 100% uniform execution.

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.425099e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.741551 sec
     2,589,547,187      cycles                    #    2.655 GHz
     3,537,039,425      instructions              #    1.37  insn per cycle
       1.044156654 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       53         2.89/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       53         2.89/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     0          0/second
                             : smsp__warps_launched.sum                            1
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.454874e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.602111 sec
     2,193,960,041      cycles                    #    2.654 GHz
     2,948,877,241      instructions              #    1.34  insn per cycle
       0.885704400 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       17,683     1.52/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       17,683     1.52/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     0          0/second
                             : smsp__warps_launched.sum                            1
=========================================================================
@valassi
Copy link
Copy Markdown
Member Author

valassi commented May 13, 2021

Self merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant