[KT.regroup Ops][0/N] Add GPU trace for KT.regroup benchmark #2157

TroyGarden · 2024-06-22T05:27:57Z

Summary:

context

we are adding fbgemm operators for the KT.regroup function.
we wanted a good way to measure the performance beside the runtime
trace is necessary to tell the many nuances in the optimization.

usage

to generate trace file in the given path (.)

buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=.

$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  943675 Jun 21 22:21 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350349 Jun 21 22:21 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json

performance

INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0

traces

_regroup_keyed_tenors
{F1712147044}
KeyedTensor.regroup
{F1712148863}
KTRegroupAsDict
{F1712150411}

Differential Revision: D58906521

facebook-github-bot · 2024-06-22T05:28:08Z

This pull request was exported from Phabricator. Differential Revision: D58906521

Summary: Pull Request resolved: #2157 # context * we are adding fbgemm operators for the KT.regroup function. * we wanted a good way to measure the performance beside the runtime * trace is necessary to tell the many nuances in the optimization. # usage * to generate trace file in the given path (.) ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=. ``` ``` $ ll *.json -rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 943675 Jun 21 22:21 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350349 Jun 21 22:21 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json ``` # performance ``` INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1 INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000 INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 ``` # traces * _regroup_keyed_tenors {F1712147044} * KeyedTensor.regroup {F1712148863} * KTRegroupAsDict {F1712150411} Differential Revision: D58906521

facebook-github-bot · 2024-06-22T05:34:08Z

This pull request was exported from Phabricator. Differential Revision: D58906521

Summary: # context * we are adding fbgemm operators for the KT.regroup function. * we wanted a good way to measure the performance beside the runtime * **trace is very important to evaluate the actual performance impact** * for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`) * but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance. # usage * to generate trace file in the given path (.) ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=. ``` ``` $ ll *.json -rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 943675 Jun 21 22:21 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350349 Jun 21 22:21 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json ``` # performance ``` INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1 INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000 INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 ``` # traces * _regroup_keyed_tenors {F1712147044} * KeyedTensor.regroup {F1712148863} * KTRegroupAsDict {F1712150411} Differential Revision: D58906521

Summary: # context * we are adding fbgemm operators for the KT.regroup function. * we wanted a good way to measure the performance beside the runtime * **trace is very important to evaluate the actual performance impact** * for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`) * but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance. # usage * to generate trace file in the given path (.) ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=. ``` ``` $ ll *.json -rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 943675 Jun 21 22:21 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350349 Jun 21 22:21 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json ``` # performance * GPU ``` INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1 INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000 INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 ``` * CPU ``` _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 144.8 ms | Memory (P90): 0.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 159.1 ms | Memory (P90): 0.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 203.0 ms | Memory (P90): 0.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 132.4 ms | Memory (P90): 0.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 134.7 ms | Memory (P90): 0.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 131.8 ms | Memory (P90): 0.0 ``` # traces * _regroup_keyed_tenors {F1712147044} * KeyedTensor.regroup {F1712148863} * KTRegroupAsDict {F1712150411} Differential Revision: D58906521

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 22, 2024

facebook-github-bot added the fb-exported label Jun 22, 2024

TroyGarden force-pushed the export-D58906521 branch from 124ed42 to 7eb5b7f Compare June 22, 2024 05:34

facebook-github-bot closed this in 704afbe Jun 25, 2024

TroyGarden changed the title ~~Add GPU trace for KT.regroup benchmark~~ [KT.regroup Ops][0/N] Add GPU trace for KT.regroup benchmark Jul 13, 2024

TroyGarden deleted the export-D58906521 branch June 4, 2025 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KT.regroup Ops][0/N] Add GPU trace for KT.regroup benchmark #2157

[KT.regroup Ops][0/N] Add GPU trace for KT.regroup benchmark #2157

Uh oh!

TroyGarden commented Jun 22, 2024

Uh oh!

facebook-github-bot commented Jun 22, 2024

Uh oh!

facebook-github-bot commented Jun 22, 2024

Uh oh!

Uh oh!

[KT.regroup Ops][0/N] Add GPU trace for KT.regroup benchmark #2157

[KT.regroup Ops][0/N] Add GPU trace for KT.regroup benchmark #2157

Uh oh!

Conversation

TroyGarden commented Jun 22, 2024

context

usage

performance

traces

Uh oh!

facebook-github-bot commented Jun 22, 2024

Uh oh!

facebook-github-bot commented Jun 22, 2024

Uh oh!

Uh oh!