Skip to content

Conversation

TroyGarden
Copy link
Contributor

Summary:

context

  • we are adding fbgemm operators for the KT.regroup function.
  • we wanted a good way to measure the performance beside the runtime
  • trace is necessary to tell the many nuances in the optimization.

usage

  • to generate trace file in the given path (.)
buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=.
$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  943675 Jun 21 22:21 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350349 Jun 21 22:21 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json

performance

INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0

traces

  • _regroup_keyed_tenors
    {F1712147044}
  • KeyedTensor.regroup
    {F1712148863}
  • KTRegroupAsDict
    {F1712150411}

Differential Revision: D58906521

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 22, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58906521

Summary:
Pull Request resolved: #2157

# context
* we are adding fbgemm operators for the KT.regroup function.
* we wanted a good way to measure the performance beside the runtime
* trace is necessary to tell the many nuances in the optimization.

# usage
* to generate trace file in the given path (.)
```
buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=.
```
```
$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  943675 Jun 21 22:21 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350349 Jun 21 22:21 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json
```

# performance
```
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
```

# traces
* _regroup_keyed_tenors
 {F1712147044}
* KeyedTensor.regroup
 {F1712148863}
* KTRegroupAsDict
 {F1712150411}

Differential Revision: D58906521
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58906521

facebook-github-bot pushed a commit that referenced this pull request Jun 22, 2024
Summary:

# context
* we are adding fbgemm operators for the KT.regroup function.
* we wanted a good way to measure the performance beside the runtime
* **trace is very important to evaluate the actual performance impact**
* for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`)
* but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance.

# usage
* to generate trace file in the given path (.)
```
buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=.
```
```
$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  943675 Jun 21 22:21 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350349 Jun 21 22:21 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json
```

# performance
```
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
```

# traces
* _regroup_keyed_tenors
 {F1712147044} 
* KeyedTensor.regroup
 {F1712148863} 
* KTRegroupAsDict
 {F1712150411}

Differential Revision: D58906521
facebook-github-bot pushed a commit that referenced this pull request Jun 22, 2024
Summary:

# context
* we are adding fbgemm operators for the KT.regroup function.
* we wanted a good way to measure the performance beside the runtime
* **trace is very important to evaluate the actual performance impact**
* for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`)
* but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance.

# usage
* to generate trace file in the given path (.)
```
buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=.
```
```
$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  943675 Jun 21 22:21 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350349 Jun 21 22:21 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json
```

# performance
```
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
```

# traces
* _regroup_keyed_tenors
 {F1712147044} 
* KeyedTensor.regroup
 {F1712148863} 
* KTRegroupAsDict
 {F1712150411}

Differential Revision: D58906521
facebook-github-bot pushed a commit that referenced this pull request Jun 22, 2024
Summary:

# context
* we are adding fbgemm operators for the KT.regroup function.
* we wanted a good way to measure the performance beside the runtime
* **trace is very important to evaluate the actual performance impact**
* for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`)
* but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance.

# usage
* to generate trace file in the given path (.)
```
buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=.
```
```
$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  943675 Jun 21 22:21 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350349 Jun 21 22:21 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json
```

# performance
```
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
```

# traces
* _regroup_keyed_tenors
 {F1712147044} 
* KeyedTensor.regroup
 {F1712148863} 
* KTRegroupAsDict
 {F1712150411}

Differential Revision: D58906521
facebook-github-bot pushed a commit that referenced this pull request Jun 22, 2024
Summary:

# context
* we are adding fbgemm operators for the KT.regroup function.
* we wanted a good way to measure the performance beside the runtime
* **trace is very important to evaluate the actual performance impact**
* for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`)
* but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance.

# usage
* to generate trace file in the given path (.)
```
buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=.
```
```
$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  943675 Jun 21 22:21 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350349 Jun 21 22:21 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json
```

# performance
```
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
```

# traces
* _regroup_keyed_tenors
 {F1712147044} 
* KeyedTensor.regroup
 {F1712148863} 
* KTRegroupAsDict
 {F1712150411}

Differential Revision: D58906521
facebook-github-bot pushed a commit that referenced this pull request Jun 22, 2024
Summary:

# context
* we are adding fbgemm operators for the KT.regroup function.
* we wanted a good way to measure the performance beside the runtime
* **trace is very important to evaluate the actual performance impact**
* for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`)
* but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance.

# usage
* to generate trace file in the given path (.)
```
buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=.
```
```
$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  943675 Jun 21 22:21 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350349 Jun 21 22:21 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json
```

# performance
* GPU
```
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
```
* CPU
```
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 144.8 ms | Memory (P90):   0.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 159.1 ms | Memory (P90):   0.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 203.0 ms | Memory (P90):   0.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 132.4 ms | Memory (P90):   0.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 134.7 ms | Memory (P90):   0.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 131.8 ms | Memory (P90):   0.0
```
# traces
* _regroup_keyed_tenors
 {F1712147044} 
* KeyedTensor.regroup
 {F1712148863} 
* KTRegroupAsDict
 {F1712150411}

Differential Revision: D58906521
@TroyGarden TroyGarden changed the title Add GPU trace for KT.regroup benchmark [KT.regroup Ops][0/N] Add GPU trace for KT.regroup benchmark Jul 13, 2024
@TroyGarden TroyGarden deleted the export-D58906521 branch June 4, 2025 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants