Skip to content

Conversation

wz337
Copy link
Contributor

@wz337 wz337 commented Aug 25, 2025

Summary:

  1. Recent changes from D79128843 introduced sync point clipping.py which was seen in trace
  2. It was creating CPU tensors which were being moved synchronously to cuda devices consequently causing long wait times in training with CudaStreamSychronization exhibiting in trace.
  3. This caused QPS degradation in CTX FM model which I was actively working on optimizing and also it cause QPS degradation in most model enabling Optimizer Gradient clipping in their yaml config.
  4. This fix helps bump qps by around 5% while keep NE unimpacted.

Differential Revision: D80959986

Summary:
1. Recent changes from D79128843 introduced sync point `clipping.py` which was seen in trace
2. It was creating CPU tensors which were being moved **synchronously**  to cuda devices consequently causing long wait times in training with `CudaStreamSychronization` exhibiting in trace. 
3. This caused QPS degradation in  CTX FM model which I was actively working on optimizing and also it cause QPS degradation in most model enabling Optimizer Gradient clipping in their yaml config. 
4. This fix helps bump qps by around 5% while keep NE unimpacted.

Differential Revision: D80959986
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80959986

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants