Remove CUDA synchronization in mean_token_accuracy #2902

cyyever · 2025-02-19T08:51:40Z

What does this PR do?

mean_token_accuracy computation invokes item() on token counts, which triggers CUDA to CPU synchronization. That synchronization becomes a minor performance bottleneck in LLM fine-tuning, as indicated by the following profiling snapshot from v0.15.1:

That bottleneck has been fixed in this PR by accumulating the correct and total token counts in tensors. item() calls are delayed until trainer.log().
The effects of the change are indicated by another profiling that the bottleneck disappears:

Because the metrics are cleared immediately after logging, this change should be safe and backwards-compatible.

Before submitting

Did you read the contributor guideline,
Pull Request section?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

qgallouedec · 2025-02-19T09:14:46Z

I don't understand the profiling actually. Where do you get that this line is the bottleneck?
Thank for contributing!

cyyever · 2025-02-19T14:44:56Z

@qgallouedec It is the sixth line in the first picture. It's not a main bottleneck, however, the GPU utility rose a bit after fixing it.

qgallouedec · 2025-02-19T21:57:00Z

This?

dequantize_4bit (bitsandbytes/functional. py:1380)

qgallouedec · 2025-02-19T21:58:18Z

The comparison is not very clear to me tbh, do you have clearer results, like two trainings (one with main, one with your branch) where we can see the speedup in term of steps/sec?

cyyever · 2025-02-20T06:16:47Z

@qgallouedec Of course, I will provide a comparison ASAP.

cyyever force-pushed the sync_point branch from 86834a4 to f677ed3 Compare February 19, 2025 09:07

cyyever force-pushed the sync_point branch from f677ed3 to 594faab Compare February 19, 2025 09:15

cyyever changed the title ~~Fix CUDA sync point in mean_token_accuracy~~ Remove CUDA synchronization in mean_token_accuracy Feb 19, 2025

Fix CUDA sync point in mean_token_accuracy

e23796d

cyyever force-pushed the sync_point branch from 594faab to e23796d Compare February 20, 2025 03:33

qgallouedec added the 😴 stale No update from the author, will be closed soon label Mar 16, 2025

qgallouedec closed this Apr 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove CUDA synchronization in mean_token_accuracy #2902

Remove CUDA synchronization in mean_token_accuracy #2902

Uh oh!

cyyever commented Feb 19, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Feb 19, 2025

Uh oh!

cyyever commented Feb 19, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Feb 19, 2025

Uh oh!

qgallouedec commented Feb 19, 2025 •

edited

Loading

Uh oh!

cyyever commented Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove CUDA synchronization in mean_token_accuracy #2902

Remove CUDA synchronization in mean_token_accuracy #2902

Uh oh!

Conversation

cyyever commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

qgallouedec commented Feb 19, 2025

Uh oh!

cyyever commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Feb 19, 2025

Uh oh!

qgallouedec commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyyever commented Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cyyever commented Feb 19, 2025 •

edited

Loading

cyyever commented Feb 19, 2025 •

edited

Loading

qgallouedec commented Feb 19, 2025 •

edited

Loading