Add async_stopping_criteria flag to reduce GPU-CPU syncs during generation #43085
+519
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
async_stopping_criteriaflag toGenerationConfigthat enables asynchronous stopping criteria checks during generationFixes #43086
Motivation
GPU-CPU synchronization during stopping criteria checks creates a bottleneck in autoregressive text generation. Each call to check if generation should stop requires syncing with the GPU to evaluate conditions like EOS token detection. This PR overlaps these checks with generation, reducing idle time.
Key Optimizations
.item()sync neededevent.query()returns True (no fixed polling interval)Benchmark Results
Stopping criteria overhead (200 checks):
Usage
Test plan
test_async_sync_equivalence_max_lengthtest_async_sync_equivalence_eos_tokentest_async_sync_equivalence_partial_eostest_async_sync_equivalence_max_length- max_length criteriatest_async_sync_equivalence_eos_token- EOS token criteriatest_async_custom_stopping_criteria- custom stopping criteriatest_async_multiple_eos_tokens- multiple EOS token IDstest_async_different_batch_sizes- batch sizes 1, 2, 4, 8, 16test_async_cpu_fallback- verifies sync fallback on CPU tensorstest_async_wrapper_basic-__len__,__iter__,max_lengthpropertytest_async_legacy_call_interface- legacy__call__interfacetest_async_finalize- cleanup method🤖 Generated with Claude Code