Add async_stopping_criteria flag to reduce GPU-CPU syncs during generation #43085

AmitMY · 2026-01-03T09:42:33Z

Summary

Adds async_stopping_criteria flag to GenerationConfig that enables asynchronous stopping criteria checks during generation
When enabled, stopping criteria are evaluated on a separate CUDA stream, allowing generation to continue while the check runs
Uses pinned (page-locked) CPU memory with numpy view for efficient GPU-CPU communication without explicit synchronization
Polls for completion only when the CUDA event signals the async operation is done (no artificial delays)

Motivation

GPU-CPU synchronization during stopping criteria checks creates a bottleneck in autoregressive text generation. Each call to check if generation should stop requires syncing with the GPU to evaluate conditions like EOS token detection. This PR overlaps these checks with generation, reducing idle time.

Key Optimizations

Async CUDA stream: Stopping criteria run on a separate stream, overlapping with token generation
Pinned memory + numpy: Results written to pinned memory and read via numpy view - no PyTorch .item() sync needed
Event-based polling: Only check completion when event.query() returns True (no fixed polling interval)
CPU-side max_length: Length checks done on CPU without any GPU sync

Benchmark Results

Stopping criteria overhead (200 checks):

Mode	Time	Per Check	Speedup
Sync	104.65ms	0.523ms	1.00x
Async	51.76ms	0.259ms	2.02x

Usage

model.generate(
    inputs, 
    max_new_tokens=200, 
    async_stopping_criteria=True
)

Test plan

Verify correctness: async mode produces identical outputs to sync mode
- test_async_sync_equivalence_max_length
- test_async_sync_equivalence_eos_token
- test_async_sync_equivalence_partial_eos
Test with various stopping criteria (EOS, max_length, custom criteria)
- test_async_sync_equivalence_max_length - max_length criteria
- test_async_sync_equivalence_eos_token - EOS token criteria
- test_async_custom_stopping_criteria - custom stopping criteria
- test_async_multiple_eos_tokens - multiple EOS token IDs
Test with different batch sizes
- test_async_different_batch_sizes - batch sizes 1, 2, 4, 8, 16
Ensure graceful fallback when CUDA is not available
- test_async_cpu_fallback - verifies sync fallback on CPU tensors
Test wrapper interface compatibility
- test_async_wrapper_basic - __len__, __iter__, max_length property
- test_async_legacy_call_interface - legacy __call__ interface
- test_async_finalize - cleanup method

🤖 Generated with Claude Code

…ation When enabled, stopping criteria checks are performed asynchronously on a separate CUDA stream with pinned memory for result communication. This allows generation to continue while the check runs, reducing synchronization overhead. Key optimizations: - Uses pinned (page-locked) CPU memory for GPU-CPU communication - Batched polling: only checks async results every N tokens - CPU-side max_length check to avoid unnecessary GPU syncs Benchmark results (utf8-lm-tiny, 200 tokens): - Sync mode: 80.92 tokens/sec - Async mode: 137.06 tokens/sec - Speedup: 1.69x 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add _sync_event for GPU-side synchronization between main and async streams - Clone input_ids instead of detach to prevent race conditions - Pre-create _should_stop_gpu tensor to avoid stream issues - Make async stream wait for current stream before reading cloned tensors - Fix test_async_sync_equivalence_eos_token to properly test async behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ArthurZucker · 2026-01-05T14:14:32Z

cc @remi-or but as we are moving towards just improving continious batching I don't think this makes sense without it being supported int CB !

remi-or · 2026-01-05T15:29:03Z

Async for CPU-GPU is super cool, it will be implemented in CB down the line for sure. I don't think the implementation in CB needs to be a blocker for implementation in generate though, it would be nice to have the two options improve hand in hand. Especially since Cb support much less than generate because it's less flexible on some things.

AmitMY · 2026-01-05T16:24:31Z

Happy you like it.
I don't really see how CB has any effect here because my main use case is batch_size=1.
My main goal is solving #43089 which seems very doable - only three spots where there are explicit syncs.

However, I do not feel comfortable tackling all three spots at once, perhaps there is a generic solution but I could not think about one.

AmitMY force-pushed the async-stopping-criteria branch 2 times, most recently from 1b8b1b0 to 0134f83 Compare January 3, 2026 09:51

AmitMY mentioned this pull request Jan 3, 2026

Add async_stopping_criteria flag to reduce GPU-CPU synchronization overhead #43086

Open

AmitMY force-pushed the async-stopping-criteria branch from 0134f83 to 60e2379 Compare January 3, 2026 10:13

AmitMY mentioned this pull request Jan 3, 2026

Generation overhead: many GPU syncs per token + PyTorch dispatch overhead #43089

Open

2 tasks

Fix ruff formatting in test_stopping_criteria.py

1e252fb

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add async_stopping_criteria flag to reduce GPU-CPU syncs during generation #43085

Add async_stopping_criteria flag to reduce GPU-CPU syncs during generation #43085

AmitMY commented Jan 3, 2026 •

edited

Loading

Uh oh!

ArthurZucker commented Jan 5, 2026

Uh oh!

remi-or commented Jan 5, 2026

Uh oh!

AmitMY commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add async_stopping_criteria flag to reduce GPU-CPU syncs during generation #43085

Are you sure you want to change the base?

Add async_stopping_criteria flag to reduce GPU-CPU syncs during generation #43085

Conversation

AmitMY commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Key Optimizations

Benchmark Results

Usage

Test plan

Uh oh!

ArthurZucker commented Jan 5, 2026

Uh oh!

remi-or commented Jan 5, 2026

Uh oh!

AmitMY commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AmitMY commented Jan 3, 2026 •

edited

Loading