Skip to content

Conversation

@AmitMY
Copy link
Contributor

@AmitMY AmitMY commented Jan 3, 2026

Summary

  • Adds async_stopping_criteria flag to GenerationConfig that enables asynchronous stopping criteria checks during generation
  • When enabled, stopping criteria are evaluated on a separate CUDA stream, allowing generation to continue while the check runs
  • Uses pinned (page-locked) CPU memory with numpy view for efficient GPU-CPU communication without explicit synchronization
  • Polls for completion only when the CUDA event signals the async operation is done (no artificial delays)

Fixes #43086

Motivation

GPU-CPU synchronization during stopping criteria checks creates a bottleneck in autoregressive text generation. Each call to check if generation should stop requires syncing with the GPU to evaluate conditions like EOS token detection. This PR overlaps these checks with generation, reducing idle time.

Key Optimizations

  1. Async CUDA stream: Stopping criteria run on a separate stream, overlapping with token generation
  2. Pinned memory + numpy: Results written to pinned memory and read via numpy view - no PyTorch .item() sync needed
  3. Event-based polling: Only check completion when event.query() returns True (no fixed polling interval)
  4. CPU-side max_length: Length checks done on CPU without any GPU sync

Benchmark Results

Stopping criteria overhead (200 checks):

Mode Time Per Check Speedup
Sync 104.65ms 0.523ms 1.00x
Async 51.76ms 0.259ms 2.02x

Usage

model.generate(
    inputs, 
    max_new_tokens=200, 
    async_stopping_criteria=True
)

Test plan

  • Verify correctness: async mode produces identical outputs to sync mode
    • test_async_sync_equivalence_max_length
    • test_async_sync_equivalence_eos_token
    • test_async_sync_equivalence_partial_eos
  • Test with various stopping criteria (EOS, max_length, custom criteria)
    • test_async_sync_equivalence_max_length - max_length criteria
    • test_async_sync_equivalence_eos_token - EOS token criteria
    • test_async_custom_stopping_criteria - custom stopping criteria
    • test_async_multiple_eos_tokens - multiple EOS token IDs
  • Test with different batch sizes
    • test_async_different_batch_sizes - batch sizes 1, 2, 4, 8, 16
  • Ensure graceful fallback when CUDA is not available
    • test_async_cpu_fallback - verifies sync fallback on CPU tensors
  • Test wrapper interface compatibility
    • test_async_wrapper_basic - __len__, __iter__, max_length property
    • test_async_legacy_call_interface - legacy __call__ interface
    • test_async_finalize - cleanup method

🤖 Generated with Claude Code

…ation

When enabled, stopping criteria checks are performed asynchronously on a
separate CUDA stream with pinned memory for result communication. This
allows generation to continue while the check runs, reducing synchronization
overhead.

Key optimizations:
- Uses pinned (page-locked) CPU memory for GPU-CPU communication
- Batched polling: only checks async results every N tokens
- CPU-side max_length check to avoid unnecessary GPU syncs

Benchmark results (utf8-lm-tiny, 200 tokens):
- Sync mode: 80.92 tokens/sec
- Async mode: 137.06 tokens/sec
- Speedup: 1.69x

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the async-stopping-criteria branch from 0134f83 to 60e2379 Compare January 3, 2026 10:13
- Add _sync_event for GPU-side synchronization between main and async streams
- Clone input_ids instead of detach to prevent race conditions
- Pre-create _should_stop_gpu tensor to avoid stream issues
- Make async stream wait for current stream before reading cloned tensors
- Fix test_async_sync_equivalence_eos_token to properly test async behavior

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ArthurZucker
Copy link
Collaborator

cc @remi-or but as we are moving towards just improving continious batching I don't think this makes sense without it being supported int CB !

@remi-or
Copy link
Collaborator

remi-or commented Jan 5, 2026

Async for CPU-GPU is super cool, it will be implemented in CB down the line for sure. I don't think the implementation in CB needs to be a blocker for implementation in generate though, it would be nice to have the two options improve hand in hand. Especially since Cb support much less than generate because it's less flexible on some things.

@AmitMY
Copy link
Contributor Author

AmitMY commented Jan 5, 2026

Happy you like it.
I don't really see how CB has any effect here because my main use case is batch_size=1.
My main goal is solving #43089 which seems very doable - only three spots where there are explicit syncs.

However, I do not feel comfortable tackling all three spots at once, perhaps there is a generic solution but I could not think about one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add async_stopping_criteria flag to reduce GPU-CPU synchronization overhead

3 participants