webgpu: Add session-level buffer pool for graph capture reuse#28761
Open
qjia7 wants to merge 2 commits into
Open
webgpu: Add session-level buffer pool for graph capture reuse#28761qjia7 wants to merge 2 commits into
qjia7 wants to merge 2 commits into
Conversation
When graph capture is enabled, each generator instance owns a per-graph BufferManager whose cache is discarded when the generator is destroyed. For workloads that repeatedly create and destroy generators on the same session (e.g., GenAI's per-request generators), this means every new generator has to reallocate all storage and uniform buffers from scratch, inflating cold-start cost and GPU memory churn. This change introduces a SessionBufferPool owned by the session. When a retiring BufferManager is released, its cached storage and uniform buffers are donated to the pool; the next BufferManager seeded from the session absorbs those buffers, skipping reallocation entirely. The pool capacity is controlled by a new provider option "ep.webgpuexecutionprovider.sessionBufferPoolGenerations" (defaults to disabled). The pool evicts the oldest slot when full, keeping the freshest distribution of buffer shapes. Verified with GenAI multi-generator scripts on phi4: subsequent generators report zero cache misses for both storage and uniform caches and produce coherent output across max_length changes and overlapping generator lifetimes.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a per-session WebGPU buffer pooling mechanism to improve graph-capture reuse across generator lifetimes by retaining a small number of recently retired per-graph buffer caches and seeding them into newly created per-graph BufferManager instances.
Changes:
- Introduces
webgpu::SessionBufferPoolto retain and recycle storage/uniform buffer caches across captured-graph lifetimes. - Adds a new provider option
ep.webgpuexecutionprovider.sessionBufferPoolGenerationsand parses/logs it in the WebGPU EP factory/config. - Wires pooling into graph-capture lifecycle: seed buffers on per-graph
BufferManagercreation and donate buffers onReleaseCapturedGraph.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/webgpu/webgpu_provider_options.h | Adds config key for session buffer pool generations. |
| onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc | Parses sessionBufferPoolGenerations and logs configured value. |
| onnxruntime/core/providers/webgpu/webgpu_execution_provider.h | Adds config field and EP member for the session-level buffer pool. |
| onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc | Creates/clears pool and integrates donate/seed with graph capture lifecycle. |
| onnxruntime/core/providers/webgpu/session_buffer_pool.h | New pool type definition and slot structure for storage/uniform buffers. |
| onnxruntime/core/providers/webgpu/session_buffer_pool.cc | New implementation for donate/seed/clear and buffer releasing. |
| onnxruntime/core/providers/webgpu/buffer_manager.h | Exposes cache managers and adds extract/absorb APIs to cache interface. |
| onnxruntime/core/providers/webgpu/buffer_manager.cc | Implements extract/absorb for graph-mode cache managers to enable pooling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- webgpu_provider_factory.cc: require std::from_chars to consume the full string for sessionBufferPoolGenerations so values like "1foo" are rejected instead of silently parsed as 1. - buffer_manager.h: drop const from StorageCache()/UniformCache() so the mutable cache references can no longer be obtained through a const BufferManager&. - session_buffer_pool.cc: drop slots_.reserve(max_generations_) to avoid a large up-front allocation when the option is set to an extreme value; slots grow on demand instead.
Contributor
Author
|
@hariharans29 @guschmue Please take a look, thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SessionBufferPoolthat lets a session hold on to retired generator buffer caches (storage + uniform) and seed them into newly created generators.ep.webgpuexecutionprovider.sessionBufferPoolGenerationsto bound how many generations of retired buffers are kept (default1; set to0to disable).BufferManager's cache into the pool and absorb pooled buffers when a newBufferManageris created for the next generator.Motivation
With graph capture enabled, each generator owns its own per-graph
BufferManager. When the generator is destroyed (e.g., per-request in GenAI), the entire buffer cache is thrown away and the next generator must reallocate all storage and uniform buffers from scratch, increasing cold-start latency and GPU memory churn.By keeping a small pool of recently-retired buffer slots at the session level, the next generator can reuse them and skip reallocation entirely after the first cycle.
Test plan
--use_webgpu— clean build.lintrunner -areports no lint issues.verify_multi_gen.py: sequential and overlapping generators all produce matching, coherent output.verify_max_length_change.py: generators with varyingmax_lengthall coherent.storage hits=171 misses=0, uniform hits=296 misses=0, i.e., the pool actually engages and eliminates reallocation.Notes
SessionReleaseCapturedGraphfromState::~State()so the per-graphBufferManageris actually released and its buffers reach the pool.