Skip to content

webgpu: Add session-level buffer pool for graph capture reuse#28761

Open
qjia7 wants to merge 2 commits into
microsoft:mainfrom
qjia7:webgpu-session-buffer-pool
Open

webgpu: Add session-level buffer pool for graph capture reuse#28761
qjia7 wants to merge 2 commits into
microsoft:mainfrom
qjia7:webgpu-session-buffer-pool

Conversation

@qjia7
Copy link
Copy Markdown
Contributor

@qjia7 qjia7 commented Jun 3, 2026

Summary

  • Introduces SessionBufferPool that lets a session hold on to retired generator buffer caches (storage + uniform) and seed them into newly created generators.
  • Adds provider option ep.webgpuexecutionprovider.sessionBufferPoolGenerations to bound how many generations of retired buffers are kept (default 1; set to 0 to disable).
  • Wires the WebGPU EP to donate a retiring BufferManager's cache into the pool and absorb pooled buffers when a new BufferManager is created for the next generator.
  • The pool is only created when graph capture is enabled AND the option is > 0, so non-graph-capture sessions are unaffected.

Motivation

With graph capture enabled, each generator owns its own per-graph BufferManager. When the generator is destroyed (e.g., per-request in GenAI), the entire buffer cache is thrown away and the next generator must reallocate all storage and uniform buffers from scratch, increasing cold-start latency and GPU memory churn.

By keeping a small pool of recently-retired buffer slots at the session level, the next generator can reuse them and skip reallocation entirely after the first cycle.

Test plan

  • Build ORT (Windows, D3D12) with --use_webgpu — clean build.
  • lintrunner -a reports no lint issues.
  • Verified end-to-end with GenAI on phi4 + WebGPU graph capture using two scripts:
    • verify_multi_gen.py: sequential and overlapping generators all produce matching, coherent output.
    • verify_max_length_change.py: generators with varying max_length all coherent.
  • With diagnostic prints (since removed), confirmed that after the first generator donates buffers, subsequent generators report storage hits=171 misses=0, uniform hits=296 misses=0, i.e., the pool actually engages and eliminates reallocation.

Notes

  • Pairs with a GenAI-side change that invokes SessionReleaseCapturedGraph from State::~State() so the per-graph BufferManager is actually released and its buffers reach the pool.

When graph capture is enabled, each generator instance owns a per-graph
BufferManager whose cache is discarded when the generator is destroyed.
For workloads that repeatedly create and destroy generators on the same
session (e.g., GenAI's per-request generators), this means every new
generator has to reallocate all storage and uniform buffers from scratch,
inflating cold-start cost and GPU memory churn.

This change introduces a SessionBufferPool owned by the session. When a
retiring BufferManager is released, its cached storage and uniform
buffers are donated to the pool; the next BufferManager seeded from the
session absorbs those buffers, skipping reallocation entirely.

The pool capacity is controlled by a new provider option
"ep.webgpuexecutionprovider.sessionBufferPoolGenerations" (defaults to
disabled). The pool evicts the oldest slot when full, keeping the
freshest distribution of buffer shapes.

Verified with GenAI multi-generator scripts on phi4: subsequent
generators report zero cache misses for both storage and uniform caches
and produce coherent output across max_length changes and overlapping
generator lifetimes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a per-session WebGPU buffer pooling mechanism to improve graph-capture reuse across generator lifetimes by retaining a small number of recently retired per-graph buffer caches and seeding them into newly created per-graph BufferManager instances.

Changes:

  • Introduces webgpu::SessionBufferPool to retain and recycle storage/uniform buffer caches across captured-graph lifetimes.
  • Adds a new provider option ep.webgpuexecutionprovider.sessionBufferPoolGenerations and parses/logs it in the WebGPU EP factory/config.
  • Wires pooling into graph-capture lifecycle: seed buffers on per-graph BufferManager creation and donate buffers on ReleaseCapturedGraph.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
onnxruntime/core/providers/webgpu/webgpu_provider_options.h Adds config key for session buffer pool generations.
onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc Parses sessionBufferPoolGenerations and logs configured value.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.h Adds config field and EP member for the session-level buffer pool.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc Creates/clears pool and integrates donate/seed with graph capture lifecycle.
onnxruntime/core/providers/webgpu/session_buffer_pool.h New pool type definition and slot structure for storage/uniform buffers.
onnxruntime/core/providers/webgpu/session_buffer_pool.cc New implementation for donate/seed/clear and buffer releasing.
onnxruntime/core/providers/webgpu/buffer_manager.h Exposes cache managers and adds extract/absorb APIs to cache interface.
onnxruntime/core/providers/webgpu/buffer_manager.cc Implements extract/absorb for graph-mode cache managers to enable pooling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc Outdated
Comment thread onnxruntime/core/providers/webgpu/buffer_manager.h Outdated
Comment thread onnxruntime/core/providers/webgpu/session_buffer_pool.cc
- webgpu_provider_factory.cc: require std::from_chars to consume the full
  string for sessionBufferPoolGenerations so values like "1foo" are
  rejected instead of silently parsed as 1.
- buffer_manager.h: drop const from StorageCache()/UniformCache() so the
  mutable cache references can no longer be obtained through a const
  BufferManager&.
- session_buffer_pool.cc: drop slots_.reserve(max_generations_) to avoid
  a large up-front allocation when the option is set to an extreme value;
  slots grow on demand instead.
@qjia7
Copy link
Copy Markdown
Contributor Author

qjia7 commented Jun 4, 2026

@hariharans29 @guschmue Please take a look, thanks.

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants