Summary
The EmbedLookup / EmbedLookupQ4K Vulkan compute shaders produce garbage output when called from HybridForwardPass.Forward, but the same shaders produce correct output when called from GpuForwardPass.Forward.
Evidence
With instrumentation that downloads _gpuHidden immediately after the embedding dispatch:
[DBG] after-embed: nans=7 big=815 h[0..4]=-4.438109E-20,2.464776E-36,-1.299449E+08,1.798577E+24
Out of 2048 floats, 7 are NaN and 815 have absolute value > 1e6 - clearly not dequantized embedding values (which should be roughly in [-0.1, 0.1]).
A Clear before the embed lookup does zero the buffer (verified), and a Clear after the embed lookup also zeros it - so both shaders execute. The embed shader runs but writes wrong values. The same shader running from GpuForwardPass produces correct values for the same model.
What was ruled out
- Not a transfer/compute barrier issue (still happens with
RecordComputeCopy everywhere)
- Not a buffer-aliasing bug (
_buffers dictionary entries verified unique)
- Not an upload-ordering bug (moved embedding upload to after all layer uploads - same garbage)
- Not specific to the Q4_K shader (
EmbedLookup produces identical garbage)
- Not a descriptor-set reuse issue (
_embedLookupQ4KPipeline._reusableDs is updated only once per Forward; embedding is the only call)
Workaround in place
HybridForwardPass.ShouldKeepFixedWeightsOnCpu now always returns true, forcing CPU dequantization of the single embedding row per token. Cost is negligible (one row vs. all the layer compute).
TODO
Find the actual cause. Suspect something about descriptor-pool / pipeline state that differs between HybridForwardPass and GpuForwardPass constructor sequences.
Summary
The
EmbedLookup/EmbedLookupQ4KVulkan compute shaders produce garbage output when called fromHybridForwardPass.Forward, but the same shaders produce correct output when called fromGpuForwardPass.Forward.Evidence
With instrumentation that downloads
_gpuHiddenimmediately after the embedding dispatch:Out of 2048 floats, 7 are NaN and 815 have absolute value > 1e6 - clearly not dequantized embedding values (which should be roughly in [-0.1, 0.1]).
A Clear before the embed lookup does zero the buffer (verified), and a Clear after the embed lookup also zeros it - so both shaders execute. The embed shader runs but writes wrong values. The same shader running from
GpuForwardPassproduces correct values for the same model.What was ruled out
RecordComputeCopyeverywhere)_buffersdictionary entries verified unique)EmbedLookupproduces identical garbage)_embedLookupQ4KPipeline._reusableDsis updated only once per Forward; embedding is the only call)Workaround in place
HybridForwardPass.ShouldKeepFixedWeightsOnCpunow always returnstrue, forcing CPU dequantization of the single embedding row per token. Cost is negligible (one row vs. all the layer compute).TODO
Find the actual cause. Suspect something about descriptor-pool / pipeline state that differs between
HybridForwardPassandGpuForwardPassconstructor sequences.