Skip to content

GPU embedding lookup writes garbage when invoked from HybridForwardPass #3

@pekkah

Description

@pekkah

Summary

The EmbedLookup / EmbedLookupQ4K Vulkan compute shaders produce garbage output when called from HybridForwardPass.Forward, but the same shaders produce correct output when called from GpuForwardPass.Forward.

Evidence

With instrumentation that downloads _gpuHidden immediately after the embedding dispatch:

[DBG] after-embed: nans=7 big=815 h[0..4]=-4.438109E-20,2.464776E-36,-1.299449E+08,1.798577E+24

Out of 2048 floats, 7 are NaN and 815 have absolute value > 1e6 - clearly not dequantized embedding values (which should be roughly in [-0.1, 0.1]).

A Clear before the embed lookup does zero the buffer (verified), and a Clear after the embed lookup also zeros it - so both shaders execute. The embed shader runs but writes wrong values. The same shader running from GpuForwardPass produces correct values for the same model.

What was ruled out

  • Not a transfer/compute barrier issue (still happens with RecordComputeCopy everywhere)
  • Not a buffer-aliasing bug (_buffers dictionary entries verified unique)
  • Not an upload-ordering bug (moved embedding upload to after all layer uploads - same garbage)
  • Not specific to the Q4_K shader (EmbedLookup produces identical garbage)
  • Not a descriptor-set reuse issue (_embedLookupQ4KPipeline._reusableDs is updated only once per Forward; embedding is the only call)

Workaround in place

HybridForwardPass.ShouldKeepFixedWeightsOnCpu now always returns true, forcing CPU dequantization of the single embedding row per token. Cost is negligible (one row vs. all the layer compute).

TODO

Find the actual cause. Suspect something about descriptor-pool / pipeline state that differs between HybridForwardPass and GpuForwardPass constructor sequences.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions