Skip to content

v1.0.0a6: fused CE HVP kernel

Latest

Choose a tag to compare

@noahgolmant noahgolmant released this 13 May 15:16

Fused CE Hessian-vector product kernel via torch.compile (CPU/CUDA/MPS) and a hand-written Triton kernel (CUDA, online softmax). Auto-selected via fused="auto" on hf_lm_loss_of_output(). ~3.4× faster, 2× less memory than eager on A100 at LM-scale vocabulary (PR #47).