Fused CE Hessian-vector product kernel via torch.compile (CPU/CUDA/MPS) and a hand-written Triton kernel (CUDA, online softmax). Auto-selected via fused="auto" on hf_lm_loss_of_output(). ~3.4× faster, 2× less memory than eager on A100 at LM-scale vocabulary (PR #47).